1 00:00:02,639 --> 00:00:04,960 hi I'm sarjit Singh and I'll be talking 2 00:00:04,960 --> 00:00:08,000 about our work Hina Hina is a work 3 00:00:08,000 --> 00:00:10,719 towards practical privacy preserving 4 00:00:10,719 --> 00:00:13,400 inference Hina is based on homo 5 00:00:13,400 --> 00:00:16,400 encryption which is a cryptographic tool 6 00:00:16,400 --> 00:00:20,560 that allows compute over encrypted data 7 00:00:20,560 --> 00:00:23,240 hence homic encryption is a vital tool 8 00:00:23,240 --> 00:00:26,679 in realizing sensitive applications like 9 00:00:26,679 --> 00:00:28,960 Medical Imaging genomics compute 10 00:00:28,960 --> 00:00:31,960 Outsourcing 11 00:00:32,520 --> 00:00:35,840 ET however practical adoption of homic 12 00:00:35,840 --> 00:00:37,760 encryption has been limited by its 13 00:00:37,760 --> 00:00:39,680 performance and 14 00:00:39,680 --> 00:00:42,879 complexity homic encryption works with 15 00:00:42,879 --> 00:00:45,360 large degree polinomial which are in 16 00:00:45,360 --> 00:00:48,360 megabytes of size leading to significant 17 00:00:48,360 --> 00:00:51,399 memory accesses and hence poor 18 00:00:51,399 --> 00:00:54,039 performance additionally modular 19 00:00:54,039 --> 00:00:56,719 arithmetic over large finite fields and 20 00:00:56,719 --> 00:00:58,559 complex operations like domain 21 00:00:58,559 --> 00:01:01,719 Transformations and morphism make this 22 00:01:01,719 --> 00:01:04,159 tool extremely difficult to 23 00:01:04,159 --> 00:01:07,040 implement as a result researchers have 24 00:01:07,040 --> 00:01:10,280 seen many orders of slowdown in private 25 00:01:10,280 --> 00:01:14,040 inference as compared to an unencrypted 26 00:01:14,040 --> 00:01:16,720 one packing is a widely adopted 27 00:01:16,720 --> 00:01:19,720 mechanism to elevate the cost of large 28 00:01:19,720 --> 00:01:22,720 pols instead of encrypting a single 29 00:01:22,720 --> 00:01:26,439 value a vctor of values are packed into 30 00:01:26,439 --> 00:01:29,720 the slots of polom hence elevating the 31 00:01:29,720 --> 00:01:33,280 cost cost of single Cipher 32 00:01:33,280 --> 00:01:36,840 operation however in real applications 33 00:01:36,840 --> 00:01:39,000 these slots might need to be rotated in 34 00:01:39,000 --> 00:01:41,280 order to fully exploit them as you can 35 00:01:41,280 --> 00:01:43,640 see from uh this picture representation 36 00:01:43,640 --> 00:01:47,040 the colors or the slots have 37 00:01:47,040 --> 00:01:50,719 moved rotations are expensive since they 38 00:01:50,719 --> 00:01:53,479 require additional metadata called keys 39 00:01:53,479 --> 00:01:56,200 to perform key switching and these keys 40 00:01:56,200 --> 00:01:58,240 are in megabytes of 41 00:01:58,240 --> 00:02:00,640 size moreover 42 00:02:00,640 --> 00:02:03,799 a non-cyclic permutation of slots is 43 00:02:03,799 --> 00:02:06,079 requires a series of many such cyclic 44 00:02:06,079 --> 00:02:09,399 rotations and hence is even more 45 00:02:09,399 --> 00:02:12,640 expensive in this work we target homic 46 00:02:12,640 --> 00:02:14,879 encryption based convolution near 47 00:02:14,879 --> 00:02:18,640 networks we identify these rotations as 48 00:02:18,640 --> 00:02:20,800 the major performance botl lck in 49 00:02:20,800 --> 00:02:23,239 encrypted 50 00:02:23,239 --> 00:02:26,080 inference a typical convolution looks 51 00:02:26,080 --> 00:02:29,920 something like this each output pixel is 52 00:02:29,920 --> 00:02:32,160 being generated by a convolution of 53 00:02:32,160 --> 00:02:36,400 input wids and inputs along the input 54 00:02:36,400 --> 00:02:39,400 channels similar colors are multiplied 55 00:02:39,400 --> 00:02:42,080 and the result across colors are added 56 00:02:42,080 --> 00:02:44,640 to generate one single complete output 57 00:02:44,640 --> 00:02:47,760 feature map which is represented by the 58 00:02:47,760 --> 00:02:51,560 red output pixel 59 00:02:51,560 --> 00:02:54,400 here a few things change when performed 60 00:02:54,400 --> 00:02:56,120 with uh with encrypted 61 00:02:56,120 --> 00:02:59,599 ciphers first inputs can be bashed 62 00:02:59,599 --> 00:03:03,440 together into polom slots using packing 63 00:03:03,440 --> 00:03:06,680 as we saw that from a previous slide 64 00:03:06,680 --> 00:03:10,440 second the multiply accumulate step now 65 00:03:10,440 --> 00:03:12,920 has an additional step called rotation 66 00:03:12,920 --> 00:03:15,560 to fully able to accumulate 67 00:03:15,560 --> 00:03:19,560 things on the left I'm presenting one 68 00:03:19,560 --> 00:03:22,959 way to pack multiple inputs together 69 00:03:22,959 --> 00:03:26,519 this example strategy backs all inputs 70 00:03:26,519 --> 00:03:28,720 that are required to complete one output 71 00:03:28,720 --> 00:03:31,840 pixel this this packing is very 72 00:03:31,840 --> 00:03:34,879 intuitive however it requires many 73 00:03:34,879 --> 00:03:38,239 rotations per multiplication which is 74 00:03:38,239 --> 00:03:41,720 bad in fact after each multiplication it 75 00:03:41,720 --> 00:03:44,840 requires n rotation or n minus one 76 00:03:44,840 --> 00:03:47,760 rotations which is a huge 77 00:03:47,760 --> 00:03:51,000 cost one other strategy which I present 78 00:03:51,000 --> 00:03:54,200 on the right attempts to achieve zero 79 00:03:54,200 --> 00:03:55,560 almost zero 80 00:03:55,560 --> 00:03:58,120 rotations however such packing is also 81 00:03:58,120 --> 00:04:00,360 inefficient as it 82 00:04:00,360 --> 00:04:03,200 utilizes the polom slots leading to 83 00:04:03,200 --> 00:04:05,439 ineffectual computations which can be 84 00:04:05,439 --> 00:04:09,159 seen from these white boxes over 85 00:04:09,159 --> 00:04:11,599 here both these strategies have been 86 00:04:11,599 --> 00:04:14,920 explored in Prior work in fact the left 87 00:04:14,920 --> 00:04:17,000 uh left packing is called Channel 88 00:04:17,000 --> 00:04:18,959 packing and the right packing is called 89 00:04:18,959 --> 00:04:22,800 Lola packing both these work try to come 90 00:04:22,800 --> 00:04:27,120 up with a uh with OP with optimizing one 91 00:04:27,120 --> 00:04:30,080 certain thing either slot utilization or 92 00:04:30,080 --> 00:04:32,639 rotation managing this trade-off between 93 00:04:32,639 --> 00:04:35,080 slot utilization and rotation is the 94 00:04:35,080 --> 00:04:38,039 target of our 95 00:04:38,080 --> 00:04:42,400 work wees Hina a packing and data flow 96 00:04:42,400 --> 00:04:45,440 for private CNN inference the key 97 00:04:45,440 --> 00:04:48,479 objectives while designing Hina are left 98 00:04:48,479 --> 00:04:51,960 listed on the left these are to First 99 00:04:51,960 --> 00:04:53,680 maximize slot 100 00:04:53,680 --> 00:04:56,479 utilization if uh by maximiz slot 101 00:04:56,479 --> 00:04:58,560 utilization we make sure that the total 102 00:04:58,560 --> 00:05:00,440 number of ciphers that we're working 103 00:05:00,440 --> 00:05:03,960 with is lower hence the working set and 104 00:05:03,960 --> 00:05:06,520 memory activity is 105 00:05:06,520 --> 00:05:10,120 smaller second to minimize rotation as 106 00:05:10,120 --> 00:05:12,080 we learned rotation are expensive 107 00:05:12,080 --> 00:05:14,800 because of expensive key switchings 108 00:05:14,800 --> 00:05:16,600 operations hence we would like to 109 00:05:16,600 --> 00:05:19,160 minimize them more importantly we would 110 00:05:19,160 --> 00:05:21,639 like to minimize the permute type of 111 00:05:21,639 --> 00:05:24,199 rotations uh as we remember permute type 112 00:05:24,199 --> 00:05:27,160 rotations require many cyclic rotations 113 00:05:27,160 --> 00:05:30,520 and hence many key switching operations 114 00:05:30,520 --> 00:05:33,919 lastly to maximize on chip 115 00:05:33,919 --> 00:05:37,840 reuse we start by packing a single phas 116 00:05:37,840 --> 00:05:40,160 of weight and its corresponding phase in 117 00:05:40,160 --> 00:05:43,160 it input feature map the output of this 118 00:05:43,160 --> 00:05:45,720 multiplication contributes to the first 119 00:05:45,720 --> 00:05:48,319 output pixels partial sub which is 120 00:05:48,319 --> 00:05:52,199 highlighted by the dark red over 121 00:05:52,199 --> 00:05:56,400 here next we pack the neighboring out uh 122 00:05:56,400 --> 00:05:59,080 neighboring non-overlapping phas and 123 00:05:59,080 --> 00:06:02,039 input uh feature my poly which is 124 00:06:02,039 --> 00:06:04,199 highlighted by the blue color 125 00:06:04,199 --> 00:06:07,280 here instead of instead of packing the 126 00:06:07,280 --> 00:06:10,520 same weight we pack a weight a different 127 00:06:10,520 --> 00:06:12,440 weight phas from a different output 128 00:06:12,440 --> 00:06:15,639 Channel which is the brown weight 129 00:06:15,639 --> 00:06:19,800 here by doing this we generate output 130 00:06:19,800 --> 00:06:22,319 pixel across different output channels 131 00:06:22,319 --> 00:06:24,520 which is out uh represented by the brown 132 00:06:24,520 --> 00:06:28,000 output feature map here this avoids 133 00:06:28,000 --> 00:06:31,440 redundant packing a weight packing 134 00:06:31,440 --> 00:06:34,440 similar to uh similar to Lola packing 135 00:06:34,440 --> 00:06:37,479 hence improving slot 136 00:06:38,000 --> 00:06:41,039 utilization once we exhaust input faces 137 00:06:41,039 --> 00:06:44,199 or output channels we go across input 138 00:06:44,199 --> 00:06:47,319 channels of the same weight values which 139 00:06:47,319 --> 00:06:50,720 is represented by the lighter colors 140 00:06:50,720 --> 00:06:53,160 here these products accumulate to the 141 00:06:53,160 --> 00:06:55,599 same output feature map since input 142 00:06:55,599 --> 00:07:00,440 channels are are added together 143 00:07:00,440 --> 00:07:02,800 so this highlights hina's packing 144 00:07:02,800 --> 00:07:06,199 strategy It Is complemented by a data 145 00:07:06,199 --> 00:07:08,639 flow that fully exploits the slots with 146 00:07:08,639 --> 00:07:11,319 minimum rotations so far we have only 147 00:07:11,319 --> 00:07:13,440 talked about a single multiplication and 148 00:07:13,440 --> 00:07:17,080 let's talk about what happens next first 149 00:07:17,080 --> 00:07:20,199 we reorder these slots so that the 150 00:07:20,199 --> 00:07:23,000 rotations required to accumulate partial 151 00:07:23,000 --> 00:07:25,479 sums are only 152 00:07:25,479 --> 00:07:27,000 cyclic 153 00:07:27,000 --> 00:07:29,520 next using the same input feature 154 00:07:29,520 --> 00:07:31,960 features and weight polinomial we can 155 00:07:31,960 --> 00:07:34,360 generate different output pixels by 156 00:07:34,360 --> 00:07:36,680 rotating our weight polinomial across 157 00:07:36,680 --> 00:07:38,960 different output feature Maps notice the 158 00:07:38,960 --> 00:07:42,319 color change in the weight polinomial 159 00:07:42,319 --> 00:07:44,840 here this generating uh this has 160 00:07:44,840 --> 00:07:47,800 generated new colors uh in output 161 00:07:47,800 --> 00:07:50,000 feature Maps as 162 00:07:50,000 --> 00:07:52,960 well this allows us to reuse the same 163 00:07:52,960 --> 00:07:56,199 operant to generate results across many 164 00:07:56,199 --> 00:07:58,879 output channels hence minimizing the off 165 00:07:58,879 --> 00:08:01,319 chest movement 166 00:08:01,319 --> 00:08:02,759 when we go 167 00:08:02,759 --> 00:08:05,960 across all our input channels all their 168 00:08:05,960 --> 00:08:09,520 results uh accumulate to complete the 169 00:08:09,520 --> 00:08:13,080 same output pixels notice how after step 170 00:08:13,080 --> 00:08:17,080 B we don't iMed immediately rotated all 171 00:08:17,080 --> 00:08:19,199 our output partial sums and added the 172 00:08:19,199 --> 00:08:21,879 similar colors 173 00:08:21,879 --> 00:08:25,080 instead we perform this rotation 174 00:08:25,080 --> 00:08:27,960 assisted partial some accumulation only 175 00:08:27,960 --> 00:08:30,440 after we have added across all 176 00:08:30,440 --> 00:08:33,240 separately packed input channels by 177 00:08:33,240 --> 00:08:35,799 delaying this rotation we have reduced 178 00:08:35,799 --> 00:08:38,399 rotation calls by a factor of almost the 179 00:08:38,399 --> 00:08:41,440 input Channel 180 00:08:41,440 --> 00:08:46,080 count next we complete the output pixels 181 00:08:46,080 --> 00:08:49,399 by going over all output channels we 182 00:08:49,399 --> 00:08:53,279 reuse the input features to generate new 183 00:08:53,279 --> 00:08:56,200 com new combinations using permute type 184 00:08:56,200 --> 00:08:59,079 rotations and finally we perform these 185 00:08:59,079 --> 00:09:03,880 operations s for all output 186 00:09:04,800 --> 00:09:07,519 pixels overall the key strategies that 187 00:09:07,519 --> 00:09:10,079 have gone into this packing and data 188 00:09:10,079 --> 00:09:14,000 flow are first carefully tailored slot 189 00:09:14,000 --> 00:09:16,440 packing that helps us convert by mutes 190 00:09:16,440 --> 00:09:19,800 to shift rotates and hence maximizing 191 00:09:19,800 --> 00:09:21,560 slot 192 00:09:21,560 --> 00:09:23,880 utilization second a data flow that 193 00:09:23,880 --> 00:09:26,120 delays partial sum accumulation until 194 00:09:26,120 --> 00:09:28,560 all input channs have been added this 195 00:09:28,560 --> 00:09:31,000 allows the St reduced rotation 196 00:09:31,000 --> 00:09:33,920 counts lastly generating new inputs 197 00:09:33,920 --> 00:09:36,920 using the already on chip cached inputs 198 00:09:36,920 --> 00:09:41,079 this Maxim this minimizes memory 199 00:09:41,079 --> 00:09:45,160 activity so we evaluate Hina on a Asic 200 00:09:45,160 --> 00:09:48,519 design to run reset 50 mobile net and 201 00:09:48,519 --> 00:09:52,519 gnmt models we employ a hybrid strategy 202 00:09:52,519 --> 00:09:55,160 where MPC is used to preserve model 203 00:09:55,160 --> 00:09:57,480 privacy while the client performs the 204 00:09:57,480 --> 00:09:59,720 nonlinear layers 205 00:09:59,720 --> 00:10:02,399 more importantly we have released a tool 206 00:10:02,399 --> 00:10:05,079 for Community to explore novel packing 207 00:10:05,079 --> 00:10:07,160 data flow strategies and Hardware 208 00:10:07,160 --> 00:10:10,079 designs for private influence this tool 209 00:10:10,079 --> 00:10:13,480 is currently uh public on GitHub and is 210 00:10:13,480 --> 00:10:16,000 termed as H 211 00:10:16,000 --> 00:10:18,760 packim more details about how we modeled 212 00:10:18,760 --> 00:10:20,560 prior Baseline packings and 213 00:10:20,560 --> 00:10:22,440 Architectural implementations is 214 00:10:22,440 --> 00:10:24,240 presenting in our full 215 00:10:24,240 --> 00:10:27,839 paper so moving on to the evaluation 216 00:10:27,839 --> 00:10:30,120 results on the top top we present 217 00:10:30,120 --> 00:10:33,160 rotation call counts and on below we 218 00:10:33,160 --> 00:10:35,440 present data footprint in 219 00:10:35,440 --> 00:10:38,440 gigabytes the rotation count is 220 00:10:38,440 --> 00:10:41,399 distributed to permutes and cyclic 221 00:10:41,399 --> 00:10:44,320 shifts we have compared with baselines 222 00:10:44,320 --> 00:10:48,680 Lola pack Channel pack and c2pc pack we 223 00:10:48,680 --> 00:10:51,079 note that compared to the baselines we 224 00:10:51,079 --> 00:10:53,560 had a very small count of rotations and 225 00:10:53,560 --> 00:10:57,480 data footprint in fact Hina has almost 226 00:10:57,480 --> 00:11:00,279 zero permute type rotations and at least 227 00:11:00,279 --> 00:11:03,000 10 times lower memory activity than the 228 00:11:03,000 --> 00:11:07,320 other packings this is mainly due to our 229 00:11:07,320 --> 00:11:09,720 efficiently packed and efficiently 230 00:11:09,720 --> 00:11:13,000 managed slots using data 231 00:11:13,000 --> 00:11:16,399 flow as a result our encrypted latencies 232 00:11:16,399 --> 00:11:18,680 are in tens to hundreds of milliseconds 233 00:11:18,680 --> 00:11:22,079 only which is at least 2x faster than 234 00:11:22,079 --> 00:11:24,959 the best Baseline and way more energy 235 00:11:24,959 --> 00:11:27,480 efficient this is represented by the two 236 00:11:27,480 --> 00:11:29,600 graphs here the left hand side 237 00:11:29,600 --> 00:11:32,120 represents the inference latency across 238 00:11:32,120 --> 00:11:33,480 different B 239 00:11:33,480 --> 00:11:36,079 benchmarks the right hand side graph is 240 00:11:36,079 --> 00:11:38,800 a stacked graph that demonstrates the 241 00:11:38,800 --> 00:11:41,240 distribution of energy consumption by 242 00:11:41,240 --> 00:11:42,639 each Hardware 243 00:11:42,639 --> 00:11:45,040 component as noticed from the previous 244 00:11:45,040 --> 00:11:48,200 Slide the high data footprint of Prior 245 00:11:48,200 --> 00:11:50,959 packing have led them to high memory 246 00:11:50,959 --> 00:11:53,519 accesses which is also represented by 247 00:11:53,519 --> 00:11:57,079 the Blue by the uh blue bar in the right 248 00:11:57,079 --> 00:12:00,160 figure overall care carefully tailored 249 00:12:00,160 --> 00:12:03,000 slots and well-managed data flow has 250 00:12:03,000 --> 00:12:06,920 allowed Hina to reduce both compute and 251 00:12:06,920 --> 00:12:08,560 data movement 252 00:12:08,560 --> 00:12:11,200 cost with that I would like to conclude 253 00:12:11,200 --> 00:12:14,279 my presentation in conclusion we 254 00:12:14,279 --> 00:12:17,199 demonstrate a practical encrypted 255 00:12:17,199 --> 00:12:20,680 inference solution we identify rotation 256 00:12:20,680 --> 00:12:24,360 as the key bottom lck we explore various 257 00:12:24,360 --> 00:12:26,959 backing and data flow strategies we 258 00:12:26,959 --> 00:12:29,600 finally publish an open source to called 259 00:12:29,600 --> 00:12:33,519 H Pac SIM for more explorat studies for 260 00:12:33,519 --> 00:12:36,440 more information about this work please 261 00:12:36,440 --> 00:12:40,959 refer to our full paper and IA 262 00:12:40,959 --> 00:12:44,800 Symposium thank you