1
00:00:02,639 --> 00:00:04,960
hi I'm sarjit Singh and I'll be talking

2
00:00:04,960 --> 00:00:08,000
about our work Hina Hina is a work

3
00:00:08,000 --> 00:00:10,719
towards practical privacy preserving

4
00:00:10,719 --> 00:00:13,400
inference Hina is based on homo

5
00:00:13,400 --> 00:00:16,400
encryption which is a cryptographic tool

6
00:00:16,400 --> 00:00:20,560
that allows compute over encrypted data

7
00:00:20,560 --> 00:00:23,240
hence homic encryption is a vital tool

8
00:00:23,240 --> 00:00:26,679
in realizing sensitive applications like

9
00:00:26,679 --> 00:00:28,960
Medical Imaging genomics compute

10
00:00:28,960 --> 00:00:31,960
Outsourcing

11
00:00:32,520 --> 00:00:35,840
ET however practical adoption of homic

12
00:00:35,840 --> 00:00:37,760
encryption has been limited by its

13
00:00:37,760 --> 00:00:39,680
performance and

14
00:00:39,680 --> 00:00:42,879
complexity homic encryption works with

15
00:00:42,879 --> 00:00:45,360
large degree polinomial which are in

16
00:00:45,360 --> 00:00:48,360
megabytes of size leading to significant

17
00:00:48,360 --> 00:00:51,399
memory accesses and hence poor

18
00:00:51,399 --> 00:00:54,039
performance additionally modular

19
00:00:54,039 --> 00:00:56,719
arithmetic over large finite fields and

20
00:00:56,719 --> 00:00:58,559
complex operations like domain

21
00:00:58,559 --> 00:01:01,719
Transformations and morphism make this

22
00:01:01,719 --> 00:01:04,159
tool extremely difficult to

23
00:01:04,159 --> 00:01:07,040
implement as a result researchers have

24
00:01:07,040 --> 00:01:10,280
seen many orders of slowdown in private

25
00:01:10,280 --> 00:01:14,040
inference as compared to an unencrypted

26
00:01:14,040 --> 00:01:16,720
one packing is a widely adopted

27
00:01:16,720 --> 00:01:19,720
mechanism to elevate the cost of large

28
00:01:19,720 --> 00:01:22,720
pols instead of encrypting a single

29
00:01:22,720 --> 00:01:26,439
value a vctor of values are packed into

30
00:01:26,439 --> 00:01:29,720
the slots of polom hence elevating the

31
00:01:29,720 --> 00:01:33,280
cost cost of single Cipher

32
00:01:33,280 --> 00:01:36,840
operation however in real applications

33
00:01:36,840 --> 00:01:39,000
these slots might need to be rotated in

34
00:01:39,000 --> 00:01:41,280
order to fully exploit them as you can

35
00:01:41,280 --> 00:01:43,640
see from uh this picture representation

36
00:01:43,640 --> 00:01:47,040
the colors or the slots have

37
00:01:47,040 --> 00:01:50,719
moved rotations are expensive since they

38
00:01:50,719 --> 00:01:53,479
require additional metadata called keys

39
00:01:53,479 --> 00:01:56,200
to perform key switching and these keys

40
00:01:56,200 --> 00:01:58,240
are in megabytes of

41
00:01:58,240 --> 00:02:00,640
size moreover

42
00:02:00,640 --> 00:02:03,799
a non-cyclic permutation of slots is

43
00:02:03,799 --> 00:02:06,079
requires a series of many such cyclic

44
00:02:06,079 --> 00:02:09,399
rotations and hence is even more

45
00:02:09,399 --> 00:02:12,640
expensive in this work we target homic

46
00:02:12,640 --> 00:02:14,879
encryption based convolution near

47
00:02:14,879 --> 00:02:18,640
networks we identify these rotations as

48
00:02:18,640 --> 00:02:20,800
the major performance botl lck in

49
00:02:20,800 --> 00:02:23,239
encrypted

50
00:02:23,239 --> 00:02:26,080
inference a typical convolution looks

51
00:02:26,080 --> 00:02:29,920
something like this each output pixel is

52
00:02:29,920 --> 00:02:32,160
being generated by a convolution of

53
00:02:32,160 --> 00:02:36,400
input wids and inputs along the input

54
00:02:36,400 --> 00:02:39,400
channels similar colors are multiplied

55
00:02:39,400 --> 00:02:42,080
and the result across colors are added

56
00:02:42,080 --> 00:02:44,640
to generate one single complete output

57
00:02:44,640 --> 00:02:47,760
feature map which is represented by the

58
00:02:47,760 --> 00:02:51,560
red output pixel

59
00:02:51,560 --> 00:02:54,400
here a few things change when performed

60
00:02:54,400 --> 00:02:56,120
with uh with encrypted

61
00:02:56,120 --> 00:02:59,599
ciphers first inputs can be bashed

62
00:02:59,599 --> 00:03:03,440
together into polom slots using packing

63
00:03:03,440 --> 00:03:06,680
as we saw that from a previous slide

64
00:03:06,680 --> 00:03:10,440
second the multiply accumulate step now

65
00:03:10,440 --> 00:03:12,920
has an additional step called rotation

66
00:03:12,920 --> 00:03:15,560
to fully able to accumulate

67
00:03:15,560 --> 00:03:19,560
things on the left I'm presenting one

68
00:03:19,560 --> 00:03:22,959
way to pack multiple inputs together

69
00:03:22,959 --> 00:03:26,519
this example strategy backs all inputs

70
00:03:26,519 --> 00:03:28,720
that are required to complete one output

71
00:03:28,720 --> 00:03:31,840
pixel this this packing is very

72
00:03:31,840 --> 00:03:34,879
intuitive however it requires many

73
00:03:34,879 --> 00:03:38,239
rotations per multiplication which is

74
00:03:38,239 --> 00:03:41,720
bad in fact after each multiplication it

75
00:03:41,720 --> 00:03:44,840
requires n rotation or n minus one

76
00:03:44,840 --> 00:03:47,760
rotations which is a huge

77
00:03:47,760 --> 00:03:51,000
cost one other strategy which I present

78
00:03:51,000 --> 00:03:54,200
on the right attempts to achieve zero

79
00:03:54,200 --> 00:03:55,560
almost zero

80
00:03:55,560 --> 00:03:58,120
rotations however such packing is also

81
00:03:58,120 --> 00:04:00,360
inefficient as it

82
00:04:00,360 --> 00:04:03,200
utilizes the polom slots leading to

83
00:04:03,200 --> 00:04:05,439
ineffectual computations which can be

84
00:04:05,439 --> 00:04:09,159
seen from these white boxes over

85
00:04:09,159 --> 00:04:11,599
here both these strategies have been

86
00:04:11,599 --> 00:04:14,920
explored in Prior work in fact the left

87
00:04:14,920 --> 00:04:17,000
uh left packing is called Channel

88
00:04:17,000 --> 00:04:18,959
packing and the right packing is called

89
00:04:18,959 --> 00:04:22,800
Lola packing both these work try to come

90
00:04:22,800 --> 00:04:27,120
up with a uh with OP with optimizing one

91
00:04:27,120 --> 00:04:30,080
certain thing either slot utilization or

92
00:04:30,080 --> 00:04:32,639
rotation managing this trade-off between

93
00:04:32,639 --> 00:04:35,080
slot utilization and rotation is the

94
00:04:35,080 --> 00:04:38,039
target of our

95
00:04:38,080 --> 00:04:42,400
work wees Hina a packing and data flow

96
00:04:42,400 --> 00:04:45,440
for private CNN inference the key

97
00:04:45,440 --> 00:04:48,479
objectives while designing Hina are left

98
00:04:48,479 --> 00:04:51,960
listed on the left these are to First

99
00:04:51,960 --> 00:04:53,680
maximize slot

100
00:04:53,680 --> 00:04:56,479
utilization if uh by maximiz slot

101
00:04:56,479 --> 00:04:58,560
utilization we make sure that the total

102
00:04:58,560 --> 00:05:00,440
number of ciphers that we're working

103
00:05:00,440 --> 00:05:03,960
with is lower hence the working set and

104
00:05:03,960 --> 00:05:06,520
memory activity is

105
00:05:06,520 --> 00:05:10,120
smaller second to minimize rotation as

106
00:05:10,120 --> 00:05:12,080
we learned rotation are expensive

107
00:05:12,080 --> 00:05:14,800
because of expensive key switchings

108
00:05:14,800 --> 00:05:16,600
operations hence we would like to

109
00:05:16,600 --> 00:05:19,160
minimize them more importantly we would

110
00:05:19,160 --> 00:05:21,639
like to minimize the permute type of

111
00:05:21,639 --> 00:05:24,199
rotations uh as we remember permute type

112
00:05:24,199 --> 00:05:27,160
rotations require many cyclic rotations

113
00:05:27,160 --> 00:05:30,520
and hence many key switching operations

114
00:05:30,520 --> 00:05:33,919
lastly to maximize on chip

115
00:05:33,919 --> 00:05:37,840
reuse we start by packing a single phas

116
00:05:37,840 --> 00:05:40,160
of weight and its corresponding phase in

117
00:05:40,160 --> 00:05:43,160
it input feature map the output of this

118
00:05:43,160 --> 00:05:45,720
multiplication contributes to the first

119
00:05:45,720 --> 00:05:48,319
output pixels partial sub which is

120
00:05:48,319 --> 00:05:52,199
highlighted by the dark red over

121
00:05:52,199 --> 00:05:56,400
here next we pack the neighboring out uh

122
00:05:56,400 --> 00:05:59,080
neighboring non-overlapping phas and

123
00:05:59,080 --> 00:06:02,039
input uh feature my poly which is

124
00:06:02,039 --> 00:06:04,199
highlighted by the blue color

125
00:06:04,199 --> 00:06:07,280
here instead of instead of packing the

126
00:06:07,280 --> 00:06:10,520
same weight we pack a weight a different

127
00:06:10,520 --> 00:06:12,440
weight phas from a different output

128
00:06:12,440 --> 00:06:15,639
Channel which is the brown weight

129
00:06:15,639 --> 00:06:19,800
here by doing this we generate output

130
00:06:19,800 --> 00:06:22,319
pixel across different output channels

131
00:06:22,319 --> 00:06:24,520
which is out uh represented by the brown

132
00:06:24,520 --> 00:06:28,000
output feature map here this avoids

133
00:06:28,000 --> 00:06:31,440
redundant packing a weight packing

134
00:06:31,440 --> 00:06:34,440
similar to uh similar to Lola packing

135
00:06:34,440 --> 00:06:37,479
hence improving slot

136
00:06:38,000 --> 00:06:41,039
utilization once we exhaust input faces

137
00:06:41,039 --> 00:06:44,199
or output channels we go across input

138
00:06:44,199 --> 00:06:47,319
channels of the same weight values which

139
00:06:47,319 --> 00:06:50,720
is represented by the lighter colors

140
00:06:50,720 --> 00:06:53,160
here these products accumulate to the

141
00:06:53,160 --> 00:06:55,599
same output feature map since input

142
00:06:55,599 --> 00:07:00,440
channels are are added together

143
00:07:00,440 --> 00:07:02,800
so this highlights hina's packing

144
00:07:02,800 --> 00:07:06,199
strategy It Is complemented by a data

145
00:07:06,199 --> 00:07:08,639
flow that fully exploits the slots with

146
00:07:08,639 --> 00:07:11,319
minimum rotations so far we have only

147
00:07:11,319 --> 00:07:13,440
talked about a single multiplication and

148
00:07:13,440 --> 00:07:17,080
let's talk about what happens next first

149
00:07:17,080 --> 00:07:20,199
we reorder these slots so that the

150
00:07:20,199 --> 00:07:23,000
rotations required to accumulate partial

151
00:07:23,000 --> 00:07:25,479
sums are only

152
00:07:25,479 --> 00:07:27,000
cyclic

153
00:07:27,000 --> 00:07:29,520
next using the same input feature

154
00:07:29,520 --> 00:07:31,960
features and weight polinomial we can

155
00:07:31,960 --> 00:07:34,360
generate different output pixels by

156
00:07:34,360 --> 00:07:36,680
rotating our weight polinomial across

157
00:07:36,680 --> 00:07:38,960
different output feature Maps notice the

158
00:07:38,960 --> 00:07:42,319
color change in the weight polinomial

159
00:07:42,319 --> 00:07:44,840
here this generating uh this has

160
00:07:44,840 --> 00:07:47,800
generated new colors uh in output

161
00:07:47,800 --> 00:07:50,000
feature Maps as

162
00:07:50,000 --> 00:07:52,960
well this allows us to reuse the same

163
00:07:52,960 --> 00:07:56,199
operant to generate results across many

164
00:07:56,199 --> 00:07:58,879
output channels hence minimizing the off

165
00:07:58,879 --> 00:08:01,319
chest movement

166
00:08:01,319 --> 00:08:02,759
when we go

167
00:08:02,759 --> 00:08:05,960
across all our input channels all their

168
00:08:05,960 --> 00:08:09,520
results uh accumulate to complete the

169
00:08:09,520 --> 00:08:13,080
same output pixels notice how after step

170
00:08:13,080 --> 00:08:17,080
B we don't iMed immediately rotated all

171
00:08:17,080 --> 00:08:19,199
our output partial sums and added the

172
00:08:19,199 --> 00:08:21,879
similar colors

173
00:08:21,879 --> 00:08:25,080
instead we perform this rotation

174
00:08:25,080 --> 00:08:27,960
assisted partial some accumulation only

175
00:08:27,960 --> 00:08:30,440
after we have added across all

176
00:08:30,440 --> 00:08:33,240
separately packed input channels by

177
00:08:33,240 --> 00:08:35,799
delaying this rotation we have reduced

178
00:08:35,799 --> 00:08:38,399
rotation calls by a factor of almost the

179
00:08:38,399 --> 00:08:41,440
input Channel

180
00:08:41,440 --> 00:08:46,080
count next we complete the output pixels

181
00:08:46,080 --> 00:08:49,399
by going over all output channels we

182
00:08:49,399 --> 00:08:53,279
reuse the input features to generate new

183
00:08:53,279 --> 00:08:56,200
com new combinations using permute type

184
00:08:56,200 --> 00:08:59,079
rotations and finally we perform these

185
00:08:59,079 --> 00:09:03,880
operations s for all output

186
00:09:04,800 --> 00:09:07,519
pixels overall the key strategies that

187
00:09:07,519 --> 00:09:10,079
have gone into this packing and data

188
00:09:10,079 --> 00:09:14,000
flow are first carefully tailored slot

189
00:09:14,000 --> 00:09:16,440
packing that helps us convert by mutes

190
00:09:16,440 --> 00:09:19,800
to shift rotates and hence maximizing

191
00:09:19,800 --> 00:09:21,560
slot

192
00:09:21,560 --> 00:09:23,880
utilization second a data flow that

193
00:09:23,880 --> 00:09:26,120
delays partial sum accumulation until

194
00:09:26,120 --> 00:09:28,560
all input channs have been added this

195
00:09:28,560 --> 00:09:31,000
allows the St reduced rotation

196
00:09:31,000 --> 00:09:33,920
counts lastly generating new inputs

197
00:09:33,920 --> 00:09:36,920
using the already on chip cached inputs

198
00:09:36,920 --> 00:09:41,079
this Maxim this minimizes memory

199
00:09:41,079 --> 00:09:45,160
activity so we evaluate Hina on a Asic

200
00:09:45,160 --> 00:09:48,519
design to run reset 50 mobile net and

201
00:09:48,519 --> 00:09:52,519
gnmt models we employ a hybrid strategy

202
00:09:52,519 --> 00:09:55,160
where MPC is used to preserve model

203
00:09:55,160 --> 00:09:57,480
privacy while the client performs the

204
00:09:57,480 --> 00:09:59,720
nonlinear layers

205
00:09:59,720 --> 00:10:02,399
more importantly we have released a tool

206
00:10:02,399 --> 00:10:05,079
for Community to explore novel packing

207
00:10:05,079 --> 00:10:07,160
data flow strategies and Hardware

208
00:10:07,160 --> 00:10:10,079
designs for private influence this tool

209
00:10:10,079 --> 00:10:13,480
is currently uh public on GitHub and is

210
00:10:13,480 --> 00:10:16,000
termed as H

211
00:10:16,000 --> 00:10:18,760
packim more details about how we modeled

212
00:10:18,760 --> 00:10:20,560
prior Baseline packings and

213
00:10:20,560 --> 00:10:22,440
Architectural implementations is

214
00:10:22,440 --> 00:10:24,240
presenting in our full

215
00:10:24,240 --> 00:10:27,839
paper so moving on to the evaluation

216
00:10:27,839 --> 00:10:30,120
results on the top top we present

217
00:10:30,120 --> 00:10:33,160
rotation call counts and on below we

218
00:10:33,160 --> 00:10:35,440
present data footprint in

219
00:10:35,440 --> 00:10:38,440
gigabytes the rotation count is

220
00:10:38,440 --> 00:10:41,399
distributed to permutes and cyclic

221
00:10:41,399 --> 00:10:44,320
shifts we have compared with baselines

222
00:10:44,320 --> 00:10:48,680
Lola pack Channel pack and c2pc pack we

223
00:10:48,680 --> 00:10:51,079
note that compared to the baselines we

224
00:10:51,079 --> 00:10:53,560
had a very small count of rotations and

225
00:10:53,560 --> 00:10:57,480
data footprint in fact Hina has almost

226
00:10:57,480 --> 00:11:00,279
zero permute type rotations and at least

227
00:11:00,279 --> 00:11:03,000
10 times lower memory activity than the

228
00:11:03,000 --> 00:11:07,320
other packings this is mainly due to our

229
00:11:07,320 --> 00:11:09,720
efficiently packed and efficiently

230
00:11:09,720 --> 00:11:13,000
managed slots using data

231
00:11:13,000 --> 00:11:16,399
flow as a result our encrypted latencies

232
00:11:16,399 --> 00:11:18,680
are in tens to hundreds of milliseconds

233
00:11:18,680 --> 00:11:22,079
only which is at least 2x faster than

234
00:11:22,079 --> 00:11:24,959
the best Baseline and way more energy

235
00:11:24,959 --> 00:11:27,480
efficient this is represented by the two

236
00:11:27,480 --> 00:11:29,600
graphs here the left hand side

237
00:11:29,600 --> 00:11:32,120
represents the inference latency across

238
00:11:32,120 --> 00:11:33,480
different B

239
00:11:33,480 --> 00:11:36,079
benchmarks the right hand side graph is

240
00:11:36,079 --> 00:11:38,800
a stacked graph that demonstrates the

241
00:11:38,800 --> 00:11:41,240
distribution of energy consumption by

242
00:11:41,240 --> 00:11:42,639
each Hardware

243
00:11:42,639 --> 00:11:45,040
component as noticed from the previous

244
00:11:45,040 --> 00:11:48,200
Slide the high data footprint of Prior

245
00:11:48,200 --> 00:11:50,959
packing have led them to high memory

246
00:11:50,959 --> 00:11:53,519
accesses which is also represented by

247
00:11:53,519 --> 00:11:57,079
the Blue by the uh blue bar in the right

248
00:11:57,079 --> 00:12:00,160
figure overall care carefully tailored

249
00:12:00,160 --> 00:12:03,000
slots and well-managed data flow has

250
00:12:03,000 --> 00:12:06,920
allowed Hina to reduce both compute and

251
00:12:06,920 --> 00:12:08,560
data movement

252
00:12:08,560 --> 00:12:11,200
cost with that I would like to conclude

253
00:12:11,200 --> 00:12:14,279
my presentation in conclusion we

254
00:12:14,279 --> 00:12:17,199
demonstrate a practical encrypted

255
00:12:17,199 --> 00:12:20,680
inference solution we identify rotation

256
00:12:20,680 --> 00:12:24,360
as the key bottom lck we explore various

257
00:12:24,360 --> 00:12:26,959
backing and data flow strategies we

258
00:12:26,959 --> 00:12:29,600
finally publish an open source to called

259
00:12:29,600 --> 00:12:33,519
H Pac SIM for more explorat studies for

260
00:12:33,519 --> 00:12:36,440
more information about this work please

261
00:12:36,440 --> 00:12:40,959
refer to our full paper and IA

262
00:12:40,959 --> 00:12:44,800
Symposium thank you