1 00:00:02,560 --> 00:00:09,340 and so this is gpu-accelerated security 2 00:00:05,770 --> 00:00:09,940 so first off standards go mine hey I'm 3 00:00:09,340 --> 00:00:11,590 Andrew 4 00:00:09,940 --> 00:00:14,759 that's my website you're interested to 5 00:00:11,590 --> 00:00:17,710 have some notable projects and a before 6 00:00:14,759 --> 00:00:19,810 so before we begin felt it's worth 7 00:00:17,710 --> 00:00:22,330 getting up a disclaimer my knowledge is 8 00:00:19,810 --> 00:00:24,788 absolutely not absolute and they're very 9 00:00:22,330 --> 00:00:26,979 likely things I've missed and my 10 00:00:24,789 --> 00:00:29,170 research primarily focused on intrusion 11 00:00:26,980 --> 00:00:30,760 detection systems or its focusing on 12 00:00:29,170 --> 00:00:32,830 intrusion detection systems I've just 13 00:00:30,760 --> 00:00:35,890 recognized similarities and other areas 14 00:00:32,830 --> 00:00:38,019 that could be applied to and there's a 15 00:00:35,890 --> 00:00:39,460 lot of things that I would like to 16 00:00:38,020 --> 00:00:41,829 explore by simply do not have time for 17 00:00:39,460 --> 00:00:43,239 and it's 53 days left 18 00:00:41,829 --> 00:00:46,329 sorry to all the fourth-year syndrome 19 00:00:43,239 --> 00:00:47,890 and and if I'm to explain doubt in 20 00:00:46,329 --> 00:00:49,570 perlier general questions please just 21 00:00:47,890 --> 00:00:51,160 weave in to the end cuz I've got way too 22 00:00:49,570 --> 00:00:55,090 many slights to cover for the time flaw 23 00:00:51,160 --> 00:00:56,620 and another thing is it took me hours to 24 00:00:55,090 --> 00:00:59,649 fully comprehend some of the topics I'm 25 00:00:56,620 --> 00:01:01,000 going to go through so I'm gonna try and 26 00:00:59,649 --> 00:01:02,559 explain them the best I can but don't 27 00:01:01,000 --> 00:01:05,259 want it if at the end you're included 28 00:01:02,559 --> 00:01:06,250 confused so the other thing is I can 29 00:01:05,259 --> 00:01:07,720 remember being a first-tier and 30 00:01:06,250 --> 00:01:09,280 completely glazing over some talks 31 00:01:07,720 --> 00:01:11,650 because there was just key terms or 32 00:01:09,280 --> 00:01:14,770 acronyms not used and or that I didn't 33 00:01:11,650 --> 00:01:17,680 know so uh there are some of the things 34 00:01:14,770 --> 00:01:19,360 I might say feathers talk and if you 35 00:01:17,680 --> 00:01:25,509 don't know what they are have a quick 36 00:01:19,360 --> 00:01:28,150 look and most certainly is the head fat 37 00:01:25,509 --> 00:01:30,630 algorithm and I'm gonna say Colonel I 38 00:01:28,150 --> 00:01:35,650 don't mean Linux kernel I mean a 39 00:01:30,630 --> 00:01:38,560 function execute on GPU so I want to 40 00:01:35,650 --> 00:01:41,259 start this off with a preface lots of 41 00:01:38,560 --> 00:01:43,570 applications try and implement a GPU 42 00:01:41,259 --> 00:01:46,960 support or there's initially a we will 43 00:01:43,570 --> 00:01:47,860 do GPU as you can see the special in 44 00:01:46,960 --> 00:01:49,570 case of Surakarta 45 00:01:47,860 --> 00:01:54,640 is that things won't always go to plan 46 00:01:49,570 --> 00:01:55,809 and there's often at least from the open 47 00:01:54,640 --> 00:01:58,360 source projects I've been able to come 48 00:01:55,810 --> 00:02:04,000 across they've been issue with regards 49 00:01:58,360 --> 00:02:06,490 to continuing support for GPU I'm on a 50 00:02:04,000 --> 00:02:08,079 night with open source a lot of things 51 00:02:06,490 --> 00:02:13,480 just here's a paper on this awesome 52 00:02:08,079 --> 00:02:15,340 concept and pay well and are the core 53 00:02:13,480 --> 00:02:15,970 there's no source code or anything even 54 00:02:15,340 --> 00:02:18,100 terrific 55 00:02:15,970 --> 00:02:23,560 that provided to back up the claims are 56 00:02:18,100 --> 00:02:27,130 making so however before we continue 57 00:02:23,560 --> 00:02:29,170 their notes GPUs aren't necessarily a 58 00:02:27,130 --> 00:02:31,540 complete replacement for all of these 59 00:02:29,170 --> 00:02:34,359 tasks a lot of things you're better off 60 00:02:31,540 --> 00:02:38,650 having balance at blend of CPU and GPU 61 00:02:34,360 --> 00:02:40,090 and and for my particular examples and 62 00:02:38,650 --> 00:02:42,010 intrusion detection system does a lot 63 00:02:40,090 --> 00:02:43,750 more than just signature matching that's 64 00:02:42,010 --> 00:02:47,709 just the area focuses on because of that 65 00:02:43,750 --> 00:02:49,959 time on the base of it so first off my 66 00:02:47,710 --> 00:02:52,690 methodology in the direction started off 67 00:02:49,959 --> 00:02:55,990 with when should I actually be using the 68 00:02:52,690 --> 00:02:58,079 GPU and effectively you need to do a 69 00:02:55,990 --> 00:03:00,610 computation on a huge number of items 70 00:02:58,080 --> 00:03:02,320 computation doesn't rely on the 71 00:03:00,610 --> 00:03:06,720 computation of other items so it's 72 00:03:02,320 --> 00:03:11,170 fairly independent and is fairly simple 73 00:03:06,720 --> 00:03:12,040 put a star in there and basically as 74 00:03:11,170 --> 00:03:13,600 long as you've not got too many 75 00:03:12,040 --> 00:03:17,010 branching operations we find difficult 76 00:03:13,600 --> 00:03:19,660 lots of branching operations the 77 00:03:17,010 --> 00:03:22,359 processing of a GPUs the cores are 78 00:03:19,660 --> 00:03:24,730 fairly stupid and so not as fast as see 79 00:03:22,360 --> 00:03:26,500 if you definitely I'm going to skip this 80 00:03:24,730 --> 00:03:27,940 bottom half because it's basically the 81 00:03:26,500 --> 00:03:31,690 inverse of the top part but quite a lot 82 00:03:27,940 --> 00:03:33,370 to get through and so to summarize what 83 00:03:31,690 --> 00:03:35,550 the GPU before performing simple 84 00:03:33,370 --> 00:03:39,130 operations on a huge volume of data and 85 00:03:35,550 --> 00:03:42,010 what's Picabia does in the GPU the data 86 00:03:39,130 --> 00:03:43,510 must be copied to the GPU and we want to 87 00:03:42,010 --> 00:03:45,910 get your results back off the GPU age 88 00:03:43,510 --> 00:03:48,940 pocket or back half again and that 89 00:03:45,910 --> 00:03:51,100 introducing overhead so my methodology 90 00:03:48,940 --> 00:03:53,290 for how to go about approaching this to 91 00:03:51,100 --> 00:03:55,650 speed things up was to maximize minimize 92 00:03:53,290 --> 00:03:57,670 and then just generally optimize and 93 00:03:55,650 --> 00:04:00,730 again now I apply these principles 94 00:03:57,670 --> 00:04:04,958 somewhat to them like file carving and 95 00:04:00,730 --> 00:04:10,600 intrusion detection systems so first off 96 00:04:04,959 --> 00:04:13,870 a maximizing so when you copy data to 97 00:04:10,600 --> 00:04:16,029 the GPU there is there's overhead and as 98 00:04:13,870 --> 00:04:19,899 if you were compared to if you were just 99 00:04:16,029 --> 00:04:21,700 to execute that same function on CPU so 100 00:04:19,899 --> 00:04:23,530 the more the more often you have to copy 101 00:04:21,700 --> 00:04:26,740 data the more often you're going to 102 00:04:23,530 --> 00:04:29,409 encounter that overhead stuff and that 103 00:04:26,740 --> 00:04:31,689 by maximizing the 104 00:04:29,409 --> 00:04:33,459 you copy at once you minimize the number 105 00:04:31,689 --> 00:04:36,819 of copies that you have to make so you 106 00:04:33,459 --> 00:04:39,519 encountered that overhead laughs and so 107 00:04:36,819 --> 00:04:43,300 my solution to this was buffer copies 108 00:04:39,519 --> 00:04:46,689 and effectively as long as you don't 109 00:04:43,300 --> 00:04:49,869 need as long as the day isn't 110 00:04:46,689 --> 00:04:51,550 time-sensitive and basically hold the 111 00:04:49,869 --> 00:04:54,459 back of it and then try and send over as 112 00:04:51,550 --> 00:04:56,860 much at once and with time sensitive 113 00:04:54,459 --> 00:04:58,479 data and you would just have a timeout 114 00:04:56,860 --> 00:05:00,610 and but you would actually have that 115 00:04:58,479 --> 00:05:02,438 with all data because you wouldn't just 116 00:05:00,610 --> 00:05:03,909 be like I'm gonna wait until I get 117 00:05:02,439 --> 00:05:05,279 another packet so I can finish off this 118 00:05:03,909 --> 00:05:10,899 buffer and you could just be waiting 119 00:05:05,279 --> 00:05:14,379 indefinitely that depends and so have I 120 00:05:10,899 --> 00:05:16,449 implemented this I have a mock-up 121 00:05:14,379 --> 00:05:18,669 intrusion detection system nowhere near 122 00:05:16,449 --> 00:05:20,769 advanced as like snort or sir cat or 123 00:05:18,669 --> 00:05:27,008 anything you get it's more just to prove 124 00:05:20,769 --> 00:05:28,719 the things I've 130 so I basically every 125 00:05:27,009 --> 00:05:30,189 time I capture packets I'm loading them 126 00:05:28,719 --> 00:05:31,748 into a buffer and I'm copying that 127 00:05:30,189 --> 00:05:33,399 buffer over and once it reaches a 128 00:05:31,749 --> 00:05:38,050 sufficient size or a certain time I feed 129 00:05:33,399 --> 00:05:40,119 it and reach so second part this sort to 130 00:05:38,050 --> 00:05:44,679 parts list they're kind of similar I am 131 00:05:40,119 --> 00:05:47,439 this is minimizing so similarly as to 132 00:05:44,679 --> 00:05:49,688 maximizing when the copies introduce 133 00:05:47,439 --> 00:05:51,189 overhead the more data that needs to be 134 00:05:49,689 --> 00:05:53,079 copied the longer it takes or I should 135 00:05:51,189 --> 00:05:55,209 be maybe save more bytes it takes to 136 00:05:53,079 --> 00:05:59,319 represent that data the longer it takes 137 00:05:55,209 --> 00:06:02,769 I'm so if you can represent the same 138 00:05:59,319 --> 00:06:08,259 information in less bytes it's usually 139 00:06:02,769 --> 00:06:10,629 going to be faster um so my solution for 140 00:06:08,259 --> 00:06:12,939 that one was what I've called bit 141 00:06:10,629 --> 00:06:14,769 indexing there's probably another term 142 00:06:12,939 --> 00:06:16,419 for it it's a simple enough idea there's 143 00:06:14,769 --> 00:06:17,619 probably been done before I just could 144 00:06:16,419 --> 00:06:20,289 not find anything on it so I've just 145 00:06:17,619 --> 00:06:24,069 come to that and the other one is 146 00:06:20,289 --> 00:06:26,110 basically minimizing the members and so 147 00:06:24,069 --> 00:06:27,579 if you're cut copying over objects and 148 00:06:26,110 --> 00:06:31,329 you've got lots of unnecessary members 149 00:06:27,579 --> 00:06:33,039 you're you're just wasting time and if 150 00:06:31,329 --> 00:06:38,229 you don't use those except for debugging 151 00:06:33,039 --> 00:06:40,959 do not put them in the final person okay 152 00:06:38,229 --> 00:06:43,020 so these are some more like just general 153 00:06:40,959 --> 00:06:45,150 summary of optimizations so 154 00:06:43,020 --> 00:06:47,250 there's texture memory shared memory and 155 00:06:45,150 --> 00:06:50,159 global memory they were on that first 156 00:06:47,250 --> 00:06:52,530 flight of your some terms might cover it 157 00:06:50,160 --> 00:06:55,410 in case you didn't catch that the extra 158 00:06:52,530 --> 00:06:59,039 memory is effectively a cache for 159 00:06:55,410 --> 00:07:01,560 texture data and 64 kilobytes it's 160 00:06:59,040 --> 00:07:03,510 fairly small if you can fit your data in 161 00:07:01,560 --> 00:07:07,680 that you've got complete control over 162 00:07:03,510 --> 00:07:09,330 that cache I'm I'm sorry that can lead 163 00:07:07,680 --> 00:07:11,430 to some massive performance improvements 164 00:07:09,330 --> 00:07:12,780 in your application especially in the 165 00:07:11,430 --> 00:07:14,850 area of string searching where you might 166 00:07:12,780 --> 00:07:17,489 have thousands of patterns to search 167 00:07:14,850 --> 00:07:21,200 there and shared memory is kind of 168 00:07:17,490 --> 00:07:25,020 similar except it is the level 1 cache 169 00:07:21,200 --> 00:07:28,229 off the GPU and effectively you fence 170 00:07:25,020 --> 00:07:30,030 off a section of it you say hey that 32 171 00:07:28,230 --> 00:07:33,210 kilobytes that's mine you don't care to 172 00:07:30,030 --> 00:07:34,950 take that I get stick tailor and then 173 00:07:33,210 --> 00:07:38,010 this global which is just like the GPU 174 00:07:34,950 --> 00:07:39,930 equivalent of RAM and still fairly fast 175 00:07:38,010 --> 00:07:42,770 but not as fast as l2 methods and 176 00:07:39,930 --> 00:07:46,140 buffering kudamon copies cover that I'm 177 00:07:42,770 --> 00:07:48,060 minimizing maximizing data one that's 178 00:07:46,140 --> 00:07:49,530 often missed out at least from the 179 00:07:48,060 --> 00:07:50,880 research that I've done is using 180 00:07:49,530 --> 00:07:53,369 multiple GPUs if they're available 181 00:07:50,880 --> 00:07:55,350 certainly in the CUDA standard is really 182 00:07:53,370 --> 00:07:58,170 really easy to make use of multiple GPUs 183 00:07:55,350 --> 00:08:01,820 lots of applications just fail to do 184 00:07:58,170 --> 00:08:04,170 that and they sync all the things and 185 00:08:01,820 --> 00:08:06,120 asynchronous tasks have existed for 186 00:08:04,170 --> 00:08:10,320 quite a while now had the support 187 00:08:06,120 --> 00:08:13,890 columns trade goods and it's often not 188 00:08:10,320 --> 00:08:16,740 used use your threads so although you'll 189 00:08:13,890 --> 00:08:18,120 be doing a lot on the GPU it doesn't 190 00:08:16,740 --> 00:08:20,370 mean you can just go off single third 191 00:08:18,120 --> 00:08:23,400 task I will literally run everything on 192 00:08:20,370 --> 00:08:25,860 one thread I've seen that in a lot of 193 00:08:23,400 --> 00:08:28,229 applications as well and if you can work 194 00:08:25,860 --> 00:08:30,150 out how I've not been able to so far a 195 00:08:28,230 --> 00:08:32,250 single large kernels it seemed like a 196 00:08:30,150 --> 00:08:34,020 really great idea the idea is 197 00:08:32,250 --> 00:08:35,789 effectively that you launch the function 198 00:08:34,020 --> 00:08:37,679 once and you just keep copying data to 199 00:08:35,789 --> 00:08:38,280 it instead of relaunching function with 200 00:08:37,679 --> 00:08:41,069 new data 201 00:08:38,280 --> 00:08:43,650 so you're minimizing that little kernel 202 00:08:41,070 --> 00:08:45,330 launch overhead and there's another one 203 00:08:43,650 --> 00:08:46,800 as well that I only just remembered like 204 00:08:45,330 --> 00:08:49,470 two minutes before this talk which is 205 00:08:46,800 --> 00:08:50,609 memory padding so if you've got the 206 00:08:49,470 --> 00:08:53,640 space to allow it 207 00:08:50,610 --> 00:08:56,220 you should pad your data so it basically 208 00:08:53,640 --> 00:08:56,819 lines in memory trunks so when you fetch 209 00:08:56,220 --> 00:08:59,430 that 210 00:08:56,820 --> 00:09:01,170 and you're not having to get another 211 00:08:59,430 --> 00:09:03,300 chunk with her and then get another 212 00:09:01,170 --> 00:09:05,280 trunk you cannot try and make it so it's 213 00:09:03,300 --> 00:09:10,020 just one request rather than three or 214 00:09:05,280 --> 00:09:12,089 more and so applying what I know so I 215 00:09:10,020 --> 00:09:13,680 felt there's better to start off with 216 00:09:12,090 --> 00:09:16,920 quick representation of what Apple 217 00:09:13,680 --> 00:09:19,739 Korsak is because although it's fairly 218 00:09:16,920 --> 00:09:21,300 common in multi string managing it's not 219 00:09:19,740 --> 00:09:22,410 one I'd encountered before before 220 00:09:21,300 --> 00:09:24,569 starting this so I figured there might 221 00:09:22,410 --> 00:09:25,589 be other people and encountered it not 222 00:09:24,570 --> 00:09:27,300 really going to talk through this too 223 00:09:25,590 --> 00:09:29,790 much because I think the visualization 224 00:09:27,300 --> 00:09:32,640 speaks for itself bugs get to it so we 225 00:09:29,790 --> 00:09:34,530 start off with our patterns and we 226 00:09:32,640 --> 00:09:35,730 affect we build a tree out of them I'm 227 00:09:34,530 --> 00:09:37,199 just basically to click through this 228 00:09:35,730 --> 00:09:40,470 kind of see what it's doing 229 00:09:37,200 --> 00:09:42,120 I'll just know that at this point I gave 230 00:09:40,470 --> 00:09:44,340 up because I was doing this on my laptop 231 00:09:42,120 --> 00:09:48,060 with a touchpad and it was just in the 232 00:09:44,340 --> 00:09:49,590 arch to build this diagram except once 233 00:09:48,060 --> 00:09:53,579 and then when you're searching for a 234 00:09:49,590 --> 00:09:56,760 pattern just works like that and so 235 00:09:53,580 --> 00:09:58,590 that's how you would represent it the 236 00:09:56,760 --> 00:10:02,400 CPU with nodes and pointers and all 237 00:09:58,590 --> 00:10:03,990 those lovely luxuries and on the GPU 238 00:10:02,400 --> 00:10:07,079 you've not really got the luxury of all 239 00:10:03,990 --> 00:10:09,510 that and particularly my case all that 240 00:10:07,080 --> 00:10:12,330 it's a single dimensional array and I've 241 00:10:09,510 --> 00:10:15,330 got to represent that sometime and so 242 00:10:12,330 --> 00:10:16,980 enter the head-fuck algorithm and it's 243 00:10:15,330 --> 00:10:19,740 based on Apple Korsak it's just an 244 00:10:16,980 --> 00:10:23,460 extension that effectively compresses 245 00:10:19,740 --> 00:10:25,790 some things and so same principles root 246 00:10:23,460 --> 00:10:28,950 node n nodes and patterns are stored 247 00:10:25,790 --> 00:10:31,410 alphabetically and the key part here 248 00:10:28,950 --> 00:10:33,870 which you will help in the next slide is 249 00:10:31,410 --> 00:10:37,350 the notes are made up of an integer 250 00:10:33,870 --> 00:10:41,580 offset which is the location of the 251 00:10:37,350 --> 00:10:45,330 first child both handles and and an 252 00:10:41,580 --> 00:10:48,840 8-month array of 32 bit bitmaps and that 253 00:10:45,330 --> 00:10:50,760 those bitmaps collectively represent all 254 00:10:48,840 --> 00:10:53,420 of the ASCII letters and we'll see why 255 00:10:50,760 --> 00:10:55,590 that matters again on the next slide and 256 00:10:53,420 --> 00:10:59,280 there's a population count 257 00:10:55,590 --> 00:11:02,070 fun and which allows you to basically 258 00:10:59,280 --> 00:11:05,220 take the number of children find the 259 00:11:02,070 --> 00:11:09,939 location of a child and also check 260 00:11:05,220 --> 00:11:13,599 whether it exists and or whether 261 00:11:09,939 --> 00:11:14,978 exist in a given order so this is what 262 00:11:13,599 --> 00:11:18,189 that looks like 263 00:11:14,979 --> 00:11:21,039 I've tried again to explain this as best 264 00:11:18,189 --> 00:11:24,098 as I can but it's this this one took me 265 00:11:21,039 --> 00:11:29,139 quite a long time to get right and so 266 00:11:24,099 --> 00:11:34,629 for the example of pattern a CD or AC 267 00:11:29,139 --> 00:11:38,739 DVD and you would start off with your 268 00:11:34,629 --> 00:11:41,589 written or 0 and you look for the 269 00:11:38,739 --> 00:11:43,779 population count up to a is the first 270 00:11:41,589 --> 00:11:46,299 one in that so the population count up 271 00:11:43,779 --> 00:11:47,739 to a is zero and then you add the offset 272 00:11:46,299 --> 00:11:50,529 which is the location of the first child 273 00:11:47,739 --> 00:11:52,509 and that takes you down to node 1 and 274 00:11:50,529 --> 00:11:54,759 you basically can repeat this process 275 00:11:52,509 --> 00:11:58,329 and I've got the calculations down the 276 00:11:54,759 --> 00:12:00,879 sights because I felt that the address 277 00:11:58,329 --> 00:12:02,919 themselves will justify or explain well 278 00:12:00,879 --> 00:12:04,239 not entirely what was going on but you 279 00:12:02,919 --> 00:12:06,249 can effectively repeat that same process 280 00:12:04,239 --> 00:12:09,279 to work your way down through the tree 281 00:12:06,249 --> 00:12:10,720 and it's worth noting that this 282 00:12:09,279 --> 00:12:12,909 implementation of it is the 283 00:12:10,720 --> 00:12:15,819 implementation where you're only looking 284 00:12:12,909 --> 00:12:18,399 for the index of that pattern not 285 00:12:15,819 --> 00:12:20,289 specifically what pattern is if you want 286 00:12:18,399 --> 00:12:22,209 to know what the pattern was instead of 287 00:12:20,289 --> 00:12:24,939 having one final note you would have a 288 00:12:22,209 --> 00:12:27,598 final node for each branch of the tree I 289 00:12:24,939 --> 00:12:31,089 am which I'm instead of it just being 290 00:12:27,599 --> 00:12:33,659 there was a match that final note would 291 00:12:31,089 --> 00:12:37,239 basically represent what the pattern was 292 00:12:33,659 --> 00:12:38,619 mmm so previous slide you might notice 293 00:12:37,239 --> 00:12:41,229 that there's d EF and there's also 294 00:12:38,619 --> 00:12:43,509 another d EF right at the bottom those 295 00:12:41,229 --> 00:12:45,720 nodes can actually be merged merging is 296 00:12:43,509 --> 00:12:48,009 a bit more complicated than it looks 297 00:12:45,720 --> 00:12:50,559 looks fairly simple and that's example 298 00:12:48,009 --> 00:12:52,629 but it's pretty awful that I've got 200 299 00:12:50,559 --> 00:12:56,679 line function that's not pretty at all 300 00:12:52,629 --> 00:12:58,929 and the key points though is virtual 301 00:12:56,679 --> 00:13:02,019 nodes share the same children merging is 302 00:12:58,929 --> 00:13:04,899 done in reverse order to maintain the 303 00:13:02,019 --> 00:13:06,759 alphabetical requirements and the 304 00:13:04,899 --> 00:13:09,129 alphabetical requirement is so that the 305 00:13:06,759 --> 00:13:13,720 population count method that will work 306 00:13:09,129 --> 00:13:17,470 fine finding the children and this one 307 00:13:13,720 --> 00:13:18,909 here is that indexing again there's 308 00:13:17,470 --> 00:13:22,059 probably another name for it it's just 309 00:13:18,909 --> 00:13:23,680 I've not encountered it before and it 310 00:13:22,059 --> 00:13:29,620 effectively allows you to 311 00:13:23,680 --> 00:13:34,770 represents a single index in one bit 312 00:13:29,620 --> 00:13:38,050 rather than in one bite and so I've 313 00:13:34,770 --> 00:13:40,990 tried to give it a decent example and 314 00:13:38,050 --> 00:13:43,000 this string a string of data along the 315 00:13:40,990 --> 00:13:45,610 top so random data hello world 316 00:13:43,000 --> 00:13:47,410 exclamation mark and wouldn't be broken 317 00:13:45,610 --> 00:13:48,940 up like this it's just broken up so that 318 00:13:47,410 --> 00:13:53,469 you can visualize better what's going on 319 00:13:48,940 --> 00:13:55,660 and there's a little bit of bitwise 320 00:13:53,470 --> 00:13:58,420 magic going on in there it's effectively 321 00:13:55,660 --> 00:14:01,420 saying which which for a would this be 322 00:13:58,420 --> 00:14:04,589 located in and so for that first one the 323 00:14:01,420 --> 00:14:07,360 arrays you know and and then it's saying 324 00:14:04,589 --> 00:14:10,390 which bit should I set according to the 325 00:14:07,360 --> 00:14:13,200 index and that one there is effectively 326 00:14:10,390 --> 00:14:17,439 just shifting the bit across by the 327 00:14:13,200 --> 00:14:20,080 index relative to that ring so the real 328 00:14:17,440 --> 00:14:22,720 index is seven but the way that we 329 00:14:20,080 --> 00:14:24,970 interpret it would be like seven on this 330 00:14:22,720 --> 00:14:28,209 side and hopefully that's an OK 331 00:14:24,970 --> 00:14:29,440 explanation but again it's this one to 332 00:14:28,209 --> 00:14:33,760 me a lot playing about let's get right 333 00:14:29,440 --> 00:14:37,360 and so the one thing this is if you use 334 00:14:33,760 --> 00:14:39,880 a byte to represent this bit index you 335 00:14:37,360 --> 00:14:42,040 also get 8 bits in a byte this went for 336 00:14:39,880 --> 00:14:43,750 a short you get 16 and went for an 337 00:14:42,040 --> 00:14:46,569 integer or any other authority that 338 00:14:43,750 --> 00:14:48,970 types 32 bits and the nice thing about 339 00:14:46,570 --> 00:14:50,470 that though is you can actually check to 340 00:14:48,970 --> 00:14:52,600 see if you've got any matches in any of 341 00:14:50,470 --> 00:14:55,209 those 32 bits when you get your results 342 00:14:52,600 --> 00:14:56,980 back by just comparing it to 0 so if 343 00:14:55,209 --> 00:14:59,380 it's 0 you didn't get any matches 344 00:14:56,980 --> 00:15:01,720 because none of those bits are set that 345 00:14:59,380 --> 00:15:05,260 if any little bits are set that is 346 00:15:01,720 --> 00:15:09,400 hopefully 0 and so implementation 347 00:15:05,260 --> 00:15:12,339 experimentation etc and digital 348 00:15:09,400 --> 00:15:15,339 forensics file carving immensely data 349 00:15:12,339 --> 00:15:19,570 intensive and when you're searching for 350 00:15:15,339 --> 00:15:22,329 file headers a basically a chunk of data 351 00:15:19,570 --> 00:15:23,740 will be copied to the GPU you would 352 00:15:22,329 --> 00:15:25,630 search that and then a result array 353 00:15:23,740 --> 00:15:28,839 typically of the same size would be 354 00:15:25,630 --> 00:15:30,670 copied back so if you're copying 2 gigs 355 00:15:28,839 --> 00:15:34,890 of data to the GPU for searching you 356 00:15:30,670 --> 00:15:34,890 would be copying two gigs back and 357 00:15:35,160 --> 00:15:38,610 that'll take a bit time 358 00:15:39,240 --> 00:15:45,960 so for Matt that's assuming as well 359 00:15:43,170 --> 00:15:50,020 that's assuming as well but it's a 360 00:15:45,960 --> 00:15:51,580 single bite a data structure that you're 361 00:15:50,020 --> 00:15:54,970 using to represent results if you're 362 00:15:51,580 --> 00:15:57,070 wanting to get the index for example of 363 00:15:54,970 --> 00:16:00,970 the results and you were using an 364 00:15:57,070 --> 00:16:04,960 integer to do that then that would be 365 00:16:00,970 --> 00:16:06,460 the same size to represent that and by 366 00:16:04,960 --> 00:16:10,150 comparison if you were just getting the 367 00:16:06,460 --> 00:16:13,150 index you can use a single byte or a 368 00:16:10,150 --> 00:16:16,510 single integer depending on what you're 369 00:16:13,150 --> 00:16:19,439 doing and and that would be good for in 370 00:16:16,510 --> 00:16:22,450 the case of the byte eight locations 371 00:16:19,440 --> 00:16:24,340 okay so in that example there if you're 372 00:16:22,450 --> 00:16:26,530 offloading identification to CPU and 373 00:16:24,340 --> 00:16:29,200 using bit indexing instead of a two 374 00:16:26,530 --> 00:16:31,689 gigabyte result or a about 0.25 Gekas 375 00:16:29,200 --> 00:16:35,500 all right and alternatively you could 376 00:16:31,690 --> 00:16:38,350 use that in addition for fast access and 377 00:16:35,500 --> 00:16:41,350 I thought I'd pick up a bit cord but 378 00:16:38,350 --> 00:16:44,740 it's pretty disgusting so again don't 379 00:16:41,350 --> 00:16:46,570 worry if you don't get em but it's the 380 00:16:44,740 --> 00:16:48,130 point this called down the bottom is 381 00:16:46,570 --> 00:16:51,370 effectively doing the same as the one 382 00:16:48,130 --> 00:16:54,670 above or has the same outcome just that 383 00:16:51,370 --> 00:16:57,220 the one above can be significantly 384 00:16:54,670 --> 00:16:58,870 faster and so around some benchmarks 385 00:16:57,220 --> 00:17:00,460 last night because I didn't wanna come 386 00:16:58,870 --> 00:17:03,610 here with like this is faster honest 387 00:17:00,460 --> 00:17:05,620 just trust me and so this is Titan in 388 00:17:03,610 --> 00:17:08,470 microseconds I deal with text one tiny 389 00:17:05,619 --> 00:17:11,589 little bit idea and I was just like 390 00:17:08,470 --> 00:17:13,720 bashing together some benchmarks to test 391 00:17:11,589 --> 00:17:16,780 with so that's only one mega theater 392 00:17:13,720 --> 00:17:20,110 that was testing it against a bar 393 00:17:16,780 --> 00:17:23,680 I've got comparisons of the indexing 394 00:17:20,109 --> 00:17:27,040 versus boolean indexing a for every 395 00:17:23,680 --> 00:17:29,110 single locations match and one in 20 396 00:17:27,040 --> 00:17:30,940 locations is about one in 50 and you can 397 00:17:29,110 --> 00:17:33,070 see for yourself the time differences 398 00:17:30,940 --> 00:17:35,020 the only one that didn't perform as well 399 00:17:33,070 --> 00:17:35,860 in is when all locations were match and 400 00:17:35,020 --> 00:17:41,020 that would just be the extra overhead 401 00:17:35,860 --> 00:17:43,540 because of the multiplicity M so future 402 00:17:41,020 --> 00:17:44,590 work for that particular section working 403 00:17:43,540 --> 00:17:46,990 out with there's no other name for it 404 00:17:44,590 --> 00:17:48,879 again it probably is a testing on much 405 00:17:46,990 --> 00:17:51,290 larger data again that was only one 406 00:17:48,880 --> 00:17:54,170 megabyte I was testing with I would 407 00:17:51,290 --> 00:17:57,710 this up 200 some eggs and possibly even 408 00:17:54,170 --> 00:17:58,940 gigabytes and determine if it's faster 409 00:17:57,710 --> 00:17:59,840 to separate the indexing and 410 00:17:58,940 --> 00:18:01,370 identification 411 00:17:59,840 --> 00:18:05,480 that's something I've not yet tested I 412 00:18:01,370 --> 00:18:07,070 walk into that I'm testing more and 413 00:18:05,480 --> 00:18:09,350 fewer bytes for the storage methods of 414 00:18:07,070 --> 00:18:11,629 that particular method I was using was a 415 00:18:09,350 --> 00:18:14,540 integer so it was testing 32 locations 416 00:18:11,630 --> 00:18:17,210 at once and it'd be interesting to test 417 00:18:14,540 --> 00:18:19,010 a 64-bit data structure of see if it 418 00:18:17,210 --> 00:18:21,800 sped up or if that would cause 419 00:18:19,010 --> 00:18:24,910 additional slowdowns because they like 420 00:18:21,800 --> 00:18:27,649 more results to check through um and 421 00:18:24,910 --> 00:18:30,620 then I think it'd be faster than running 422 00:18:27,650 --> 00:18:33,020 to searches because the assuming you 423 00:18:30,620 --> 00:18:34,850 were able to copy over more data that's 424 00:18:33,020 --> 00:18:37,840 the GP on that initial search because 425 00:18:34,850 --> 00:18:39,889 the space saved with the result rate and 426 00:18:37,840 --> 00:18:43,240 by again I need to test that 427 00:18:39,890 --> 00:18:45,710 so clearly theory at the moment and so 428 00:18:43,240 --> 00:18:48,560 bit about specifically what I'm doing 429 00:18:45,710 --> 00:18:50,120 the my dissertation is focused on 430 00:18:48,560 --> 00:18:51,860 network intrusion detection system 431 00:18:50,120 --> 00:18:54,500 optimization and I want to make 432 00:18:51,860 --> 00:18:57,320 something more so general because as I 433 00:18:54,500 --> 00:19:01,660 said if this problem applies to a lot of 434 00:18:57,320 --> 00:19:04,040 things and so I'm actually developing a 435 00:19:01,660 --> 00:19:05,420 head fact library so it's an 436 00:19:04,040 --> 00:19:09,020 implementation of the algorithms that 437 00:19:05,420 --> 00:19:11,240 I've just discussed and sort of general 438 00:19:09,020 --> 00:19:15,379 purposes its use I'm just gonna say it's 439 00:19:11,240 --> 00:19:17,600 really prototype and there's probably 440 00:19:15,380 --> 00:19:19,550 things in it that get done better but I 441 00:19:17,600 --> 00:19:21,790 like to think that the concepts are 442 00:19:19,550 --> 00:19:25,100 there that can be built upon and 443 00:19:21,790 --> 00:19:27,020 head-fuck algorithm it's not mine and is 444 00:19:25,100 --> 00:19:28,159 described in that paper there so if 445 00:19:27,020 --> 00:19:30,560 you're interested in the algorithm 446 00:19:28,160 --> 00:19:32,510 itself check that out the only that I've 447 00:19:30,560 --> 00:19:36,379 added to it is the bit indexing and the 448 00:19:32,510 --> 00:19:38,930 other optimization methods and so I also 449 00:19:36,380 --> 00:19:40,460 have to show some benchmark data from 450 00:19:38,930 --> 00:19:44,780 the strength searching algorithm that I 451 00:19:40,460 --> 00:19:46,280 was using and this is extremely old 452 00:19:44,780 --> 00:19:51,129 extremely unoptimized this was my 453 00:19:46,280 --> 00:19:53,570 feasibility demo last year and I've not 454 00:19:51,130 --> 00:19:55,490 done anything on that one since and 455 00:19:53,570 --> 00:19:59,149 because I've been working I the ideas 456 00:19:55,490 --> 00:20:01,670 but all my final results will happen 457 00:19:59,150 --> 00:20:03,980 website once we've done and as well as 458 00:20:01,670 --> 00:20:05,300 all the research that most of this talk 459 00:20:03,980 --> 00:20:08,960 is based on is already on the 460 00:20:05,300 --> 00:20:11,450 and so some important things to know if 461 00:20:08,960 --> 00:20:13,370 it's prey fast but it's not as fast as 462 00:20:11,450 --> 00:20:15,680 it could be because there's no texture 463 00:20:13,370 --> 00:20:18,050 memory in use there's no branch 464 00:20:15,680 --> 00:20:21,620 minimization this is like this was bad 465 00:20:18,050 --> 00:20:24,740 and lots and lots and lots of excessive 466 00:20:21,620 --> 00:20:27,229 copying needless members in each node 467 00:20:24,740 --> 00:20:30,740 I'd so many debug member cells like that 468 00:20:27,230 --> 00:20:32,600 I just need results and but again I just 469 00:20:30,740 --> 00:20:34,640 want to show that I'm not just speaking 470 00:20:32,600 --> 00:20:38,659 entirely theory and I have actually done 471 00:20:34,640 --> 00:20:41,750 some things and point if not the time 472 00:20:38,660 --> 00:20:43,820 there includes the memory copy time 473 00:20:41,750 --> 00:20:45,050 whereas a lot results just glaze over 474 00:20:43,820 --> 00:20:46,790 that entirely and go 475 00:20:45,050 --> 00:20:48,470 yeah the processing times fast we've got 476 00:20:46,790 --> 00:20:51,080 27 gigabytes per second we're going to 477 00:20:48,470 --> 00:20:52,970 completely ignore the fact that we 478 00:20:51,080 --> 00:20:55,399 needed to copy all idea to the GPU to 479 00:20:52,970 --> 00:20:57,470 actually do it and so I think it's 480 00:20:55,400 --> 00:21:00,020 better to represent the results in a 481 00:20:57,470 --> 00:21:03,820 realistic fashion like that's it with 482 00:21:00,020 --> 00:21:07,580 the memory copy time and so this is my 483 00:21:03,820 --> 00:21:10,939 final design for now and I've got the 484 00:21:07,580 --> 00:21:12,620 bit indexing for the return type packet 485 00:21:10,940 --> 00:21:14,870 identification offloaded to CPU 486 00:21:12,620 --> 00:21:17,120 separately again I need to test furthest 487 00:21:14,870 --> 00:21:19,429 actually faster I've got my buffer 488 00:21:17,120 --> 00:21:22,610 system and for copying the packets to 489 00:21:19,430 --> 00:21:24,530 the GPU I'm using a head pack algorithm 490 00:21:22,610 --> 00:21:26,149 to build my tree for failure States 491 00:21:24,530 --> 00:21:29,570 which is copied to texture memory which 492 00:21:26,150 --> 00:21:33,560 is then compared and that's pretty much 493 00:21:29,570 --> 00:21:48,110 it at the moment I am but there could be 494 00:21:33,560 --> 00:21:49,700 more added and any questions programmers 495 00:21:48,110 --> 00:21:51,669 who designed things like there are 496 00:21:49,700 --> 00:21:53,500 institutions this is the tank it says 497 00:21:51,670 --> 00:21:55,120 they're not really going to want to use 498 00:21:53,500 --> 00:21:57,730 that they're going to want to use an 499 00:21:55,120 --> 00:22:02,459 equal sign in the color of the plant and 500 00:21:57,730 --> 00:22:07,750 had much of that hitting from from 501 00:22:02,460 --> 00:22:10,059 German programmers so and if I that way 502 00:22:07,750 --> 00:22:12,220 yeah so I go back through that the 503 00:22:10,059 --> 00:22:13,350 contrast and that's really bad and if 504 00:22:12,220 --> 00:22:16,120 I'd known it was gonna be slow the bear 505 00:22:13,350 --> 00:22:18,820 are that bad I would have done it with 506 00:22:16,120 --> 00:22:21,850 light theme and I've effectively go down 507 00:22:18,820 --> 00:22:23,710 to this is the source that you want this 508 00:22:21,850 --> 00:22:28,889 is what you want to search against go 509 00:22:23,710 --> 00:22:31,960 and it's fully I've made it so that the 510 00:22:28,890 --> 00:22:33,700 the default implementation for one 511 00:22:31,960 --> 00:22:37,360 result found it's not implemented you 512 00:22:33,700 --> 00:22:38,980 have to override that damn so if I'm 513 00:22:37,360 --> 00:22:40,899 trying to make it so it'd be easy enough 514 00:22:38,980 --> 00:22:43,470 for anyone just to pick up and go okay 515 00:22:40,900 --> 00:22:47,679 on result phones we won't do this and 516 00:22:43,470 --> 00:22:49,600 again I'm a student not developer so I'm 517 00:22:47,679 --> 00:22:51,730 kind of just guessing as to what I 518 00:22:49,600 --> 00:22:52,689 should do rather than knowing exactly 519 00:22:51,730 --> 00:22:55,510 what I should do 520 00:22:52,690 --> 00:22:58,480 I'm so I'm trying to build an interface 521 00:22:55,510 --> 00:23:01,620 that's easy to use but again it's just 522 00:22:58,480 --> 00:23:01,620 guesswork for me at the moment 523 00:23:08,580 --> 00:23:11,629 [Music] 524 00:23:18,419 --> 00:23:23,890 so really what I would if I had the time 525 00:23:22,240 --> 00:23:27,040 I would have really liked have 526 00:23:23,890 --> 00:23:31,120 integrated this and snort or Sorocaba 527 00:23:27,040 --> 00:23:34,780 and the circuit CUDA code was just too 528 00:23:31,120 --> 00:23:38,889 broken for me to touch em and I found 529 00:23:34,780 --> 00:23:43,450 the phone smart really hard to read 530 00:23:38,890 --> 00:23:46,090 I'm just because of the sheer scale of 531 00:23:43,450 --> 00:23:50,049 it I think if I was if I was doing this 532 00:23:46,090 --> 00:23:52,649 as a thesis absolutely I would have no 533 00:23:50,049 --> 00:23:55,418 problem implementing it I just think I'm 534 00:23:52,650 --> 00:23:58,870 but we're being an honest project is 535 00:23:55,419 --> 00:24:00,669 didn't have the time to do such a wide 536 00:23:58,870 --> 00:24:04,510 scope is actually even things on that 537 00:24:00,669 --> 00:24:05,320 I've had to just go I originally had DP 538 00:24:04,510 --> 00:24:07,450 DK 539 00:24:05,320 --> 00:24:09,939 let me tap on PF ring as like all those 540 00:24:07,450 --> 00:24:12,250 collection methods call it and but they 541 00:24:09,940 --> 00:24:15,669 just simply was not time to implement 542 00:24:12,250 --> 00:24:17,230 all of those and so that is the 543 00:24:15,669 --> 00:24:25,769 collection effort for that is just love 544 00:24:17,230 --> 00:24:25,769 pcap and yeah okay