1 00:00:02,530 --> 00:00:10,840 so hello my name is Helena and I will 2 00:00:07,600 --> 00:00:12,518 not do many interruptions like so I do 3 00:00:10,840 --> 00:00:14,440 cyber security and forensics at 4 00:00:12,519 --> 00:00:17,470 Edinburgh Napier University I am 5 00:00:14,440 --> 00:00:20,470 currently doing my well pop of my third 6 00:00:17,470 --> 00:00:22,770 year in a financial institution during 7 00:00:20,470 --> 00:00:29,109 placement I'm going to tell you they are 8 00:00:22,770 --> 00:00:32,800 often no I used to be part of the any 9 00:00:29,109 --> 00:00:37,630 SEC committee any second is the rival to 10 00:00:32,800 --> 00:00:44,110 average a hackers over nature you should 11 00:00:37,630 --> 00:00:46,180 join I also do magic tricks and I am a 12 00:00:44,110 --> 00:00:48,870 TEDx organizer very happy to talk about 13 00:00:46,180 --> 00:00:51,550 either of those things after the talk 14 00:00:48,870 --> 00:00:53,440 the only reason that I'm doing this talk 15 00:00:51,550 --> 00:00:55,989 is because I want to get more Twitter 16 00:00:53,440 --> 00:00:59,050 followers than my dad and currently 17 00:00:55,989 --> 00:01:03,459 about 450 people off so if you can 18 00:00:59,050 --> 00:01:06,429 please follow me there that's me please 19 00:01:03,460 --> 00:01:11,229 I really really need it like my I 20 00:01:06,430 --> 00:01:13,420 thought about going on alright so a 21 00:01:11,229 --> 00:01:16,390 short disclaimer and I will take my card 22 00:01:13,420 --> 00:01:18,640 south I am woefully under qualified to 23 00:01:16,390 --> 00:01:21,040 get this hope I have not done machine 24 00:01:18,640 --> 00:01:23,170 learning in an academic environment ever 25 00:01:21,040 --> 00:01:27,280 and I didn't think I ever level unless I 26 00:01:23,170 --> 00:01:29,560 do a masters which is possible I did 27 00:01:27,280 --> 00:01:31,540 machine learning for the first time as a 28 00:01:29,560 --> 00:01:34,360 project network it was given to me and 29 00:01:31,540 --> 00:01:37,030 this tool aims to be the talk that I 30 00:01:34,360 --> 00:01:39,700 wish I had seen before I actually start 31 00:01:37,030 --> 00:01:41,380 to project because I found it was a 32 00:01:39,700 --> 00:01:45,009 little bit of an uphill struggle 33 00:01:41,380 --> 00:01:46,810 especially at the start if you have done 34 00:01:45,009 --> 00:01:48,909 machine learning at work or university 35 00:01:46,810 --> 00:01:50,920 go away you're not going to learn 36 00:01:48,909 --> 00:01:54,130 anything here this is very much an 37 00:01:50,920 --> 00:01:57,640 introduction you might see a whole 38 00:01:54,130 --> 00:02:01,420 satara sniffs but that's basically all 39 00:01:57,640 --> 00:02:03,780 you're gonna learn so yeah I won't feel 40 00:02:01,420 --> 00:02:05,310 offended if you do get up and leave so 41 00:02:03,780 --> 00:02:09,459 [Music] 42 00:02:05,310 --> 00:02:11,830 oh right yes so this is a Twitter poll I 43 00:02:09,459 --> 00:02:15,340 got is anybody that voted in this 44 00:02:11,830 --> 00:02:17,830 Twitter poll here like Ford 45 00:02:15,340 --> 00:02:19,750 okay well I guess so this is your hope 46 00:02:17,830 --> 00:02:23,280 so we're gonna do an interactive magic 47 00:02:19,750 --> 00:02:23,280 trick I need a volunteer 48 00:02:24,150 --> 00:02:39,390 okey what's your name meat let the 49 00:02:47,610 --> 00:02:53,850 mortgagor done it doesn't matter 50 00:02:50,170 --> 00:02:53,850 everybody's gonna do the trick anyways 51 00:02:54,660 --> 00:03:00,190 right so what I would you to do is to 52 00:02:58,720 --> 00:03:01,000 put your hands up with this in front of 53 00:03:00,190 --> 00:03:03,160 you 54 00:03:01,000 --> 00:03:04,900 brilliant and now I would like everybody 55 00:03:03,160 --> 00:03:05,859 else in the room to do exactly the same 56 00:03:04,900 --> 00:03:08,319 as chump 57 00:03:05,860 --> 00:03:09,790 if anybody isn't doing it you can put 58 00:03:08,319 --> 00:03:14,380 down your coat it's fine this one never 59 00:03:09,790 --> 00:03:16,739 mind oh I'm sorry you're really injured 60 00:03:14,380 --> 00:03:16,739 okay 61 00:03:21,650 --> 00:03:30,620 if you're injured please don't do this 62 00:03:25,470 --> 00:03:33,660 coking this unlike this and a line go 63 00:03:30,620 --> 00:03:36,450 okay so now when I count to three what I 64 00:03:33,660 --> 00:03:38,609 want you to do is to go off right but 65 00:03:36,450 --> 00:03:45,660 hoping will take not to plus me 66 00:03:38,610 --> 00:03:48,450 yes exactly okay so one two three 67 00:03:45,660 --> 00:03:53,480 and everybody's poking each other right 68 00:03:48,450 --> 00:03:53,480 and now magics gonna begin abracadabra 69 00:04:04,520 --> 00:04:12,900 right so onto the boring stuff oh no no 70 00:04:09,810 --> 00:04:17,570 we're good okay so what is machine 71 00:04:12,900 --> 00:04:23,010 learning machine learning is a subset 72 00:04:17,570 --> 00:04:25,440 keep trying I'll teach you later machine 73 00:04:23,010 --> 00:04:28,980 learning is a subset of AI but those 74 00:04:25,440 --> 00:04:31,770 algorithms that allow computers to learn 75 00:04:28,980 --> 00:04:34,830 to perform tasks from day tens of being 76 00:04:31,770 --> 00:04:36,870 explicitly programmed to do things so 77 00:04:34,830 --> 00:04:38,580 the reason that I've put these things 78 00:04:36,870 --> 00:04:41,700 out this because a they're pretty funny 79 00:04:38,580 --> 00:04:43,830 in my opinion but also the most common 80 00:04:41,700 --> 00:04:45,570 kind of trend that I've noticed in terms 81 00:04:43,830 --> 00:04:47,280 of security people is a lack of 82 00:04:45,570 --> 00:04:49,890 understanding about the difference 83 00:04:47,280 --> 00:04:51,510 between what a if statement is or a lot 84 00:04:49,890 --> 00:05:00,300 of misstatements and an actual machine 85 00:04:51,510 --> 00:05:03,960 learning algorithm so um the difference 86 00:05:00,300 --> 00:05:09,419 is that while if-else statements are 87 00:05:03,960 --> 00:05:11,219 very good at dealing with with sorry 88 00:05:09,419 --> 00:05:14,250 with data that doesn't change machine 89 00:05:11,220 --> 00:05:17,220 learning is very good at acting in 90 00:05:14,250 --> 00:05:19,320 stochastic or randomized environments so 91 00:05:17,220 --> 00:05:21,330 it's very good at taking new data and 92 00:05:19,320 --> 00:05:23,820 figuring out what to do is that even if 93 00:05:21,330 --> 00:05:25,200 it hasn't seen it before so let's take 94 00:05:23,820 --> 00:05:27,840 an example of a dump 95 00:05:25,200 --> 00:05:29,669 great if we've got an if-else statement 96 00:05:27,840 --> 00:05:31,599 that is trying to see if something is ID 97 00:05:29,669 --> 00:05:34,869 or not maybe it's 98 00:05:31,599 --> 00:05:36,339 to have a question about how many likes 99 00:05:34,869 --> 00:05:40,689 the doll counts right 100 00:05:36,339 --> 00:05:43,809 so if dog has four legs then don't but 101 00:05:40,689 --> 00:05:46,659 what if the dog had a car accident and 102 00:05:43,809 --> 00:05:48,639 has only three legs so the default 103 00:05:46,659 --> 00:05:50,319 statement is going to fail at that point 104 00:05:48,639 --> 00:05:51,969 unless it's a little bit of a smart one 105 00:05:50,319 --> 00:05:54,099 none has seen this kind of dog before 106 00:05:51,969 --> 00:05:58,719 and a developer has gone in and changed 107 00:05:54,099 --> 00:06:01,800 to change the algorithm to a purses 108 00:05:58,719 --> 00:06:05,439 cabinet but a machine learning algorithm 109 00:06:01,800 --> 00:06:07,839 because of how similar this dog is to 110 00:06:05,439 --> 00:06:11,289 other dolls or because of other features 111 00:06:07,839 --> 00:06:14,559 of the dog like its fur or you know 112 00:06:11,289 --> 00:06:18,490 velocity of zoomies and things like that 113 00:06:14,559 --> 00:06:21,819 thank you for laughing it's going to be 114 00:06:18,490 --> 00:06:23,649 able to classify this dog has a dog even 115 00:06:21,819 --> 00:06:26,819 though it doesn't fulfill the exact 116 00:06:23,649 --> 00:06:30,330 criteria of a dog that it seemed before 117 00:06:26,819 --> 00:06:30,330 the slicker 118 00:06:31,169 --> 00:06:37,808 so um one question that I've heard a lot 119 00:06:35,019 --> 00:06:42,610 as well is what is AI what is machine 120 00:06:37,809 --> 00:06:44,409 learning how many different so um you 121 00:06:42,610 --> 00:06:46,149 might not see at the back so I will do 122 00:06:44,409 --> 00:06:48,879 the terrible thing at breathing outsides 123 00:06:46,149 --> 00:06:50,909 artificial intelligence is a program 124 00:06:48,879 --> 00:06:53,709 that can sense reason act and adapt 125 00:06:50,909 --> 00:07:01,449 while machine learning is an application 126 00:06:53,709 --> 00:07:03,789 of AI so one thing that we need to 127 00:07:01,449 --> 00:07:05,649 understand when we're talking about 128 00:07:03,789 --> 00:07:07,779 machine learning it's the difference 129 00:07:05,649 --> 00:07:10,599 between deep and shallow learning as 130 00:07:07,779 --> 00:07:13,990 well so deep learning is a little bit 131 00:07:10,599 --> 00:07:15,490 out of scope for this talk today because 132 00:07:13,990 --> 00:07:17,979 we wouldn't thirty minutes and I 133 00:07:15,490 --> 00:07:22,269 completely don't understand 134 00:07:17,979 --> 00:07:24,849 but essentially deep learning is good at 135 00:07:22,269 --> 00:07:27,009 figuring out what features to select on 136 00:07:24,849 --> 00:07:29,919 its own whenever it's trying to 137 00:07:27,009 --> 00:07:32,619 determine for example a classification 138 00:07:29,919 --> 00:07:35,378 of a certain object while shallow 139 00:07:32,619 --> 00:07:38,409 learning needs experts like security 140 00:07:35,379 --> 00:07:41,079 professionals like all of us to explain 141 00:07:38,409 --> 00:07:42,998 to the algorithm which features of a 142 00:07:41,079 --> 00:07:44,169 certain object are going to be important 143 00:07:42,999 --> 00:07:45,610 when we're going to try and determine 144 00:07:44,169 --> 00:07:52,090 will anything 145 00:07:45,610 --> 00:07:55,840 it's um yeah everyone is going to use a 146 00:07:52,090 --> 00:07:58,359 more accurate representation of machine 147 00:07:55,840 --> 00:08:02,469 learning and AI and this whole ecosystem 148 00:07:58,360 --> 00:08:04,840 of trying to use applied this area of 149 00:08:02,469 --> 00:08:07,930 knowledge to security this is more what 150 00:08:04,840 --> 00:08:09,549 this is what it would look like but we 151 00:08:07,930 --> 00:08:13,569 don't have to worry about that this is a 152 00:08:09,550 --> 00:08:16,990 short talk so my friend Gordon who is 153 00:08:13,569 --> 00:08:19,419 somewhere in here I hope oh hi on a 154 00:08:16,990 --> 00:08:21,819 drunken night out before I had ever 155 00:08:19,419 --> 00:08:25,599 touched machine learning at an any sec 156 00:08:21,819 --> 00:08:27,969 social which is great he said something 157 00:08:25,599 --> 00:08:29,560 along the lines of this which I will 158 00:08:27,969 --> 00:08:31,419 read out for the people at the back I 159 00:08:29,560 --> 00:08:33,399 don't understand unsupervised machine 160 00:08:31,419 --> 00:08:34,689 learning can he just can't you just use 161 00:08:33,399 --> 00:08:36,339 the supervisor that supervised a 162 00:08:34,690 --> 00:08:38,729 supervised machine learning to supervise 163 00:08:36,339 --> 00:08:43,630 the unsupervised machine learning the 164 00:08:38,729 --> 00:08:45,550 answer to that is no but let me explain 165 00:08:43,630 --> 00:08:47,019 a little bit more so it's kind of the 166 00:08:45,550 --> 00:08:49,930 wrong question to ask in the first place 167 00:08:47,019 --> 00:08:52,899 because supervised learning requires 168 00:08:49,930 --> 00:08:55,689 training on a large labeled data set so 169 00:08:52,899 --> 00:08:58,600 what you're doing is you are feeding the 170 00:08:55,690 --> 00:09:00,610 machine learning algorithm data where it 171 00:08:58,600 --> 00:09:02,079 will have some kind of way of verifying 172 00:09:00,610 --> 00:09:03,850 at the end if it's making the right 173 00:09:02,079 --> 00:09:06,010 choices or not 174 00:09:03,850 --> 00:09:07,870 now the issue with supervised learning 175 00:09:06,010 --> 00:09:09,550 is that you need to have a labelled data 176 00:09:07,870 --> 00:09:11,649 set and these are actually pretty hard 177 00:09:09,550 --> 00:09:15,670 to come by especially in computer 178 00:09:11,649 --> 00:09:18,010 security there are and I will touch on 179 00:09:15,670 --> 00:09:19,870 this a little later as well but the most 180 00:09:18,010 --> 00:09:22,510 popular machine learning data sets for 181 00:09:19,870 --> 00:09:25,930 example detecting malicious network 182 00:09:22,510 --> 00:09:28,990 connections are from like 1998 that's 183 00:09:25,930 --> 00:09:31,239 pretty old at the moment so we do have a 184 00:09:28,990 --> 00:09:36,970 significant black of labeled dates which 185 00:09:31,240 --> 00:09:38,440 is why um a more common use case well 186 00:09:36,970 --> 00:09:40,269 actually I don't know the statistics 187 00:09:38,440 --> 00:09:42,850 that might not be more common but a 188 00:09:40,269 --> 00:09:45,100 common use case is unsupervised machine 189 00:09:42,850 --> 00:09:49,230 learning which does not require labeled 190 00:09:45,100 --> 00:09:51,279 HR at a higher level what an 191 00:09:49,230 --> 00:09:53,110 unsupervised machine learning algorithms 192 00:09:51,279 --> 00:09:55,360 do is it will look at the similarities 193 00:09:53,110 --> 00:09:58,899 and similarities and dissimilarities 194 00:09:55,360 --> 00:10:00,760 between different data points and it 195 00:09:58,899 --> 00:10:05,680 will group them together based on that 196 00:10:00,760 --> 00:10:08,350 so if it's got a red cap and a green cap 197 00:10:05,680 --> 00:10:10,479 it might put them separately but if it's 198 00:10:08,350 --> 00:10:13,260 got an orange cat in might get near to 199 00:10:10,480 --> 00:10:16,300 the red cap because it's fairly someone 200 00:10:13,260 --> 00:10:18,100 um but one thing I should probably 201 00:10:16,300 --> 00:10:23,050 mention here is that supervised learning 202 00:10:18,100 --> 00:10:24,940 is usually easily has two categories one 203 00:10:23,050 --> 00:10:28,029 of which is classification and one of 204 00:10:24,940 --> 00:10:30,190 them is regression classification is 205 00:10:28,029 --> 00:10:33,610 trying to determine if something is for 206 00:10:30,190 --> 00:10:34,930 example red or blue and regression is 207 00:10:33,610 --> 00:10:38,560 trying to predict some kind of 208 00:10:34,930 --> 00:10:40,870 continuous or a numeric variable while 209 00:10:38,560 --> 00:10:42,910 unsupervised learning is going to be 210 00:10:40,870 --> 00:10:45,880 devided usually up until clustering or 211 00:10:42,910 --> 00:10:47,620 Association clustering being looking for 212 00:10:45,880 --> 00:10:51,910 inherit groups in the data so for 213 00:10:47,620 --> 00:10:53,980 example dogs cats an association being 214 00:10:51,910 --> 00:10:58,319 trying to look for rule surf horizontal 215 00:10:53,980 --> 00:10:58,320 people that click on X will also do what 216 00:10:58,770 --> 00:11:05,260 it's really warning here 217 00:11:01,680 --> 00:11:07,359 right I know this looks scary don't 218 00:11:05,260 --> 00:11:09,160 worry I want to point able to this 219 00:11:07,360 --> 00:11:10,420 really good cheat sheet just because if 220 00:11:09,160 --> 00:11:12,040 you do ever end up doing machine 221 00:11:10,420 --> 00:11:13,899 learning and you're using scikit-learn 222 00:11:12,040 --> 00:11:16,180 in Python you will probably come across 223 00:11:13,899 --> 00:11:18,430 this and this is brilliant especially if 224 00:11:16,180 --> 00:11:22,319 you're starting out but for most of us 225 00:11:18,430 --> 00:11:25,680 we're like my world from Fawlty Towers 226 00:11:22,320 --> 00:11:30,070 but let me go through each little bit of 227 00:11:25,680 --> 00:11:32,829 this diagram so regression the classic 228 00:11:30,070 --> 00:11:36,160 example of regression you can see here 229 00:11:32,829 --> 00:11:40,510 so if you've got 1,000 houses or who 230 00:11:36,160 --> 00:11:45,730 says I suppose in dundee francs you can 231 00:11:40,510 --> 00:11:49,870 fold them on a graph where X is the 232 00:11:45,730 --> 00:11:54,100 price and Y is the highest sorry all the 233 00:11:49,870 --> 00:11:57,310 way around I'm really good at completely 234 00:11:54,100 --> 00:12:00,070 right but anyway so 235 00:11:57,310 --> 00:12:02,439 you plot the price and the size on a 236 00:12:00,070 --> 00:12:04,450 graph and you put each house that you've 237 00:12:02,440 --> 00:12:06,820 got and undie that you know about on to 238 00:12:04,450 --> 00:12:08,950 the graph so one is here what is there 239 00:12:06,820 --> 00:12:10,779 etcetera etcetera then what you will do 240 00:12:08,950 --> 00:12:13,600 is you will try to use an algorithm to 241 00:12:10,779 --> 00:12:15,880 draw a line that will best represent all 242 00:12:13,600 --> 00:12:19,150 of this data which means that then if 243 00:12:15,880 --> 00:12:22,270 you've got a new house that is of a 244 00:12:19,150 --> 00:12:24,160 certain size you can go okay well it's 245 00:12:22,270 --> 00:12:28,089 there going up to this line it is 246 00:12:24,160 --> 00:12:33,600 probably going to cost around plus very 247 00:12:28,089 --> 00:12:33,600 cheap but like 1,300 pounds apartment 248 00:12:35,520 --> 00:12:39,939 and here I will quickly go into 249 00:12:37,870 --> 00:12:41,950 overfitting so one of the issues that 250 00:12:39,940 --> 00:12:44,830 you will encounter if you are going to 251 00:12:41,950 --> 00:12:46,900 use this kind of algorithm is that you 252 00:12:44,830 --> 00:12:49,600 might overfit the data what this means 253 00:12:46,900 --> 00:12:51,970 is that when you draw this line you can 254 00:12:49,600 --> 00:12:55,300 make it fit the data points that you've 255 00:12:51,970 --> 00:12:57,580 got so well like here that if you change 256 00:12:55,300 --> 00:13:00,579 the data your line won't be 257 00:12:57,580 --> 00:13:05,589 representative of the new data anymore 258 00:13:00,580 --> 00:13:07,240 so for example if we put a house here if 259 00:13:05,589 --> 00:13:10,030 you had a new house work that with the 260 00:13:07,240 --> 00:13:11,680 size that is here on the graph you would 261 00:13:10,030 --> 00:13:14,949 see that it's maybe not going to be as 262 00:13:11,680 --> 00:13:17,050 representative as one that you would put 263 00:13:14,950 --> 00:13:19,690 on a line that is just right and then 264 00:13:17,050 --> 00:13:23,199 underfitting is the other side of the 265 00:13:19,690 --> 00:13:24,940 spectrum where obviously you're not 266 00:13:23,200 --> 00:13:27,010 going to have any meaningful results if 267 00:13:24,940 --> 00:13:32,410 your line is just basically drawing 268 00:13:27,010 --> 00:13:34,810 wherever but next bit of the lovely 269 00:13:32,410 --> 00:13:39,670 python diner so we've got classification 270 00:13:34,810 --> 00:13:42,939 I touched on this briefly before so I'm 271 00:13:39,670 --> 00:13:45,400 going to do this fast classification is 272 00:13:42,940 --> 00:13:48,459 a supervised learning approach where is 273 00:13:45,400 --> 00:13:50,709 a computer learns from labelled data and 274 00:13:48,459 --> 00:13:53,410 uses this to classify new observations 275 00:13:50,709 --> 00:13:54,699 so it could be by class so for example 276 00:13:53,410 --> 00:13:56,770 malicious not malicious but it also 277 00:13:54,700 --> 00:14:00,430 could also be a number of different 278 00:13:56,770 --> 00:14:02,140 classes now the why a classification 279 00:14:00,430 --> 00:14:04,630 actually has loads of applications 280 00:14:02,140 --> 00:14:08,770 especially in computer security not only 281 00:14:04,630 --> 00:14:11,140 but here as well you know is it malware 282 00:14:08,770 --> 00:14:12,760 is it phishing is it malicious 283 00:14:11,140 --> 00:14:15,100 all of these things can be solved using 284 00:14:12,760 --> 00:14:17,730 machine learning algorithms the problem 285 00:14:15,100 --> 00:14:20,590 is finding good labelled ixelles 286 00:14:17,730 --> 00:14:22,780 clustering now clusterings are really 287 00:14:20,590 --> 00:14:25,090 interesting and the reasons that you do 288 00:14:22,780 --> 00:14:28,089 it are actually quite cool so it's an 289 00:14:25,090 --> 00:14:29,260 unsupervised learning method where what 290 00:14:28,090 --> 00:14:31,270 you're going to try and do is find 291 00:14:29,260 --> 00:14:34,830 meaningful structure or groupings in a 292 00:14:31,270 --> 00:14:37,060 set of data again like I said before 293 00:14:34,830 --> 00:14:39,370 uses the similarity and dissimilarity 294 00:14:37,060 --> 00:14:43,510 between objects to put them into certain 295 00:14:39,370 --> 00:14:45,880 groups now you use this on data that you 296 00:14:43,510 --> 00:14:50,800 usually don't know much about this is 297 00:14:45,880 --> 00:14:53,080 called explorative analysis so the 298 00:14:50,800 --> 00:14:55,479 reason that this is hard is because for 299 00:14:53,080 --> 00:14:56,830 most algorithms firstly you need to 300 00:14:55,480 --> 00:14:59,860 determine how many groups you're going 301 00:14:56,830 --> 00:15:01,780 to group it into and for that you need 302 00:14:59,860 --> 00:15:03,930 to basically look at the data I boa and 303 00:15:01,780 --> 00:15:06,939 have some kind of intuition or use 304 00:15:03,930 --> 00:15:08,500 complicated and complex statistical 305 00:15:06,940 --> 00:15:14,230 analysis to figure out how many groups 306 00:15:08,500 --> 00:15:18,160 you want to create and the other issue 307 00:15:14,230 --> 00:15:19,920 is that it's kind of hard to interpret 308 00:15:18,160 --> 00:15:22,660 the data sometimes because humans are 309 00:15:19,920 --> 00:15:24,339 very good at understanding data that is 310 00:15:22,660 --> 00:15:27,790 in multiple dimensions because we can't 311 00:15:24,340 --> 00:15:31,240 visualize it then we can't see it now 312 00:15:27,790 --> 00:15:32,800 apart from finding groups of things you 313 00:15:31,240 --> 00:15:35,440 can also use clustering for anomaly 314 00:15:32,800 --> 00:15:37,660 detection so if you've got three very 315 00:15:35,440 --> 00:15:39,400 clear groups like here but then you put 316 00:15:37,660 --> 00:15:40,870 some little blobs that are outside of 317 00:15:39,400 --> 00:15:42,880 one of the groups you can think oh well 318 00:15:40,870 --> 00:15:47,080 maybe that's an anomaly which is useful 319 00:15:42,880 --> 00:15:54,150 in security the last bit of the graph is 320 00:15:47,080 --> 00:15:56,350 dimensionality reduction which is the 321 00:15:54,150 --> 00:15:59,740 10-minutes dama pairing in a bit quick 322 00:15:56,350 --> 00:16:04,120 which is which is reducing the number of 323 00:15:59,740 --> 00:16:06,880 random variables in a dataset so that in 324 00:16:04,120 --> 00:16:08,470 a dataset where you will have a number 325 00:16:06,880 --> 00:16:09,939 of relevant features that are going to 326 00:16:08,470 --> 00:16:12,700 solve your problem so for example if 327 00:16:09,940 --> 00:16:15,690 you've got a email that will have 328 00:16:12,700 --> 00:16:18,010 features such as the lengths of the 329 00:16:15,690 --> 00:16:20,700 actual content of the email the time it 330 00:16:18,010 --> 00:16:23,130 was sent and let's say this under domain 331 00:16:20,700 --> 00:16:25,380 and you might think 332 00:16:23,130 --> 00:16:27,689 for fishing the thing that you want to 333 00:16:25,380 --> 00:16:29,070 use to classify as fishing or not would 334 00:16:27,690 --> 00:16:30,240 be this under domain rather than the 335 00:16:29,070 --> 00:16:33,210 time that it was sent because that could 336 00:16:30,240 --> 00:16:35,040 be random not going into it in detail 337 00:16:33,210 --> 00:16:37,110 dimensionality reduction is basically 338 00:16:35,040 --> 00:16:39,390 trying to figure out which of the 339 00:16:37,110 --> 00:16:41,850 features are less important than others 340 00:16:39,390 --> 00:16:43,860 and taking them away or making them less 341 00:16:41,850 --> 00:16:47,670 impactful on the machine learning model 342 00:16:43,860 --> 00:16:53,400 and yeah I'm not going to be honest so 343 00:16:47,670 --> 00:16:55,589 why do we need this let's look at this 344 00:16:53,400 --> 00:16:57,000 from a kind of enterprise security point 345 00:16:55,590 --> 00:16:59,100 of view because I assume that most 346 00:16:57,000 --> 00:17:02,130 people here are going to move into that 347 00:16:59,100 --> 00:17:03,780 space at some point in their life so we 348 00:17:02,130 --> 00:17:06,300 have a lot of data and we have to 349 00:17:03,780 --> 00:17:10,230 protect organisations from threats and 350 00:17:06,300 --> 00:17:12,889 we have to detect them as well so we use 351 00:17:10,230 --> 00:17:15,390 lots of for intelligence feeds 352 00:17:12,890 --> 00:17:16,980 rule-based IDS's rules about file 353 00:17:15,390 --> 00:17:19,470 downloads outside should to try and do 354 00:17:16,980 --> 00:17:24,000 this but with a growing number of data 355 00:17:19,470 --> 00:17:27,569 and excuse me of the term but the cyber 356 00:17:24,000 --> 00:17:30,960 threat landscape ever opened and the 357 00:17:27,569 --> 00:17:33,629 infamous skill shortage in security 358 00:17:30,960 --> 00:17:34,860 we need our people to be charging Prats 359 00:17:33,630 --> 00:17:37,380 that actually exists 360 00:17:34,860 --> 00:17:38,669 and there is a lot of false positives in 361 00:17:37,380 --> 00:17:40,800 socks thank you 362 00:17:38,670 --> 00:17:43,500 socks isn't security operation centers 363 00:17:40,800 --> 00:17:49,320 no socks my mom is going to watch this 364 00:17:43,500 --> 00:17:52,410 one day and she'll be like more so the 365 00:17:49,320 --> 00:17:52,950 vision for this is that sorry I'm going 366 00:17:52,410 --> 00:17:54,780 to stay here 367 00:17:52,950 --> 00:17:57,060 it's not easy there are lots of 368 00:17:54,780 --> 00:17:58,950 accessible libraries and toolkits to do 369 00:17:57,060 --> 00:18:00,510 machine learning in general but we still 370 00:17:58,950 --> 00:18:03,120 need skilled developers to be able to 371 00:18:00,510 --> 00:18:04,590 not only understand security but also 372 00:18:03,120 --> 00:18:05,939 understand how to choose the different 373 00:18:04,590 --> 00:18:09,740 features that we're going to use to 374 00:18:05,940 --> 00:18:12,720 build our machine learning models now 375 00:18:09,740 --> 00:18:14,340 machine learning security use cases are 376 00:18:12,720 --> 00:18:17,340 usually divided into two different 377 00:18:14,340 --> 00:18:20,010 things one is pattern recognition and 378 00:18:17,340 --> 00:18:22,320 one is anomaly detection it's easy to 379 00:18:20,010 --> 00:18:23,280 conflate both of them but they have 380 00:18:22,320 --> 00:18:26,429 slightly different goals 381 00:18:23,280 --> 00:18:28,920 so in pattern recognition we're going to 382 00:18:26,430 --> 00:18:31,410 detect characteristics and data and 383 00:18:28,920 --> 00:18:33,930 these characters which can be used later 384 00:18:31,410 --> 00:18:35,130 to teach an algorithm to recognize forms 385 00:18:33,930 --> 00:18:36,070 of data that are the same or different 386 00:18:35,130 --> 00:18:38,140 from 387 00:18:36,070 --> 00:18:39,850 characteristics well an anomaly 388 00:18:38,140 --> 00:18:43,330 detection you're trying to baseline 389 00:18:39,850 --> 00:18:45,610 normal and find a notion of normality 390 00:18:43,330 --> 00:18:48,669 that statistically covers most of the 391 00:18:45,610 --> 00:18:50,199 data that you have and then anything 392 00:18:48,670 --> 00:18:55,420 that differs from that will be classed 393 00:18:50,200 --> 00:18:59,320 as an anomaly so this is kind of the 394 00:18:55,420 --> 00:19:00,790 most important bit of this hope so if 395 00:18:59,320 --> 00:19:02,080 you want to learn more about machine 396 00:19:00,790 --> 00:19:04,240 learning it is actually pretty 397 00:19:02,080 --> 00:19:10,020 accessible even though the Massa's heart 398 00:19:04,240 --> 00:19:12,940 so I started off with a this course on 399 00:19:10,020 --> 00:19:14,260 plural sight sorry you have to pay for 400 00:19:12,940 --> 00:19:16,960 plural tighteners also loads of really 401 00:19:14,260 --> 00:19:19,270 good three ones and this machine 402 00:19:16,960 --> 00:19:21,490 learning by Stanford is really good but 403 00:19:19,270 --> 00:19:24,550 you have to do it in like MATLAB or R or 404 00:19:21,490 --> 00:19:27,940 something isn't Python so I was like no 405 00:19:24,550 --> 00:19:29,830 um but scikit-learn and Python is really 406 00:19:27,940 --> 00:19:31,480 good even just reading the documentation 407 00:19:29,830 --> 00:19:33,639 is really good for looking at 408 00:19:31,480 --> 00:19:38,380 applications in these cases although not 409 00:19:33,640 --> 00:19:39,910 necessarily for security and last but 410 00:19:38,380 --> 00:19:42,520 not least machine learning is difficult 411 00:19:39,910 --> 00:19:44,380 but don't get discouraged if I can do a 412 00:19:42,520 --> 00:19:46,660 little bit of it you can all definitely 413 00:19:44,380 --> 00:19:49,180 do a little bit I think is a really 414 00:19:46,660 --> 00:19:51,070 valuable and rewarding area to get into 415 00:19:49,180 --> 00:19:53,590 especially as it makes you think about 416 00:19:51,070 --> 00:19:55,750 what is actually important in security 417 00:19:53,590 --> 00:19:58,419 and how we can make you know the output 418 00:19:55,750 --> 00:20:01,870 of security professionals useful to a 419 00:19:58,420 --> 00:20:04,480 business or a country or whoever you 420 00:20:01,870 --> 00:20:05,979 work for and also you can really impress 421 00:20:04,480 --> 00:20:09,400 your friends and family by saying oh 422 00:20:05,980 --> 00:20:11,980 yeah I do machine learning so that was 423 00:20:09,400 --> 00:20:15,240 following me on Twitter once again and 424 00:20:11,980 --> 00:20:15,240 thank you so much for your attention 425 00:20:35,150 --> 00:20:42,330 good question so the I'm not really 426 00:20:40,890 --> 00:20:46,710 allowed to talk in depth about the 427 00:20:42,330 --> 00:20:48,360 projects that I did it work however the 428 00:20:46,710 --> 00:20:50,760 stuff I was working on also required 429 00:20:48,360 --> 00:20:53,850 domain expertise in networks and network 430 00:20:50,760 --> 00:20:55,170 security so I think that if I hadn't 431 00:20:53,850 --> 00:20:56,550 known that and I had a new mission 432 00:20:55,170 --> 00:21:00,450 learning I would have been completely 433 00:20:56,550 --> 00:21:06,060 screwed for lack of a better word um but 434 00:21:00,450 --> 00:21:07,980 I think intuition about maths and just 435 00:21:06,060 --> 00:21:10,470 kind of in the village he I did not set 436 00:21:07,980 --> 00:21:14,490 the school I don't we did a one course 437 00:21:10,470 --> 00:21:21,330 in uni but it was really well great this 438 00:21:14,490 --> 00:21:23,100 is being filmed so I think like having a 439 00:21:21,330 --> 00:21:25,409 bit of an aptitude for mathematics was 440 00:21:23,100 --> 00:21:27,120 also really useful but apart from that I 441 00:21:25,410 --> 00:21:28,800 think as with everything is just grit 442 00:21:27,120 --> 00:21:30,870 and determination you know whenever 443 00:21:28,800 --> 00:21:32,820 you're well maybe I'm going to lose your 444 00:21:30,870 --> 00:21:36,629 job but whenever your boss is telling 445 00:21:32,820 --> 00:21:39,750 you to do it you just do it yeah sorry 446 00:21:36,630 --> 00:21:45,120 right yes that is a good point good for 447 00:21:39,750 --> 00:21:51,840 coming I couldn't to be honest any other 448 00:21:45,120 --> 00:21:54,149 questions okay oh so a big problem with 449 00:21:51,840 --> 00:21:59,070 machine learning anything that you do 450 00:21:54,150 --> 00:22:01,050 I'm still just as we are - so just the 451 00:21:59,070 --> 00:22:04,379 five decisions that are all who makes 452 00:22:01,050 --> 00:22:07,500 you have any negative implications if 453 00:22:04,380 --> 00:22:09,900 you cyber security so for example unlike 454 00:22:07,500 --> 00:22:11,940 in historical an assault can I have this 455 00:22:09,900 --> 00:22:15,240 machine learning that box reinstated 456 00:22:11,940 --> 00:22:17,370 it's giving me all my nice through 457 00:22:15,240 --> 00:22:20,720 positives what about the false positive 458 00:22:17,370 --> 00:22:26,129 sort of things in helping catch DCU 459 00:22:20,720 --> 00:22:29,040 agent up so I'm going to answer this 460 00:22:26,130 --> 00:22:34,320 question to both as well as I can 461 00:22:29,040 --> 00:22:36,750 so to the gdpr thing there is a lot of 462 00:22:34,320 --> 00:22:40,260 attention put into processing people's 463 00:22:36,750 --> 00:22:43,920 data in general and GDP or as and I'm 464 00:22:40,260 --> 00:22:47,490 not an expert by my shirt or don't you 465 00:22:43,920 --> 00:22:49,020 GPR but there is a lot of attention 466 00:22:47,490 --> 00:22:51,080 button to things like dates and lasting 467 00:22:49,020 --> 00:22:54,600 so making sure the data is anonymous and 468 00:22:51,080 --> 00:22:56,610 in terms of I don't know how to DPI 469 00:22:54,600 --> 00:22:59,419 works for kind of regulatory compliance 470 00:22:56,610 --> 00:23:04,830 for not responding to cyber incidents 471 00:22:59,420 --> 00:23:06,690 but I mean yes see whenever you deploy 472 00:23:04,830 --> 00:23:08,159 an algorithm a machine learning 473 00:23:06,690 --> 00:23:09,960 algorithm that you don't really 474 00:23:08,160 --> 00:23:12,720 understand on to about tenets telling 475 00:23:09,960 --> 00:23:14,310 you this is true the schools you have to 476 00:23:12,720 --> 00:23:18,330 be able to justify the decision that you 477 00:23:14,310 --> 00:23:20,250 made and I know that in the financial 478 00:23:18,330 --> 00:23:22,679 sector regulators are incredibly strict 479 00:23:20,250 --> 00:23:24,300 and it probably get really screwed if 480 00:23:22,680 --> 00:23:28,050 you make the wrong decision at some 481 00:23:24,300 --> 00:23:29,790 point but yeah I mean I think that's the 482 00:23:28,050 --> 00:23:32,970 reason that we employ experts that will 483 00:23:29,790 --> 00:23:35,750 own the risk for like anything any other 484 00:23:32,970 --> 00:23:35,750 questions 485 00:23:35,870 --> 00:23:46,340 manual police a national excitation um 486 00:23:40,580 --> 00:23:48,899 so I think it will definitely help I 487 00:23:46,340 --> 00:23:50,370 don't think I'm really qualified to 488 00:23:48,900 --> 00:23:53,430 answer it but we'll replace it 489 00:23:50,370 --> 00:23:55,110 completely I think you know machine 490 00:23:53,430 --> 00:23:57,660 learning algorithms are already being 491 00:23:55,110 --> 00:23:59,129 used to navies and things like that that 492 00:23:57,660 --> 00:24:01,440 doesn't mean we don't have relay systems 493 00:23:59,130 --> 00:24:03,030 as well I think at the point that 494 00:24:01,440 --> 00:24:05,960 they're battering more effective but 495 00:24:03,030 --> 00:24:07,830 yeah we'll use them but I don't think 496 00:24:05,960 --> 00:24:09,900 because of the complexity of 497 00:24:07,830 --> 00:24:12,060 implementing these things in the first 498 00:24:09,900 --> 00:24:15,450 place I think you know rather than there 499 00:24:12,060 --> 00:24:17,250 being a bunch of people that write rules 500 00:24:15,450 --> 00:24:18,480 for assault will be a bunch of people 501 00:24:17,250 --> 00:24:20,310 that write machine learning I'll bring 502 00:24:18,480 --> 00:24:22,110 this Chris offering this will have to be 503 00:24:20,310 --> 00:24:23,639 tested as well you know the whole 504 00:24:22,110 --> 00:24:27,449 problems just going to repeat itself is 505 00:24:23,640 --> 00:24:29,190 just going to be more complex any other 506 00:24:27,450 --> 00:24:31,610 questions oh I'm out of time today thank 507 00:24:29,190 --> 00:24:31,610 you so much