1 00:00:00,480 --> 00:00:03,000 hello everyone here I am going to 2 00:00:03,000 --> 00:00:05,560 introduce my recent work Kronos a 3 00:00:05,560 --> 00:00:07,759 testing framework for detecting timeout 4 00:00:07,759 --> 00:00:10,280 bugs in distributed systems by Deep 5 00:00:10,280 --> 00:00:13,759 priority fuzzing with transient 6 00:00:13,759 --> 00:00:16,560 delay distributed systems are prone to 7 00:00:16,560 --> 00:00:19,240 various faults that may occur at runtime 8 00:00:19,240 --> 00:00:21,600 one of the most common faults is delays 9 00:00:21,600 --> 00:00:23,560 which can be caused by uncontrollable 10 00:00:23,560 --> 00:00:26,480 factors such as Network traffic resource 11 00:00:26,480 --> 00:00:29,400 monopolization software bugs and so on 12 00:00:29,400 --> 00:00:31,920 in comp Le Lex distributed systems to 13 00:00:31,920 --> 00:00:34,160 mitigate these unexpected failures 14 00:00:34,160 --> 00:00:36,239 various timeout mechanisms have been 15 00:00:36,239 --> 00:00:38,719 proposed ensuring the stability and 16 00:00:38,719 --> 00:00:42,680 reliability of distributed 17 00:00:43,039 --> 00:00:46,039 systems when object 01 sends a request 18 00:00:46,039 --> 00:00:49,239 to another object O2 01 sets a timeout 19 00:00:49,239 --> 00:00:51,719 value and waits for the response from O2 20 00:00:51,719 --> 00:00:54,879 until the time expires in case O2 fails 21 00:00:54,879 --> 00:00:57,520 or a message loss occurs 01 can break 22 00:00:57,520 --> 00:00:59,199 out of the waiting State triggered by 23 00:00:59,199 --> 00:01:00,559 the timeout of 24 00:01:00,559 --> 00:01:03,199 execute timeout handling code and take 25 00:01:03,199 --> 00:01:07,479 proper actions retrying or skipping 26 00:01:07,479 --> 00:01:10,040 accordingly in general the timeout 27 00:01:10,040 --> 00:01:12,400 mechanism in a distributed system mainly 28 00:01:12,400 --> 00:01:15,040 distributes on two parts components 29 00:01:15,040 --> 00:01:17,759 interact locally through local IO and 30 00:01:17,759 --> 00:01:22,240 nodes communicate remotely via Network 31 00:01:22,400 --> 00:01:25,640 IO distributed systems are inherently 32 00:01:25,640 --> 00:01:28,560 complex involving numerous resources and 33 00:01:28,560 --> 00:01:30,200 nodes that communicate with with each 34 00:01:30,200 --> 00:01:33,200 other to handle unexpected delays and 35 00:01:33,200 --> 00:01:36,240 maintain uninterrupted service provision 36 00:01:36,240 --> 00:01:38,479 these systems require a significant 37 00:01:38,479 --> 00:01:41,640 number of timeout mechanisms however due 38 00:01:41,640 --> 00:01:44,159 to this complexity it can be challenging 39 00:01:44,159 --> 00:01:46,159 to avoid incorrect handling or 40 00:01:46,159 --> 00:01:48,759 implementation bugs in these timeout 41 00:01:48,759 --> 00:01:52,880 Logics which we call them timeout 42 00:01:52,880 --> 00:01:55,960 bugs to effectively test these timeout 43 00:01:55,960 --> 00:01:58,439 mechanisms it is necessary to trigger 44 00:01:58,439 --> 00:02:00,560 more abnormal delays than those 45 00:02:00,560 --> 00:02:03,159 occurring naturally during regular use 46 00:02:03,159 --> 00:02:06,520 by injecting plenty of 47 00:02:06,640 --> 00:02:09,679 delays here I show an example of timeout 48 00:02:09,679 --> 00:02:13,120 bug in hdfs where incorrect handling of 49 00:02:13,120 --> 00:02:15,800 the timeout mechanism led to critical 50 00:02:15,800 --> 00:02:18,519 issues the data streamer throws an IO 51 00:02:18,519 --> 00:02:21,040 exception and incorrectly sets its state 52 00:02:21,040 --> 00:02:24,560 to interrupted as shown in line 18 53 00:02:24,560 --> 00:02:26,280 however when the thread then performs 54 00:02:26,280 --> 00:02:28,680 other actions such as an RPC remote 55 00:02:28,680 --> 00:02:31,120 procedure call to name node by the 56 00:02:31,120 --> 00:02:33,599 function abandon block it first checks 57 00:02:33,599 --> 00:02:35,599 the current state and fails immediately 58 00:02:35,599 --> 00:02:37,519 since its state is interrupted 59 00:02:37,519 --> 00:02:39,920 triggering this bug this bug obstructs 60 00:02:39,920 --> 00:02:42,959 the RPC to other nodes causing service 61 00:02:42,959 --> 00:02:45,560 Hang-Ups and affecting the availability 62 00:02:45,560 --> 00:02:47,920 of the 63 00:02:48,800 --> 00:02:51,280 hdfs however there are three main 64 00:02:51,280 --> 00:02:54,080 challenges first code implementation of 65 00:02:54,080 --> 00:02:56,959 distributed systems is huge and various 66 00:02:56,959 --> 00:02:59,280 it is hard to precisely identify all the 67 00:02:59,280 --> 00:03:03,159 code location ations where to inject 68 00:03:03,480 --> 00:03:06,120 delays the Second Challenge lies in 69 00:03:06,120 --> 00:03:07,879 effectively triggering instrumented 70 00:03:07,879 --> 00:03:11,040 delays in deep paths where timeout bugs 71 00:03:11,040 --> 00:03:13,560 tend to remain hidden in distributed 72 00:03:13,560 --> 00:03:15,959 systems the business logic is usually 73 00:03:15,959 --> 00:03:18,400 divided into multiple phases each with 74 00:03:18,400 --> 00:03:21,120 its corresponding timeout mechanism 75 00:03:21,120 --> 00:03:23,200 frequently executing delays at the 76 00:03:23,200 --> 00:03:25,360 shallow execution paths prevents 77 00:03:25,360 --> 00:03:27,599 exploring timeout mechanisms in deep 78 00:03:27,599 --> 00:03:31,720 paths resulting in ineffective 79 00:03:31,720 --> 00:03:34,239 testing the third challenge is that 80 00:03:34,239 --> 00:03:36,400 executing delays will introduce 81 00:03:36,400 --> 00:03:38,519 significant overhead to the testing 82 00:03:38,519 --> 00:03:41,120 process since actual timeouts are 83 00:03:41,120 --> 00:03:43,879 usually timeconsuming directly inserting 84 00:03:43,879 --> 00:03:46,200 delays reduces testing speed and 85 00:03:46,200 --> 00:03:48,920 consequently testing 86 00:03:48,920 --> 00:03:51,159 performance to overcome these three 87 00:03:51,159 --> 00:03:53,640 challenges we propose Kronos as shown in 88 00:03:53,640 --> 00:03:57,400 this picture which consists of two main 89 00:03:57,400 --> 00:04:00,079 process the first process is the delay 90 00:04:00,079 --> 00:04:02,920 instrument process in this process given 91 00:04:02,920 --> 00:04:06,200 a DSU the delay injector first decides 92 00:04:06,200 --> 00:04:08,599 where to inject delays then precisely 93 00:04:08,599 --> 00:04:11,319 injects the fine grain delay logic and 94 00:04:11,319 --> 00:04:14,560 finally outputs the instrumented DSU 95 00:04:14,560 --> 00:04:17,160 with a set of delay 96 00:04:17,160 --> 00:04:20,040 blocks the second process is the testing 97 00:04:20,040 --> 00:04:23,360 process which involves the following 98 00:04:23,360 --> 00:04:26,320 steps Kronos first loads workloads to 99 00:04:26,320 --> 00:04:30,080 send requests to the distributed system 100 00:04:30,080 --> 00:04:32,440 then delay selector dynamically selects 101 00:04:32,440 --> 00:04:34,479 a subset of the delay blocks according 102 00:04:34,479 --> 00:04:37,039 to the runtime context such as call 103 00:04:37,039 --> 00:04:40,280 Trace execution depth and so 104 00:04:40,280 --> 00:04:43,120 on after that Kronos combines the 105 00:04:43,120 --> 00:04:45,440 selected delay blocks and generates a 106 00:04:45,440 --> 00:04:46,440 delay 107 00:04:46,440 --> 00:04:48,919 sequence and then delay selector 108 00:04:48,919 --> 00:04:52,440 activates all delay blocks in the 109 00:04:52,440 --> 00:04:55,919 sequence dut executes the workload with 110 00:04:55,919 --> 00:04:58,840 the activated delay blocks to speed up 111 00:04:58,840 --> 00:05:01,120 the testing process C the delay 112 00:05:01,120 --> 00:05:03,400 executive proposes the transient delay 113 00:05:03,400 --> 00:05:07,160 mechanism to quickly trigger the timeout 114 00:05:07,160 --> 00:05:09,800 mechanisms finally the timeout bug 115 00:05:09,800 --> 00:05:12,120 analyzer monitors the runtime states of 116 00:05:12,120 --> 00:05:14,639 distributed systems in real time and 117 00:05:14,639 --> 00:05:16,800 identifies if there are node crashes or 118 00:05:16,800 --> 00:05:18,479 service 119 00:05:18,479 --> 00:05:21,440 hangs the timeout bugs are reported once 120 00:05:21,440 --> 00:05:24,080 they are detected Kronos proceeds to the 121 00:05:24,080 --> 00:05:26,639 next iteration from step one to step 122 00:05:26,639 --> 00:05:28,720 seven of the testing process until 123 00:05:28,720 --> 00:05:30,400 termination 124 00:05:30,400 --> 00:05:32,680 first the delay injector determines 125 00:05:32,680 --> 00:05:34,960 where to inject delay which code 126 00:05:34,960 --> 00:05:37,120 locations in delay blocks are eligible 127 00:05:37,120 --> 00:05:39,360 to inject delays there are three 128 00:05:39,360 --> 00:05:41,400 alternative delay injection methods as 129 00:05:41,400 --> 00:05:43,319 shown in 130 00:05:43,319 --> 00:05:46,240 figure first we can directly modify the 131 00:05:46,240 --> 00:05:48,160 source code of the distributed system 132 00:05:48,160 --> 00:05:50,960 under test however it is hard to 133 00:05:50,960 --> 00:05:53,560 identify all timeout mechanisms since 134 00:05:53,560 --> 00:05:55,800 their code implementation are complex 135 00:05:55,800 --> 00:05:56,639 and 136 00:05:56,639 --> 00:05:59,680 various or we can inject delays in local 137 00:05:59,680 --> 00:06:02,960 IO and network iio by the Linux kernel 138 00:06:02,960 --> 00:06:05,199 however it is hard for the OS kernel to 139 00:06:05,199 --> 00:06:07,199 distinguish which collar which delay 140 00:06:07,199 --> 00:06:11,759 blocks D1 D2 or D3 calls through the 141 00:06:11,759 --> 00:06:15,039 EO since timeout mechanisms in local 142 00:06:15,039 --> 00:06:17,319 interactions and remote Communications 143 00:06:17,319 --> 00:06:19,560 are executed through local IO and 144 00:06:19,560 --> 00:06:23,120 network EO usually distributed systems 145 00:06:23,120 --> 00:06:25,840 do not directly operate these IO but 146 00:06:25,840 --> 00:06:28,759 rather utilize existing common runtime 147 00:06:28,759 --> 00:06:31,520 libraries so we can inject delays in the 148 00:06:31,520 --> 00:06:33,080 runtime Library 149 00:06:33,080 --> 00:06:35,840 layer the Second Challenge is to decide 150 00:06:35,840 --> 00:06:38,720 when to activate delays each delay block 151 00:06:38,720 --> 00:06:42,199 has two states activated or 152 00:06:42,199 --> 00:06:44,360 deactivated because there are so many 153 00:06:44,360 --> 00:06:46,840 delay blocks in a distributed system the 154 00:06:46,840 --> 00:06:48,880 input space is 155 00:06:48,880 --> 00:06:51,360 huge and one our assumption is that 156 00:06:51,360 --> 00:06:53,599 delay mechanisms in the Deep path are 157 00:06:53,599 --> 00:06:55,599 hardly triggered in normal usage or 158 00:06:55,599 --> 00:06:57,960 testing and therefore may have more 159 00:06:57,960 --> 00:07:00,120 timeout bugs 160 00:07:00,120 --> 00:07:02,280 then we use a deep priority guided 161 00:07:02,280 --> 00:07:04,680 algorithm for selecting delay sequences 162 00:07:04,680 --> 00:07:07,160 that can trigger new found deep delays 163 00:07:07,160 --> 00:07:10,000 in this way Kronos constantly generates 164 00:07:10,000 --> 00:07:12,240 highquality delay sequences as test 165 00:07:12,240 --> 00:07:15,080 inputs and explores as many deep delay 166 00:07:15,080 --> 00:07:16,840 blocks as 167 00:07:16,840 --> 00:07:19,599 possible the third challenge is how to 168 00:07:19,599 --> 00:07:22,599 execute delays in real world distributed 169 00:07:22,599 --> 00:07:24,680 systems timeouts are typically set 170 00:07:24,680 --> 00:07:26,759 between 1 second to 10 171 00:07:26,759 --> 00:07:29,680 minutes however injecting such long 172 00:07:29,680 --> 00:07:32,199 delays directly can significantly slow 173 00:07:32,199 --> 00:07:33,960 down the testing 174 00:07:33,960 --> 00:07:36,919 process to address it we propose a 175 00:07:36,919 --> 00:07:39,599 transient delay mechanism that directly 176 00:07:39,599 --> 00:07:42,360 sets the timeout value to zero this 177 00:07:42,360 --> 00:07:44,520 enables quick triggering of delay Logics 178 00:07:44,520 --> 00:07:47,759 in the system and speed up the testing 179 00:07:47,759 --> 00:07:50,360 process we have implemented Kronos on 180 00:07:50,360 --> 00:07:52,680 four widely used distributed systems 181 00:07:52,680 --> 00:07:57,039 including zookeeper MySQL cluster hdfs 182 00:07:57,039 --> 00:07:59,520 and go ethereum in total we have 183 00:07:59,520 --> 00:08:03,000 detected 28 new bugs in these systems we 184 00:08:03,000 --> 00:08:05,159 compared Kronos with other state-of-art 185 00:08:05,159 --> 00:08:07,080 methods for fault injection Tech on 186 00:08:07,080 --> 00:08:09,879 alies including random method brute 187 00:08:09,879 --> 00:08:12,720 force method and coverage guided fuzzing 188 00:08:12,720 --> 00:08:14,479 the results show Kronos always 189 00:08:14,479 --> 00:08:16,919 outperforms others on all four targets 190 00:08:16,919 --> 00:08:18,840 distributed 191 00:08:18,840 --> 00:08:21,639 systems to evaluate the effectiveness of 192 00:08:21,639 --> 00:08:24,720 the transient delay we also run Kronos 193 00:08:24,720 --> 00:08:27,159 minus the version of Kronos which 194 00:08:27,159 --> 00:08:29,680 disables the the transient delay and 195 00:08:29,680 --> 00:08:33,000 executes the actual delays in conclusion 196 00:08:33,000 --> 00:08:36,240 Kronos can detect all 28 bugs within 24 197 00:08:36,240 --> 00:08:39,200 hours while Kronos minus only detects 14 198 00:08:39,200 --> 00:08:44,000 of them and Kronos covers 18% more delay 199 00:08:44,000 --> 00:08:47,080 blocks our future work mainly focuses on 200 00:08:47,080 --> 00:08:49,440 the following points first we would 201 00:08:49,440 --> 00:08:51,920 explore more oracles such as privacy 202 00:08:51,920 --> 00:08:54,320 leaking problem or performance issues to 203 00:08:54,320 --> 00:08:56,920 further improve kronos's bug 204 00:08:56,920 --> 00:08:59,440 analyzer in addition we will try to 205 00:08:59,440 --> 00:09:01,399 enhance Kronos with finer feedback 206 00:09:01,399 --> 00:09:04,200 guidance in finding more timeout bugs at 207 00:09:04,200 --> 00:09:06,839 last we will try to adapt Kronos to more 208 00:09:06,839 --> 00:09:08,880 distributed 209 00:09:08,880 --> 00:09:11,680 systems thank you for listening please 210 00:09:11,680 --> 00:09:16,000 contact me if you have any questions