Static Analysis- Introduction part 1
[SOUND]. A key element of a security conscious software development process is performing code reviews with the assistance of tools. Increasingly such tools employ a technology called static analysis. In this unit we'll look at static analysis in detail. We begin with an introduction. We'll look at what static analysis is and why it is useful. Then we'll look at how static analysis works. To understand the basics we will develop a flow analysis that tries to understand how tainted values flow around a program. Our goal is to find bugs. Then, we will look at how to increase the precision of the basic analysis by making it context, flow and path sensitive. We will finish up by filling in some missing details. We'll consider how an analysis handles more challenging programming language features. How it can be customized to different settings, and how it can be made more impactful in practice. Current practice for software assurance tends to involve testing. The idea here is to make sure that the program runs correctly on a set of inputs that seem relevant to the operation of the program. So, we provide those inputs to the program. The program produces some outputs. And an oracle that the tester constructs confirms that the outputs are the ones that are expected. So testing is great because, if a test fails, you have a concrete failure that you can use to help find what the issue is. And, really also to establish whether or not the failure is a true one. On the other hand, coming up with test cases actually running them. And doing them so that they cover all of the relevant code paths is very difficult and provides no guarantees. The no guarantee part is especially troubling for security tests. Because it only takes one failure, one failed code path to lead to a vulnerability that could take down an entire system. Another approach that's complimentary is to use manual code auditing and the idea here is to convince someone else on your software development team that the source code is correct. Now, the benefit is that humans, when they review code, can generalize beyond single runs. So they can do better than just single test cases. They can imagine executions in their head that maybe they haven't written test cases for. On the other hand, human time is very expensive especially compared to computer time. And once again, this is a difficult task that provides now guarantees of coverage. And of course, we are worried about security, and so the malicious adversary is going to exploit those things that we miss. So, static analysis to the rescue. The idea here is to use a computer program to analyze a program's source code without running it. So in a sense we're asking a computer to do what a human might have done during a code review. And the benefit here is much higher code coverage. The computer program, the static analyzer, can reason about many possible runs of the program, and in some cases even reason about all possible runs, and thereby provide a guarantee that the property that we've attempted to establish, hopefully a reasonable. An important security property is in fact true, all of the time. We also can reason about incomplete programs, like libraries, that would otherwise be challenging to test. Now, as it turns out, static analysis, surprise, surprise, has drawbacks too. For example, it can only tend to analyze limited properties rather than complete functional correctness. It may miss some errors or have false alarms for engineering reasons. And it may also be time consuming to run. Static analysis can have significant impact on a security oriented development process. Because static analysis can throughly check limited but useful properties and there by eliminate entire categories of errors, it frees up developers to concentrate on deeper reasoning. Static analysis tool usage can also encourage better development practices. Programmers who use tools start to develop programming models that avoid mistakes in the first place. Programmers also end up thinking about and making manifest their assumptions. For example, they will use annotations that tools understand in order to improve tool precision by specifying what they intend for a program to do. Static analysis tools are becoming a commercial reality. And so companies are starting to feel this impact today.
Static Analysis- Introduction part 2
[SOUND] What is Static Analysis? Well to answer that question, let's first ask a different question, which is, what can static analysis do? A classic static analysis problem is The Halting Problem. It asks, can we write an analyzer that can prove, for any program P and inputs to it, whether P will terminate? The problem is depicted visually as follows. We start with the program P. We feed this program and its input to our analyzer. And it determines whether or not P terminates. To solve the halting problem, we have to build such an analyzer. The question is, whether doing so is even possible. As it turns out, the answer is no. It has been proved, that the halting problem is undecidable. That is to say, it is impossible to write a general analyzer that can answer the halting question for all programs and all inputs. While we cannot establish termination behavior, maybe we can establish other security relevant properties. For example, perhaps we can use static analysis to prove that all array accesses are in bounds. If we could do this we could eliminate a large source of memory safety violations. That is, buffer overruns. Unfortunately, this question is undecidable too. As is essentially every other interesting property. We can prove this claim by showing that an analyzer for array bounds checking could be used to decide the halting problem, which we know is undecidable. And as such, array bounds checking must also be undecidable. Here's some more example questions that are undecidable. That all such properties are undecidable follows from Rice's theorem. Is an SQL query constructed from untrusted input? Is a pointer dereferenced after it is freed? Is accessing a variable a possible source of a data erase? And of course, there are many other such questions. So let's look at how the halting problem can be viewed as equivalent to the question of whether an array index is in bounds. This is a kind of proof by transformation that interesting program analysis problems are equivalent to the halting problem. So first, let's take all indexing expressions ai in the program and convert them to exit instead. So we here, see here what that transformation is. We replace ai with first the bounds check. Is i greater than or equal to 0, or less than the length. And if so, do ai is normal, otherwise exit. And now, an array bounds error is instead a termination. Next, we'll change all program exit points to instead be out of bounds accesses. So, if the program would just exit normally by completing the main function, for example. Or, if it would call the exit command or throw an exception that's never caught. Then those exit points we would put a indexed by a length plus 10. That is an index that's out of bounds. Okay now we take this transform program and we pass it to our analyzer that we hypothesize can perfectly check whether or not an array access is in bounds. If the array bounds checker finds an error, well then the original program must have halted. On the other hand, if it claims there are no such errors, the original program does not halt. But this is a contradiction with the undecidability of the halting problem, because we've now shown that an array bounds checker can decide the halting problem, which as a problem we have decided is undecidable. So does this mean that static analysis is impossible? Well, not exactly. What it means is that perfect static analysis is not possible. Useful static analysis is perfectly possible, on the other hand, despite the fact that the analyzer may fail to terminate itself or emit false alarms, which are claimed errors that are not really errors. Or have missed errors where the analyzer reports no problems but in fact, the program is not error free. Now as a practical matter, because non-terminating analyses are confusing to users, tools that exhibit only false alarms or missed errors are the norm. In particular these tools fall somewhere between sound and complete analyses. So, let's go into what I mean by that. Soundness is a property that states if an analysis says that x is true, then x is actually true. So what it's saying is that, when analysis says these are the things that I say, they're contained in the true things. And of course, this means that a trivially sound analysis is one that says nothing. Completeness is such that if X is true, then the analysis says that X is true. So here are the things that I say, and true things are contained within them. So trivially complete analyses say everything. A sound and complete analysis is such that the things I say are all of the true things. And of course what we have shown is that, for problems like the halting problem, such sound and complete analyses do not exist. You have to sacrifice either one or the other. You can go one of two ways. You could have a sound analysis which says that if the program is claimed to be error free, then it really is. Alarms do not imply erroneousness. A complete analysis is one that says if the program is claimed to erroneous, then it really is. Silence does not imply error freedom. So essentially, the most interesting analyses are neither sound nor complete, and not both, but they usually lead toward soundness or completeness. Because perfect static analysis is impossible in general, our goal is simply to make a tool that is useful. Doing so is as much art as it is science. In particular, there are many different elements of an analysis that trade off with one another. A practical tool must decide which elements are most important. Some of these elements are the following. A precise analysis aims to model program behavior very closely. To make the most informed decision about whether it has found a bug, so as to avoid false alarms. A scalable analysis will successfully analyze large programs, without unreasonable resource requirements, so that is space and time. Often times scalability is a direct tradeoff with precision. An understandable analysis takes its human user into account. For example, the analysis is designed so that alarms are easy to understand and are actionable. An analysis that enjoys all three features is easier to reduce if its focus is clean code. The intuition is that code that is hard for a human to understand, is also hard for a tool to understand and vice versa. As such, it may be reasonable to issue false alarms or run more slowly. When doing so is because of confusing or convoluted code patterns. In this way we can view such poor performance of the static analyzer positively. If programmers clean up their code they will reduce the total number of false alarms and perhaps improve the running time. While also making their code easier for humans to understand.
Flow Analysis
[SOUND] We are now going to look in depth at one particular kind of static analysis called flow analysis. Roughly speaking a flow analysis tracks how values might flow between the different memory relocations in a program. This kind of analysis is interesting because it can help us find bugs whose root cause is improperly trusting user input, and in particular, input that has not been validated. In a tainted flow analysis, untrusted input is considered tainted. Various operations in the program, on the other hand, expect to operate only on untainted data. If the flow analysis finds that tainted data could be used where untainted data is expected, there could be a potential security vulnerability. Examples include the source strain to store copy whose length is assumed to be no greater than that of the target buffer. Format strings given to printf, which could otherwise be used in format string attacks. And form fields from web pages, used in the construction of SQL strings, which could be used in SQL injection attacks. As a possible application of tainted flow analysis; recall format string attacks. Here the adversary controls the format string. fgets reads from the network, and suppose the adversary is on the other side, sending data along, which is stored by the program into the name variable. Then, this variable is passed as the format string to printf. The problem is the attacker can set this format string in such a way as to exploit the program. For example by setting it to %s%s%s, it will attempt to read from arguments that printf assumes are on the stack,. But of course in this case, they are not, which results in the program crashing. Even worse, the attacker can set the format string to include %n, which will write to assumed arguments on the stack, which again are not there. And as a result, the adversary can actually do code injection attacks. And these bugs still do occur in the wild because we cannot usually simply specify the format string and say a constant string. Now another way to look at this problem in types is to imagine that the origin of the string, its level of trust, is specified as a type qualifier, either tainted or untainted. So here we specify that printf expects an untainted format string. And by untainted we mean that it is not controlled by a potential adversary. On the other hand fgets could return data that is tainted that an adversary could control. As such the program that we saw before is illegal because fgets returns a tainted string, but printf expects an untainted one. These qualifiers don't match. Tainted does not equal untainted. And so we have found a possible bug, in fact, a security relevant bug in the program. Let's carefully state the problem we want our analysis to solve. it is the no tainted flows problem. The goal is to establish that no tainted data is used, or untainted data is expected. We identify trusted data using the untainted annotation and untrusted data, using the tainted annotation. Un-annotated data can be either, and the analysis will figure out which. A solution to the problem considers all possible flows of data through the program. That is, what sources can reach what sinks, and whether a tainted data source can reach an untainted sink. We will aim to develop a sound analysis. For the flow problem, this means that if the analysis finds no illegal flows then indeed there are none. So which flows are legal? Well, if f expects a tainted parameter, it's actually okay to pass it when it's untainted instead. Here we see a function f that take's a tainted int. And we're passing it an untainted int instead. If f can handle something that's tainted, surely, f can handle something that's untainted instead. On the other hand, if g expects and untainted argument, then we should not pass it a tainted argument. That's the situation we've been considering with format strings. So, g should accept only untainted data. So we can view the allowed flows as a kind of lattice, where tainted is less than untainted in the lattice. Here we see that untainted is less than tainted, but tainted is not less than untainted. Visually we can look at the lattice as a kind of graph where the higher values in the lattice are higher up in the graph. So for our analysis we can think of a flow analysis as a kind of type inference. If no qualifier is present, then we must infer it. To perform type inference, our algorithm will take the following steps. First we'll create a name for each missing qualifier, that is, for each type that does not have a qualifier, we'll generate a fresh name for it. We'll call them alpha or beta. For each statement in the program, we will generate constraints of the form qualifier one is less than or equal to qualifier two, and possible solutions of qualifier variables. For example, if we have the statement, x equals y in the program, then we would generate a constraint qy is less than or equal to qx, where qy is y's qualifier and qx is x's qualifier. These qualifiers could either be constants like untainted or tainted, or variables if they were missing like alpha or beta. After we have generated the constraints for the entire program, we can solve the constraints to produce solutions for alpha, beta and so on. A solution is simply a substitution of qualifiers like tainted or untainted, for names like alpha or beta, such that all of the constraints in the resulting substituted constraints are legal. If no solution exists, than we may have spotted an illegal flow. So here's an example analysis. Recall the declarations of our printf and fgets functions that we've seen already and then consider this program. The first step is to generate qualifier names for those variables that do not have qualifiers attached to them, in this case name and x. The next step is to generate constraints due to each of the statements in the program. So, for the first statement in the program, fgets return something that is tainted and assigns it to the name variable. Therefore, we generate a constraint tainted is less than or equal to alpha. Next, we assign name to x, and as a result we generate a constraint that alpha is less than or equal to beta. Finally we pass x to printf, which means that beta is less than or equal to untainted which is the annotation, the qualifier constant on the argument to printf. So now we can look at the constraints and see if a solution exists. The first constraint requires that alpha equals tainted. The only thing that tainted can be less than or equal to is something that's tainted. Therefore, to satisfy the second constraint we must also have beta as equal to tainted because, again, tainted is only less than or equal to itself. But then the third constraint is illegal, because nothing allows tainted to be less than or equal to untainted. As a result, we have identified a potentially illegal flow, because there is no possible solution for alpha and beta.
Flow Analysis- Adding Sensitivity
[MUSIC] Now let's consider improving our analysis by adding sensitivity, which increases the analyses precision. Here's another example. We have the same printf and fgets prototypes as before. And we have the same start to the program. We read into the name variable from fgets. But, we've now changed the program to branch so that in the true branch, we assign x to name. And in the false branch, we assign x the value hello. Finally, after the branch concludes, we print out x. So, let's imagine that we execute our program. We'll consider one statement at a time to consider the constraints that that statement will generate. The first statement generates tainted as less than or equal to alpha, just like the example we saw a moment ago. Next we consider the conditional. If we were to take the true branch then assigning name to x causes us to generate the constraint that alpha is less than or equal to beta. On the other hand if we were to take the false branch, we would take an untainted value, the cause then hello, and assign it to beta. Finally we print out the result, because x's qualifier is beta, beta is going to flow into the assumption of the argument of printf that is untainted. Not considering that third constraint, we can see that these four constraints are still unsolvable, and therefore we still have a potentially illegal flow. And this makes sense, because if the true branch were ever executed, we would in fact use something read from the network as a format string for printf. Now let's drop the conditional. It's a very similar program. Except in this case, when we assign to name, we immediately overwrite it with the constant hello. If we go through the program, it turns out we will generate exactly the same set of constraints. In this case however, those constraints really mean something very different. Because we are sure that the second assignment to x will overwrite the value assigned in the first case, rather than being assigned in alternative to it. As such, these constraints will complain that there is an error, when in fact, there is none. And therefore, the analysis has produced a false alarm. So the problem here is that our analysis is flow insensitive. Each variable has one qualifier which abstracts the taintedness of all values that it will ever contain. On the other hand, a flow sensitive analysis would account for variables whose contents change. Each assigned use of a variable would effectively have a different qualifier. So, for example, x's qualifier at line one would be given the name alpha-1, but x's qualifier at line two would be given the name alpha-2. As such, solutions for alpha-1 and alpha-2 can differ. We could even implement such a flow sensitive analysis by transforming the program to assign to a variable at most once. And this is converting the program to what is called static single assignment form or SSA. So here is our example again, but reworked to be in SSA form. Notice that we have replaced the single variable x with two variables, x1 and x2, where each one has a distinct qualifier name, in this case beta and gamma. If we look at the constraints the program generates, the first one is the same. But now the second one assigns to the qualifier variable for x1, and the third assigns to the qualifier variable for x2. These are now different qualifier names. And as such, we can see that there is no alarm. That is, we can set gamma to be untainted, and alpha and beta to be tainted, and these are good solutions to the program. And so, no problem exists. Now let's consider a variation of our example with multiple conditionals. First, if x is true, we assign to y the constant string hello. Hello is untainted and y's qualifier variable is alpha, so we produce the constraint, untainted is less than or equal to alpha. In the else branch, we assign to y the tainted output of fgets, so we produce the constraint, tainted is less than or equal to alpha. On the next line, if x is true, we call printf with y. Now printf expects an untainted argument. And y has qualifier variable alpha, and flows into that untainted argument. And so we produce the constraint, alpha is less than or equal to untainted. However, we can see that these constraints have no solution. But, that this lack of a solution is, in fact, a false alarm. Why? Well, the two constraints that create the problem are impossible to execute in the same program. That is, if x is false, then we will execute the else command, which produces the first constraint. But if x is false, then we will not execute the second command that does the printf, which would have produced the other constraint. Conversely, if x was true, we would not have generated the tainted less than or equal to alpha constraint. And we would have not have created the problem that made the solution impossible. So the issue here is that the constraints that we consider that produced the problem are not corresponding to a feasible path. An analysis that is past sensitive considers path feasibility when determining whether or not there is a potential problem in the program. So to illustrate this, we've labelled some of the statements in the program in the little code snippet on the right. When x is not 0, the program will follow the path 1-2-4-5-6. When x is 0, it will follow the plat, the path 1-3-4-6. The path that leads to the problem that we saw a moment ago is 1-3-4-5-6, but this path is infeasible. A path sensitive analysis will rule out such a path by qualifying each constraint with a path condition. So instead of just generating the constraint, untainted is less than or equal to alpha, it will instead include the condition that says this constraint is only valid when x is not equal to 0. On the other hand, the constraint tainted is less than or equal to alpha is only valid when x equals 0. When determining whether constraints have a solution, the path conditions will be considered. We'll see that only the first and third constraints should be considered together on the one hand, and in this case alpha does have a solution: untainted. Or, only the second constraint should be considered on the other hand and again, alpha has a solution but a different solution. And therefore there is no alarm in the program. If flow and path sensitivity are so useful, is there any reason not to use them? It would seem that the added precision is important for avoiding false alarms, which would otherwise arise by conflating flows or paths. But both sensitivities make the flow analysis problem harder to solve. For example, flow sensitivity increases the number of variables the analysis considers, and thus increases the number of nodes in the constraint graph that we must solve. Path sensitivity requires the use of predicates to distinguish different paths, which means we must employ more general and expensive solving procedures to make use of them. In short, the added precision decreases performance, which ultimately hurts scalability, limiting the size of programs we can analyze. This trade-off is sometimes managed by employing a demand driven analysis, which adds precision only as needed and where it's needed. We'll discuss this idea a bit more later on.
Context Sensitive Analysis
[SOUND] Up until now, we focused on tracking tainted flows through blocks of code, normal statements. Now we'll consider how we handle tainted flows to function calls. This example shows a function id that we want to analyze. The id function itself is rather simple, it takes the argument x and simply returns it back to the caller. On the left hand side we see a client program that takes the result of fgets, passes it to id, and returns the result into the variable b. So we want to see wether or not there is a tainted flow in this program, so we need to track the flow into the id function and back out again. So, the way that we're going to do this is in several steps. First, we need to give flow variable names to the argument and return value of the function. So here we pick gamma and delta. Next, we need to create flow constraints between the actual caller, where in this case, we pass in the variable a. And the callee which receives that in the argument parameter x. In addition, after computing whatever result we compute in the function we need to take what is returned, in this case x. And create a flow between it and wherever that result is used in the caller, in this case the variable b. So, let's put it together. We see that fgets returned something into a. Remember that fgets has a tainted return value and so tainted is less or equal to alpha. Then, we call the identity function passing a in. Therefore, we need to create a flow between a and x, that is, the actual argument a, and the formal parameter x, in the function id. And, therefore, we create a flow constraint between alpha and gamma. In the body of the function, all that the body does is return the variable x, and, so, x has the name gamma and because we're returning it, we need to create a flow with the name of the return value. Which, as we said before, is delta. Now, finally, what is actually returned from this particular function column needs to flow into the variable we're assigning it to and so we create a flow between delta and b that is delta and beta, b's name. Okay so we've extended the example a little bit with two extra statements. In particular we define a new variable c, to which we assign the untainted constant high. And we also take that c and pass it as the format string to printf. So essentially we'll ignore the first two statements as far as the actual semantics of the program is concerned. But this is setting a more interesting example that we'll see in just a moment. Okay. So we're going to have the same sort of constraints that we saw a minute ago for the first two lines. Then we'll create the constraint untainted flows to omega. And finally the constraint omega flows untainted. Because, printf expects an untainted argument, and c's name is omega, that is, the actual argument we're passing in. So, there's a good solution here, and there's no alarm. Now consider a small change to the example. Instead of assigning hi to the variable c directly. We pass it into id and then it will return the result which eventually makes its way to c. So if we were to run this program and the prior program, they would have exactly the same outcome. But as we shall see, the analysis will treat them a bit differently. So, we generate the first set of constraints once again. And now we have to consider the function call into ID. We're passing an untainted value into the argument of ID, so untainted is less than gamma, and then likewise we're taking the return value from ID delta and assigning it to c whose name is omega. Now we'll call printf which causes us to pass omega into untainted. Which is the expectation of the format string for printf. And now let's look at solving these constraints. Unfortunately we're going to get a false alarm. There's no solution and yet as we said a moment ago this is exactly the same semantics as the program we saw before. So what happened? Well as the animation just showed. What's happening is the analysis is being imprecise, and it's considering a call into which fgets passes an argument, which is tainted, and the return into which we pass the untainted value high. But we're pretending the analysis is conflating these two calls together and so we're pretending that the first call can return its result to the second call. So this is a problem of context insensitivity. These call sites, as I said, are conflated in the graph. So context sensitive analysis solves this problem by distinguishing call sites in some way, so that we don't allow a call at one place to return it's value to another call. We can do this by giving call sites a label, call it i and perhaps correlate it with the line number in the program at which the call occurs. And then in the graph, matching up calls with corresponding returns, only when the labels on flow edges match. We'll also add what we call polarities to these edges, to distinguish calls from returns. So, minus i for argument passing, and plus i for return values. So let's see the example again, but this time in a context sensitive setting. Notice that we have labeled the two calls to id with one and two. These are the indexes for those calls that we can think of as happening on lines one and two of the program. So, we have the same constraint that we had originally. But now when we go to call the function id, we're going to index that call with the label one. And we're going to use a negativity polarity because that's the argument being passed down. Same constraint for the body of the function, now the call site one is going to return something slightly different. It's going to be indexed with a label, and it will have polar, polarity plus to indicate that it's a return. Okay, so here's where the new part happens. We have a new label on the constraint this time. We have label two. For both the caller and the return value. And then we have the final constraint. So given these constraints, these were the constraints that were part of the false alarm that we saw before. But, notice that they can't allow a solution now because the polarities don't match. Sorry, the, indexes don't match. And because it's an infeasible flow, the analysis we'll actually see that due to the different labels, and there will be no alarm. >> Just like flow and path sensitivity, context sensitivity is a tradeoff, favoring precision over scalability. In particular, the context insensitive algorithm takes roughly time O of n, where n is the size of the program, while the context sensitive algorithm will take time O of n cubed. On the other hand, sometimes the added precision actually helps performance. By eliminating infeasible paths it can reduce the size of the constraint graph by a constant factor. And this is true for flow and path sensitivity too. But the general trend is that greater precision means lower scalability. One way to push the trade-off more towards performance. Is to be selective about where context sensitivity is used. Rather than using it at all call sites, we could use it only at some of them. The remaining call sites are conflated, as in an insensitive analysis. This means that the analysis coarsely models calls from one site, as if they could return to another. Rather than having one blob of insensitive sites, we could also imagine groups of sites, that are sensitive with respect to the other groups, but insensitive with respect to their own members. Doing this amounts to giving all sites, in the same group, the same index. Finally, it's also possible to limit the depth of sensitivity. For example, to only match nested calls up to a certain limit.
Flow Analysis- Scaling it up to a Complete Language and Problem Set
[SOUND] Okay, so let's finish up this discussion of our tainted flow analysis by seeing how to scale it up to a more complete language and problem set. So far we've considered scalar values or pointers treated as blocks. But we haven't considered dereferences through pointers and how those affect the tainting. In this example, we have character pointers a and b, which were assigned strings as normal. But we also have p and q, which are pointers to pointers. That is, p is a pointer to a string, as opposed to a string itself. What the example is going to do is assign, hi to a, p will point to a variable, q will alias p, that is, it will also point to a the variable. Then via q we assign b, which is a tainted value we read from fgets, and then we print through p which aliases with q. So what we can see here is because of the aliasing of p and q, that is, because they both point to the same memory, when we assign through q the value in b, we effectively make p tainted. But what we're going to see is that without care our analysis won't realize this. So here on the first line, we have the assignment as usual. Here on the second line, we're going to take the address of a. And so now we're concerned with beta, which is the taintedness of the contents that p is pointing to, rather than of p itself. So we're going to create that flow constraint, and likewise that flow constraint, and here's another one that we do as normal. And finally, we do the assignment through q. Now, star q refers to the character pointer that q points to, whose label is gamma, and b refers to what is returned by fgets, whose label is omega. So omega will flow into gamma through this assignment. And then we print via star p. So once again, what p points to, that character string, it has beta as its label. And so it is going to go to untainted. Okay. Well, here's a problem. A solution exists, even though the assignment to q of p is going to affect the contents of p. And therefore, this is actually an error that our analysis will fail to uncover. So what do we need to do to fix this. Well, we need a way of identifying when flow can happen through assignments via aliases. And we do this by, when we assign pointers to pointers, we create edges that go both ways between the flow labels of the things that they point to. So here we show the constraint gamma goes to beta when we assign p to q. So not only do we go beta to gamma, we also go gamma to beta. And actually, we should do the same thing with the assignment on the second line. That is, we should have an edge from beta to alpha, but I just haven't shown it here to keep one fewer constraint from cluttering the slides. Okay, with this extra constraint, a solution no longer exists. We can see now that tainted can flow to untainted, and so therefore, we've caught the error. So again, the idea is that an assignment via a pointer flows both ways. That is, it anticipates that this, through an alias, we could assign a value that would create a flow later on. And the flow both ways allows that assignment to be captured. Unfortunately, this can lead to false alarms. So we have a couple of choices for reducing them. One is, if we know that an assignment through an alias can never happen, for example because the alias is labeled as const, then the backward flow is not needed and we do not need to include the edge. Or we could just drop the backward flow edge anyway, anticipating that an assignment will not happen. That's problematic, whether or not that's guaranteed by the type system. So effectively, this is trading a false alarm for a missed error. And different analyses make different choices about this question. Another sort of hidden flow that might happen is what's called an implicit flow. And that's illustrated by an example we'll develop now. So, here's a copy function that takes a tainted character pointer source, and copies it to a destination character pointer. And we can see that this is illegal, because the tainted characters in src should no be able to flow to the untainted expectations of the characters in dest. So this flow is not allowed, and therefore the analysis should flag that. Okay, now let's look at this program, which, as it turns out, implements exactly the same function, though in a very inefficient manner. What it does is for each character that's considered in the source array, it will iterate through all possible values that character could have. So j will consider all possible character values. If that value matches the one that's in the source, that's the if statement of source sub i is equal to j, then we simply assign j to dst. This will be legal because j is untainted. But then we've missed a flow, because even though the character was not directly assigned from src, the same value of that character was. And so the information contained inside of src is certainly copied to dst, and therefore the information is leaked. So, we can perform what we call information flow analysis to discover when such implicit flows happen. Data did not flow, but the information did. We can discover this by maintaining a second label that we call the program counter label. And it represents the maximum, in the lattice, taint, affecting the current position of the program counter in the program. Assignments will generate constraints involving the pc. So x equals y will produce two constraints, one between the labels of y and x as usual, and another saying that the pc label must be able to be upper bounded by the label of x, so that a flow does not leak through the assignment of x. And this will allow us to track information flows like the one we just saw. So here's a, an example with an implicit flow in it. It resembles the prior one, but it's simpler, it doesn't have the loop and so on. Here, we have a tainted source and an untainted destination. Actually we'll infer what the taintedness is of the destination. So we branch on the source, which is tainted, and then based on whether the source is zero or not, we assign to dst. So if the source is zero, then dst will contain the same value as the source, otherwise it will contain one. So information is clearly flowing from src to dst, though data is not, because information about src can be recovered by looking at the value of dst after the program runs. So let's see what the constraints look like that we would generate. At each of the assignments, dst equals 0 or dst equals 1, we just have an untainted constant 0 flowing into alpha as normal. And also for 1. And also here for the 0 portion of the dst plus equals 0. Okay, now let's consider the constraints according to the pc label. We've labeled each of the lines in the program with a relevant pc. The first line that can be executed is guard. Then either the then branch, or else branch, or pc 2 and 3. And then finally, the final assignment, pc4. Now, the first thing that we will consider are the extra constraints that result from the assignment statements due to the pc. In particular, here, we have to create a constraint, alpha is less than or equal to pc2, because alpha is the variable to which we're assigning and the current program counter at that assignment is pc2. Likewise, for the else case, we do alpha is less than or equal to pc3, the program counter there. And then finally, the last assignment is alpha is less than or equal to pc4. So what are the program counter values at these different locations? Well, pc1 is the first line that's executed in the program and has not been influenced by any tainted data, so it starts as untainted. However pc2 and 3 are result of branching on src. That is, they can only be reached because we have branched on the src variable, and src is tainted. And therefore, we need to bump the program counter label to tainted in both cases. Once we have exited the conditional and we're back on the line with pc4, the program counter label can be untainted again, because it doesn't make any difference which direction we took when we branched on src we will always reach pc4, and so the value of the guard src has no influence on it, and it can be untainted. Looking at all of these constraints, we can see whether there is a solution to them, and we can see that the solution requires alpha to be tainted. So this is discovering the implicit flow, that is that the tainted source is tainting the destination. So, should we always track information flow constraints? Well, tracking them with a pc label can sometimes lead to false alarms. In this program, the dst value is always going to be 0, no matter what the value of src is, and so there is no information leak. And yet, the assignments to dst will be tainted according to the use of the pc label, and therefore it will produce a false alarm. Extra constraints due to the pc label can also hurt performance. As it turns out, the copying example we saw before is pathological. Developers typically don't write programs like this, and generally, implicit flows have little overall influence in well written programs. So as an engineering question, oftentimes implicit flows are not tracked in industrial analyses. One point though, is that if you're analyzing untrusted programs, for example, mobile code that's being downloaded by your browser or some other system, that you run locally, there is a greater chance that an adversary will maliciously exploit this weakness in your analysis to try to taint information, or perhaps leak it. However, this situation aside, tainting analyses tend to ignore implicit flows.
Challenges and Variations
[SOUND] We have considered how to analyze most of the key elements of a language. But not all of them. A robust tool obviously has to handle them all. First, while assignments obviously transfer the taint from the source to the target, what should happen if the source is an expression rather than a variable? We must define the taint of operators. Usually, if an argument to an operator is tainted, then the expression containing that operator is, too. So, far we've assumed we know which function a call will invoke. This is important for linking up the flow constraints at the caller with those in the callee. But what if we make a call through a function pointer? The pointer is not assigned until runtime. So, that static compiled time analysis. Canno, cannot always be sure of its target. As it turns out we can use a kind of flow analysis to determine the set of possible targets a function pointer could contain. The analysis follows the same principles of the flow analysis that we've already seen. This way, when analyzing a call site using a function pointer, we add constraints as if all possible targets were called rather than a single target. And even more course analysis of possible targets, for example, based on a function's type could also be used rather than one that more precisely determines the possible values of the function pointer. C structs are records have several fields. The most precise analysis can track the taintedness of each field of a struct separately as if they were separate variables. But such precision can be expensive. Alternatively, an analysis can track the taintedness of the same field of all instances of a struct as if that field was a kind of global variable. Now this hurts precision but is often a big boost to performance. An in between position is to analyze only the taintedness of the struct as a whole. Or to do so for only some of its fields. Note that these decisions are relevant to object oriented languages. You can view objects as a struct containing function pointers and so the trade offs we've just considered apply in the analysis of object oriented languages. Another way to aggregate data in a language is to store it in an array, and there are several, similar trade-offs here. You can track taintedness of each element or you could track it on the array as a whole. The challenge here is that while struct fields are constants, array indexes are computed by arithmetic. As such, it is sometimes hard to track them precisely at compile time because we won't know what arithmetic produces until we run the program. And so a hybrid approach is inevitably employed. Up until now we have focused on how the analysis works. A key part of using it to perform security reviews is to properly label the tainted sources and untainted sinks. And there are many options here. We could label array indexes and format strings and SQL strings. We can also label functions that act as sanitizers or validators. These are library routines that take in tainted data and return untainted data. For example a function that escapes an HTML string would be a sanitizer. Labeling sanitizers is needed to avoid false alarms. Finally flow analysis need not only apply to finding improper use of tainted data. It can also be used to discover the improper disclosure of sensitive information. In particular we can imagine labeling sensitive data as secret, and untrusted output channels as public. And our goal is to prevent secret sources from reaching public sinks. In essence this is the dual of the tainting problem, where the only difference is that the lattice and flow relationships are inverted. Finally, we should point out that flow analysis is not the only kind of static analysis, in, in fact there are many other kinds. One common analysis, that is often used to assist other analysis problems, is called pointer analysis. With a common flavor of it called points to analysis. This analysis determines whether two pointers are possibly aliases, that is, whether they could both point to the same memory. Earlier we saw that knowing this is important for not missing bugs. In fact, our tainted flow analysis involved a very coarse grain form of point to pointer analysis within it. But more sophisticated analysis would improve precision. Significant advances have aided both the precision and scalability of pointer analysis in recent years. Another kind of analysis is called data flow analysis. And it was developed in the 1970s as part of research on optimizing compilers. Like the flow analysis we've already considered, data flow analysis focuses on the flow of values through variables in a program. But it maintains information about this flow bit differently. One common data flow analysis is called liveness analysis. And it determines which variables, at a given program point are still alive. That is, whether their values could still be red in the future. Other variables are dead, meaning that their values will be overridden. Liveness analysis is an important part of allocating variables to machine registers during copulation. Registers storing dead variables, can be re-used. Data flow analysis techniques are also useful for security questions and many industrial tools use them. For example, they can also answer the tainted flow question. Finally another significant kind of analysis is called abstract interpretation. It was invented in part to provide a theoretical explanation of data flow analysis, but it has grown into a style of analysis in its own right. The key idea of abstract interpretation, is that a static analysis is an abstraction of all possible concrete runs of a program. To be practical, the abstraction must discard information. For example, which calls correspond to which returns. The key is to discard as much information as possible for purposes of scalability, while still being able to prove the property of interest. An abstract interpretation is at the core of several industrial strength tools. Static analysis is gaining traction in practice with a variety of commercial products from a variety of companies as well as open-source tools. If you'd like to learn more about static analysis, I'd recommend this book, Secure Programming with Static Analysis by Brian Chess. This goes into more depth about how the static analysis tools work, elaborating on the description on this unit. And talks more about how static analysis fits in a secure development process. For more of the mathematical details about how program analysis works, I recommend this book by Nielson, Nielson, and Hankin. It's a bit dense for the casual reader, but it's good for introducing the academic field of static analysis.
Introducing Symbolic Execution
[SOUND]. Software has bugs. Seems unavoidable. The most common ways of finding bugs are testing and code reviews. But it's all users know, software, nevertheless, ships with bugs still unfixed. In fact, many of these bugs are known to the developers but they were not deemed important enough to fix prior to shipping. Many other bugs are not known. Why are they not found? It could be that the bug is in a rarely used feature. For example, in a device driver for a rare piece of hardware. Or, the bug arises in rare circumstances. For example, when memory is nearly exhausted, or the bug could arise nondeterministically. The circumstances that make the bug manifest could be unpredictable. And thus make the bug hard to reproduce and hard to fix. One way we can try to find those bugs that testing misses is to use static analysis. And this conceptually makes sense, because static analysis can consider all possible runs of the program, perhaps even those that are rare, non-deterministic and so on. We can feel heartened by the explosion of interesting ideas and tools that are being considered in research. And by the fact that companies are now selling and using static analysis tools. So there's great potential here to improve software equality. The question, though, is can static analysis find deep difficult bugs? So in principle, yes. But, maybe in practice, not so often. Okay, why is this? Well, it's because commercial viability implies that you have to deal with developer confusion, false positives, error management. That is quite a lot of code that has nothing to do with the core idea of analysis itself but, rather, the way that humans use the analysis. One particular thing that companies seem to do is to keep the false positive rate down. They do this by purposefully missing bugs to keep the analysis simpler. This turns out to be a good idea because studies show that developers get tired of dealing with alarms that don't turn out to be true bugs. And they're not interested in sifting through, say, nine alarms, even if on the tenth time they find a really important bug. So companies respond to this, and they make tools that will miss bugs. So one of the issues here that makes false alarms hard to deal with is abstraction. Abstraction is critical for static analysis because it allows us to simplify our view of what the program does so that we can efficiently, or efficiently enough, model all possible runs. But to do this we have to introduce conservatism. Now the various sensitivities like flow, context, and path sensitivity add precision. So that reduces conservatism. But these more precise abstractions are more expensive, which means that they won't be applicable to larger code bases, at least not at short time scales. And they still will not completely eliminate the false alarms problem. One particular problem that makes false alarms hard to deal with is that the static analysis technique may use an abstraction that is not familiar to the developer, that is the developer is not able to make sense of the alarm that the tool produces because it was produced in a way that the developer can't relate to. And therefore can't easily triage to determine whether the bug is true or no. So for the remainder of this unit, I'm going to talk about a technique called symbolic execution, that's kind of a middle ground between testing and static analysis, and aims to retain the best features of both techniques. We start with the observation that testing works. Reported bugs are real bugs. The problem is that each test only explores one possible execution of the program. If we do assert f of 3 is equal to 5, well, we've checked that f of 3 equals 5, but not that f applied to other input values also equals whatever it is they're supposed to equal. In short, tests are complete, but they're not sound. And, while we hope they generalize, they don't provide any guarantees. Now, symbolic execution aims to generalize testing, in a sense it's more sound than testing, so we can write tests like this. We can say that y is equal to alpha, which represents an unknown or symbolic value. And then we can assert that f of y is equal to 2 times y minus 1. This is basically saying, for all y, f of y is equal to 2y minus 1. Symbolic execution can execute this program, and confirm that this is indeed the case, thus performing a more general sort of test. The way that this is possible is that the symbolic executor can run the program to the point that it depends on an unknown. And then it will conceptually fork the symbolic executor to consider all possible values of that unknown that would affect the control flow of the program. So here we see the function f, and we see that it depends on x, it's argument at the first branch. If x was symbolic, as in our assert, then the symbolic executor will consider both branches. That is the case when x is greater than 0 and the case when it is not greater than 0. And of course, in this example when it does that it will discover that the assertion will not hold for non-positive values of x. Let's look at another example just to get a flavor of how symbolic execution works. Of course, we'll go into much greater detail later on. So here on the left, we have a program. And in the program, we've assigned to three of the variables, a, b, and c, symbolic values, alpha, beta, and gamma. The rest of the program is a series of conditionals and assignments. At the end, we see an assertion that we would like to hold for all possible runs of the program. All right, so the symbolic executor will start, and it will execute line 3, initializing x, y and z to 0. Then it will reach the conditional on line 4. This conditional depends on a symbolic value, alpha. And so, it will consider whether both branches are possible, and of course, they are. At first, it will consider the true branch, that is when alpha is non 0. In that case, it will assign x as negative 2. Then, it carries on to executing line 7. Here, the guard depends on a symbolic value again, and so it checks whether beta can be less then five, again, it could be or it might not be, so the symbolic executor will pick one direction. Here, suppose that it picks true. Now we see that we'll skip line 8, and the reason is that along this path we have assumed that a is non-zero. And that means the conditional on line 8 will not hold. After assigning z to 2, we reach the assertion and we see that the assertion succeeds. And that's because x is minus 2, z is 2, and y is 0. The sum is 0, and of course 0 is not equal to 3. Now, we can characterize this particular path of our program execution, using what's called a path condition. That is along this path we assumed that alpha was true i.e., non-zero, and beta was less than 5. So we want to also consider the other paths by considering those branches again and going the false direction if possible. So for example, we could take the fall side of the branch on line 7. In that case we'd skip lines 8 and 9, breach the assertion again. In this case, the assertion is still going to hold because x is minus 2, the other 2 are zero and therefore the resulting sum is not 3. Alright, we'll consider the other branch once again. On this branch, we'll now assume that a is, that alpha is false. And we'll check whether b is less than 5. Again, we have no constraints on beta symbolic value. So, we could go both ways. Suppose we go the false direction this time, thereby skipping lines 8 and 9? Here the path condition is not alpha, that is we're assuming that alpha is false and b is greater than or equal than 5. Now we'll go the other direction. In this direction, we're assuming that alpha is not true and gamma is true. Can this condition hold? Sure, it depends on what c is, and we have not constrained that. That is we have not constrained gamma. So suppose we go in the false direction, in this case we'll assign z is two, because we skipped the assignment to y. However we go in the true direction, then y equals 1 and z equals 2 and now the sum of y being 1, z being 2, and x being 0 is 3, and that violates our assertion. Importantly, the assertion violation comes with the path condition not alpha and beta is less than 5 and gamma, and this precisely states all of the inputs that will cause this test case to fail. And that is, if we used a satisfiability solver to give us a potential solution to this path condition, we could produce a test case that shows the failure. And of course, this is very useful because the developer can then run the test case and use it to diagnose the reason for the failure in the program. Notice that each path through the tree we just constructed represents a set of possible executions. More precisely, it represents all of those executions whose concrete inputs are solutions to the path condition. This set is a very precise notion of code coverage. Viewing symbolic execution as a kind of testing. Complete coverage of the program would be all of its paths. Symbolic execution allows us to systematically consider many of these paths. Viewed as a kind of static analysis, symbolic execution is complete in that whenever a symbolic executor claims to have found a bug, the claim is true. However, it is rarely sound because it is unlikely to cover every path. The reason is that most programs do not have a finite number of paths, and so symbolic execution will not terminate except by imposing a time limit or other resource limit. Symbolic execution's high precision, the source of its completeness, can be characterized as path, flow, and context sensitivity in static analysis parlance.
Symbolic Execution- A Little History
[SOUND] Symbolic execution is an old idea. It was first proposed in the mid 1970s. The most cited early paper on the subject is James King's PhD thesis. A synopsis of which was presented in the communications of the ACM in 1976. So if symbolic execution is such an old idea, and it seems like such a good one, well why didn't it take off? Why haven't we been using it for a long time? Well, one reason is that symbolic execution can be compute-intensive. In the end, for big programs, we'll consider many, many possible paths, and along each or many of those paths, we'll need to query an SMT solver to decide which of the paths are feasible, in which assertions could be false. And in the end such queries could involve much of the program state which could be very large. At the time symbolic execution was invented computers were relatively slow, not much processing power, and small, not much memory. I mean Apple iPads are as fast as Cray-2's from the 80s. Today computers are much faster and much bigger. Both more memory and more computing power. And we also have better algorithms. So SMT solvers are much more powerful today because of smarter optimizations that people have implemented over time. They can solve very large questions very quickly. And that lets us check the assertions and to prune infeasible paths. So let's take a minute to look at both the improved hardware and the improved algorithms. Here's a quick chart that shows trends of hardware improvement. You can see along the x axis is the date, the year, and along the y axis is number of, operations that the computers can perform. Because the y axis is a logarithmic scale, we can see that this linear growth is actually an exponential rise in computing power. On the algorithm side, here we can see that the time to solve a particular instance has been improving even while keeping the hardware steady. So here on the y axis is the time that it takes to solve a particular problem. While on the x axis we list the year of the algorithm we're considering. And all of these algorithms will run on the same computer, that is on 2011 hardware. So we can see that the time to run has been decreasing as the algorithms have been improving over time. And though small instances have not improved so much in time, since 2008 big problems are now starting to be solved faster. Symbolic execution has experienced a resurgence of interest in the last ten years or so, starting in about 2005. The main motivation behind it was to show that symbolic execution could find very interesting bugs that were often missed by normal testing.
Basic Symbolic Execution
[SOUND]. Now we'll look in greater detail at how symbolic execution works. We'll start with a typical programming language. Here we show a grammar of expressions, denoted by the letter e. Expressions consist of things like integers n, variables x. Expressions involving arithmetic operators like addition. And expressions involving comparisons, like disequality. To support symbolic execution, expressions also include symbolic variables. Which we denote with greek letters, like Alpha. Symbolic variables represent inputs to the program. These inputs could come from many sources, including reads from files or network sockets, memory mapped files or devices. Whereas normal tests must provide specific inputs. Symbolic execution models inputs using symbolic variables, and thus can explore how different executions are induced by different values of these variables. These values are discovered by solving for constraints generated during execution as we'll see in more detail shortly. >> Okay, now that we've defined our symbolic language we need to define its semantics. One way to do that is to make or modify a language interpreter that can compute symbolically. So normally a program's variables contain values, but for our symbolic interpreter they will also be able to contain symbolic expressions. Which are program expressions that make mention of symbolic variables. So as example normal values we have the integer five in the string hello. But as example symbolic expressions we have alpha plus 5. The string hello concatenated with some symbolic string alpha. The array index expression alpha plus beta plus 2 for indexing the array a and so on. So let's see how we use these symbolic expressions in an example execution. Here we have a simple straight line program. On the left-hand side we'll consider concrete memory, we'll look at how a concrete, that is normal execution of the program might go, and then we look at how it might go for our symbolic interpreter. So in the first line we perform the read. Let's say, for argument's sake, it reads the value 5. Next, we'll assign to y. What do we assign it? Well 5 plus the contents of x, which are 5, is 10. Next, we'll execute line three. Here we'll store 17, because 7 plus 10 is 17. And then finally we'll use z, 17, to index the array a. Now notice here, we're assuming that a contains four values. So a sub 17 is off the end of the array. That is, it's an out of bounds access. Okay, now let's consider the symbolic execution of this program. Instead of returning a concrete value for x we will return a symbolic variable alpha and store it there instead. For y we can't store a value anymore because five plus alpha can't be further reduced, so instead y is assigned the symbolic expression five plus alpha. Next, we assign to z. Here, seven plus five plus alpha is 12 plus alpha, so we store that. And then, finally, we use the symbolic expression, 12 plus alpha, to index z, and this could be an overrun. Of course, we just saw an example of one in the concrete execution. That is there is a solution for the expression 12 plus alpha that could create an index that, that is outside of a's bounds. We'll explain how we can find that in just a moment. Now as we saw with our overview example, program control can be affected by symbolic values. So if you branch on a symbolic expression. It could be that, depending on the solution of that, of the variables in that expression you could go either true or false or both. Here in this example we see that we read into x and then branch on it and so that's a branch on a symbolic expression. We represent the influence of symbolic values on the current path, that is the list of the branches we have taken, true or false, as a path condition pi. So as an example, line three is reached when the path condition is alpha is greater than 5. Line five, when alpha is greater than 5, and, because it's nested, alpha is less than 10. And line six, when alpha is less than or equal to 5. That is, we've taken the false branch in the case that we checked whether alpha was greater than 5. Alpha being the contents of x. Now whether a path is feasible is tantamount to a path condition being satisfiable. So here we list three path conditions for this program. It's a slight modification of the program we just saw, with a branch on line four, is x less than 3. So, we can see here that the path condition a, alpha greater than 5, alpha less than 3, is not possible. And, indeed this path condition is not satisfiable. There is no alpha such that both constraints hold. Now given a path condition, we can come up with a solution to the constraints in it, and these can be used as inputs to a concrete test case that will execute exactly that path. So, the solution to leach, reach line three could be alpha is 6. And a solution to reach line six, could be alpha is 2. Now assertions, like array bounds checks, are conditionals. So if we look at our original program, we can think of just prior to the index of the array a performing two checks to see whether the index expression z is within the bounds of the array. That is, is z less than 0, and is it greater than or equal to 4? If the answer is yes, we've gone out of bounds and we should abort the program. So we can think of our symbolic executor as inserting these extra statements, and then performing its normal process of generating a path condition. So along these first four lines, pi is true because there are no branches. Now to reach line five, it would have to be the case that 12 plus alpha, that is the symbolic expression we had stored in z, is less than 0. If we managed to get past this, that is we took essentially the else branch to get to the else case to get to line six, then we would have the negation of that path condition. To get to line seven, we would also have to check that z is greater than or equal to 4, that is 12 plus alpha is greater than or equal to 4, along with that negation. If that was not true, then the path condition wouldn't negate that as well, and to reach line eight, we would have this path condition. So, if either lines five or line seven are reachable; that is, paths reaching them are satisfiable, the path conditions are feasible. We have found an out of bounds array access. >> Whereas a normal execution must take either the true or false branch of a conditional, a symbolic execution could actually take both branches. This is possible because when a symbolic variable, or variables, appear in a conditional scarred expression, there could exist some solutions that would make the guard true, and some solutions that would make it false. As such, a symbolic execute, a symbolic executor could choose to go either, or both ways. How should it choose which and in what order? One point to make is that the executor doesn't immediately know whether the true or false branch or both is feasible, when it reaches a conditional. To know for sure it could invoke the solver to check the satisfiability of the path condition. If a path condition concatenated with the guard condition is feasible, then the true branch can be explored. Likewise if the negation of the guard is feasible then the false branch can be explored. However making calls to the solver, at each conditional can be expensive. So, one idea is to just pick on branch and go that way, delaying the feasibility check, at the risk of doing extra, unnecessary work. A related approach is to use concolic execution to determine the path to take. The idea here is to run the program concretely for actual inputs. But consider some of those inputs as if they were symbolic on the side. The concrete inputs which guarantee feasibility, determine which direction to go. The symbolic executor maintains symbolic versions of these expressions and the path conditional is usual. Which it then uses to generate subsequent tests. We'll say more about concolic execution a bit later. Okay, so now we can consider a symbolic execution algorithm as pseudo code, and it's very simple. First, we create the initial task, which is a triple. It consists of the program counter location, which is 0, the start of the program. The path condition pi, which starts out as empty, and the symbolic state, which also starts as empty. And we add that task onto the worklist. Now, while the worklist is not empty, we will pull some tasks from that worklist and we will execute it. Eventually we'll reach a point where that execution forks. And let's say the state at the point has the program counter at pc 0, the path condition at pi 0, and the symbolic state at sigma 0. And let's say the reason for the fork is a branch. Well, we're branching on p, at line pc 0, where the true branch is at pc 1 and the false branch is at pc 2. Well, in that case, what we'll do is we'll add the task pc 1, pie 0 and p, sigma 0 to the work list, as long as pie 0 and p is feasible. That is the current path condition when concatenated with the guard that contains symbolic variables. When that's feasible, then we have a legitimate task that we can execute and so we should add it to our list. We'll add the alternative, that is, going down the false branch in the case that not p is feasible when concatenated with the path condition. And notice here that we could add either or both tasks to the work list. So, we can keep on running until either the list becomes empty, in which case we've considered every path in the program. Or eventually, we get tired and terminate the program, say after some time out. Now, in a practical implementation, we have to consider not just the main program we're executing, but also libraries and native code. At some point, the symbolic executor will reach the edges of the application, and the interpreter won't be able to make further progress. So this may happen with a library, or the system calls, or assembly code, or something like that. So one thing we could do is, we could actually pull that code in two and symbolically execute it. So for example, if we're symbolically executing c code, we could symbolically execute the c code that implements the standard libraries. But real implementations of the standard libraries are often very complicated and they might involve assembly code and stuff like that. And so a symbolic executor will often get stuck running around loops and not making any progress getting out of that library. So you could use a simpler version of the library instead that's still semantically accurate. Or you could create a model of the code that you want to execute. For example, you could create a model of a RAM disc to model the kernel file system code, rather than trying to say symbolically execute the Linux kernel. So let's return to the idea of concolic execution I mentioned before, this is also called dynamic symbolic execution sometimes. And the idea here is, instead of running the program according to the algorithm that we just saw a moment ago, we run the program concretely, but we instrument it to sort of do symbolic execution on the side. That is, the instrumentation maintains some shadow concrete program state with symbolic variables. The initial concrete state is going to determine the path that we take so it could be randomly generated. But we will keep track of the choices we made when guards involve our shadow symbolic variables. So we can build up a path condition along the side. So, a concolic executor will explore one complete path at time, start to finish. And once it's done so it can determine the next path to explore by simply negating some element of the last path condition. And then solving for that path condition to produce concrete inputs for the next test. While we're symbolically executing using concolic execution, we'll always have the concrete underlying values to drive the path. But these turn out to be useful in other ways too. That is, the process of concretization is very easy in a concolic execution. And what I mean by concretization is replacing symbolic variables with concrete values that satisfy the path condition. Basically we're going to drop symbolic-ness and potentially miss some paths. But we're going to simplify our symbolic constraints by doing so. So this allows us, for example, to do system calls that might involve symbolic variables. We just pick reasonable concrete values for those variables and do the system calls. In so doing, we lose the symbolicness but we're able to continue to make progress in trying to find bugs. Replacing symbolic values can also be useful when so, SMT solver calls might be too complicated if we left all of the values in.
Symbolic Execution as Search, and the Rise of Solvers
[SOUND] As is probably apparent, a good symbolic executor requires careful engineering. We've seen that symbolic execution employs an appealingly simple algorithm, but with high computational costs. In particular, the executor might traverse many different paths in the program, and it must make potentially expensive calls to an SMT solver, to determine which paths are feasible. Considering the problem of traversing may paths, we will see that symbolic execution boils down to a kind of search problem. The goal is to search through a large space of possibilities to find events of interest. For us these events are bugs. We'll also take a minute to look at how improvements in SMT solver performance make the path feasibility check possible even for larger programs. A well known problem with symbolic execution is the path explosion problem. Essentially, for a symbolic executor to consider the entirety program's space of executions it needs to consider every path. But because most programs have a huge number of paths we can't usually run symbolic execution to exhaustion. In fact, programs can be exponential in branching structure. Here we see a program that has four lines of code, and yet because it has three variables it has eight program paths. That is, two to the third possible paths despite using the variable only once on each line. Loops on symbolic variables are even worse. In this program we're looping on variable a, which is symbolic. This is going to cause us to consider an arbitrary number of loop iterations because every time we re-reach the head of the loop, we consider whether to stay in the loop or whether to exit. Now, if we compare symbolic execution to static analysis, we can see that there's a clear benefit of static analysis. That is, it will actually terminate even when considering all possible runs. And, it does this by approximation and abstraction, approximating multiple loop, loop executions or branch conditions, and so on. But as we discussed to motivate symbolic execution, static analysis as use of extraction can lead to false alarms. So what can we do to improve the way that symbolic execution operates to maximize it's benefits? Well to understand that we have to look at how symbolic execution is viewed as a search algorithm. So the simplest way to perform symbolic execution is to use either depth-first search, or breadth-first search. That is, recalling our algorithm for symbolic execution, we can use, for depth-first search, a worklist that is a stack. So it will consider the most recent node, that is the most recent program state, when deciding what to do next. Or we can make it a queue, where we prioritize things that we put in that queue earlier. Now the potential drawbacks of using either one of these two strategies, is that neither of them are really guided by any higher level knowledge. That is, we aren't telling how we might be looking for particular bugs or that we want to get to a particular part of the program. Instead it will just blindly follow the program structure in considering what paths that it wants to execute. DFS in particular could easily get stuck in one part of the program by continuing to go deeper and deeper. So think about that loop example. We could go around the loop over and over and over again where as in breadth-first search, we'll consider both one further loop execution and the case that we get out of the loop. Therefore, of these two, breadth-first search is probably a better choice. On the other hand, it is more intrusive to implement. For example, it can't easily be made concolic. So what better search strategies might we perform? Well, one thing we can do is to try to prioritize our search, to steer it towards paths more likely to contain assertion failures, sInce that's ultimately what we're trying to find. And recall that assertion failures could involve array bounds checks, null pointers checks, and so on. We only want to run for a certain amount of time, and so, the prioritization ensures that, that time is used best. To consider these different search strategies, think of program execution as a kind of DAG. So the nodes in the graph are programme states and an edge between those nodes means that one state can transition to the next. Therefore, we can think of symbolic search as a kind of graph exploration algorithm, trying to pick among the various possible paths in the overall search space. One useful technique to employ is randomness. We don't know a priori which paths to take, so adding some randomness is probably a good idea. This is something that modern stat solvers do and it works very effectively in that setting. So, one idea is to pick the next path to explore uniformly at random, and we can call this the Random Path Strategy. Another thing we can do is randomly restart the search if we haven't had anything interesting in a while. So imagine that we explore depth first for a while, but as we go further and further down in the search of a particular path, we may increase our chances of stopping and going back to the start to try a different path altogether. And we could choose among equal priority paths at random. That is, if we had a prioritization strategy that say, favored coverage, or we wanted to try paths that we haven't covered those could statements before. If we had equal priority paths, we can just flip a coin to pick one versus the other. So one drawback of randomness is reproducibility. That is, because we may have made random choices to figure out a particular bug, if we run the symbolic executor on the same program two times, it's not guaranteed to find bugs that it found before, and in fact might find new bugs. This doesn't necessarily sound so bad except it's complicated for software engineering, for example once we fix the bug that we find, how can we run the symbolic executor again to confirm that it's not there anymore. So it's probably good to use pseudo-randomness based on a seed, and then record which seed is picked when bugs are found. And that way you can confirm that you don't hit the bug again, or if you re-run you'll find the same bugs you found before. Now a moment ago I hinted at the idea of using coverage to guide your prioritization strategy. And here we'll consider it in a bit more detail, that is, we should try to visit statements we haven't seen before when choosing which paths to follow. So the approach is to score a statement based on the number of times it's been traversed, and then pick the next statement to explore that has the lowest current score. So this might work because errors are often in hard to reach part of the program, and the strategy will favor going to parts of the program it hasn't seen before. On the other hand, we may never be able to get to a statement if a proper precondition is not setup, and so just favoring going closer to that statement is not necessarily going to get us there. Another strategy is called generational search. You can think of it as a kind of hybrid between breadth-first search and a coverage-guided search. It works like this. At generation 0 we pick a program and a random input to it and run it to completion. At the end we take paths from generation 0, negate one branch condition in a path to yield a new path prefix. Find a solution for that prefix and then take the resulting path. Will semirandomly assign to any variables not constrained by the prefix. For generation n we'll similarly apply this approach, but we'll branch off of generation n minus 1. And we'll use a coverage heuristic to pick the priority amongst the different paths that we choose. This generational search is often used with concolic execution. Now, it's probably obvious at this point, no one search strategy wins all the time. Instead different search strategies may find some bugs that other search strategies will not find. So one obvious idea is to combine searches by doing multiple searches at the same time. Essentially have parallel processes where we switch from one to the next. Depending on the conditions needed to exhibit the bug, one search may find it as opposed to another search finding it. You could even imagine using different algorithms to reach different parts of the program, depending on the properties of the program that you're trying to find bugs in. SMT stands for Sat Modulo Theories, where by theories, we mean mathematical theories beyond just boolean formulas. Such theories could include a theory of arithmetic over integers, for example. These theories can sometimes be encoded as SAT problems. And so some SMT solvers are basically just front ends that translate the SMT problem into a SAT problem. For example, the theory of bit vectors. That is, arrays of bits with operations on them, like addition, subtraction and so on, can be encoded as a SAT problem. It turns out that modern SMT solvers implement a variety of optimizations either in the SAT engine or in the translation to it. And these optimizations can make a big difference in performance. One example is to make use of axioms that witness equivalences. Another example is to directly support in the SMT solver a theory of arrays which model modifications to memory. The alternative would to be to have the symbolic executor simulate memory directly, rather than rely on the solver. But then, optimizations become less apparent. Another important optimization is to cache solver queries. It turns out that symbolic executors will submit queries repeatedly that contain identical subexpressions, and this makes sense because the path condition is built up incrementally over time. Finally, we can make the SAT solver's job easier by eliminating some symbolic variables from consideration. For example, if we were testing whether a guard is feasible, we only care about the existence of a solution to the variables in the guard. Such a solution may involve other variables which can be syntactically related to the ones in the guard. That is, we keep under consideration expressions that have variables that are transitively related to variables in the guard. For other expressions these variables can be ignored. In total the optimizations can make an enormous difference. SMT solvers have been a popular topic of research and development over the last several years. There's several that you can get free online. Z3 is quite sophisticated, it's developed at Microsoft research. Yices, developed at SRI, has been around for quite a while and continues a steady improvement. STP, which is the SMT solver used by XC and KLEE is available for free. And CVC is yet another. While smarter search strategies and SMT based optimizations will help, ultimately path based search is limited. To see why, consider this program. We can see that it has a loop. It's going to run around the loop 100 times. And each time it goes around the loop it may either enter the if statement or not. This means it has 2 to the 100 possible execution paths. Now the program is asserting that it's a bug if the counter is ever equal to 75. So this bug is going to be hard to find. That is, there are 2 to the 78 paths that reach the buggy line of code, that is, 100 choose 75. Out of the 100 executions, 75 of them need to take the true branch. But if that's true, the probability of finding the bug is very low, because 2 to the 78 paths will find it out of 2 to the 100 total. So the chances are 2 to the negative 22 that we will find the bug. In short, independent of search strategy, we'll have a very difficult time finding the bug using a path-based search algorithm. So this is really just a fundamental limitation of path-based search in symbolic execution.
Symbolic Execution Systems
[SOUND] Symbolic execution was invented in the 1970s, but as I said before, it failed to take hold at that time as a commonly used tool. And this is because of limitations in both the algorithms and machine power available at that time. In the mid 2000s, two key systems triggered a revival of symbolic execution. And these systems were DART, developed by Godefroid and Sen and first published in 2005, and EXE, by Cadar, Ganesh, Pawlowski, Dill, and Engler, published in 2006. These systems demonstrated the promise of symbolic execution by showing it could find real bugs in interesting and complicated systems. And as such it spurred interested in the topic and many new systems have developed since. One important system that has made big impact in practice is SAGE. It is an concolic executor, developed at Microsoft Research, and grew out of Godefroid's work on DART. It uses the technique of generational search that we talked about before. SAGE primarily targets bugs in file parsers, for example parsers like jpg, Microsoft Word, Microsoft PowerPoint and other documents. And is a good fit for concolic execution because parsing files is likely to terminate and it only needs to consider input/output behavior rather than complicated and interactive system calls. SAGE has transitioned from research prototype to production tool. It's been used on production software at Microsoft since 2007. Between 2007 and 2013, SAGE was used remarkably often at Microsoft. For example, it's been run for more than 500 machine years as part of the largest fuzzing lab in the world. It's generated 3.4 billion plus constraints, making it the largest SMT solver usage ever. It's been applied to hundreds of applications, and found hundreds of bugs that were missed by other bug finding techniques. For example, one third of all Win7 WEX bugs were found by SAGE. Bug fixes shipped quietly to one billion PC's around the world, saving millions of dollars both for Microsoft and the world. And SAGE is now used daily in Windows Office and all of the big Microsoft products. KLEE is another mature tool. In this case it grew out of the work on EXE. It symbolically executes LLVM bitcode. LLVM is a compiler framework that's now in regular use at Apple. It compiles source language programs to an intermediate representation called LLVM bitcode which is stored in a .bc file. And KLEE will run on the bc file. It works in the style of the basic symbolic executor that we saw before. Where it uses fork to manage multiple states. That is, rather than having an explicit work list, each time that it could go both directions on a branch it actually forks another version of itself which proceeds along that branch. And a separate tool manages all of the distinct copies of KLEE that are running at once. KLEE employs a variety of search strategies, primarily random path and coverage guided. And it mocks up the environment to deal with system calls and file access and so on. And you get it as part of the LLVM distribution. So looking at how KLEE actually works, in the original KLEE paper, published in 2008. They applied KLEE to Coreutils, which is the suite of small programs that runs on Linux and Unix distributions. So these are programs like mkdir and paste and sed and ls and things like that. What this graph is showing is the amount of coverage from KLEE generated tests compared to the manual test suite that comes with Coreutils. What this chart shows is that KLEE is better at testing than the manual test suite. And we make that determination by looking at the lines of code covered by tests. In particular the lines covered by KLEE tests when from them when we subtract the lines covered by manual tests we see that for all but nine programs KLEE covers more lines than the manual case. Here are a bunch of bugs that was found by KLEE. These are the command lines that produced the bugs, that caused these Coreutils programs to crash. Notably, the bugs involved very simple command lines, and very simple input files. And yet they went undetected for quite a long time before KLEE found them. So since these second generation tools of KLEE and SAGE, there have been further developments. One of them is the Mayhem system developed at Carnegie Mellon by Dave Brumley and, and his collaborators. And it runs on actual executables. That is binaries. Like KLEE it uses a breadth first search style search, but it also uses native execution, combining the best of symbolic and concolic strategies. And as an interesting twist, it will automatically generate exploits when bugs are found, to shine a light on the fact that these bugs could, in fact be security critical. In further developments from the Mayhem work, the same group produced Mergepoint, which uses a technique called veritesting, and in this case it combines symbolic execution with static analysis. What it does is use static analysis for complete code blocks. So it will analyze a straight line piece of code and in so doing basically convert it into a formula that can be used by an SMT solver to ask a question about the programs execution. It will use symbolic execution for hard to analyze parts of the program. In particular, loops which, for which it may be hard to determine how many times it will run. As well as, complex pointer arithmetic, and system calls, and so on. So this produces a better balance of time between the solver and the executor resulting in better bug finding, that is bugs are found faster and more of the programs are covered in the same period of time. Incredibly, Mergepoint found more than 11,000 bugs in 4,300 distinct applications in a Linux distribution including new bugs in highly tested code, like the Coreutils from KLEE. So there are many other symbolic executors. Cloud9 is basically a parallel KLEE. jCUTE and Java PathFinder are symbolic execution for Java. Bitblaze is a binary analysis framework, so like Mayhem it works on binary executables. And Otter is a symbolic execution framework for C, that is directing and that you can, tell it which line at the program you'd like to execute and it would try hard to get to those lines, when performing a search. Pex is a symbolic executioner for .NET programs. >> This unit delved into the details of symbolic execution. A powerful technology that can be used to find security critical bugs in real software. Symbolic execution can be viewed, on the one hand, as a generalization of testing. It uses static analysis to develop new tests that explore different program paths. We can also view it as a kind of static analysis. But one with very high precision and no guarantee of termination. And if it does terminate it's results are both sound and complete. Symbolic execution is used in practice today to find important bugs in production code. The SAGE tool is used heavily at Microsoft. And the Mergepoint tool regularly analyzes large Linux repositories. Symbolic execution tools of good quality are freely available. And we can expect that they, like static analysis tools, will penetrate more deeply into the main stream in the near future.
Really nice topics you had discussed above. I am much impressed. Thank you for providing this nice information here.
ReplyDeleteSoftware Testing Company
QA Services
PS4 Game Tester
Game Testing Services
Video Game Testing Companies