MyClassNotes: Software Security - Week 5

Static Analysis- Introduction part 1

[SOUND].   A key element of a security conscious   software development process is   performing code reviews with   the assistance of tools.   Increasingly such tools employ   a technology called static analysis.   In this unit we'll look at   static analysis in detail.   We begin with an introduction.   We'll look at what static analysis is and   why it is useful.   Then we'll look at how   static analysis works.   To understand the basics we will   develop a flow analysis that tries to   understand how tainted values   flow around a program.   Our goal is to find bugs.   Then, we will look at how to increase   the precision of the basic analysis by   making it context,   flow and path sensitive.   We will finish up by filling   in some missing details.   We'll consider how an analysis   handles more challenging programming   language features.   How it can be customized   to different settings, and   how it can be made more   impactful in practice.   Current practice for software   assurance tends to involve testing.   The idea here is to make sure that   the program runs correctly on a set of   inputs that seem relevant to   the operation of the program.   So, we provide those   inputs to the program.   The program produces some outputs.   And an oracle that the tester   constructs confirms that   the outputs are the ones   that are expected.   So testing is great because,   if a test fails,   you have a concrete failure that you   can use to help find what the issue is.   And, really also to establish whether or   not the failure is a true one.   On the other hand, coming up with   test cases actually running them.   And doing them so that they cover all   of the relevant code paths is very   difficult and provides no guarantees.   The no guarantee part is especially   troubling for security tests.   Because it only takes one failure,   one failed code path to lead to   a vulnerability that could   take down an entire system.   Another approach that's complimentary   is to use manual code auditing and   the idea here is to convince someone else   on your software development team that   the source code is correct.   Now, the benefit is that humans,   when they review code,   can generalize beyond single runs.   So they can do better than   just single test cases.   They can imagine executions in their   head that maybe they haven't written   test cases for.   On the other hand,   human time is very expensive   especially compared to computer time.   And once again, this is a difficult task   that provides now guarantees of coverage.   And of course,   we are worried about security, and so   the malicious adversary is going to   exploit those things that we miss.   So, static analysis to the rescue.   The idea here is to use a computer program   to analyze a program's source code   without running it.   So in a sense we're asking a computer   to do what a human might have   done during a code review.   And the benefit here is   much higher code coverage.   The computer program, the static analyzer,   can reason about many possible runs of   the program, and in some cases even   reason about all possible runs, and   thereby provide a guarantee   that the property that   we've attempted to establish,   hopefully a reasonable.   An important security property is   in fact true, all of the time.   We also can reason about   incomplete programs,   like libraries, that would   otherwise be challenging to test.   Now, as it turns out, static analysis,   surprise, surprise, has drawbacks too.   For example, it can only tend to   analyze limited properties rather than   complete functional correctness.   It may miss some errors or   have false alarms for engineering reasons.   And it may also be time consuming to run.   Static analysis can have significant   impact on a security oriented   development process.   Because static analysis can throughly   check limited but useful properties and   there by eliminate entire   categories of errors,   it frees up developers to   concentrate on deeper reasoning.   Static analysis tool usage can also   encourage better development practices.   Programmers who use tools start to   develop programming models that   avoid mistakes in the first place.   Programmers also end up thinking about and   making manifest their assumptions.   For example, they will use annotations   that tools understand in order to   improve tool precision by specifying   what they intend for a program to do.   Static analysis tools are becoming   a commercial reality.   And so companies are starting   to feel this impact today.

Static Analysis- Introduction part 2

[SOUND] What is Static Analysis?   Well to answer that question, let's first   ask a different question, which is,   what can static analysis do?   A classic static analysis   problem is The Halting Problem.   It asks, can we write   an analyzer that can prove, for   any program P and inputs to it,   whether P will terminate?   The problem is depicted   visually as follows.   We start with the program P.   We feed this program and   its input to our analyzer.   And it determines whether or   not P terminates.   To solve the halting problem,   we have to build such an analyzer.   The question is, whether doing so   is even possible.   As it turns out, the answer is no.   It has been proved,   that the halting problem is undecidable.   That is to say, it is impossible to   write a general analyzer that can   answer the halting question for   all programs and all inputs.   While we cannot establish   termination behavior,   maybe we can establish other   security relevant properties.   For example,   perhaps we can use static analysis to   prove that all array   accesses are in bounds.   If we could do this we could eliminate a   large source of memory safety violations.   That is, buffer overruns.   Unfortunately, this question   is undecidable too.   As is essentially every   other interesting property.   We can prove this claim by showing that   an analyzer for array bounds checking   could be used to decide the halting   problem, which we know is undecidable.   And as such, array bounds checking   must also be undecidable.   Here's some more example   questions that are undecidable.   That all such properties are undecidable   follows from Rice's theorem.   Is an SQL query constructed   from untrusted input?   Is a pointer dereferenced   after it is freed?   Is accessing a variable a possible   source of a data erase?   And of course,   there are many other such questions.   So let's look at how the halting   problem can be viewed as equivalent to   the question of whether   an array index is in bounds.   This is a kind of proof by   transformation that interesting program   analysis problems are equivalent   to the halting problem.   So first, let's take all indexing   expressions ai in the program and   convert them to exit instead.   So we here,   see here what that transformation is.   We replace ai with first the bounds check.   Is i greater than or equal to 0,   or less than the length.   And if so, do ai is normal,   otherwise exit.   And now, an array bounds error   is instead a termination.   Next, we'll change all program exit points   to instead be out of bounds accesses.   So, if the program would just exit   normally by completing the main function,   for example.   Or, if it would call the exit command or   throw an exception that's never caught.   Then those exit points we would   put a indexed by a length plus 10.   That is an index that's out of bounds.   Okay now we take this transform program   and we pass it to our analyzer that we   hypothesize can perfectly check whether or   not an array access is in bounds.   If the array bounds   checker finds an error,   well then the original   program must have halted.   On the other hand,   if it claims there are no such errors,   the original program does not halt.   But this is a contradiction with   the undecidability of the halting problem,   because we've now shown that an array   bounds checker can decide the halting   problem, which as a problem we   have decided is undecidable.   So does this mean that static   analysis is impossible?   Well, not exactly.   What it means is that perfect   static analysis is not possible.   Useful static analysis is perfectly   possible, on the other hand,   despite the fact that the analyzer   may fail to terminate itself or   emit false alarms, which are claimed   errors that are not really errors.   Or have missed errors where   the analyzer reports no problems but   in fact, the program is not error free.   Now as a practical matter, because   non-terminating analyses are confusing   to users, tools that exhibit only false   alarms or missed errors are the norm.   In particular these tools fall somewhere   between sound and complete analyses.   So, let's go into what I mean by that.   Soundness is a property that states   if an analysis says that x is true,   then x is actually true.   So what it's saying is that, when analysis   says these are the things that I say,   they're contained in the true things.   And of course, this means that a trivially   sound analysis is one that says nothing.   Completeness is such that if X is true,   then the analysis says that X is true.   So here are the things that I say, and   true things are contained within them.   So trivially complete   analyses say everything.   A sound and   complete analysis is such that the things   I say are all of the true things.   And of course what we have shown is that,   for   problems like the halting problem, such   sound and complete analyses do not exist.   You have to sacrifice either one or   the other.   You can go one of two ways.   You could have a sound analysis which   says that if the program is claimed to   be error free, then it really is.   Alarms do not imply erroneousness.   A complete analysis is one that says   if the program is claimed to erroneous,   then it really is.   Silence does not imply error freedom.   So essentially, the most interesting   analyses are neither sound nor   complete, and not both, but they usually   lead toward soundness or completeness.   Because perfect static analysis   is impossible in general,   our goal is simply to make   a tool that is useful.   Doing so is as much art as it is science.   In particular, there are many   different elements of an analysis that   trade off with one another.   A practical tool must decide which   elements are most important.   Some of these elements are the following.   A precise analysis aims to model   program behavior very closely.   To make the most informed decision   about whether it has found a bug, so   as to avoid false alarms.   A scalable analysis will   successfully analyze large programs,   without unreasonable resource   requirements, so that is space and time.   Often times scalability is   a direct tradeoff with precision.   An understandable analysis takes   its human user into account.   For example, the analysis is designed so   that alarms are easy to understand and   are actionable.   An analysis that enjoys all   three features is easier to   reduce if its focus is clean code.   The intuition is that   code that is hard for   a human to understand, is also hard for   a tool to understand and vice versa.   As such, it may be reasonable to issue   false alarms or run more slowly.   When doing so is because of confusing or   convoluted code patterns.   In this way we can view   such poor performance of   the static analyzer positively.   If programmers clean up their code   they will reduce the total number of   false alarms and   perhaps improve the running time.   While also making their code easier for   humans to understand.

Flow Analysis

[SOUND]   We are now going to look in depth at one   particular kind of static   analysis called flow analysis.   Roughly speaking a flow analysis   tracks how values might   flow between the different   memory relocations in a program.   This kind of analysis is interesting   because it can help us find bugs   whose root cause is improperly   trusting user input, and   in particular,   input that has not been validated.   In a tainted flow analysis,   untrusted input is considered tainted.   Various operations in the program,   on the other hand,   expect to operate only on untainted data.   If the flow analysis finds that tainted   data could be used where untainted   data is expected, there could be   a potential security vulnerability.   Examples include the source   strain to store copy   whose length is assumed to be no   greater than that of the target buffer.   Format strings given to printf,   which could otherwise be used   in format string attacks.   And form fields from web pages,   used in the construction of SQL strings,   which could be used in   SQL injection attacks.   As a possible application of tainted flow   analysis; recall format string attacks.   Here the adversary controls   the format string.   fgets reads from the network, and suppose   the adversary is on the other side,   sending data along, which is stored by   the program into the name variable.   Then, this variable is passed   as the format string to printf.   The problem is the attacker   can set this format string   in such a way as to exploit the program.   For example by setting it to %s%s%s,   it will attempt to read from arguments   that printf assumes are on the stack,.   But of course in this case, they are not,   which results in the program crashing.   Even worse, the attacker can set   the format string to include %n,   which will write to assumed arguments on   the stack, which again are not there.   And as a result, the adversary can   actually do code injection attacks.   And these bugs still do occur in the wild   because we cannot usually simply   specify the format string and   say a constant string.   Now another way to look at this problem   in types is to imagine that the origin   of the string, its level of trust,   is specified as a type qualifier,   either tainted or untainted.   So here we specify that printf   expects an untainted format string.   And by untainted we mean that it is not   controlled by a potential adversary.   On the other hand fgets could   return data that is tainted   that an adversary could control.   As such the program that we   saw before is illegal because   fgets returns a tainted string, but   printf expects an untainted one.   These qualifiers don't match.   Tainted does not equal untainted.   And so we have found a possible bug,   in fact,   a security relevant bug in the program.   Let's carefully state the problem   we want our analysis to solve.   it is the no tainted flows problem.   The goal is to establish that   no tainted data is used,   or untainted data is expected.   We identify trusted data using   the untainted annotation and   untrusted data,   using the tainted annotation.   Un-annotated data can be either, and   the analysis will figure out which.   A solution to the problem   considers all possible flows   of data through the program.   That is,   what sources can reach what sinks, and   whether a tainted data source   can reach an untainted sink.   We will aim to develop a sound analysis.   For the flow problem,   this means that if the analysis finds no   illegal flows then indeed there are none.   So which flows are legal?   Well, if f expects a tainted parameter,   it's actually okay to pass it   when it's untainted instead.   Here we see a function f   that take's a tainted int.   And we're passing it   an untainted int instead.   If f can handle something that's tainted,   surely,   f can handle something   that's untainted instead.   On the other hand, if g expects and   untainted argument, then we should   not pass it a tainted argument.   That's the situation we've been   considering with format strings.   So, g should accept only untainted data.   So we can view the allowed   flows as a kind of lattice,   where tainted is less than   untainted in the lattice.   Here we see that untainted   is less than tainted, but   tainted is not less than untainted.   Visually we can look at the lattice as   a kind of graph where the higher values in   the lattice are higher up in the graph.   So for our analysis we can think of a flow   analysis as a kind of type inference.   If no qualifier is present,   then we must infer it.   To perform type inference, our algorithm   will take the following steps.   First we'll create a name for   each missing qualifier, that is, for   each type that does not have a qualifier,   we'll generate a fresh name for it.   We'll call them alpha or beta.   For each statement in the program,   we will generate constraints of   the form qualifier one is less than or   equal to qualifier two, and   possible solutions of qualifier variables.   For example, if we have the statement,   x equals y in the program,   then we would generate a constraint   qy is less than or equal to qx,   where qy is y's qualifier and   qx is x's qualifier.   These qualifiers could either   be constants like untainted or   tainted, or variables if they   were missing like alpha or beta.   After we have generated the constraints   for the entire program,   we can solve the constraints to produce   solutions for alpha, beta and so on.   A solution is simply a substitution   of qualifiers like tainted or   untainted, for names like alpha or   beta, such that   all of the constraints in the resulting   substituted constraints are legal.   If no solution exists,   than we may have spotted an illegal flow.   So here's an example analysis.   Recall the declarations of our printf and   fgets functions that we've seen already   and then consider this program.   The first step is to generate qualifier   names for those variables that do not have   qualifiers attached to them,   in this case name and x.   The next step is to   generate constraints due to   each of the statements in the program.   So, for   the first statement in the program,   fgets return something that is tainted and   assigns it to the name variable.   Therefore, we generate a constraint   tainted is less than or equal to alpha.   Next, we assign name to x, and   as a result we generate a constraint that   alpha is less than or equal to beta.   Finally we pass x to printf,   which means that beta is less than or   equal to untainted which   is the annotation,   the qualifier constant on   the argument to printf.   So now we can look at the constraints and   see if a solution exists.   The first constraint requires   that alpha equals tainted.   The only thing that tainted   can be less than or   equal to is something that's tainted.   Therefore, to satisfy the second   constraint we must also have beta as equal   to tainted because, again, tainted is   only less than or equal to itself.   But then the third constraint is illegal,   because nothing allows tainted to   be less than or equal to untainted.   As a result, we have identified   a potentially illegal flow,   because there is no possible solution for   alpha and beta.

Flow Analysis- Adding Sensitivity

[MUSIC]   Now let's consider improving our   analysis by adding sensitivity,   which increases the analyses precision.   Here's another example.   We have the same printf and   fgets prototypes as before.   And we have the same start to the program.   We read into the name variable from fgets.   But, we've now changed the program to   branch so that in the true branch,   we assign x to name.   And in the false branch,   we assign x the value hello.   Finally, after the branch concludes,   we print out x.   So, let's imagine that   we execute our program.   We'll consider one statement at a time to   consider the constraints that   that statement will generate.   The first statement generates   tainted as less than or   equal to alpha,   just like the example we saw a moment ago.   Next we consider the conditional.   If we were to take the true branch   then assigning name to x causes us to   generate the constraint that alpha   is less than or equal to beta.   On the other hand if we were   to take the false branch,   we would take an untainted value, the   cause then hello, and assign it to beta.   Finally we print out the result,   because x's qualifier is beta, beta   is going to flow into the assumption of   the argument of printf that is untainted.   Not considering that third constraint,   we can see that these four constraints   are still unsolvable, and therefore we   still have a potentially illegal flow.   And this makes sense, because if   the true branch were ever executed,   we would in fact use something read from   the network as a format string for printf.   Now let's drop the conditional.   It's a very similar program.   Except in this case,   when we assign to name, we immediately   overwrite it with the constant hello.   If we go through the program,   it turns out we will generate   exactly the same set of constraints.   In this case however, those constraints   really mean something very different.   Because we are sure that the second   assignment to x will overwrite the value   assigned in the first case, rather than   being assigned in alternative to it.   As such, these constraints will complain   that there is an error, when in fact,   there is none.   And therefore,   the analysis has produced a false alarm.   So the problem here is that our   analysis is flow insensitive.   Each variable has one qualifier   which abstracts the taintedness of   all values that it will ever contain.   On the other hand, a flow sensitive   analysis would account for   variables whose contents change.   Each assigned use of a variable would   effectively have a different qualifier.   So, for example, x's qualifier at line   one would be given the name alpha-1, but   x's qualifier at line two would   be given the name alpha-2.   As such, solutions for   alpha-1 and alpha-2 can differ.   We could even implement such   a flow sensitive analysis by   transforming the program to assign   to a variable at most once.   And this is converting   the program to what is   called static single assignment form or   SSA.   So here is our example again,   but reworked to be in SSA form.   Notice that we have replaced the single   variable x with two variables, x1 and x2,   where each one has a distinct qualifier   name, in this case beta and gamma.   If we look at the constraints the program   generates, the first one is the same.   But now the second one assigns   to the qualifier variable for   x1, and the third assigns to   the qualifier variable for x2.   These are now different qualifier names.   And as such,   we can see that there is no alarm.   That is, we can set gamma to be untainted,   and alpha and   beta to be tainted, and   these are good solutions to the program.   And so, no problem exists.   Now let's consider a variation of our   example with multiple conditionals.   First, if x is true,   we assign to y the constant string hello.   Hello is untainted and   y's qualifier variable is alpha, so   we produce the constraint,   untainted is less than or equal to alpha.   In the else branch, we assign to   y the tainted output of fgets, so   we produce the constraint,   tainted is less than or equal to alpha.   On the next line, if x is true,   we call printf with y.   Now printf expects an untainted argument.   And y has qualifier variable alpha, and   flows into that untainted argument.   And so we produce the constraint,   alpha is less than or equal to untainted.   However, we can see that these   constraints have no solution.   But, that this lack of a solution is,   in fact, a false alarm.   Why?   Well, the two constraints that create   the problem are impossible to execute in   the same program.   That is, if x is false,   then we will execute the else command,   which produces the first constraint.   But if x is false, then we will not   execute the second command that   does the printf, which would have   produced the other constraint.   Conversely, if x was true, we would not   have generated the tainted less than or   equal to alpha constraint.   And we would have not have created the   problem that made the solution impossible.   So the issue here is that   the constraints that we   consider that produced the problem   are not corresponding to a feasible path.   An analysis that is past sensitive   considers path feasibility when   determining whether or not there is   a potential problem in the program.   So to illustrate this,   we've labelled some of the statements in   the program in the little   code snippet on the right.   When x is not 0, the program   will follow the path 1-2-4-5-6.   When x is 0, it will follow the plat,   the path 1-3-4-6.   The path that leads to the problem   that we saw a moment ago is 1-3-4-5-6,   but this path is infeasible.   A path sensitive analysis   will rule out such a path   by qualifying each constraint   with a path condition.   So instead of just generating   the constraint, untainted is less than or   equal to alpha, it will instead include   the condition that says this constraint is   only valid when x is not equal to 0.   On the other hand,   the constraint tainted is less than or   equal to alpha is only   valid when x equals 0.   When determining whether   constraints have a solution,   the path conditions will be considered.   We'll see that only the first and third   constraints should be considered together   on the one hand, and in this case   alpha does have a solution: untainted.   Or, only the second constraint should   be considered on the other hand and   again, alpha has a solution but   a different solution.   And therefore there is   no alarm in the program.   If flow and   path sensitivity are so useful,   is there any reason not to use them?   It would seem that the added precision   is important for avoiding false alarms,   which would otherwise arise   by conflating flows or paths.   But both sensitivities make the flow   analysis problem harder to solve.   For example,   flow sensitivity increases the number of   variables the analysis considers, and   thus increases the number of nodes in   the constraint graph that we must solve.   Path sensitivity requires the use of   predicates to distinguish different paths,   which means we must   employ more general and   expensive solving procedures   to make use of them.   In short, the added precision decreases   performance, which ultimately hurts   scalability, limiting the size   of programs we can analyze.   This trade-off is sometimes managed by   employing a demand driven analysis,   which adds precision only as needed and   where it's needed.   We'll discuss this idea   a bit more later on.

Context Sensitive Analysis

[SOUND]   Up until now,   we focused on tracking tainted flows   through blocks of code, normal statements.   Now we'll consider how we handle   tainted flows to function calls.   This example shows a function   id that we want to analyze.   The id function itself is rather simple,   it takes the argument x and   simply returns it back to the caller.   On the left hand side we see a client   program that takes the result of fgets,   passes it to id, and   returns the result into the variable b.   So we want to see wether or not there   is a tainted flow in this program, so   we need to track the flow into   the id function and back out again.   So, the way that we're going to   do this is in several steps.   First, we need to give flow   variable names to the argument and   return value of the function.   So here we pick gamma and delta.   Next, we need to create flow   constraints between the actual caller,   where in this case,   we pass in the variable a.   And the callee which receives   that in the argument parameter x.   In addition, after computing   whatever result we compute in   the function we need to take what   is returned, in this case x.   And create a flow between it and   wherever that result is used in   the caller, in this case the variable b.   So, let's put it together.   We see that fgets returned   something into a.   Remember that fgets has   a tainted return value and   so tainted is less or equal to alpha.   Then, we call the identity   function passing a in.   Therefore, we need to create   a flow between a and x, that is,   the actual argument a, and the formal   parameter x, in the function id.   And, therefore, we create a flow   constraint between alpha and gamma.   In the body of the function, all that the   body does is return the variable x, and,   so, x has the name gamma and   because we're returning it,   we need to create a flow with   the name of the return value.   Which, as we said before, is delta.   Now, finally, what is actually returned   from this particular function column needs   to flow into the variable   we're assigning it to and so   we create a flow between delta and   b that is delta and beta, b's name.   Okay so we've extended the example   a little bit with two extra statements.   In particular we define a new variable c,   to which we assign   the untainted constant high.   And we also take that c and   pass it as the format string to printf.   So essentially we'll ignore   the first two statements as far as   the actual semantics of   the program is concerned.   But this is setting a more interesting   example that we'll see in just a moment.   Okay.   So we're going to   have the same sort of constraints that we   saw a minute ago for the first two lines.   Then we'll create the constraint   untainted flows to omega.   And finally the constraint   omega flows untainted.   Because, printf expects an untainted   argument, and c's name is omega,   that is,   the actual argument we're passing in.   So, there's a good solution here,   and there's no alarm.   Now consider a small   change to the example.   Instead of assigning hi to   the variable c directly.   We pass it into id and   then it will return the result which   eventually makes its way to c.   So if we were to run this program and   the prior program,   they would have exactly the same outcome.   But as we shall see, the analysis   will treat them a bit differently.   So, we generate the first set   of constraints once again.   And now we have to consider   the function call into ID.   We're passing an untainted value   into the argument of ID, so   untainted is less than gamma,   and then likewise we're   taking the return value from ID delta and   assigning it to c whose name is omega.   Now we'll call printf which causes   us to pass omega into untainted.   Which is the expectation of   the format string for printf.   And now let's look at   solving these constraints.   Unfortunately we're   going to get a false alarm.   There's no solution and yet   as we said a moment ago this is   exactly the same semantics as   the program we saw before.   So what happened?   Well as the animation just showed.   What's happening is the analysis   is being imprecise, and   it's considering a call into which fgets   passes an argument, which is tainted,   and the return into which we   pass the untainted value high.   But we're pretending the analysis is   conflating these two calls together and so   we're pretending that the first call can   return its result to the second call.   So this is a problem of   context insensitivity.   These call sites, as I said,   are conflated in the graph.   So context sensitive analysis solves this   problem by distinguishing   call sites in some way, so   that we don't allow a call at one place   to return it's value to another call.   We can do this by giving call   sites a label, call it i and   perhaps correlate it with the line number   in the program at which the call occurs.   And then in the graph, matching up   calls with corresponding returns,   only when the labels on flow edges match.   We'll also add what we call   polarities to these edges,   to distinguish calls from returns.   So, minus i for argument passing,   and plus i for return values.   So let's see the example again, but   this time in a context sensitive setting.   Notice that we have labeled the two   calls to id with one and two.   These are the indexes for   those calls that we can think of   as happening on lines one and   two of the program.   So, we have the same constraint   that we had originally.   But now when we go to   call the function id,   we're going to index that   call with the label one.   And we're going to use a negativity   polarity because that's   the argument being passed down.   Same constraint for   the body of the function,   now the call site one is going to   return something slightly different.   It's going to be indexed with a label, and   it will have polar, polarity plus   to indicate that it's a return.   Okay, so   here's where the new part happens.   We have a new label on   the constraint this time.   We have label two.   For both the caller and the return value.   And then we have the final constraint.   So given these constraints,   these were the constraints that were part   of the false alarm that we saw before.   But, notice that they   can't allow a solution now   because the polarities don't match.   Sorry, the, indexes don't match.   And because it's an infeasible flow,   the analysis we'll actually see that due   to the different labels,   and there will be no alarm.   >> Just like flow and path sensitivity,   context sensitivity is a tradeoff,   favoring precision over scalability.   In particular, the context insensitive   algorithm takes roughly time O of n,   where n is the size of the program,   while the context sensitive algorithm   will take time O of n cubed.   On the other hand, sometimes the added   precision actually helps performance.   By eliminating infeasible paths   it can reduce the size of   the constraint graph by a constant factor.   And this is true for flow and   path sensitivity too.   But the general trend is that greater   precision means lower scalability.   One way to push the trade-off   more towards performance.   Is to be selective about where   context sensitivity is used.   Rather than using it at all call sites,   we could use it only at some of them.   The remaining call sites are conflated,   as in an insensitive analysis.   This means that the analysis   coarsely models calls from one site,   as if they could return to another.   Rather than having one   blob of insensitive sites,   we could also imagine groups of sites,   that are sensitive with respect to   the other groups, but insensitive   with respect to their own members.   Doing this amounts to giving all sites,   in the same group, the same index.   Finally, it's also possible to   limit the depth of sensitivity.   For example, to only match nested   calls up to a certain limit.

Flow Analysis- Scaling it up to a Complete Language and Problem Set

[SOUND]   Okay, so   let's finish up this discussion of our   tainted flow analysis by seeing how to   scale it up to a more complete   language and problem set.   So far we've considered scalar values or   pointers treated as blocks.   But we haven't considered   dereferences through pointers and   how those affect the tainting.   In this example,   we have character pointers a and   b, which were assigned strings as normal.   But we also have p and q,   which are pointers to pointers.   That is, p is a pointer to a string,   as opposed to a string itself.   What the example is going to do is assign,   hi to a, p will point to   a variable, q will alias p, that is,   it will also point to a the variable.   Then via q we assign b, which is   a tainted value we read from fgets,   and then we print through   p which aliases with q.   So what we can see here is because   of the aliasing of p and q,   that is, because they both   point to the same memory,   when we assign through q the value in b,   we effectively make p tainted.   But what we're going to see is that   without care our analysis won't   realize this.   So here on the first line,   we have the assignment as usual.   Here on the second line,   we're going to take the address of a.   And so now we're concerned with beta,   which is the taintedness of   the contents that p is pointing to,   rather than of p itself.   So we're going to create   that flow constraint, and   likewise that flow constraint, and   here's another one that we do as normal.   And finally,   we do the assignment through q.   Now, star q refers to the character   pointer that q points to, whose   label is gamma, and b refers to what is   returned by fgets, whose label is omega.   So omega will flow into gamma   through this assignment.   And then we print via star p.   So once again, what p points to,   that character string,   it has beta as its label.   And so it is going to go to untainted.   Okay.   Well, here's a problem.   A solution exists,   even though the assignment to q of p   is going to affect the contents of p.   And therefore, this is actually an error   that our analysis will fail to uncover.   So what do we need to do to fix this.   Well, we need a way of   identifying when flow can   happen through assignments via aliases.   And we do this by, when we assign   pointers to pointers, we create edges   that go both ways between the flow   labels of the things that they point to.   So here we show the constraint gamma   goes to beta when we assign p to q.   So not only do we go beta to gamma,   we also go gamma to beta.   And actually, we should do the same thing   with the assignment on the second line.   That is, we should have an edge from beta   to alpha, but I just haven't shown it   here to keep one fewer constraint   from cluttering the slides.   Okay, with this extra constraint,   a solution no longer exists.   We can see now that tainted   can flow to untainted, and so   therefore, we've caught the error.   So again, the idea is that an assignment   via a pointer flows both ways.   That is, it anticipates that this,   through an alias,   we could assign a value that   would create a flow later on.   And the flow both ways allows   that assignment to be captured.   Unfortunately, this can   lead to false alarms.   So we have a couple of choices for   reducing them.   One is, if we know that an assignment   through an alias can never happen, for   example because the alias   is labeled as const,   then the backward flow is not needed and   we do not need to include the edge.   Or we could just drop   the backward flow edge anyway,   anticipating that   an assignment will not happen.   That's problematic, whether or   not that's guaranteed by the type system.   So effectively, this is trading   a false alarm for a missed error.   And different analyses make different   choices about this question.   Another sort of hidden flow that might   happen is what's called an implicit flow.   And that's illustrated by   an example we'll develop now.   So, here's a copy function that takes   a tainted character pointer source, and   copies it to a destination   character pointer.   And we can see that this is illegal,   because the tainted characters in src   should no be able to flow to the untainted   expectations of the characters in dest.   So this flow is not allowed, and   therefore the analysis should flag that.   Okay, now let's look at this program,   which, as it turns out,   implements exactly the same function,   though in a very inefficient manner.   What it does is for each character   that's considered in the source array,   it will iterate through all possible   values that character could have.   So j will consider all   possible character values.   If that value matches the one that's in   the source, that's the if statement of   source sub i is equal to j,   then we simply assign j to dst.   This will be legal because j is untainted.   But then we've missed a flow,   because even though the character was not   directly assigned from src,   the same value of that character was.   And so the information contained inside   of src is certainly copied to dst, and   therefore the information is leaked.   So, we can perform what we   call information flow analysis   to discover when such   implicit flows happen.   Data did not flow, but   the information did.   We can discover this by   maintaining a second label that we   call the program counter label.   And it represents the maximum,   in the lattice,   taint, affecting the current position   of the program counter in the program.   Assignments will generate   constraints involving the pc.   So x equals y will   produce two constraints,   one between the labels of y and   x as usual, and another saying that   the pc label must be able to be   upper bounded by the label of x, so   that a flow does not leak   through the assignment of x.   And this will allow us to   track information flows like   the one we just saw.   So here's a,   an example with an implicit flow in it.   It resembles the prior one,   but it's simpler,   it doesn't have the loop and so on.   Here, we have a tainted source and   an untainted destination.   Actually we'll infer what   the taintedness is of the destination.   So we branch on the source,   which is tainted, and   then based on whether the source   is zero or not, we assign to dst.   So if the source is zero,   then dst will contain the same value as   the source, otherwise it will contain one.   So information is clearly   flowing from src to dst,   though data is not, because information   about src can be recovered by looking at   the value of dst after the program runs.   So let's see what the constraints   look like that we would generate.   At each of the assignments,   dst equals 0 or   dst equals 1, we just have an untainted   constant 0 flowing into alpha as normal.   And also for 1.   And also here for   the 0 portion of the dst plus equals 0.   Okay, now let's consider the constraints   according to the pc label.   We've labeled each of the lines in   the program with a relevant pc.   The first line that can   be executed is guard.   Then either the then branch,   or else branch, or pc 2 and 3.   And then finally,   the final assignment, pc4.   Now, the first thing that we will consider   are the extra constraints that result from   the assignment statements due to the pc.   In particular, here, we have to create   a constraint, alpha is less than or   equal to pc2, because alpha is   the variable to which we're assigning and   the current program counter   at that assignment is pc2.   Likewise, for the else case,   we do alpha is less than or   equal to pc3, the program counter there.   And then finally, the last assignment   is alpha is less than or equal to pc4.   So what are the program counter   values at these different locations?   Well, pc1 is the first line that's   executed in the program and   has not been influenced by any tainted   data, so it starts as untainted.   However pc2 and   3 are result of branching on src.   That is, they can only be reached because   we have branched on the src variable, and   src is tainted.   And therefore, we need to bump the program   counter label to tainted in both cases.   Once we have exited the conditional and   we're back on the line with pc4,   the program counter label can be untainted   again, because it doesn't make any   difference which direction we took when we   branched on src we will always reach pc4,   and so the value of the guard src has no   influence on it, and it can be untainted.   Looking at all of these constraints,   we can see whether there is a solution to   them, and we can see that the solution   requires alpha to be tainted.   So this is discovering the implicit flow,   that is that the tainted source   is tainting the destination.   So, should we always track   information flow constraints?   Well, tracking them with a pc label   can sometimes lead to false alarms.   In this program,   the dst value is always going to be 0,   no matter what the value of src is,   and so there is no information leak.   And yet, the assignments to dst will   be tainted according to the use of   the pc label, and   therefore it will produce a false alarm.   Extra constraints due to the pc   label can also hurt performance.   As it turns out, the copying example   we saw before is pathological.   Developers typically don't write   programs like this, and generally,   implicit flows have little overall   influence in well written programs.   So as an engineering question,   oftentimes implicit flows are not   tracked in industrial analyses.   One point though, is that if you're   analyzing untrusted programs, for example,   mobile code that's being downloaded by   your browser or some other system, that   you run locally, there is a greater chance   that an adversary will maliciously exploit   this weakness in your analysis to try to   taint information, or perhaps leak it.   However, this situation aside, tainting   analyses tend to ignore implicit flows.

Challenges and Variations

[SOUND]   We have considered how to analyze most   of the key elements of a language.   But not all of them.   A robust tool obviously   has to handle them all.   First, while assignments   obviously transfer the taint from   the source to the target,   what should happen if the source is   an expression rather than a variable?   We must define the taint of operators.   Usually, if an argument to   an operator is tainted,   then the expression containing   that operator is, too.   So, far we've assumed we know   which function a call will invoke.   This is important for   linking up the flow constraints at   the caller with those in the callee.   But what if we make a call   through a function pointer?   The pointer is not assigned until runtime.   So, that static compiled time analysis.   Canno, cannot always   be sure of its target.   As it turns out we can use   a kind of flow analysis to   determine the set of possible targets   a function pointer could contain.   The analysis follows the same   principles of the flow analysis that   we've already seen.   This way, when analyzing a call site using   a function pointer, we add constraints as   if all possible targets were called   rather than a single target.   And even more course analysis   of possible targets, for   example, based on a function's type   could also be used rather than one   that more precisely determines the   possible values of the function pointer.   C structs are records have several fields.   The most precise analysis can track   the taintedness of each field of   a struct separately as if   they were separate variables.   But such precision can be expensive.   Alternatively, an analysis can track   the taintedness of the same field of   all instances of a struct as if that   field was a kind of global variable.   Now this hurts precision but   is often a big boost to performance.   An in between position is to analyze only   the taintedness of the struct as a whole.   Or to do so for only some of its fields.   Note that these decisions are relevant   to object oriented languages.   You can view objects as a struct   containing function pointers and so   the trade offs we've just   considered apply in the analysis of   object oriented languages.   Another way to aggregate data in   a language is to store it in an array, and   there are several,   similar trade-offs here.   You can track taintedness   of each element or   you could track it on   the array as a whole.   The challenge here is that while   struct fields are constants,   array indexes are computed by arithmetic.   As such, it is sometimes hard to   track them precisely at compile time   because we won't know what arithmetic   produces until we run the program.   And so   a hybrid approach is inevitably employed.   Up until now we have focused   on how the analysis works.   A key part of using it to   perform security reviews is to   properly label the tainted sources and   untainted sinks.   And there are many options here.   We could label array indexes and   format strings and SQL strings.   We can also label functions that   act as sanitizers or validators.   These are library routines that take in   tainted data and return untainted data.   For example a function that escapes   an HTML string would be a sanitizer.   Labeling sanitizers is needed   to avoid false alarms.   Finally flow analysis need not only apply   to finding improper use of tainted data.   It can also be used to discover   the improper disclosure of   sensitive information.   In particular we can imagine labeling   sensitive data as secret, and   untrusted output channels as public.   And our goal is to prevent secret   sources from reaching public sinks.   In essence this is the dual of   the tainting problem, where the only   difference is that the lattice and   flow relationships are inverted.   Finally, we should point out that   flow analysis is not the only kind of   static analysis, in,   in fact there are many other kinds.   One common analysis, that is often used   to assist other analysis problems,   is called pointer analysis.   With a common flavor of it   called points to analysis.   This analysis determines whether two   pointers are possibly aliases, that is,   whether they could both   point to the same memory.   Earlier we saw that knowing this   is important for not missing bugs.   In fact, our tainted flow analysis   involved a very coarse grain   form of point to pointer   analysis within it.   But more sophisticated analysis   would improve precision.   Significant advances have   aided both the precision and   scalability of pointer   analysis in recent years.   Another kind of analysis is   called data flow analysis.   And it was developed in the 1970s as part   of research on optimizing compilers.   Like the flow analysis   we've already considered,   data flow analysis focuses on the flow of   values through variables in a program.   But it maintains information   about this flow bit differently.   One common data flow analysis   is called liveness analysis.   And it determines which variables,   at a given program point are still alive.   That is, whether their values   could still be red in the future.   Other variables are dead, meaning   that their values will be overridden.   Liveness analysis is an important   part of allocating variables to   machine registers during copulation.   Registers storing dead variables,   can be re-used.   Data flow analysis techniques   are also useful for   security questions and   many industrial tools use them.   For example, they can also answer   the tainted flow question.   Finally another significant   kind of analysis is called   abstract interpretation.   It was invented in part to provide   a theoretical explanation of   data flow analysis, but it has grown into   a style of analysis in its own right.   The key idea of abstract interpretation,   is that a static analysis is   an abstraction of all possible   concrete runs of a program.   To be practical,   the abstraction must discard information.   For example,   which calls correspond to which returns.   The key is to discard as much   information as possible for   purposes of scalability, while still being   able to prove the property of interest.   An abstract interpretation is at the core   of several industrial strength tools.   Static analysis is gaining traction   in practice with a variety of   commercial products from a variety of   companies as well as open-source tools.   If you'd like to learn more about static   analysis, I'd recommend this book,   Secure Programming with   Static Analysis by Brian Chess.   This goes into more depth about how   the static analysis tools work,   elaborating on   the description on this unit.   And talks more about how static analysis   fits in a secure development process.   For more of the mathematical details   about how program analysis works,   I recommend this book by Nielson,   Nielson, and Hankin.   It's a bit dense for   the casual reader, but it's good for   introducing the academic   field of static analysis.

Introducing Symbolic Execution

[SOUND].   Software has bugs.   Seems unavoidable.   The most common ways of finding   bugs are testing and code reviews.   But it's all users know,   software, nevertheless,   ships with bugs still unfixed.   In fact, many of these bugs   are known to the developers but   they were not deemed important   enough to fix prior to shipping.   Many other bugs are not known.   Why are they not found?   It could be that the bug is   in a rarely used feature.   For example, in a device driver for   a rare piece of hardware.   Or, the bug arises in rare circumstances.   For example,   when memory is nearly exhausted, or   the bug could arise nondeterministically.   The circumstances that make the bug   manifest could be unpredictable.   And thus make the bug hard to   reproduce and hard to fix.   One way we can try to find those bugs that   testing misses is to use static analysis.   And this conceptually makes sense,   because static analysis can consider   all possible runs of the program,   perhaps even those that are rare,   non-deterministic and so on.   We can feel heartened by the explosion   of interesting ideas and   tools that are being   considered in research.   And by the fact that companies are now   selling and using static analysis tools.   So there's great potential here   to improve software equality.   The question, though, is can static   analysis find deep difficult bugs?   So in principle, yes.   But, maybe in practice, not so often.   Okay, why is this?   Well, it's because commercial   viability implies that you have to   deal with developer confusion,   false positives, error management.   That is quite a lot of code that has   nothing to do with the core idea of   analysis itself but, rather,   the way that humans use the analysis.   One particular thing that companies   seem to do is to keep the false   positive rate down.   They do this by purposefully missing   bugs to keep the analysis simpler.   This turns out to be a good idea because   studies show that developers get tired of   dealing with alarms that don't   turn out to be true bugs.   And they're not interested   in sifting through,   say, nine alarms, even if on the tenth   time they find a really important bug.   So companies respond to this, and   they make tools that will miss bugs.   So one of the issues here that makes false   alarms hard to deal with is abstraction.   Abstraction is critical for   static analysis because it allows us to   simplify our view of what the program   does so that we can efficiently, or   efficiently enough,   model all possible runs.   But to do this we have to   introduce conservatism.   Now the various sensitivities like flow,   context, and   path sensitivity add precision.   So that reduces conservatism.   But these more precise abstractions   are more expensive, which means that they   won't be applicable to larger code bases,   at least not at short time scales.   And they still will not completely   eliminate the false alarms problem.   One particular problem that makes   false alarms hard to deal with is   that the static analysis technique   may use an abstraction that   is not familiar to the developer,   that is the developer is not able to   make sense of the alarm that the tool   produces because it was produced in a way   that the developer can't relate to.   And therefore can't easily triage to   determine whether the bug is true or no.   So for the remainder of this unit,   I'm going to talk about a technique called   symbolic execution, that's kind of   a middle ground between testing and   static analysis, and aims to retain   the best features of both techniques.   We start with the observation   that testing works.   Reported bugs are real bugs.   The problem is that each test only   explores one possible execution of   the program.   If we do assert f of 3 is equal to 5,   well, we've checked that f of 3 equals 5,   but not that f applied to other   input values also equals whatever it   is they're supposed to equal.   In short, tests are complete,   but they're not sound.   And, while we hope they generalize,   they don't provide any guarantees.   Now, symbolic execution   aims to generalize testing,   in a sense it's more sound than testing,   so we can write tests like this.   We can say that y is equal to alpha, which   represents an unknown or symbolic value.   And then we can assert that f of   y is equal to 2 times y minus 1.   This is basically saying, for all y,   f of y is equal to 2y minus 1.   Symbolic execution can   execute this program, and   confirm that this is indeed the case, thus   performing a more general sort of test.   The way that this is possible is   that the symbolic executor can   run the program to the point   that it depends on an unknown.   And then it will conceptually fork   the symbolic executor to consider all   possible values of that unknown that would   affect the control flow of the program.   So here we see the function f, and   we see that it depends on x,   it's argument at the first branch.   If x was symbolic,   as in our assert, then the symbolic   executor will consider both branches.   That is the case when x   is greater than 0 and   the case when it is not greater than 0.   And of course, in this example when   it does that it will discover that   the assertion will not hold for   non-positive values of x.   Let's look at another example just to get   a flavor of how symbolic execution works.   Of course, we'll go into much   greater detail later on.   So here on the left, we have a program.   And in the program, we've assigned   to three of the variables, a, b,   and c, symbolic values,   alpha, beta, and gamma.   The rest of the program is a series   of conditionals and assignments.   At the end, we see an assertion   that we would like to hold for   all possible runs of the program.   All right, so the symbolic executor   will start, and it will execute line 3,   initializing x, y and z to 0.   Then it will reach   the conditional on line 4.   This conditional depends   on a symbolic value, alpha.   And so, it will consider whether   both branches are possible, and   of course, they are.   At first, it will consider the true   branch, that is when alpha is non 0.   In that case,   it will assign x as negative 2.   Then, it carries on to executing line 7.   Here, the guard depends on   a symbolic value again, and so   it checks whether beta can be less   then five, again, it could be or   it might not be, so the symbolic   executor will pick one direction.   Here, suppose that it picks true.   Now we see that we'll skip line 8, and   the reason is that along this path   we have assumed that a is non-zero.   And that means the conditional   on line 8 will not hold.   After assigning z to 2,   we reach the assertion and   we see that the assertion succeeds.   And that's because x is minus 2,   z is 2, and y is 0.   The sum is 0, and   of course 0 is not equal to 3.   Now, we can characterize this particular   path of our program execution,   using what's called a path condition.   That is along this path we   assumed that alpha was true i.e.,   non-zero, and beta was less than 5.   So we want to also consider   the other paths by considering those   branches again and   going the false direction if possible.   So for example, we could take the fall   side of the branch on line 7.   In that case we'd skip lines 8 and   9, breach the assertion again.   In this case, the assertion is still   going to hold because x is minus 2,   the other 2 are zero and   therefore the resulting sum is not 3.   Alright, we'll consider   the other branch once again.   On this branch, we'll now assume   that a is, that alpha is false.   And we'll check whether b is less than 5.   Again, we have no constraints   on beta symbolic value.   So, we could go both ways.   Suppose we go the false direction this   time, thereby skipping lines 8 and 9?   Here the path condition is not alpha, that   is we're assuming that alpha is false and   b is greater than or equal than 5.   Now we'll go the other direction.   In this direction, we're assuming that   alpha is not true and gamma is true.   Can this condition hold?   Sure, it depends on what c is,   and we have not constrained that.   That is we have not constrained gamma.   So suppose we go in the false direction,   in this case we'll assign z is two,   because we skipped the assignment to y.   However we go in the true direction,   then y equals 1 and   z equals 2 and now the sum of y being 1,   z being 2, and   x being 0 is 3, and   that violates our assertion.   Importantly, the assertion violation comes   with the path condition not alpha and   beta is less than 5 and gamma, and   this precisely states all of the inputs   that will cause this test case to fail.   And that is, if we used a satisfiability   solver to give us a potential solution to   this path condition, we could produce   a test case that shows the failure.   And of course, this is very useful   because the developer can then run   the test case and use it to diagnose   the reason for the failure in the program.   Notice that each path through the tree   we just constructed represents a set   of possible executions.   More precisely, it represents   all of those executions whose   concrete inputs are solutions   to the path condition.   This set is a very precise   notion of code coverage.   Viewing symbolic execution   as a kind of testing.   Complete coverage of the program   would be all of its paths.   Symbolic execution allows us to   systematically consider   many of these paths.   Viewed as a kind of static analysis,   symbolic execution is complete in that   whenever a symbolic executor claims to   have found a bug, the claim is true.   However, it is rarely sound because   it is unlikely to cover every path.   The reason is that most programs do   not have a finite number of paths, and   so symbolic execution will not terminate   except by imposing a time limit or   other resource limit.   Symbolic execution's high precision,   the source of its completeness, can be   characterized as path, flow, and context   sensitivity in static analysis parlance.

Symbolic Execution- A Little History

[SOUND]   Symbolic execution is an old idea.   It was first proposed in the mid 1970s.   The most cited early paper on   the subject is James King's PhD thesis.   A synopsis of which was presented in   the communications of the ACM in 1976.   So if symbolic execution   is such an old idea, and   it seems like such a good one,   well why didn't it take off?   Why haven't we been using it for   a long time?   Well, one reason is that symbolic   execution can be compute-intensive.   In the end, for big programs, we'll   consider many, many possible paths, and   along each or many of those paths,   we'll need to query an SMT solver   to decide which of the paths are feasible,   in which assertions could be false.   And in the end such queries   could involve much of   the program state which   could be very large.   At the time symbolic execution was   invented computers were relatively slow,   not much processing power,   and small, not much memory.   I mean Apple iPads are as fast   as Cray-2's from the 80s.   Today computers are much faster and   much bigger.   Both more memory and more computing power.   And we also have better algorithms.   So SMT solvers are much more   powerful today because of   smarter optimizations that people   have implemented over time.   They can solve very large   questions very quickly.   And that lets us check the assertions and   to prune infeasible paths.   So let's take a minute to look at   both the improved hardware and   the improved algorithms.   Here's a quick chart that shows   trends of hardware improvement.   You can see along the x axis is the date,   the year, and   along the y axis is number of,   operations that the computers can perform.   Because the y axis is a logarithmic scale,   we can   see that this linear growth is actually   an exponential rise in computing power.   On the algorithm side,   here we can see that the time to solve   a particular instance has been improving   even while keeping the hardware steady.   So here on the y axis is the time that   it takes to solve a particular problem.   While on the x axis we list the year   of the algorithm we're considering.   And all of these algorithms   will run on the same computer,   that is on 2011 hardware.   So we can see that the time to run has   been decreasing as the algorithms have   been improving over time.   And though small instances   have not improved so   much in time, since 2008 big problems   are now starting to be solved faster.   Symbolic execution has   experienced a resurgence of   interest in the last ten years or   so, starting in about 2005.   The main motivation behind it was to   show that symbolic execution could   find very interesting bugs that were   often missed by normal testing.

Basic Symbolic Execution

[SOUND].   Now we'll look in greater detail   at how symbolic execution works.   We'll start with a typical   programming language.   Here we show a grammar of expressions,   denoted by the letter e.   Expressions consist of things   like integers n, variables x.   Expressions involving arithmetic   operators like addition.   And expressions involving comparisons,   like disequality.   To support symbolic execution, expressions   also include symbolic variables.   Which we denote with greek letters,   like Alpha.   Symbolic variables represent   inputs to the program.   These inputs could come from many sources,   including reads from files or   network sockets,   memory mapped files or devices.   Whereas normal tests must   provide specific inputs.   Symbolic execution models inputs using   symbolic variables, and thus can explore   how different executions are induced   by different values of these variables.   These values are discovered by solving for   constraints generated during execution   as we'll see in more detail shortly.   >> Okay, now that we've defined   our symbolic language we need to   define its semantics.   One way to do that is to make or   modify a language interpreter   that can compute symbolically.   So normally a program's variables   contain values, but for   our symbolic interpreter they will also   be able to contain symbolic expressions.   Which are program expressions that   make mention of symbolic variables.   So as example normal values we have   the integer five in the string hello.   But as example symbolic   expressions we have alpha plus 5.   The string hello concatenated   with some symbolic string alpha.   The array index expression alpha plus beta   plus 2 for indexing the array a and so on.   So let's see how we use these symbolic   expressions in an example execution.   Here we have a simple   straight line program.   On the left-hand side we'll   consider concrete memory,   we'll look at how a concrete, that is   normal execution of the program might go,   and then we look at how it might go for   our symbolic interpreter.   So in the first line we perform the read.   Let's say, for argument's sake,   it reads the value 5.   Next, we'll assign to y.   What do we assign it?   Well 5 plus the contents of x,   which are 5, is 10.   Next, we'll execute line three.   Here we'll store 17,   because 7 plus 10 is 17.   And then finally we'll use z,   17, to index the array a.   Now notice here, we're assuming   that a contains four values.   So a sub 17 is off the end of the array.   That is, it's an out of bounds access.   Okay, now let's consider the symbolic   execution of this program.   Instead of returning a concrete value for   x we will return a symbolic variable   alpha and store it there instead.   For y we can't store a value anymore   because five plus alpha can't be   further reduced, so instead y is assigned   the symbolic expression five plus alpha.   Next, we assign to z.   Here, seven plus five plus alpha is   12 plus alpha, so we store that.   And then, finally, we use the symbolic   expression, 12 plus alpha,   to index z, and this could be an overrun.   Of course, we just saw an example   of one in the concrete execution.   That is there is a solution for   the expression 12 plus alpha   that could create an index that,   that is outside of a's bounds.   We'll explain how we can   find that in just a moment.   Now as we saw with our overview example,   program control can be   affected by symbolic values.   So if you branch on a symbolic expression.   It could be that,   depending on the solution of that,   of the variables in that expression you   could go either true or false or both.   Here in this example we see that we   read into x and then branch on it and   so that's a branch on   a symbolic expression.   We represent the influence of symbolic   values on the current path, that is   the list of the branches we have taken,   true or false, as a path condition pi.   So as an example,   line three is reached when the path   condition is alpha is greater than 5.   Line five, when alpha is greater than 5,   and, because it's nested,   alpha is less than 10.   And line six,   when alpha is less than or equal to 5.   That is, we've taken the false   branch in the case that we   checked whether alpha was greater than 5.   Alpha being the contents of x.   Now whether a path is   feasible is tantamount to   a path condition being satisfiable.   So here we list three path conditions for   this program.   It's a slight modification   of the program we just saw,   with a branch on line four,   is x less than 3.   So, we can see here that the path   condition a, alpha greater than 5,   alpha less than 3, is not possible.   And, indeed this path   condition is not satisfiable.   There is no alpha such that   both constraints hold.   Now given a path condition, we can come up   with a solution to the constraints in it,   and these can be used as inputs   to a concrete test case that   will execute exactly that path.   So, the solution to leach,   reach line three could be alpha is 6.   And a solution to reach line six,   could be alpha is 2.   Now assertions, like array bounds checks,   are conditionals.   So if we look at our original program,   we can think of just prior to the index   of the array a performing two   checks to see whether the index expression   z is within the bounds of the array.   That is, is z less than 0, and   is it greater than or equal to 4?   If the answer is yes, we've gone out of   bounds and we should abort the program.   So we can think of our symbolic executor   as inserting these extra statements, and   then performing its normal process   of generating a path condition.   So along these first four lines,   pi is true because there are no branches.   Now to reach line five, it would have   to be the case that 12 plus alpha,   that is the symbolic expression we   had stored in z, is less than 0.   If we managed to get past this,   that is we took essentially the else   branch to get to the else case to   get to line six, then we would have   the negation of that path condition.   To get to line seven, we would also   have to check that z is greater than or   equal to 4,   that is 12 plus alpha is greater than or   equal to 4, along with that negation.   If that was not true, then the path   condition wouldn't negate that as well,   and to reach line eight,   we would have this path condition.   So, if either lines five or   line seven are reachable; that is,   paths reaching them are satisfiable,   the path conditions are feasible.   We have found an out of   bounds array access.   >> Whereas a normal execution must take   either the true or false branch of   a conditional, a symbolic execution   could actually take both branches.   This is possible because when a symbolic   variable, or variables, appear   in a conditional scarred expression, there   could exist some solutions that would   make the guard true, and   some solutions that would make it false.   As such, a symbolic execute, a symbolic   executor could choose to go either,   or both ways.   How should it choose which and   in what order?   One point to make is that the executor   doesn't immediately know whether the true   or false branch or both is feasible,   when it reaches a conditional.   To know for sure it could invoke   the solver to check the satisfiability of   the path condition.   If a path condition concatenated with   the guard condition is feasible,   then the true branch can be explored.   Likewise if the negation of the guard   is feasible then the false branch can   be explored.   However making calls to the solver,   at each conditional can be expensive.   So, one idea is to just pick on branch and   go that way,   delaying the feasibility check, at   the risk of doing extra, unnecessary work.   A related approach is to use concolic   execution to determine the path to take.   The idea here is to run the program   concretely for actual inputs.   But consider some of those inputs as   if they were symbolic on the side.   The concrete inputs which   guarantee feasibility,   determine which direction to go.   The symbolic executor maintains symbolic   versions of these expressions and   the path conditional is usual.   Which it then uses to   generate subsequent tests.   We'll say more about concolic   execution a bit later.   Okay, so now we can consider a symbolic   execution algorithm as pseudo code, and   it's very simple.   First, we create the initial task,   which is a triple.   It consists of the program   counter location, which is 0,   the start of the program.   The path condition pi, which starts   out as empty, and the symbolic state,   which also starts as empty.   And we add that task onto the worklist.   Now, while the worklist is not empty,   we will pull some tasks from that   worklist and we will execute it.   Eventually we'll reach a point   where that execution forks.   And let's say the state at the point   has the program counter at pc 0,   the path condition at pi 0, and   the symbolic state at sigma 0.   And let's say the reason for   the fork is a branch.   Well, we're branching on p, at line pc 0,   where the true branch is at pc 1 and   the false branch is at pc 2.   Well, in that case, what we'll do is   we'll add the task pc 1, pie 0 and p,   sigma 0 to the work list,   as long as pie 0 and p is feasible.   That is the current path condition   when concatenated with the guard that   contains symbolic variables.   When that's feasible, then we have a   legitimate task that we can execute and so   we should add it to our list.   We'll add the alternative, that is,   going down the false branch in   the case that not p is feasible when   concatenated with the path condition.   And notice here that we could add   either or both tasks to the work list.   So, we can keep on running until   either the list becomes empty,   in which case we've considered   every path in the program.   Or eventually, we get tired and terminate   the program, say after some time out.   Now, in a practical implementation,   we have to consider not just   the main program we're executing,   but also libraries and native code.   At some point, the symbolic executor will   reach the edges of the application, and   the interpreter won't be able   to make further progress.   So this may happen with a library, or   the system calls, or assembly code, or   something like that.   So one thing we could do is,   we could actually pull that code in   two and symbolically execute it.   So for example,   if we're symbolically executing c code,   we could symbolically execute the c code   that implements the standard libraries.   But real implementations of the standard   libraries are often very complicated and   they might involve assembly code and   stuff like that.   And so a symbolic executor will often   get stuck running around loops and   not making any progress   getting out of that library.   So you could use a simpler version   of the library instead that's still   semantically accurate.   Or you could create a model of   the code that you want to execute.   For example, you could create a model   of a RAM disc to model the kernel   file system code, rather than trying to   say symbolically execute the Linux kernel.   So let's return to the idea of   concolic execution I mentioned before,   this is also called dynamic   symbolic execution sometimes.   And the idea here is, instead of   running the program according to   the algorithm that we just saw a moment   ago, we run the program concretely, but   we instrument it to sort of do   symbolic execution on the side.   That is, the instrumentation maintains   some shadow concrete program state   with symbolic variables.   The initial concrete state is going   to determine the path that we take so   it could be randomly generated.   But we will keep track of the choices   we made when guards involve our   shadow symbolic variables.   So we can build up a path   condition along the side.   So, a concolic executor will explore one   complete path at time, start to finish.   And once it's done so it can determine   the next path to explore by simply   negating some element of   the last path condition.   And then solving for that path condition   to produce concrete inputs for   the next test.   While we're symbolically executing   using concolic execution,   we'll always have the concrete   underlying values to drive the path.   But these turn out to be   useful in other ways too.   That is, the process of concretization   is very easy in a concolic execution.   And what I mean by concretization is   replacing symbolic variables with   concrete values that   satisfy the path condition.   Basically we're going to   drop symbolic-ness and   potentially miss some paths.   But we're going to simplify our   symbolic constraints by doing so.   So this allows us, for   example, to do system calls that   might involve symbolic variables.   We just pick reasonable   concrete values for   those variables and do the system calls.   In so doing, we lose the symbolicness but   we're able to continue to make   progress in trying to find bugs.   Replacing symbolic values   can also be useful when so,   SMT solver calls might be too complicated   if we left all of the values in.

Symbolic Execution as Search, and the Rise of Solvers

[SOUND]   As is probably apparent, a good symbolic   executor requires careful engineering.   We've seen that symbolic execution employs   an appealingly simple algorithm, but   with high computational costs.   In particular, the executor might traverse   many different paths in the program, and   it must make potentially   expensive calls to an SMT solver,   to determine which paths are feasible.   Considering the problem   of traversing may paths,   we will see that symbolic execution   boils down to a kind of search problem.   The goal is to search through   a large space of possibilities to   find events of interest.   For us these events are bugs.   We'll also take a minute to look   at how improvements in SMT solver   performance make the path feasibility   check possible even for larger programs.   A well known problem with symbolic   execution is the path explosion problem.   Essentially, for a symbolic executor to   consider the entirety program's space of   executions it needs to   consider every path.   But because most programs   have a huge number of   paths we can't usually run   symbolic execution to exhaustion.   In fact, programs can be   exponential in branching structure.   Here we see a program that has   four lines of code, and yet   because it has three variables   it has eight program paths.   That is, two to the third possible   paths despite using the variable only   once on each line.   Loops on symbolic   variables are even worse.   In this program we're looping on   variable a, which is symbolic.   This is going to cause us to   consider an arbitrary number of   loop iterations because every time   we re-reach the head of the loop,   we consider whether to stay in the loop or   whether to exit.   Now, if we compare symbolic   execution to static analysis,   we can see that there's a clear   benefit of static analysis.   That is, it will actually terminate   even when considering all possible runs.   And, it does this by approximation and   abstraction, approximating multiple loop,   loop executions or   branch conditions, and so on.   But as we discussed to   motivate symbolic execution,   static analysis as use of extraction   can lead to false alarms.   So what can we do to improve the way   that symbolic execution operates to   maximize it's benefits?   Well to understand that we have to look   at how symbolic execution is viewed as   a search algorithm.   So the simplest way to perform   symbolic execution is to   use either depth-first search,   or breadth-first search.   That is, recalling our algorithm for   symbolic execution, we can use, for   depth-first search,   a worklist that is a stack.   So it will consider the most recent node,   that is the most recent program state,   when deciding what to do next.   Or we can make it a queue,   where we prioritize things that   we put in that queue earlier.   Now the potential drawbacks of using   either one of these two strategies,   is that neither of them are really   guided by any higher level knowledge.   That is, we aren't telling   how we might be looking for   particular bugs or that we want to get   to a particular part of the program.   Instead it will just blindly follow   the program structure in considering what   paths that it wants to execute.   DFS in particular could easily   get stuck in one part of   the program by continuing to go deeper and   deeper.   So think about that loop example.   We could go around the loop over and   over and over again where as in   breadth-first search, we'll consider   both one further loop execution and   the case that we get out of the loop.   Therefore, of these two, breadth-first   search is probably a better choice.   On the other hand,   it is more intrusive to implement.   For example,   it can't easily be made concolic.   So what better search   strategies might we perform?   Well, one thing we can do is to   try to prioritize our search,   to steer it towards paths more likely   to contain assertion failures,   sInce that's ultimately   what we're trying to find.   And recall that assertion failures   could involve array bounds checks,   null pointers checks, and so on.   We only want to run for   a certain amount of time, and   so, the prioritization ensures that,   that time is used best.   To consider these different   search strategies,   think of program execution   as a kind of DAG.   So the nodes in the graph   are programme states and   an edge between those nodes means that   one state can transition to the next.   Therefore, we can think of symbolic search   as a kind of graph exploration algorithm,   trying to pick among the various possible   paths in the overall search space.   One useful technique to   employ is randomness.   We don't know a priori   which paths to take, so   adding some randomness   is probably a good idea.   This is something that   modern stat solvers do and   it works very effectively in that setting.   So, one idea is to pick the next path   to explore uniformly at random, and   we can call this the Random Path Strategy.   Another thing we can do is   randomly restart the search if   we haven't had anything   interesting in a while.   So imagine that we explore depth first for   a while, but as we go further and   further down in the search of a particular   path, we may increase our chances of   stopping and going back to the start   to try a different path altogether.   And we could choose among equal   priority paths at random.   That is, if we had a prioritization   strategy that say, favored coverage, or   we wanted to try paths that we haven't   covered those could statements before.   If we had equal priority paths,   we can just flip a coin to   pick one versus the other.   So one drawback of randomness   is reproducibility.   That is, because we may have made random   choices to figure out a particular bug,   if we run the symbolic executor on   the same program two times, it's not   guaranteed to find bugs that it found   before, and in fact might find new bugs.   This doesn't necessarily sound so   bad except it's complicated for   software engineering, for   example once we fix the bug that we find,   how can we run the symbolic executor again   to confirm that it's not there anymore.   So it's probably good to use   pseudo-randomness based on a seed, and   then record which seed is   picked when bugs are found.   And that way you can confirm that   you don't hit the bug again, or   if you re-run you'll find   the same bugs you found before.   Now a moment ago I hinted at   the idea of using coverage to   guide your prioritization strategy.   And here we'll consider it in a bit   more detail, that is, we should try to   visit statements we haven't seen before   when choosing which paths to follow.   So the approach is to   score a statement based on   the number of times it's been traversed,   and   then pick the next statement to explore   that has the lowest current score.   So this might work because errors   are often in hard to reach part of   the program, and   the strategy will favor going to parts   of the program it hasn't seen before.   On the other hand, we may never   be able to get to a statement if   a proper precondition is not setup,   and so just favoring going closer to   that statement is not necessarily   going to get us there.   Another strategy is called   generational search.   You can think of it as a kind of hybrid   between breadth-first search and   a coverage-guided search.   It works like this.   At generation 0 we pick a program and   a random input to it and   run it to completion.   At the end we take paths   from generation 0,   negate one branch condition in   a path to yield a new path prefix.   Find a solution for that prefix and   then take the resulting path.   Will semirandomly assign to any   variables not constrained by the prefix.   For generation n we'll similarly   apply this approach, but   we'll branch off of generation n minus 1.   And we'll use a coverage heuristic to pick   the priority amongst the different   paths that we choose.   This generational search is often   used with concolic execution.   Now, it's probably obvious at this point,   no one search strategy wins all the time.   Instead different search strategies   may find some bugs that other   search strategies will not find.   So one obvious idea is to combine   searches by doing multiple searches at   the same time.   Essentially have parallel processes   where we switch from one to the next.   Depending on the conditions   needed to exhibit the bug,   one search may find it as opposed   to another search finding it.   You could even imagine using different   algorithms to reach different parts of   the program, depending on the properties   of the program that you're trying to   find bugs in.   SMT stands for Sat Modulo Theories,   where by theories,   we mean mathematical theories   beyond just boolean formulas.   Such theories could include a theory of   arithmetic over integers, for example.   These theories can sometimes   be encoded as SAT problems.   And so   some SMT solvers are basically just front   ends that translate the SMT   problem into a SAT problem.   For example, the theory of bit vectors.   That is, arrays of bits with operations   on them, like addition, subtraction and   so on, can be encoded as a SAT problem.   It turns out that modern SMT   solvers implement a variety of   optimizations either in the SAT engine or   in the translation to it.   And these optimizations can make   a big difference in performance.   One example is to make use of   axioms that witness equivalences.   Another example is to directly   support in the SMT solver a theory of   arrays which model   modifications to memory.   The alternative would to be to have   the symbolic executor simulate memory   directly, rather than rely on the solver.   But then,   optimizations become less apparent.   Another important optimization   is to cache solver queries.   It turns out that symbolic executors   will submit queries repeatedly that   contain identical subexpressions, and this   makes sense because the path condition   is built up incrementally over time.   Finally, we can make the SAT solver's   job easier by eliminating some   symbolic variables from consideration.   For example, if we were testing   whether a guard is feasible,   we only care about the existence of   a solution to the variables in the guard.   Such a solution may involve other   variables which can be syntactically   related to the ones in the guard.   That is, we keep under   consideration expressions that   have variables that are transitively   related to variables in the guard.   For other expressions these   variables can be ignored.   In total the optimizations can   make an enormous difference.   SMT solvers have been a popular   topic of research and   development over the last several years.   There's several that you   can get free online.   Z3 is quite sophisticated,   it's developed at Microsoft research.   Yices, developed at SRI,   has been around for   quite a while and   continues a steady improvement.   STP, which is the SMT solver used by   XC and KLEE is available for free.   And CVC is yet another.   While smarter search strategies and   SMT based optimizations will help,   ultimately path based search is limited.   To see why, consider this program.   We can see that it has a loop.   It's going to run around   the loop 100 times.   And each time it goes around the loop it   may either enter the if statement or not.   This means it has 2 to the 100   possible execution paths.   Now the program is asserting that it's   a bug if the counter is ever equal to 75.   So this bug is going to be hard to find.   That is, there are 2 to the 78 paths that   reach the buggy line of code, that is,   100 choose 75.   Out of the 100 executions,   75 of them need to take the true branch.   But if that's true, the probability   of finding the bug is very low,   because 2 to the 78 paths will   find it out of 2 to the 100 total.   So the chances are 2 to the negative   22 that we will find the bug.   In short, independent of search strategy,   we'll have   a very difficult time finding the bug   using a path-based search algorithm.   So this is really just a fundamental   limitation of path-based search in   symbolic execution.

Symbolic Execution Systems

[SOUND]    Symbolic execution was    invented in the 1970s, but    as I said before, it failed to take hold    at that time as a commonly used tool.    And this is because of limitations    in both the algorithms and    machine power available at that time.    In the mid 2000s, two key systems    triggered a revival of symbolic execution.    And these systems were DART,    developed by Godefroid and    Sen and first published in 2005,    and EXE, by Cadar,    Ganesh, Pawlowski, Dill, and    Engler, published in 2006.    These systems demonstrated the promise    of symbolic execution by showing it    could find real bugs in interesting and    complicated systems.    And as such it spurred    interested in the topic and    many new systems have developed since.    One important system that has made    big impact in practice is SAGE.    It is an concolic executor,    developed at Microsoft Research, and    grew out of Godefroid's work on DART.    It uses the technique of generational    search that we talked about before.    SAGE primarily targets    bugs in file parsers, for    example parsers like jpg, Microsoft Word,    Microsoft PowerPoint and other documents.    And is a good fit for concolic execution    because parsing files is likely to    terminate and it only needs to consider    input/output behavior rather than    complicated and interactive system calls.    SAGE has transitioned from research    prototype to production tool.    It's been used on production    software at Microsoft since 2007.    Between 2007 and 2013, SAGE was    used remarkably often at Microsoft.    For example, it's been run for    more than 500 machine years as part of    the largest fuzzing lab in the world.    It's generated 3.4    billion plus constraints,    making it the largest    SMT solver usage ever.    It's been applied to hundreds    of applications, and    found hundreds of bugs that were missed    by other bug finding techniques.    For example, one third of all    Win7 WEX bugs were found by SAGE.    Bug fixes shipped quietly to one    billion PC's around the world,    saving millions of dollars both for    Microsoft and the world.    And SAGE is now used daily    in Windows Office and    all of the big Microsoft products.    KLEE is another mature tool.    In this case it grew    out of the work on EXE.    It symbolically executes LLVM bitcode.    LLVM is a compiler framework that's    now in regular use at Apple.    It compiles source language programs to    an intermediate representation called    LLVM bitcode which is    stored in a .bc file.    And KLEE will run on the bc file.    It works in the style of the basic    symbolic executor that we saw before.    Where it uses fork to    manage multiple states.    That is,    rather than having an explicit work list,    each time that it could go both directions    on a branch it actually forks another    version of itself which    proceeds along that branch.    And a separate tool manages    all of the distinct copies of    KLEE that are running at once.    KLEE employs a variety of search    strategies, primarily random path and    coverage guided.    And it mocks up the environment    to deal with system calls and    file access and so on.    And you get it as part of    the LLVM distribution.    So looking at how KLEE actually works,    in the original KLEE paper,    published in 2008.    They applied KLEE to Coreutils,    which is the suite of small programs that    runs on Linux and Unix distributions.    So these are programs like mkdir and    paste and sed and ls and things like that.    What this graph is showing is    the amount of coverage from KLEE    generated tests compared to the manual    test suite that comes with Coreutils.    What this chart shows is    that KLEE is better at    testing than the manual test suite.    And we make that determination by looking    at the lines of code covered by tests.    In particular the lines covered by    KLEE tests when from them when we    subtract the lines covered by manual    tests we see that for all but    nine programs KLEE covers more    lines than the manual case.    Here are a bunch of bugs    that was found by KLEE.    These are the command lines    that produced the bugs,    that caused these Coreutils    programs to crash.    Notably, the bugs involved    very simple command lines, and    very simple input files.    And yet they went undetected for    quite a long time before KLEE found them.    So since these second    generation tools of KLEE and    SAGE, there have been    further developments.    One of them is the Mayhem system developed    at Carnegie Mellon by Dave Brumley and,    and his collaborators.    And it runs on actual executables.    That is binaries.    Like KLEE it uses a breadth first    search style search, but it    also uses native execution, combining the    best of symbolic and concolic strategies.    And as an interesting twist,    it will automatically generate    exploits when bugs are found,    to shine a light on the fact that these    bugs could, in fact be security critical.    In further developments    from the Mayhem work,    the same group produced Mergepoint,    which uses a technique called veritesting,    and in this case it combines symbolic    execution with static analysis.    What it does is use static analysis for    complete code blocks.    So it will analyze a straight    line piece of code and in so    doing basically convert it    into a formula that can    be used by an SMT solver to ask    a question about the programs execution.    It will use symbolic execution for    hard to analyze parts of the program.    In particular, loops which, for    which it may be hard to determine    how many times it will run.    As well as, complex pointer arithmetic,    and system calls, and so on.    So this produces a better balance    of time between the solver and    the executor resulting in better bug    finding, that is bugs are found faster and    more of the programs are covered    in the same period of time.    Incredibly, Mergepoint found more    than 11,000 bugs in 4,300 distinct    applications in a Linux distribution    including new bugs in highly tested code,    like the Coreutils from KLEE.    So there are many other    symbolic executors.    Cloud9 is basically a parallel KLEE.    jCUTE and Java PathFinder    are symbolic execution for Java.    Bitblaze is a binary analysis framework,    so    like Mayhem it works    on binary executables.    And Otter is a symbolic execution    framework for C, that is directing and    that you can, tell it which line at    the program you'd like to execute and    it would try hard to get to those lines,    when performing a search.    Pex is a symbolic executioner for    .NET programs.    >> This unit delved into    the details of symbolic execution.    A powerful technology that can be used    to find security critical bugs in    real software.    Symbolic execution can be viewed, on the    one hand, as a generalization of testing.    It uses static analysis    to develop new tests that    explore different program paths.    We can also view it as    a kind of static analysis.    But one with very high precision and    no guarantee of termination.    And if it does terminate it's    results are both sound and complete.    Symbolic execution is used in practice    today to find important bugs in    production code.    The SAGE tool is used    heavily at Microsoft.    And the Mergepoint tool regularly    analyzes large Linux repositories.    Symbolic execution tools of good    quality are freely available.    And we can expect that they,    like static analysis tools,    will penetrate more deeply into    the main stream in the near future.

2 comments:

EdwardFebruary 19, 2019 at 10:24 PM
Really nice topics you had discussed above. I am much impressed. Thank you for providing this nice information here.

Software Testing Company

QA Services

PS4 Game Tester

Game Testing Services

Video Game Testing Companies
PinkiJune 13, 2025 at 4:25 AM
Thanks for sharing these notes! For anyone passionate about digital safety and ethical hacking, a B.Sc. (Hons.) in Cyber Security is a great career path. Learn to protect systems and fight cyber threats with real-world training!

MyClassNotes

Monday, August 31, 2015

Software Security - Week 5 - Notes

2 comments: