Monday, August 31, 2015

Software Security - Week 5 - Notes

Static Analysis- Introduction part 1

 [SOUND].    A key element of a security conscious    software development process is    performing code reviews with    the assistance of tools.    Increasingly such tools employ    a technology called static analysis.    In this unit we'll look at    static analysis in detail.    We begin with an introduction.    We'll look at what static analysis is and    why it is useful.    Then we'll look at how    static analysis works.    To understand the basics we will    develop a flow analysis that tries to    understand how tainted values    flow around a program.    Our goal is to find bugs.    Then, we will look at how to increase    the precision of the basic analysis by    making it context,    flow and path sensitive.    We will finish up by filling    in some missing details.    We'll consider how an analysis    handles more challenging programming    language features.    How it can be customized    to different settings, and    how it can be made more    impactful in practice.    Current practice for software    assurance tends to involve testing.    The idea here is to make sure that    the program runs correctly on a set of    inputs that seem relevant to    the operation of the program.    So, we provide those    inputs to the program.    The program produces some outputs.    And an oracle that the tester    constructs confirms that    the outputs are the ones    that are expected.    So testing is great because,    if a test fails,    you have a concrete failure that you    can use to help find what the issue is.    And, really also to establish whether or    not the failure is a true one.    On the other hand, coming up with    test cases actually running them.    And doing them so that they cover all    of the relevant code paths is very    difficult and provides no guarantees.    The no guarantee part is especially    troubling for security tests.    Because it only takes one failure,    one failed code path to lead to    a vulnerability that could    take down an entire system.    Another approach that's complimentary    is to use manual code auditing and    the idea here is to convince someone else    on your software development team that    the source code is correct.    Now, the benefit is that humans,    when they review code,    can generalize beyond single runs.    So they can do better than    just single test cases.    They can imagine executions in their    head that maybe they haven't written    test cases for.    On the other hand,    human time is very expensive    especially compared to computer time.    And once again, this is a difficult task    that provides now guarantees of coverage.    And of course,    we are worried about security, and so    the malicious adversary is going to    exploit those things that we miss.    So, static analysis to the rescue.    The idea here is to use a computer program    to analyze a program's source code    without running it.    So in a sense we're asking a computer    to do what a human might have    done during a code review.    And the benefit here is    much higher code coverage.    The computer program, the static analyzer,    can reason about many possible runs of    the program, and in some cases even    reason about all possible runs, and    thereby provide a guarantee    that the property that    we've attempted to establish,    hopefully a reasonable.    An important security property is    in fact true, all of the time.    We also can reason about    incomplete programs,    like libraries, that would    otherwise be challenging to test.    Now, as it turns out, static analysis,    surprise, surprise, has drawbacks too.    For example, it can only tend to    analyze limited properties rather than    complete functional correctness.    It may miss some errors or    have false alarms for engineering reasons.    And it may also be time consuming to run.    Static analysis can have significant    impact on a security oriented    development process.    Because static analysis can throughly    check limited but useful properties and    there by eliminate entire    categories of errors,    it frees up developers to    concentrate on deeper reasoning.    Static analysis tool usage can also    encourage better development practices.    Programmers who use tools start to    develop programming models that    avoid mistakes in the first place.    Programmers also end up thinking about and    making manifest their assumptions.    For example, they will use annotations    that tools understand in order to    improve tool precision by specifying    what they intend for a program to do.    Static analysis tools are becoming    a commercial reality.    And so companies are starting    to feel this impact today.       

Static Analysis- Introduction part 2

 [SOUND] What is Static Analysis?    Well to answer that question, let's first    ask a different question, which is,    what can static analysis do?    A classic static analysis    problem is The Halting Problem.    It asks, can we write    an analyzer that can prove, for    any program P and inputs to it,    whether P will terminate?    The problem is depicted    visually as follows.    We start with the program P.    We feed this program and    its input to our analyzer.    And it determines whether or    not P terminates.    To solve the halting problem,    we have to build such an analyzer.    The question is, whether doing so    is even possible.    As it turns out, the answer is no.    It has been proved,    that the halting problem is undecidable.    That is to say, it is impossible to    write a general analyzer that can    answer the halting question for    all programs and all inputs.    While we cannot establish    termination behavior,    maybe we can establish other    security relevant properties.    For example,    perhaps we can use static analysis to    prove that all array    accesses are in bounds.    If we could do this we could eliminate a    large source of memory safety violations.    That is, buffer overruns.    Unfortunately, this question    is undecidable too.    As is essentially every    other interesting property.    We can prove this claim by showing that    an analyzer for array bounds checking    could be used to decide the halting    problem, which we know is undecidable.    And as such, array bounds checking    must also be undecidable.    Here's some more example    questions that are undecidable.    That all such properties are undecidable    follows from Rice's theorem.    Is an SQL query constructed    from untrusted input?    Is a pointer dereferenced    after it is freed?    Is accessing a variable a possible    source of a data erase?    And of course,    there are many other such questions.    So let's look at how the halting    problem can be viewed as equivalent to    the question of whether    an array index is in bounds.    This is a kind of proof by    transformation that interesting program    analysis problems are equivalent    to the halting problem.    So first, let's take all indexing    expressions ai in the program and    convert them to exit instead.    So we here,    see here what that transformation is.    We replace ai with first the bounds check.    Is i greater than or equal to 0,    or less than the length.    And if so, do ai is normal,    otherwise exit.    And now, an array bounds error    is instead a termination.    Next, we'll change all program exit points    to instead be out of bounds accesses.    So, if the program would just exit    normally by completing the main function,    for example.    Or, if it would call the exit command or    throw an exception that's never caught.    Then those exit points we would    put a indexed by a length plus 10.    That is an index that's out of bounds.    Okay now we take this transform program    and we pass it to our analyzer that we    hypothesize can perfectly check whether or    not an array access is in bounds.    If the array bounds    checker finds an error,    well then the original    program must have halted.    On the other hand,    if it claims there are no such errors,    the original program does not halt.    But this is a contradiction with    the undecidability of the halting problem,    because we've now shown that an array    bounds checker can decide the halting    problem, which as a problem we    have decided is undecidable.    So does this mean that static    analysis is impossible?    Well, not exactly.    What it means is that perfect    static analysis is not possible.    Useful static analysis is perfectly    possible, on the other hand,    despite the fact that the analyzer    may fail to terminate itself or    emit false alarms, which are claimed    errors that are not really errors.    Or have missed errors where    the analyzer reports no problems but    in fact, the program is not error free.    Now as a practical matter, because    non-terminating analyses are confusing    to users, tools that exhibit only false    alarms or missed errors are the norm.    In particular these tools fall somewhere    between sound and complete analyses.    So, let's go into what I mean by that.    Soundness is a property that states    if an analysis says that x is true,    then x is actually true.    So what it's saying is that, when analysis    says these are the things that I say,    they're contained in the true things.    And of course, this means that a trivially    sound analysis is one that says nothing.    Completeness is such that if X is true,    then the analysis says that X is true.    So here are the things that I say, and    true things are contained within them.    So trivially complete    analyses say everything.    A sound and    complete analysis is such that the things    I say are all of the true things.    And of course what we have shown is that,    for    problems like the halting problem, such    sound and complete analyses do not exist.    You have to sacrifice either one or    the other.    You can go one of two ways.    You could have a sound analysis which    says that if the program is claimed to    be error free, then it really is.    Alarms do not imply erroneousness.    A complete analysis is one that says    if the program is claimed to erroneous,    then it really is.    Silence does not imply error freedom.    So essentially, the most interesting    analyses are neither sound nor    complete, and not both, but they usually    lead toward soundness or completeness.    Because perfect static analysis    is impossible in general,    our goal is simply to make    a tool that is useful.    Doing so is as much art as it is science.    In particular, there are many    different elements of an analysis that    trade off with one another.    A practical tool must decide which    elements are most important.    Some of these elements are the following.    A precise analysis aims to model    program behavior very closely.    To make the most informed decision    about whether it has found a bug, so    as to avoid false alarms.    A scalable analysis will    successfully analyze large programs,    without unreasonable resource    requirements, so that is space and time.    Often times scalability is    a direct tradeoff with precision.    An understandable analysis takes    its human user into account.    For example, the analysis is designed so    that alarms are easy to understand and    are actionable.    An analysis that enjoys all    three features is easier to    reduce if its focus is clean code.    The intuition is that    code that is hard for    a human to understand, is also hard for    a tool to understand and vice versa.    As such, it may be reasonable to issue    false alarms or run more slowly.    When doing so is because of confusing or    convoluted code patterns.    In this way we can view    such poor performance of    the static analyzer positively.    If programmers clean up their code    they will reduce the total number of    false alarms and    perhaps improve the running time.    While also making their code easier for    humans to understand.       

Flow Analysis

 [SOUND]    We are now going to look in depth at one    particular kind of static    analysis called flow analysis.    Roughly speaking a flow analysis    tracks how values might    flow between the different    memory relocations in a program.    This kind of analysis is interesting    because it can help us find bugs    whose root cause is improperly    trusting user input, and    in particular,    input that has not been validated.    In a tainted flow analysis,    untrusted input is considered tainted.    Various operations in the program,    on the other hand,    expect to operate only on untainted data.    If the flow analysis finds that tainted    data could be used where untainted    data is expected, there could be    a potential security vulnerability.    Examples include the source    strain to store copy    whose length is assumed to be no    greater than that of the target buffer.    Format strings given to printf,    which could otherwise be used    in format string attacks.    And form fields from web pages,    used in the construction of SQL strings,    which could be used in    SQL injection attacks.    As a possible application of tainted flow    analysis; recall format string attacks.    Here the adversary controls    the format string.    fgets reads from the network, and suppose    the adversary is on the other side,    sending data along, which is stored by    the program into the name variable.    Then, this variable is passed    as the format string to printf.    The problem is the attacker    can set this format string    in such a way as to exploit the program.    For example by setting it to %s%s%s,    it will attempt to read from arguments    that printf assumes are on the stack,.    But of course in this case, they are not,    which results in the program crashing.    Even worse, the attacker can set    the format string to include %n,    which will write to assumed arguments on    the stack, which again are not there.    And as a result, the adversary can    actually do code injection attacks.    And these bugs still do occur in the wild    because we cannot usually simply    specify the format string and    say a constant string.    Now another way to look at this problem    in types is to imagine that the origin    of the string, its level of trust,    is specified as a type qualifier,    either tainted or untainted.    So here we specify that printf    expects an untainted format string.    And by untainted we mean that it is not    controlled by a potential adversary.    On the other hand fgets could    return data that is tainted    that an adversary could control.    As such the program that we    saw before is illegal because    fgets returns a tainted string, but    printf expects an untainted one.    These qualifiers don't match.    Tainted does not equal untainted.    And so we have found a possible bug,    in fact,    a security relevant bug in the program.    Let's carefully state the problem    we want our analysis to solve.    it is the no tainted flows problem.    The goal is to establish that    no tainted data is used,    or untainted data is expected.    We identify trusted data using    the untainted annotation and    untrusted data,    using the tainted annotation.    Un-annotated data can be either, and    the analysis will figure out which.    A solution to the problem    considers all possible flows    of data through the program.    That is,    what sources can reach what sinks, and    whether a tainted data source    can reach an untainted sink.    We will aim to develop a sound analysis.    For the flow problem,    this means that if the analysis finds no    illegal flows then indeed there are none.    So which flows are legal?    Well, if f expects a tainted parameter,    it's actually okay to pass it    when it's untainted instead.    Here we see a function f    that take's a tainted int.    And we're passing it    an untainted int instead.    If f can handle something that's tainted,    surely,    f can handle something    that's untainted instead.    On the other hand, if g expects and    untainted argument, then we should    not pass it a tainted argument.    That's the situation we've been    considering with format strings.    So, g should accept only untainted data.    So we can view the allowed    flows as a kind of lattice,    where tainted is less than    untainted in the lattice.    Here we see that untainted    is less than tainted, but    tainted is not less than untainted.    Visually we can look at the lattice as    a kind of graph where the higher values in    the lattice are higher up in the graph.    So for our analysis we can think of a flow    analysis as a kind of type inference.    If no qualifier is present,    then we must infer it.    To perform type inference, our algorithm    will take the following steps.    First we'll create a name for    each missing qualifier, that is, for    each type that does not have a qualifier,    we'll generate a fresh name for it.    We'll call them alpha or beta.    For each statement in the program,    we will generate constraints of    the form qualifier one is less than or    equal to qualifier two, and    possible solutions of qualifier variables.    For example, if we have the statement,    x equals y in the program,    then we would generate a constraint    qy is less than or equal to qx,    where qy is y's qualifier and    qx is x's qualifier.    These qualifiers could either    be constants like untainted or    tainted, or variables if they    were missing like alpha or beta.    After we have generated the constraints    for the entire program,    we can solve the constraints to produce    solutions for alpha, beta and so on.    A solution is simply a substitution    of qualifiers like tainted or    untainted, for names like alpha or    beta, such that    all of the constraints in the resulting    substituted constraints are legal.    If no solution exists,    than we may have spotted an illegal flow.    So here's an example analysis.    Recall the declarations of our printf and    fgets functions that we've seen already    and then consider this program.    The first step is to generate qualifier    names for those variables that do not have    qualifiers attached to them,    in this case name and x.    The next step is to    generate constraints due to    each of the statements in the program.    So, for    the first statement in the program,    fgets return something that is tainted and    assigns it to the name variable.    Therefore, we generate a constraint    tainted is less than or equal to alpha.    Next, we assign name to x, and    as a result we generate a constraint that    alpha is less than or equal to beta.    Finally we pass x to printf,    which means that beta is less than or    equal to untainted which    is the annotation,    the qualifier constant on    the argument to printf.    So now we can look at the constraints and    see if a solution exists.    The first constraint requires    that alpha equals tainted.    The only thing that tainted    can be less than or    equal to is something that's tainted.    Therefore, to satisfy the second    constraint we must also have beta as equal    to tainted because, again, tainted is    only less than or equal to itself.    But then the third constraint is illegal,    because nothing allows tainted to    be less than or equal to untainted.    As a result, we have identified    a potentially illegal flow,    because there is no possible solution for    alpha and beta.       

Flow Analysis- Adding Sensitivity

 [MUSIC]    Now let's consider improving our    analysis by adding sensitivity,    which increases the analyses precision.    Here's another example.    We have the same printf and    fgets prototypes as before.    And we have the same start to the program.    We read into the name variable from fgets.    But, we've now changed the program to    branch so that in the true branch,    we assign x to name.    And in the false branch,    we assign x the value hello.    Finally, after the branch concludes,    we print out x.    So, let's imagine that    we execute our program.    We'll consider one statement at a time to    consider the constraints that    that statement will generate.    The first statement generates    tainted as less than or    equal to alpha,    just like the example we saw a moment ago.    Next we consider the conditional.    If we were to take the true branch    then assigning name to x causes us to    generate the constraint that alpha    is less than or equal to beta.    On the other hand if we were    to take the false branch,    we would take an untainted value, the    cause then hello, and assign it to beta.    Finally we print out the result,    because x's qualifier is beta, beta    is going to flow into the assumption of    the argument of printf that is untainted.    Not considering that third constraint,    we can see that these four constraints    are still unsolvable, and therefore we    still have a potentially illegal flow.    And this makes sense, because if    the true branch were ever executed,    we would in fact use something read from    the network as a format string for printf.    Now let's drop the conditional.    It's a very similar program.    Except in this case,    when we assign to name, we immediately    overwrite it with the constant hello.    If we go through the program,    it turns out we will generate    exactly the same set of constraints.    In this case however, those constraints    really mean something very different.    Because we are sure that the second    assignment to x will overwrite the value    assigned in the first case, rather than    being assigned in alternative to it.    As such, these constraints will complain    that there is an error, when in fact,    there is none.    And therefore,    the analysis has produced a false alarm.    So the problem here is that our    analysis is flow insensitive.    Each variable has one qualifier    which abstracts the taintedness of    all values that it will ever contain.    On the other hand, a flow sensitive    analysis would account for    variables whose contents change.    Each assigned use of a variable would    effectively have a different qualifier.    So, for example, x's qualifier at line    one would be given the name alpha-1, but    x's qualifier at line two would    be given the name alpha-2.    As such, solutions for    alpha-1 and alpha-2 can differ.    We could even implement such    a flow sensitive analysis by    transforming the program to assign    to a variable at most once.    And this is converting    the program to what is    called static single assignment form or    SSA.    So here is our example again,    but reworked to be in SSA form.    Notice that we have replaced the single    variable x with two variables, x1 and x2,    where each one has a distinct qualifier    name, in this case beta and gamma.    If we look at the constraints the program    generates, the first one is the same.    But now the second one assigns    to the qualifier variable for    x1, and the third assigns to    the qualifier variable for x2.    These are now different qualifier names.    And as such,    we can see that there is no alarm.    That is, we can set gamma to be untainted,    and alpha and    beta to be tainted, and    these are good solutions to the program.    And so, no problem exists.    Now let's consider a variation of our    example with multiple conditionals.    First, if x is true,    we assign to y the constant string hello.    Hello is untainted and    y's qualifier variable is alpha, so    we produce the constraint,    untainted is less than or equal to alpha.    In the else branch, we assign to    y the tainted output of fgets, so    we produce the constraint,    tainted is less than or equal to alpha.    On the next line, if x is true,    we call printf with y.    Now printf expects an untainted argument.    And y has qualifier variable alpha, and    flows into that untainted argument.    And so we produce the constraint,    alpha is less than or equal to untainted.    However, we can see that these    constraints have no solution.    But, that this lack of a solution is,    in fact, a false alarm.    Why?    Well, the two constraints that create    the problem are impossible to execute in    the same program.    That is, if x is false,    then we will execute the else command,    which produces the first constraint.    But if x is false, then we will not    execute the second command that    does the printf, which would have    produced the other constraint.    Conversely, if x was true, we would not    have generated the tainted less than or    equal to alpha constraint.    And we would have not have created the    problem that made the solution impossible.    So the issue here is that    the constraints that we    consider that produced the problem    are not corresponding to a feasible path.    An analysis that is past sensitive    considers path feasibility when    determining whether or not there is    a potential problem in the program.    So to illustrate this,    we've labelled some of the statements in    the program in the little    code snippet on the right.    When x is not 0, the program    will follow the path 1-2-4-5-6.    When x is 0, it will follow the plat,    the path 1-3-4-6.    The path that leads to the problem    that we saw a moment ago is 1-3-4-5-6,    but this path is infeasible.    A path sensitive analysis    will rule out such a path    by qualifying each constraint    with a path condition.    So instead of just generating    the constraint, untainted is less than or    equal to alpha, it will instead include    the condition that says this constraint is    only valid when x is not equal to 0.    On the other hand,    the constraint tainted is less than or    equal to alpha is only    valid when x equals 0.    When determining whether    constraints have a solution,    the path conditions will be considered.    We'll see that only the first and third    constraints should be considered together    on the one hand, and in this case    alpha does have a solution: untainted.    Or, only the second constraint should    be considered on the other hand and    again, alpha has a solution but    a different solution.    And therefore there is    no alarm in the program.    If flow and    path sensitivity are so useful,    is there any reason not to use them?    It would seem that the added precision    is important for avoiding false alarms,    which would otherwise arise    by conflating flows or paths.    But both sensitivities make the flow    analysis problem harder to solve.    For example,    flow sensitivity increases the number of    variables the analysis considers, and    thus increases the number of nodes in    the constraint graph that we must solve.    Path sensitivity requires the use of    predicates to distinguish different paths,    which means we must    employ more general and    expensive solving procedures    to make use of them.    In short, the added precision decreases    performance, which ultimately hurts    scalability, limiting the size    of programs we can analyze.    This trade-off is sometimes managed by    employing a demand driven analysis,    which adds precision only as needed and    where it's needed.    We'll discuss this idea    a bit more later on.       

Context Sensitive Analysis

 [SOUND]    Up until now,    we focused on tracking tainted flows    through blocks of code, normal statements.    Now we'll consider how we handle    tainted flows to function calls.    This example shows a function    id that we want to analyze.    The id function itself is rather simple,    it takes the argument x and    simply returns it back to the caller.    On the left hand side we see a client    program that takes the result of fgets,    passes it to id, and    returns the result into the variable b.    So we want to see wether or not there    is a tainted flow in this program, so    we need to track the flow into    the id function and back out again.    So, the way that we're going to    do this is in several steps.    First, we need to give flow    variable names to the argument and    return value of the function.    So here we pick gamma and delta.    Next, we need to create flow    constraints between the actual caller,    where in this case,    we pass in the variable a.    And the callee which receives    that in the argument parameter x.    In addition, after computing    whatever result we compute in    the function we need to take what    is returned, in this case x.    And create a flow between it and    wherever that result is used in    the caller, in this case the variable b.    So, let's put it together.    We see that fgets returned    something into a.    Remember that fgets has    a tainted return value and    so tainted is less or equal to alpha.    Then, we call the identity    function passing a in.    Therefore, we need to create    a flow between a and x, that is,    the actual argument a, and the formal    parameter x, in the function id.    And, therefore, we create a flow    constraint between alpha and gamma.    In the body of the function, all that the    body does is return the variable x, and,    so, x has the name gamma and    because we're returning it,    we need to create a flow with    the name of the return value.    Which, as we said before, is delta.    Now, finally, what is actually returned    from this particular function column needs    to flow into the variable    we're assigning it to and so    we create a flow between delta and    b that is delta and beta, b's name.    Okay so we've extended the example    a little bit with two extra statements.    In particular we define a new variable c,    to which we assign    the untainted constant high.    And we also take that c and    pass it as the format string to printf.    So essentially we'll ignore    the first two statements as far as    the actual semantics of    the program is concerned.    But this is setting a more interesting    example that we'll see in just a moment.    Okay.    So we're going to    have the same sort of constraints that we    saw a minute ago for the first two lines.    Then we'll create the constraint    untainted flows to omega.    And finally the constraint    omega flows untainted.    Because, printf expects an untainted    argument, and c's name is omega,    that is,    the actual argument we're passing in.    So, there's a good solution here,    and there's no alarm.    Now consider a small    change to the example.    Instead of assigning hi to    the variable c directly.    We pass it into id and    then it will return the result which    eventually makes its way to c.    So if we were to run this program and    the prior program,    they would have exactly the same outcome.    But as we shall see, the analysis    will treat them a bit differently.    So, we generate the first set    of constraints once again.    And now we have to consider    the function call into ID.    We're passing an untainted value    into the argument of ID, so    untainted is less than gamma,    and then likewise we're    taking the return value from ID delta and    assigning it to c whose name is omega.    Now we'll call printf which causes    us to pass omega into untainted.    Which is the expectation of    the format string for printf.    And now let's look at    solving these constraints.    Unfortunately we're    going to get a false alarm.    There's no solution and yet    as we said a moment ago this is    exactly the same semantics as    the program we saw before.    So what happened?    Well as the animation just showed.    What's happening is the analysis    is being imprecise, and    it's considering a call into which fgets    passes an argument, which is tainted,    and the return into which we    pass the untainted value high.    But we're pretending the analysis is    conflating these two calls together and so    we're pretending that the first call can    return its result to the second call.    So this is a problem of    context insensitivity.    These call sites, as I said,    are conflated in the graph.    So context sensitive analysis solves this    problem by distinguishing    call sites in some way, so    that we don't allow a call at one place    to return it's value to another call.    We can do this by giving call    sites a label, call it i and    perhaps correlate it with the line number    in the program at which the call occurs.    And then in the graph, matching up    calls with corresponding returns,    only when the labels on flow edges match.    We'll also add what we call    polarities to these edges,    to distinguish calls from returns.    So, minus i for argument passing,    and plus i for return values.    So let's see the example again, but    this time in a context sensitive setting.    Notice that we have labeled the two    calls to id with one and two.    These are the indexes for    those calls that we can think of    as happening on lines one and    two of the program.    So, we have the same constraint    that we had originally.    But now when we go to    call the function id,    we're going to index that    call with the label one.    And we're going to use a negativity    polarity because that's    the argument being passed down.    Same constraint for    the body of the function,    now the call site one is going to    return something slightly different.    It's going to be indexed with a label, and    it will have polar, polarity plus    to indicate that it's a return.    Okay, so    here's where the new part happens.    We have a new label on    the constraint this time.    We have label two.    For both the caller and the return value.    And then we have the final constraint.    So given these constraints,    these were the constraints that were part    of the false alarm that we saw before.    But, notice that they    can't allow a solution now    because the polarities don't match.    Sorry, the, indexes don't match.    And because it's an infeasible flow,    the analysis we'll actually see that due    to the different labels,    and there will be no alarm.    >> Just like flow and path sensitivity,    context sensitivity is a tradeoff,    favoring precision over scalability.    In particular, the context insensitive    algorithm takes roughly time O of n,    where n is the size of the program,    while the context sensitive algorithm    will take time O of n cubed.    On the other hand, sometimes the added    precision actually helps performance.    By eliminating infeasible paths    it can reduce the size of    the constraint graph by a constant factor.    And this is true for flow and    path sensitivity too.    But the general trend is that greater    precision means lower scalability.    One way to push the trade-off    more towards performance.    Is to be selective about where    context sensitivity is used.    Rather than using it at all call sites,    we could use it only at some of them.    The remaining call sites are conflated,    as in an insensitive analysis.    This means that the analysis    coarsely models calls from one site,    as if they could return to another.    Rather than having one    blob of insensitive sites,    we could also imagine groups of sites,    that are sensitive with respect to    the other groups, but insensitive    with respect to their own members.    Doing this amounts to giving all sites,    in the same group, the same index.    Finally, it's also possible to    limit the depth of sensitivity.    For example, to only match nested    calls up to a certain limit.       

Flow Analysis- Scaling it up to a Complete Language and Problem Set

 [SOUND]    Okay, so    let's finish up this discussion of our    tainted flow analysis by seeing how to    scale it up to a more complete    language and problem set.    So far we've considered scalar values or    pointers treated as blocks.    But we haven't considered    dereferences through pointers and    how those affect the tainting.    In this example,    we have character pointers a and    b, which were assigned strings as normal.    But we also have p and q,    which are pointers to pointers.    That is, p is a pointer to a string,    as opposed to a string itself.    What the example is going to do is assign,    hi to a, p will point to    a variable, q will alias p, that is,    it will also point to a the variable.    Then via q we assign b, which is    a tainted value we read from fgets,    and then we print through    p which aliases with q.    So what we can see here is because    of the aliasing of p and q,    that is, because they both    point to the same memory,    when we assign through q the value in b,    we effectively make p tainted.    But what we're going to see is that    without care our analysis won't    realize this.    So here on the first line,    we have the assignment as usual.    Here on the second line,    we're going to take the address of a.    And so now we're concerned with beta,    which is the taintedness of    the contents that p is pointing to,    rather than of p itself.    So we're going to create    that flow constraint, and    likewise that flow constraint, and    here's another one that we do as normal.    And finally,    we do the assignment through q.    Now, star q refers to the character    pointer that q points to, whose    label is gamma, and b refers to what is    returned by fgets, whose label is omega.    So omega will flow into gamma    through this assignment.    And then we print via star p.    So once again, what p points to,    that character string,    it has beta as its label.    And so it is going to go to untainted.    Okay.    Well, here's a problem.    A solution exists,    even though the assignment to q of p    is going to affect the contents of p.    And therefore, this is actually an error    that our analysis will fail to uncover.    So what do we need to do to fix this.    Well, we need a way of    identifying when flow can    happen through assignments via aliases.    And we do this by, when we assign    pointers to pointers, we create edges    that go both ways between the flow    labels of the things that they point to.    So here we show the constraint gamma    goes to beta when we assign p to q.    So not only do we go beta to gamma,    we also go gamma to beta.    And actually, we should do the same thing    with the assignment on the second line.    That is, we should have an edge from beta    to alpha, but I just haven't shown it    here to keep one fewer constraint    from cluttering the slides.    Okay, with this extra constraint,    a solution no longer exists.    We can see now that tainted    can flow to untainted, and so    therefore, we've caught the error.    So again, the idea is that an assignment    via a pointer flows both ways.    That is, it anticipates that this,    through an alias,    we could assign a value that    would create a flow later on.    And the flow both ways allows    that assignment to be captured.    Unfortunately, this can    lead to false alarms.    So we have a couple of choices for    reducing them.    One is, if we know that an assignment    through an alias can never happen, for    example because the alias    is labeled as const,    then the backward flow is not needed and    we do not need to include the edge.    Or we could just drop    the backward flow edge anyway,    anticipating that    an assignment will not happen.    That's problematic, whether or    not that's guaranteed by the type system.    So effectively, this is trading    a false alarm for a missed error.    And different analyses make different    choices about this question.    Another sort of hidden flow that might    happen is what's called an implicit flow.    And that's illustrated by    an example we'll develop now.    So, here's a copy function that takes    a tainted character pointer source, and    copies it to a destination    character pointer.    And we can see that this is illegal,    because the tainted characters in src    should no be able to flow to the untainted    expectations of the characters in dest.    So this flow is not allowed, and    therefore the analysis should flag that.    Okay, now let's look at this program,    which, as it turns out,    implements exactly the same function,    though in a very inefficient manner.    What it does is for each character    that's considered in the source array,    it will iterate through all possible    values that character could have.    So j will consider all    possible character values.    If that value matches the one that's in    the source, that's the if statement of    source sub i is equal to j,    then we simply assign j to dst.    This will be legal because j is untainted.    But then we've missed a flow,    because even though the character was not    directly assigned from src,    the same value of that character was.    And so the information contained inside    of src is certainly copied to dst, and    therefore the information is leaked.    So, we can perform what we    call information flow analysis    to discover when such    implicit flows happen.    Data did not flow, but    the information did.    We can discover this by    maintaining a second label that we    call the program counter label.    And it represents the maximum,    in the lattice,    taint, affecting the current position    of the program counter in the program.    Assignments will generate    constraints involving the pc.    So x equals y will    produce two constraints,    one between the labels of y and    x as usual, and another saying that    the pc label must be able to be    upper bounded by the label of x, so    that a flow does not leak    through the assignment of x.    And this will allow us to    track information flows like    the one we just saw.    So here's a,    an example with an implicit flow in it.    It resembles the prior one,    but it's simpler,    it doesn't have the loop and so on.    Here, we have a tainted source and    an untainted destination.    Actually we'll infer what    the taintedness is of the destination.    So we branch on the source,    which is tainted, and    then based on whether the source    is zero or not, we assign to dst.    So if the source is zero,    then dst will contain the same value as    the source, otherwise it will contain one.    So information is clearly    flowing from src to dst,    though data is not, because information    about src can be recovered by looking at    the value of dst after the program runs.    So let's see what the constraints    look like that we would generate.    At each of the assignments,    dst equals 0 or    dst equals 1, we just have an untainted    constant 0 flowing into alpha as normal.    And also for 1.    And also here for    the 0 portion of the dst plus equals 0.    Okay, now let's consider the constraints    according to the pc label.    We've labeled each of the lines in    the program with a relevant pc.    The first line that can    be executed is guard.    Then either the then branch,    or else branch, or pc 2 and 3.    And then finally,    the final assignment, pc4.    Now, the first thing that we will consider    are the extra constraints that result from    the assignment statements due to the pc.    In particular, here, we have to create    a constraint, alpha is less than or    equal to pc2, because alpha is    the variable to which we're assigning and    the current program counter    at that assignment is pc2.    Likewise, for the else case,    we do alpha is less than or    equal to pc3, the program counter there.    And then finally, the last assignment    is alpha is less than or equal to pc4.    So what are the program counter    values at these different locations?    Well, pc1 is the first line that's    executed in the program and    has not been influenced by any tainted    data, so it starts as untainted.    However pc2 and    3 are result of branching on src.    That is, they can only be reached because    we have branched on the src variable, and    src is tainted.    And therefore, we need to bump the program    counter label to tainted in both cases.    Once we have exited the conditional and    we're back on the line with pc4,    the program counter label can be untainted    again, because it doesn't make any    difference which direction we took when we    branched on src we will always reach pc4,    and so the value of the guard src has no    influence on it, and it can be untainted.    Looking at all of these constraints,    we can see whether there is a solution to    them, and we can see that the solution    requires alpha to be tainted.    So this is discovering the implicit flow,    that is that the tainted source    is tainting the destination.    So, should we always track    information flow constraints?    Well, tracking them with a pc label    can sometimes lead to false alarms.    In this program,    the dst value is always going to be 0,    no matter what the value of src is,    and so there is no information leak.    And yet, the assignments to dst will    be tainted according to the use of    the pc label, and    therefore it will produce a false alarm.    Extra constraints due to the pc    label can also hurt performance.    As it turns out, the copying example    we saw before is pathological.    Developers typically don't write    programs like this, and generally,    implicit flows have little overall    influence in well written programs.    So as an engineering question,    oftentimes implicit flows are not    tracked in industrial analyses.    One point though, is that if you're    analyzing untrusted programs, for example,    mobile code that's being downloaded by    your browser or some other system, that    you run locally, there is a greater chance    that an adversary will maliciously exploit    this weakness in your analysis to try to    taint information, or perhaps leak it.    However, this situation aside, tainting    analyses tend to ignore implicit flows.       

Challenges and Variations

 [SOUND]    We have considered how to analyze most    of the key elements of a language.    But not all of them.    A robust tool obviously    has to handle them all.    First, while assignments    obviously transfer the taint from    the source to the target,    what should happen if the source is    an expression rather than a variable?    We must define the taint of operators.    Usually, if an argument to    an operator is tainted,    then the expression containing    that operator is, too.    So, far we've assumed we know    which function a call will invoke.    This is important for    linking up the flow constraints at    the caller with those in the callee.    But what if we make a call    through a function pointer?    The pointer is not assigned until runtime.    So, that static compiled time analysis.    Canno, cannot always    be sure of its target.    As it turns out we can use    a kind of flow analysis to    determine the set of possible targets    a function pointer could contain.    The analysis follows the same    principles of the flow analysis that    we've already seen.    This way, when analyzing a call site using    a function pointer, we add constraints as    if all possible targets were called    rather than a single target.    And even more course analysis    of possible targets, for    example, based on a function's type    could also be used rather than one    that more precisely determines the    possible values of the function pointer.    C structs are records have several fields.    The most precise analysis can track    the taintedness of each field of    a struct separately as if    they were separate variables.    But such precision can be expensive.    Alternatively, an analysis can track    the taintedness of the same field of    all instances of a struct as if that    field was a kind of global variable.    Now this hurts precision but    is often a big boost to performance.    An in between position is to analyze only    the taintedness of the struct as a whole.    Or to do so for only some of its fields.    Note that these decisions are relevant    to object oriented languages.    You can view objects as a struct    containing function pointers and so    the trade offs we've just    considered apply in the analysis of    object oriented languages.    Another way to aggregate data in    a language is to store it in an array, and    there are several,    similar trade-offs here.    You can track taintedness    of each element or    you could track it on    the array as a whole.    The challenge here is that while    struct fields are constants,    array indexes are computed by arithmetic.    As such, it is sometimes hard to    track them precisely at compile time    because we won't know what arithmetic    produces until we run the program.    And so    a hybrid approach is inevitably employed.    Up until now we have focused    on how the analysis works.    A key part of using it to    perform security reviews is to    properly label the tainted sources and    untainted sinks.    And there are many options here.    We could label array indexes and    format strings and SQL strings.    We can also label functions that    act as sanitizers or validators.    These are library routines that take in    tainted data and return untainted data.    For example a function that escapes    an HTML string would be a sanitizer.    Labeling sanitizers is needed    to avoid false alarms.    Finally flow analysis need not only apply    to finding improper use of tainted data.    It can also be used to discover    the improper disclosure of    sensitive information.    In particular we can imagine labeling    sensitive data as secret, and    untrusted output channels as public.    And our goal is to prevent secret    sources from reaching public sinks.    In essence this is the dual of    the tainting problem, where the only    difference is that the lattice and    flow relationships are inverted.    Finally, we should point out that    flow analysis is not the only kind of    static analysis, in,    in fact there are many other kinds.    One common analysis, that is often used    to assist other analysis problems,    is called pointer analysis.    With a common flavor of it    called points to analysis.    This analysis determines whether two    pointers are possibly aliases, that is,    whether they could both    point to the same memory.    Earlier we saw that knowing this    is important for not missing bugs.    In fact, our tainted flow analysis    involved a very coarse grain    form of point to pointer    analysis within it.    But more sophisticated analysis    would improve precision.    Significant advances have    aided both the precision and    scalability of pointer    analysis in recent years.    Another kind of analysis is    called data flow analysis.    And it was developed in the 1970s as part    of research on optimizing compilers.    Like the flow analysis    we've already considered,    data flow analysis focuses on the flow of    values through variables in a program.    But it maintains information    about this flow bit differently.    One common data flow analysis    is called liveness analysis.    And it determines which variables,    at a given program point are still alive.    That is, whether their values    could still be red in the future.    Other variables are dead, meaning    that their values will be overridden.    Liveness analysis is an important    part of allocating variables to    machine registers during copulation.    Registers storing dead variables,    can be re-used.    Data flow analysis techniques    are also useful for    security questions and    many industrial tools use them.    For example, they can also answer    the tainted flow question.    Finally another significant    kind of analysis is called    abstract interpretation.    It was invented in part to provide    a theoretical explanation of    data flow analysis, but it has grown into    a style of analysis in its own right.    The key idea of abstract interpretation,    is that a static analysis is    an abstraction of all possible    concrete runs of a program.    To be practical,    the abstraction must discard information.    For example,    which calls correspond to which returns.    The key is to discard as much    information as possible for    purposes of scalability, while still being    able to prove the property of interest.    An abstract interpretation is at the core    of several industrial strength tools.    Static analysis is gaining traction    in practice with a variety of    commercial products from a variety of    companies as well as open-source tools.    If you'd like to learn more about static    analysis, I'd recommend this book,    Secure Programming with    Static Analysis by Brian Chess.    This goes into more depth about how    the static analysis tools work,    elaborating on    the description on this unit.    And talks more about how static analysis    fits in a secure development process.    For more of the mathematical details    about how program analysis works,    I recommend this book by Nielson,    Nielson, and Hankin.    It's a bit dense for    the casual reader, but it's good for    introducing the academic    field of static analysis.       

Introducing Symbolic Execution

 [SOUND].    Software has bugs.    Seems unavoidable.    The most common ways of finding    bugs are testing and code reviews.    But it's all users know,    software, nevertheless,    ships with bugs still unfixed.    In fact, many of these bugs    are known to the developers but    they were not deemed important    enough to fix prior to shipping.    Many other bugs are not known.    Why are they not found?    It could be that the bug is    in a rarely used feature.    For example, in a device driver for    a rare piece of hardware.    Or, the bug arises in rare circumstances.    For example,    when memory is nearly exhausted, or    the bug could arise nondeterministically.    The circumstances that make the bug    manifest could be unpredictable.    And thus make the bug hard to    reproduce and hard to fix.    One way we can try to find those bugs that    testing misses is to use static analysis.    And this conceptually makes sense,    because static analysis can consider    all possible runs of the program,    perhaps even those that are rare,    non-deterministic and so on.    We can feel heartened by the explosion    of interesting ideas and    tools that are being    considered in research.    And by the fact that companies are now    selling and using static analysis tools.    So there's great potential here    to improve software equality.    The question, though, is can static    analysis find deep difficult bugs?    So in principle, yes.    But, maybe in practice, not so often.    Okay, why is this?    Well, it's because commercial    viability implies that you have to    deal with developer confusion,    false positives, error management.    That is quite a lot of code that has    nothing to do with the core idea of    analysis itself but, rather,    the way that humans use the analysis.    One particular thing that companies    seem to do is to keep the false    positive rate down.    They do this by purposefully missing    bugs to keep the analysis simpler.    This turns out to be a good idea because    studies show that developers get tired of    dealing with alarms that don't    turn out to be true bugs.    And they're not interested    in sifting through,    say, nine alarms, even if on the tenth    time they find a really important bug.    So companies respond to this, and    they make tools that will miss bugs.    So one of the issues here that makes false    alarms hard to deal with is abstraction.    Abstraction is critical for    static analysis because it allows us to    simplify our view of what the program    does so that we can efficiently, or    efficiently enough,    model all possible runs.    But to do this we have to    introduce conservatism.    Now the various sensitivities like flow,    context, and    path sensitivity add precision.    So that reduces conservatism.    But these more precise abstractions    are more expensive, which means that they    won't be applicable to larger code bases,    at least not at short time scales.    And they still will not completely    eliminate the false alarms problem.    One particular problem that makes    false alarms hard to deal with is    that the static analysis technique    may use an abstraction that    is not familiar to the developer,    that is the developer is not able to    make sense of the alarm that the tool    produces because it was produced in a way    that the developer can't relate to.    And therefore can't easily triage to    determine whether the bug is true or no.    So for the remainder of this unit,    I'm going to talk about a technique called    symbolic execution, that's kind of    a middle ground between testing and    static analysis, and aims to retain    the best features of both techniques.    We start with the observation    that testing works.    Reported bugs are real bugs.    The problem is that each test only    explores one possible execution of    the program.    If we do assert f of 3 is equal to 5,    well, we've checked that f of 3 equals 5,    but not that f applied to other    input values also equals whatever it    is they're supposed to equal.    In short, tests are complete,    but they're not sound.    And, while we hope they generalize,    they don't provide any guarantees.    Now, symbolic execution    aims to generalize testing,    in a sense it's more sound than testing,    so we can write tests like this.    We can say that y is equal to alpha, which    represents an unknown or symbolic value.    And then we can assert that f of    y is equal to 2 times y minus 1.    This is basically saying, for all y,    f of y is equal to 2y minus 1.    Symbolic execution can    execute this program, and    confirm that this is indeed the case, thus    performing a more general sort of test.    The way that this is possible is    that the symbolic executor can    run the program to the point    that it depends on an unknown.    And then it will conceptually fork    the symbolic executor to consider all    possible values of that unknown that would    affect the control flow of the program.    So here we see the function f, and    we see that it depends on x,    it's argument at the first branch.    If x was symbolic,    as in our assert, then the symbolic    executor will consider both branches.    That is the case when x    is greater than 0 and    the case when it is not greater than 0.    And of course, in this example when    it does that it will discover that    the assertion will not hold for    non-positive values of x.    Let's look at another example just to get    a flavor of how symbolic execution works.    Of course, we'll go into much    greater detail later on.    So here on the left, we have a program.    And in the program, we've assigned    to three of the variables, a, b,    and c, symbolic values,    alpha, beta, and gamma.    The rest of the program is a series    of conditionals and assignments.    At the end, we see an assertion    that we would like to hold for    all possible runs of the program.    All right, so the symbolic executor    will start, and it will execute line 3,    initializing x, y and z to 0.    Then it will reach    the conditional on line 4.    This conditional depends    on a symbolic value, alpha.    And so, it will consider whether    both branches are possible, and    of course, they are.    At first, it will consider the true    branch, that is when alpha is non 0.    In that case,    it will assign x as negative 2.    Then, it carries on to executing line 7.    Here, the guard depends on    a symbolic value again, and so    it checks whether beta can be less    then five, again, it could be or    it might not be, so the symbolic    executor will pick one direction.    Here, suppose that it picks true.    Now we see that we'll skip line 8, and    the reason is that along this path    we have assumed that a is non-zero.    And that means the conditional    on line 8 will not hold.    After assigning z to 2,    we reach the assertion and    we see that the assertion succeeds.    And that's because x is minus 2,    z is 2, and y is 0.    The sum is 0, and    of course 0 is not equal to 3.    Now, we can characterize this particular    path of our program execution,    using what's called a path condition.    That is along this path we    assumed that alpha was true i.e.,    non-zero, and beta was less than 5.    So we want to also consider    the other paths by considering those    branches again and    going the false direction if possible.    So for example, we could take the fall    side of the branch on line 7.    In that case we'd skip lines 8 and    9, breach the assertion again.    In this case, the assertion is still    going to hold because x is minus 2,    the other 2 are zero and    therefore the resulting sum is not 3.    Alright, we'll consider    the other branch once again.    On this branch, we'll now assume    that a is, that alpha is false.    And we'll check whether b is less than 5.    Again, we have no constraints    on beta symbolic value.    So, we could go both ways.    Suppose we go the false direction this    time, thereby skipping lines 8 and 9?    Here the path condition is not alpha, that    is we're assuming that alpha is false and    b is greater than or equal than 5.    Now we'll go the other direction.    In this direction, we're assuming that    alpha is not true and gamma is true.    Can this condition hold?    Sure, it depends on what c is,    and we have not constrained that.    That is we have not constrained gamma.    So suppose we go in the false direction,    in this case we'll assign z is two,    because we skipped the assignment to y.    However we go in the true direction,    then y equals 1 and    z equals 2 and now the sum of y being 1,    z being 2, and    x being 0 is 3, and    that violates our assertion.    Importantly, the assertion violation comes    with the path condition not alpha and    beta is less than 5 and gamma, and    this precisely states all of the inputs    that will cause this test case to fail.    And that is, if we used a satisfiability    solver to give us a potential solution to    this path condition, we could produce    a test case that shows the failure.    And of course, this is very useful    because the developer can then run    the test case and use it to diagnose    the reason for the failure in the program.    Notice that each path through the tree    we just constructed represents a set    of possible executions.    More precisely, it represents    all of those executions whose    concrete inputs are solutions    to the path condition.    This set is a very precise    notion of code coverage.    Viewing symbolic execution    as a kind of testing.    Complete coverage of the program    would be all of its paths.    Symbolic execution allows us to    systematically consider    many of these paths.    Viewed as a kind of static analysis,    symbolic execution is complete in that    whenever a symbolic executor claims to    have found a bug, the claim is true.    However, it is rarely sound because    it is unlikely to cover every path.    The reason is that most programs do    not have a finite number of paths, and    so symbolic execution will not terminate    except by imposing a time limit or    other resource limit.    Symbolic execution's high precision,    the source of its completeness, can be    characterized as path, flow, and context    sensitivity in static analysis parlance.       

Symbolic Execution- A Little History

 [SOUND]    Symbolic execution is an old idea.    It was first proposed in the mid 1970s.    The most cited early paper on    the subject is James King's PhD thesis.    A synopsis of which was presented in    the communications of the ACM in 1976.    So if symbolic execution    is such an old idea, and    it seems like such a good one,    well why didn't it take off?    Why haven't we been using it for    a long time?    Well, one reason is that symbolic    execution can be compute-intensive.    In the end, for big programs, we'll    consider many, many possible paths, and    along each or many of those paths,    we'll need to query an SMT solver    to decide which of the paths are feasible,    in which assertions could be false.    And in the end such queries    could involve much of    the program state which    could be very large.    At the time symbolic execution was    invented computers were relatively slow,    not much processing power,    and small, not much memory.    I mean Apple iPads are as fast    as Cray-2's from the 80s.    Today computers are much faster and    much bigger.    Both more memory and more computing power.    And we also have better algorithms.    So SMT solvers are much more    powerful today because of    smarter optimizations that people    have implemented over time.    They can solve very large    questions very quickly.    And that lets us check the assertions and    to prune infeasible paths.    So let's take a minute to look at    both the improved hardware and    the improved algorithms.    Here's a quick chart that shows    trends of hardware improvement.    You can see along the x axis is the date,    the year, and    along the y axis is number of,    operations that the computers can perform.    Because the y axis is a logarithmic scale,    we can    see that this linear growth is actually    an exponential rise in computing power.    On the algorithm side,    here we can see that the time to solve    a particular instance has been improving    even while keeping the hardware steady.    So here on the y axis is the time that    it takes to solve a particular problem.    While on the x axis we list the year    of the algorithm we're considering.    And all of these algorithms    will run on the same computer,    that is on 2011 hardware.    So we can see that the time to run has    been decreasing as the algorithms have    been improving over time.    And though small instances    have not improved so    much in time, since 2008 big problems    are now starting to be solved faster.    Symbolic execution has    experienced a resurgence of    interest in the last ten years or    so, starting in about 2005.    The main motivation behind it was to    show that symbolic execution could    find very interesting bugs that were    often missed by normal testing.       

Basic Symbolic Execution

 [SOUND].    Now we'll look in greater detail    at how symbolic execution works.    We'll start with a typical    programming language.    Here we show a grammar of expressions,    denoted by the letter e.    Expressions consist of things    like integers n, variables x.    Expressions involving arithmetic    operators like addition.    And expressions involving comparisons,    like disequality.    To support symbolic execution, expressions    also include symbolic variables.    Which we denote with greek letters,    like Alpha.    Symbolic variables represent    inputs to the program.    These inputs could come from many sources,    including reads from files or    network sockets,    memory mapped files or devices.    Whereas normal tests must    provide specific inputs.    Symbolic execution models inputs using    symbolic variables, and thus can explore    how different executions are induced    by different values of these variables.    These values are discovered by solving for    constraints generated during execution    as we'll see in more detail shortly.    >> Okay, now that we've defined    our symbolic language we need to    define its semantics.    One way to do that is to make or    modify a language interpreter    that can compute symbolically.    So normally a program's variables    contain values, but for    our symbolic interpreter they will also    be able to contain symbolic expressions.    Which are program expressions that    make mention of symbolic variables.    So as example normal values we have    the integer five in the string hello.    But as example symbolic    expressions we have alpha plus 5.    The string hello concatenated    with some symbolic string alpha.    The array index expression alpha plus beta    plus 2 for indexing the array a and so on.    So let's see how we use these symbolic    expressions in an example execution.    Here we have a simple    straight line program.    On the left-hand side we'll    consider concrete memory,    we'll look at how a concrete, that is    normal execution of the program might go,    and then we look at how it might go for    our symbolic interpreter.    So in the first line we perform the read.    Let's say, for argument's sake,    it reads the value 5.    Next, we'll assign to y.    What do we assign it?    Well 5 plus the contents of x,    which are 5, is 10.    Next, we'll execute line three.    Here we'll store 17,    because 7 plus 10 is 17.    And then finally we'll use z,    17, to index the array a.    Now notice here, we're assuming    that a contains four values.    So a sub 17 is off the end of the array.    That is, it's an out of bounds access.    Okay, now let's consider the symbolic    execution of this program.    Instead of returning a concrete value for    x we will return a symbolic variable    alpha and store it there instead.    For y we can't store a value anymore    because five plus alpha can't be    further reduced, so instead y is assigned    the symbolic expression five plus alpha.    Next, we assign to z.    Here, seven plus five plus alpha is    12 plus alpha, so we store that.    And then, finally, we use the symbolic    expression, 12 plus alpha,    to index z, and this could be an overrun.    Of course, we just saw an example    of one in the concrete execution.    That is there is a solution for    the expression 12 plus alpha    that could create an index that,    that is outside of a's bounds.    We'll explain how we can    find that in just a moment.    Now as we saw with our overview example,    program control can be    affected by symbolic values.    So if you branch on a symbolic expression.    It could be that,    depending on the solution of that,    of the variables in that expression you    could go either true or false or both.    Here in this example we see that we    read into x and then branch on it and    so that's a branch on    a symbolic expression.    We represent the influence of symbolic    values on the current path, that is    the list of the branches we have taken,    true or false, as a path condition pi.    So as an example,    line three is reached when the path    condition is alpha is greater than 5.    Line five, when alpha is greater than 5,    and, because it's nested,    alpha is less than 10.    And line six,    when alpha is less than or equal to 5.    That is, we've taken the false    branch in the case that we    checked whether alpha was greater than 5.    Alpha being the contents of x.    Now whether a path is    feasible is tantamount to    a path condition being satisfiable.    So here we list three path conditions for    this program.    It's a slight modification    of the program we just saw,    with a branch on line four,    is x less than 3.    So, we can see here that the path    condition a, alpha greater than 5,    alpha less than 3, is not possible.    And, indeed this path    condition is not satisfiable.    There is no alpha such that    both constraints hold.    Now given a path condition, we can come up    with a solution to the constraints in it,    and these can be used as inputs    to a concrete test case that    will execute exactly that path.    So, the solution to leach,    reach line three could be alpha is 6.    And a solution to reach line six,    could be alpha is 2.    Now assertions, like array bounds checks,    are conditionals.    So if we look at our original program,    we can think of just prior to the index    of the array a performing two    checks to see whether the index expression    z is within the bounds of the array.    That is, is z less than 0, and    is it greater than or equal to 4?    If the answer is yes, we've gone out of    bounds and we should abort the program.    So we can think of our symbolic executor    as inserting these extra statements, and    then performing its normal process    of generating a path condition.    So along these first four lines,    pi is true because there are no branches.    Now to reach line five, it would have    to be the case that 12 plus alpha,    that is the symbolic expression we    had stored in z, is less than 0.    If we managed to get past this,    that is we took essentially the else    branch to get to the else case to    get to line six, then we would have    the negation of that path condition.    To get to line seven, we would also    have to check that z is greater than or    equal to 4,    that is 12 plus alpha is greater than or    equal to 4, along with that negation.    If that was not true, then the path    condition wouldn't negate that as well,    and to reach line eight,    we would have this path condition.    So, if either lines five or    line seven are reachable; that is,    paths reaching them are satisfiable,    the path conditions are feasible.    We have found an out of    bounds array access.    >> Whereas a normal execution must take    either the true or false branch of    a conditional, a symbolic execution    could actually take both branches.    This is possible because when a symbolic    variable, or variables, appear    in a conditional scarred expression, there    could exist some solutions that would    make the guard true, and    some solutions that would make it false.    As such, a symbolic execute, a symbolic    executor could choose to go either,    or both ways.    How should it choose which and    in what order?    One point to make is that the executor    doesn't immediately know whether the true    or false branch or both is feasible,    when it reaches a conditional.    To know for sure it could invoke    the solver to check the satisfiability of    the path condition.    If a path condition concatenated with    the guard condition is feasible,    then the true branch can be explored.    Likewise if the negation of the guard    is feasible then the false branch can    be explored.    However making calls to the solver,    at each conditional can be expensive.    So, one idea is to just pick on branch and    go that way,    delaying the feasibility check, at    the risk of doing extra, unnecessary work.    A related approach is to use concolic    execution to determine the path to take.    The idea here is to run the program    concretely for actual inputs.    But consider some of those inputs as    if they were symbolic on the side.    The concrete inputs which    guarantee feasibility,    determine which direction to go.    The symbolic executor maintains symbolic    versions of these expressions and    the path conditional is usual.    Which it then uses to    generate subsequent tests.    We'll say more about concolic    execution a bit later.    Okay, so now we can consider a symbolic    execution algorithm as pseudo code, and    it's very simple.    First, we create the initial task,    which is a triple.    It consists of the program    counter location, which is 0,    the start of the program.    The path condition pi, which starts    out as empty, and the symbolic state,    which also starts as empty.    And we add that task onto the worklist.    Now, while the worklist is not empty,    we will pull some tasks from that    worklist and we will execute it.    Eventually we'll reach a point    where that execution forks.    And let's say the state at the point    has the program counter at pc 0,    the path condition at pi 0, and    the symbolic state at sigma 0.    And let's say the reason for    the fork is a branch.    Well, we're branching on p, at line pc 0,    where the true branch is at pc 1 and    the false branch is at pc 2.    Well, in that case, what we'll do is    we'll add the task pc 1, pie 0 and p,    sigma 0 to the work list,    as long as pie 0 and p is feasible.    That is the current path condition    when concatenated with the guard that    contains symbolic variables.    When that's feasible, then we have a    legitimate task that we can execute and so    we should add it to our list.    We'll add the alternative, that is,    going down the false branch in    the case that not p is feasible when    concatenated with the path condition.    And notice here that we could add    either or both tasks to the work list.    So, we can keep on running until    either the list becomes empty,    in which case we've considered    every path in the program.    Or eventually, we get tired and terminate    the program, say after some time out.    Now, in a practical implementation,    we have to consider not just    the main program we're executing,    but also libraries and native code.    At some point, the symbolic executor will    reach the edges of the application, and    the interpreter won't be able    to make further progress.    So this may happen with a library, or    the system calls, or assembly code, or    something like that.    So one thing we could do is,    we could actually pull that code in    two and symbolically execute it.    So for example,    if we're symbolically executing c code,    we could symbolically execute the c code    that implements the standard libraries.    But real implementations of the standard    libraries are often very complicated and    they might involve assembly code and    stuff like that.    And so a symbolic executor will often    get stuck running around loops and    not making any progress    getting out of that library.    So you could use a simpler version    of the library instead that's still    semantically accurate.    Or you could create a model of    the code that you want to execute.    For example, you could create a model    of a RAM disc to model the kernel    file system code, rather than trying to    say symbolically execute the Linux kernel.    So let's return to the idea of    concolic execution I mentioned before,    this is also called dynamic    symbolic execution sometimes.    And the idea here is, instead of    running the program according to    the algorithm that we just saw a moment    ago, we run the program concretely, but    we instrument it to sort of do    symbolic execution on the side.    That is, the instrumentation maintains    some shadow concrete program state    with symbolic variables.    The initial concrete state is going    to determine the path that we take so    it could be randomly generated.    But we will keep track of the choices    we made when guards involve our    shadow symbolic variables.    So we can build up a path    condition along the side.    So, a concolic executor will explore one    complete path at time, start to finish.    And once it's done so it can determine    the next path to explore by simply    negating some element of    the last path condition.    And then solving for that path condition    to produce concrete inputs for    the next test.    While we're symbolically executing    using concolic execution,    we'll always have the concrete    underlying values to drive the path.    But these turn out to be    useful in other ways too.    That is, the process of concretization    is very easy in a concolic execution.    And what I mean by concretization is    replacing symbolic variables with    concrete values that    satisfy the path condition.    Basically we're going to    drop symbolic-ness and    potentially miss some paths.    But we're going to simplify our    symbolic constraints by doing so.    So this allows us, for    example, to do system calls that    might involve symbolic variables.    We just pick reasonable    concrete values for    those variables and do the system calls.    In so doing, we lose the symbolicness but    we're able to continue to make    progress in trying to find bugs.    Replacing symbolic values    can also be useful when so,    SMT solver calls might be too complicated    if we left all of the values in.       

Symbolic Execution as Search, and the Rise of Solvers

 [SOUND]    As is probably apparent, a good symbolic    executor requires careful engineering.    We've seen that symbolic execution employs    an appealingly simple algorithm, but    with high computational costs.    In particular, the executor might traverse    many different paths in the program, and    it must make potentially    expensive calls to an SMT solver,    to determine which paths are feasible.    Considering the problem    of traversing may paths,    we will see that symbolic execution    boils down to a kind of search problem.    The goal is to search through    a large space of possibilities to    find events of interest.    For us these events are bugs.    We'll also take a minute to look    at how improvements in SMT solver    performance make the path feasibility    check possible even for larger programs.    A well known problem with symbolic    execution is the path explosion problem.    Essentially, for a symbolic executor to    consider the entirety program's space of    executions it needs to    consider every path.    But because most programs    have a huge number of    paths we can't usually run    symbolic execution to exhaustion.    In fact, programs can be    exponential in branching structure.    Here we see a program that has    four lines of code, and yet    because it has three variables    it has eight program paths.    That is, two to the third possible    paths despite using the variable only    once on each line.    Loops on symbolic    variables are even worse.    In this program we're looping on    variable a, which is symbolic.    This is going to cause us to    consider an arbitrary number of    loop iterations because every time    we re-reach the head of the loop,    we consider whether to stay in the loop or    whether to exit.    Now, if we compare symbolic    execution to static analysis,    we can see that there's a clear    benefit of static analysis.    That is, it will actually terminate    even when considering all possible runs.    And, it does this by approximation and    abstraction, approximating multiple loop,    loop executions or    branch conditions, and so on.    But as we discussed to    motivate symbolic execution,    static analysis as use of extraction    can lead to false alarms.    So what can we do to improve the way    that symbolic execution operates to    maximize it's benefits?    Well to understand that we have to look    at how symbolic execution is viewed as    a search algorithm.    So the simplest way to perform    symbolic execution is to    use either depth-first search,    or breadth-first search.    That is, recalling our algorithm for    symbolic execution, we can use, for    depth-first search,    a worklist that is a stack.    So it will consider the most recent node,    that is the most recent program state,    when deciding what to do next.    Or we can make it a queue,    where we prioritize things that    we put in that queue earlier.    Now the potential drawbacks of using    either one of these two strategies,    is that neither of them are really    guided by any higher level knowledge.    That is, we aren't telling    how we might be looking for    particular bugs or that we want to get    to a particular part of the program.    Instead it will just blindly follow    the program structure in considering what    paths that it wants to execute.    DFS in particular could easily    get stuck in one part of    the program by continuing to go deeper and    deeper.    So think about that loop example.    We could go around the loop over and    over and over again where as in    breadth-first search, we'll consider    both one further loop execution and    the case that we get out of the loop.    Therefore, of these two, breadth-first    search is probably a better choice.    On the other hand,    it is more intrusive to implement.    For example,    it can't easily be made concolic.    So what better search    strategies might we perform?    Well, one thing we can do is to    try to prioritize our search,    to steer it towards paths more likely    to contain assertion failures,    sInce that's ultimately    what we're trying to find.    And recall that assertion failures    could involve array bounds checks,    null pointers checks, and so on.    We only want to run for    a certain amount of time, and    so, the prioritization ensures that,    that time is used best.    To consider these different    search strategies,    think of program execution    as a kind of DAG.    So the nodes in the graph    are programme states and    an edge between those nodes means that    one state can transition to the next.    Therefore, we can think of symbolic search    as a kind of graph exploration algorithm,    trying to pick among the various possible    paths in the overall search space.    One useful technique to    employ is randomness.    We don't know a priori    which paths to take, so    adding some randomness    is probably a good idea.    This is something that    modern stat solvers do and    it works very effectively in that setting.    So, one idea is to pick the next path    to explore uniformly at random, and    we can call this the Random Path Strategy.    Another thing we can do is    randomly restart the search if    we haven't had anything    interesting in a while.    So imagine that we explore depth first for    a while, but as we go further and    further down in the search of a particular    path, we may increase our chances of    stopping and going back to the start    to try a different path altogether.    And we could choose among equal    priority paths at random.    That is, if we had a prioritization    strategy that say, favored coverage, or    we wanted to try paths that we haven't    covered those could statements before.    If we had equal priority paths,    we can just flip a coin to    pick one versus the other.    So one drawback of randomness    is reproducibility.    That is, because we may have made random    choices to figure out a particular bug,    if we run the symbolic executor on    the same program two times, it's not    guaranteed to find bugs that it found    before, and in fact might find new bugs.    This doesn't necessarily sound so    bad except it's complicated for    software engineering, for    example once we fix the bug that we find,    how can we run the symbolic executor again    to confirm that it's not there anymore.    So it's probably good to use    pseudo-randomness based on a seed, and    then record which seed is    picked when bugs are found.    And that way you can confirm that    you don't hit the bug again, or    if you re-run you'll find    the same bugs you found before.    Now a moment ago I hinted at    the idea of using coverage to    guide your prioritization strategy.    And here we'll consider it in a bit    more detail, that is, we should try to    visit statements we haven't seen before    when choosing which paths to follow.    So the approach is to    score a statement based on    the number of times it's been traversed,    and    then pick the next statement to explore    that has the lowest current score.    So this might work because errors    are often in hard to reach part of    the program, and    the strategy will favor going to parts    of the program it hasn't seen before.    On the other hand, we may never    be able to get to a statement if    a proper precondition is not setup,    and so just favoring going closer to    that statement is not necessarily    going to get us there.    Another strategy is called    generational search.    You can think of it as a kind of hybrid    between breadth-first search and    a coverage-guided search.    It works like this.    At generation 0 we pick a program and    a random input to it and    run it to completion.    At the end we take paths    from generation 0,    negate one branch condition in    a path to yield a new path prefix.    Find a solution for that prefix and    then take the resulting path.    Will semirandomly assign to any    variables not constrained by the prefix.    For generation n we'll similarly    apply this approach, but    we'll branch off of generation n minus 1.    And we'll use a coverage heuristic to pick    the priority amongst the different    paths that we choose.    This generational search is often    used with concolic execution.    Now, it's probably obvious at this point,    no one search strategy wins all the time.    Instead different search strategies    may find some bugs that other    search strategies will not find.    So one obvious idea is to combine    searches by doing multiple searches at    the same time.    Essentially have parallel processes    where we switch from one to the next.    Depending on the conditions    needed to exhibit the bug,    one search may find it as opposed    to another search finding it.    You could even imagine using different    algorithms to reach different parts of    the program, depending on the properties    of the program that you're trying to    find bugs in.    SMT stands for Sat Modulo Theories,    where by theories,    we mean mathematical theories    beyond just boolean formulas.    Such theories could include a theory of    arithmetic over integers, for example.    These theories can sometimes    be encoded as SAT problems.    And so    some SMT solvers are basically just front    ends that translate the SMT    problem into a SAT problem.    For example, the theory of bit vectors.    That is, arrays of bits with operations    on them, like addition, subtraction and    so on, can be encoded as a SAT problem.    It turns out that modern SMT    solvers implement a variety of    optimizations either in the SAT engine or    in the translation to it.    And these optimizations can make    a big difference in performance.    One example is to make use of    axioms that witness equivalences.    Another example is to directly    support in the SMT solver a theory of    arrays which model    modifications to memory.    The alternative would to be to have    the symbolic executor simulate memory    directly, rather than rely on the solver.    But then,    optimizations become less apparent.    Another important optimization    is to cache solver queries.    It turns out that symbolic executors    will submit queries repeatedly that    contain identical subexpressions, and this    makes sense because the path condition    is built up incrementally over time.    Finally, we can make the SAT solver's    job easier by eliminating some    symbolic variables from consideration.    For example, if we were testing    whether a guard is feasible,    we only care about the existence of    a solution to the variables in the guard.    Such a solution may involve other    variables which can be syntactically    related to the ones in the guard.    That is, we keep under    consideration expressions that    have variables that are transitively    related to variables in the guard.    For other expressions these    variables can be ignored.    In total the optimizations can    make an enormous difference.    SMT solvers have been a popular    topic of research and    development over the last several years.    There's several that you    can get free online.    Z3 is quite sophisticated,    it's developed at Microsoft research.    Yices, developed at SRI,    has been around for    quite a while and    continues a steady improvement.    STP, which is the SMT solver used by    XC and KLEE is available for free.    And CVC is yet another.    While smarter search strategies and    SMT based optimizations will help,    ultimately path based search is limited.    To see why, consider this program.    We can see that it has a loop.    It's going to run around    the loop 100 times.    And each time it goes around the loop it    may either enter the if statement or not.    This means it has 2 to the 100    possible execution paths.    Now the program is asserting that it's    a bug if the counter is ever equal to 75.    So this bug is going to be hard to find.    That is, there are 2 to the 78 paths that    reach the buggy line of code, that is,    100 choose 75.    Out of the 100 executions,    75 of them need to take the true branch.    But if that's true, the probability    of finding the bug is very low,    because 2 to the 78 paths will    find it out of 2 to the 100 total.    So the chances are 2 to the negative    22 that we will find the bug.    In short, independent of search strategy,    we'll have    a very difficult time finding the bug    using a path-based search algorithm.    So this is really just a fundamental    limitation of path-based search in    symbolic execution.       

Symbolic Execution Systems

[SOUND]    Symbolic execution was    invented in the 1970s, but    as I said before, it failed to take hold    at that time as a commonly used tool.    And this is because of limitations    in both the algorithms and    machine power available at that time.    In the mid 2000s, two key systems    triggered a revival of symbolic execution.    And these systems were DART,    developed by Godefroid and    Sen and first published in 2005,    and EXE, by Cadar,    Ganesh, Pawlowski, Dill, and    Engler, published in 2006.    These systems demonstrated the promise    of symbolic execution by showing it    could find real bugs in interesting and    complicated systems.    And as such it spurred    interested in the topic and    many new systems have developed since.    One important system that has made    big impact in practice is SAGE.    It is an concolic executor,    developed at Microsoft Research, and    grew out of Godefroid's work on DART.    It uses the technique of generational    search that we talked about before.    SAGE primarily targets    bugs in file parsers, for    example parsers like jpg, Microsoft Word,    Microsoft PowerPoint and other documents.    And is a good fit for concolic execution    because parsing files is likely to    terminate and it only needs to consider    input/output behavior rather than    complicated and interactive system calls.    SAGE has transitioned from research    prototype to production tool.    It's been used on production    software at Microsoft since 2007.    Between 2007 and 2013, SAGE was    used remarkably often at Microsoft.    For example, it's been run for    more than 500 machine years as part of    the largest fuzzing lab in the world.    It's generated 3.4    billion plus constraints,    making it the largest    SMT solver usage ever.    It's been applied to hundreds    of applications, and    found hundreds of bugs that were missed    by other bug finding techniques.    For example, one third of all    Win7 WEX bugs were found by SAGE.    Bug fixes shipped quietly to one    billion PC's around the world,    saving millions of dollars both for    Microsoft and the world.    And SAGE is now used daily    in Windows Office and    all of the big Microsoft products.    KLEE is another mature tool.    In this case it grew    out of the work on EXE.    It symbolically executes LLVM bitcode.    LLVM is a compiler framework that's    now in regular use at Apple.    It compiles source language programs to    an intermediate representation called    LLVM bitcode which is    stored in a .bc file.    And KLEE will run on the bc file.    It works in the style of the basic    symbolic executor that we saw before.    Where it uses fork to    manage multiple states.    That is,    rather than having an explicit work list,    each time that it could go both directions    on a branch it actually forks another    version of itself which    proceeds along that branch.    And a separate tool manages    all of the distinct copies of    KLEE that are running at once.    KLEE employs a variety of search    strategies, primarily random path and    coverage guided.    And it mocks up the environment    to deal with system calls and    file access and so on.    And you get it as part of    the LLVM distribution.    So looking at how KLEE actually works,    in the original KLEE paper,    published in 2008.    They applied KLEE to Coreutils,    which is the suite of small programs that    runs on Linux and Unix distributions.    So these are programs like mkdir and    paste and sed and ls and things like that.    What this graph is showing is    the amount of coverage from KLEE    generated tests compared to the manual    test suite that comes with Coreutils.    What this chart shows is    that KLEE is better at    testing than the manual test suite.    And we make that determination by looking    at the lines of code covered by tests.    In particular the lines covered by    KLEE tests when from them when we    subtract the lines covered by manual    tests we see that for all but    nine programs KLEE covers more    lines than the manual case.    Here are a bunch of bugs    that was found by KLEE.    These are the command lines    that produced the bugs,    that caused these Coreutils    programs to crash.    Notably, the bugs involved    very simple command lines, and    very simple input files.    And yet they went undetected for    quite a long time before KLEE found them.    So since these second    generation tools of KLEE and    SAGE, there have been    further developments.    One of them is the Mayhem system developed    at Carnegie Mellon by Dave Brumley and,    and his collaborators.    And it runs on actual executables.    That is binaries.    Like KLEE it uses a breadth first    search style search, but it    also uses native execution, combining the    best of symbolic and concolic strategies.    And as an interesting twist,    it will automatically generate    exploits when bugs are found,    to shine a light on the fact that these    bugs could, in fact be security critical.    In further developments    from the Mayhem work,    the same group produced Mergepoint,    which uses a technique called veritesting,    and in this case it combines symbolic    execution with static analysis.    What it does is use static analysis for    complete code blocks.    So it will analyze a straight    line piece of code and in so    doing basically convert it    into a formula that can    be used by an SMT solver to ask    a question about the programs execution.    It will use symbolic execution for    hard to analyze parts of the program.    In particular, loops which, for    which it may be hard to determine    how many times it will run.    As well as, complex pointer arithmetic,    and system calls, and so on.    So this produces a better balance    of time between the solver and    the executor resulting in better bug    finding, that is bugs are found faster and    more of the programs are covered    in the same period of time.    Incredibly, Mergepoint found more    than 11,000 bugs in 4,300 distinct    applications in a Linux distribution    including new bugs in highly tested code,    like the Coreutils from KLEE.    So there are many other    symbolic executors.    Cloud9 is basically a parallel KLEE.    jCUTE and Java PathFinder    are symbolic execution for Java.    Bitblaze is a binary analysis framework,    so    like Mayhem it works    on binary executables.    And Otter is a symbolic execution    framework for C, that is directing and    that you can, tell it which line at    the program you'd like to execute and    it would try hard to get to those lines,    when performing a search.    Pex is a symbolic executioner for    .NET programs.    >> This unit delved into    the details of symbolic execution.    A powerful technology that can be used    to find security critical bugs in    real software.    Symbolic execution can be viewed, on the    one hand, as a generalization of testing.    It uses static analysis    to develop new tests that    explore different program paths.    We can also view it as    a kind of static analysis.    But one with very high precision and    no guarantee of termination.    And if it does terminate it's    results are both sound and complete.    Symbolic execution is used in practice    today to find important bugs in    production code.    The SAGE tool is used    heavily at Microsoft.    And the Mergepoint tool regularly    analyzes large Linux repositories.    Symbolic execution tools of good    quality are freely available.    And we can expect that they,    like static analysis tools,    will penetrate more deeply into    the main stream in the near future.       

1 comment: