This is the crucial step: abstracting the concrete semantics away so that we can do control flow analysis (CFA). Fortunately, in small-step abstract interpretation, the concrete and abstract semantics are very similar.
In this article, I introduce the concept and need for control flow analysis and show how to convert concrete semantics from the definition of my Concrete CESK machine into abstract semantics in preparation for creating an Abstract CESK machine.
Small-step Abstract Interpretation
At the heart of analysis of higher-order languages (like Dalvik), is a recursive relationship between control-flow and value-flow: any attempt to tackle one necessarily means you must tackle the other. The class of algorithms that moves to solve this are Control-flow analyses - which specialize in solving the value-flow problem.
Small-step abstract interpretation is one way to write a CFA.
Higher-order Control-flow problem
The problem is simply stated as: the precise target of a function call may not be obvious. In languages Object-oriented like Java, Ruby, and Python, this problem is presented in the form of dynamically dispatched methods.
In the above code, the target of the method call
.method() depends on the value
that flows to the statement
object. In other words, the data-flow
affecting the value of
object creates a dependency with the control-flow
of the method call.
object were a parameter to a method, the
data-flow of the parameters of the method would be determined by the
control-flow of the function.
With a java-like pseudo-code example:
1 2 3 4 5 6
In the above example, the target of
func depends on where the method flows.
Further, it is not clear which method
func may refer. So in control-flow
analysis where the expression could be invoked must be considered along with
what types of argument is might receive.
Higher-order Value-flow problem
The consideration of which values might be produced by the expressions
x is the value-flow problem. This is generally undecidable. Instead we
must think of all possible values that a procedure could evaluate to - without
knowing exactly which values those might be.
This means that there is non-determinism in the evaluation of procedures and consequently leads to the inability of Concrete semantics to correctly model the flow of a program without having infinite amounts of memory at its disposal.
Once we add non-determinism into the mix, we need to redefine the CESK machine from Concrete semantics to Abstract semantics.
Abstract vs Concrete
There are two main changes that occur in the transition from concrete to abstract semantics.
- Infinite state-space must become finite state space.
- The addresses in the finite state-space become pointers to a set of all possible values for that address.
In the Concrete CESK machine that I built, it would be trivial to construct a recursive function that allocated a new address on every call. This would eventually fill up the available memory of the machine running it - unless, of course, you have a computer with an infinite amount of memory. Since it is safe to say that an infinite-memory machine doesn’t exist, we must abstract away the idea of an address.
It is both mathematically and computationally possible to define an abstract machine with a single address. This is due to the nature of abstract interpretation. Since an abstract address is a pointer to a set of possible values, then through non-determinism, you can give all possible (and impossible) execution paths with a single address. I will take a more practical approach.
The solution to how to represent a finite number of addresses comes from a simple observation: A statement is a finite thing and in any Dalvik program, there are a finite number of statements. So an abstract address could be the statement itself.
With that solution, the following definition of the conversion from an infinite store and address space to a finite store and address space follows:
In a way, one can think of the transition from the infinite state-space to the finite-state space as a simple “Throw hats on everything” approach. That thought would be wrong, however.
The problem with that approach is that since addresses contain frame pointers and frame pointers contain addresses, you analysis could run into an infinite frame pointer where its set of values contains itself.
k-CFA shows us how to tame recursion by removing it from the state-space through a process called “snipping the knots” - thereby removing these cyclical dependencies from the dependency graph. I’ll be following the outline in Abstract interpreters for free.
The first part of of making a snip is handled already by the Concrete CESK machine: add a store. The store is still defined in terms of an infinite number of addresses at this point:
We need to take this store and then thread it through the transition relation so that a dependency graph edge going from set would result in:
Thereby making the store an integral component of each state. This is known as a store-passing style transform. Fortunately, I already did this in the Concrete CESK machine.
Trickling up Abstraction
We need two components to calculate an “optimal” small-step abstract transition relation:
- An abstract state-space
- A Galois connection between the abstract and concrete state-spaces
Galois connections are seen mostly in order theory and is a particular correspondence between partially ordered sets. It would be way beyond the scope of this to delve too deeply into it. However, it is important to explain the role the correspondence plays in static program analysis.
In my article about static analysis, I touch a little on lattices and partial ordering where I introduce the concept of a supremum and infimum in terms of a lattice and their role in showing partial ordering of variable types.
I would like to take this a bit further and define a simple Galois connection resulting from the mapping functions and from the abstract transition relation above.
Take the following code:
1 2 3 4 5 6 7
We can clearly see that if a or b is 0, that we have caused a problem. We can define a single abstract function and a single aconcretization function as follows:
You can clearly see that the functions map from one semantic domain to the other. The goal is to ensure we have those and mappings for the entire transition function:
This happens next.
Abstracting Leaves of the dependency graph
After snipping and store-passing style transforms, what is left are sets of addresses that I will refer to as infinite leaf nodes. One of the two goals in the transition from concrete semantics to abstract semantics is to make addresses finite.
It is up to the analysis designer to choose a finite set for every leaf node which will become the leaves of the abstract state-space. It is equally important that an extraction function be created that maps a concrete element to an abstract element: . With a fixed extraction function, we can use inference rules to automate the construction of the abstract state-space with structural Galois connections.
Given this surjective map, , the structure yields the Galois connection:
Furthermore, the extraction function fixes polyvariance and the context-sensitivity of the analysis.
Percolating automatic abstraction up
With the use of inference rules, we can build structural Galois connections and recursively apply them all the way up through the top-level of sets of states. The inference rules are beyond the scope of this article, but can be found in Principles of Program Analysis.
After this recursive automatic synthesis of the abstract state-space, we have now built up Galois connections all the way up. Combining these connections, the extraction function on addresses and the snipping of , we can finally define the top-level state-space and the Galois connection.
The differences between concrete and abstract semantics look small on the surface. For the most part, it really is as simple as “throw a hat on everything”. However, there are fundamental differences between the two domains that require care.
In my next article, I will be applying these mathematical transformations to the implementation of an Abstract CESK machine.