$blog→errstr

do not commit

Concrete CESK for Android's Dalvik

| Comments

Introduction

At the heart of all Android applications is Dalvik byte code. It’s what everything gets compiled to and then run on the Dalvik VM. In order to do static program analysis for Android, you need to somehow interpret the byte code. That’s where the CESK machine shines.

The CESK machine, developed by Matthias Felleisen, provides a simple and powerful architecture used to model the semantics of functional, object-oriented and imperative languages and features like mutation, recursion, exceptions, continuations, garbage collection and multi-threading. It is a state-machine that takes its name from the four components of each state: Control, Environment, Store, and Kontinuation.

In this article, I implement a concrete CESK machine to interpret a dynamically typed object-oriented language abstracted from Dalvik byte code. Every byte code and its semantics have been transformed into this language.

CESK

Being a state machine, CESK has a notion of jumping or stepping from one state to another. In terms of sets, we can think of a program () as a set of these machine states () that exist within the set of all machine states () with a partial transition function (step) from state to state ().

Defining the state-space, as $$\sigma\in\Sigma = Stmt^{*}\times FP\times Store\times Kont$$, which allows us to encode a state as a simple struct:

(struct state {stmts fp stor kont})

If you are familiar with pushdown automata, then a CESK machine has many striking similarities. You can think of each state as the CES portion and the would be what would be pushed/popped from the stack between state transitions. (For a project in my computational theory course some classmates and I implemented a Non-Deterministic Pushdown Automata in Python that outputs to a commandline and DOT format, feel free to play around with it: PyDA)

C.ontrol: A sequence of statements

The Control component of a CESK machine is a control string. In the lambda calculus, the control string would be an expression. For Dalvik, we should think of this component as a sequence of statements. This sequence of statements gives an indication of which part of the program this state is.

E.nvironment: Frame pointers

The Environment component of a CESK machine is a datastructure that maps variables with an address. In the Dalvik CESK machine, we use simple frame pointers as the addresses.

Addresses

Registers are offsets from frame pointers and map to local variables, we will need to compute the location (frame offset) by pairing the frame pointer with the name of a Dalvik register

Objects and their fields are structurally equivalent to frame addresses.

So the set of all addresses include both Object and Frame addresses:

S.tore: The Heap

The Store component of a CESK machine is a data structure that maps addresses to values. In the Dalvik case, we map frame pointers to values.

K.ontinuations: The program stack (continuations)

The Kontinuation component of a CESK machine is essentially a program stack. Within Dalvik, you find exception handlers and procedure calls - and with all continuation based machines, the halt continuation to signify program termination.

Each continuation is placed on a stack, where the top-most matching continuation is found and executed: for exceptions this is the matching handler and for assignment it is the next assignment continuation.

Halt is handled as a termination continuation without context encoded into the component.

In the case of exceptions, Dalvik defines a type of exception (class name), the branch label where execution should go and the next continuation. $$ handle(className, label, \kappa) $$

Any other type of invocation affecting the program stack is an assignment continuation, where the return context for a procedure call is encoded with the register waiting for the result (name), the statement after the call, the frame pointer from before the call and the next continuation. $$ assign(name, \vec{s}, fp, \kappa)$$

Running the CESK machine

Since the CESK machine is a state-machine, we have a single partial transition function (from state to state) that is run until it is told to terminate (when we encounter the halt continuation) called step. We only need to have an initial state () and then iterate until we hit halt. So, we need four things:

  • inject to create the initial state, with an empty environment and store
  • step the transition function for each type of state transition
  • lookup a way to lookup frame pointers in the store
  • run run the CESK state-machine
dalvik_cesk.rkt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
(struct state {stmts fp store kont})

; create the initial machine state ς0
(define empty-fp (gensym))
(define empty-store (make-immutable-hash))
(define (inject stmt)
  (state stmt empty-fp empty-store 'halt))

; the partial state to state transition function
(define (next cur-state)
  ; extract the state struct's contents
  (let* ([stmts (state-stmts cur-state)]
         [fp (state-fp cur-state)]
         [σ (state-stor cur-state)]
         [κ (state-kont cur-state)]
         [current-stmt (first stmts)]
         [next-stmt (rest stmts)])
      (match current-stmt
        ...)))

; lookup the value of the framepointer, variable
(define (lookup σ fp var)
  (hash-ref σ `(,fp ,var)))

; extend store with one value
(define (extend σ fp var val)
  (hash-set σ `(,fp ,var) val))

; the algorithm for running the interpreter
(define (run state)
  (if (state? state)
    (let ([step (next state)])
      (run step))
    (printf "~s\n" state)))

Dalvik Byte-code Grammar

For the purposes of this article, I will be using the core grammar defined by Matt Might’s Java CESK article. He defined this by looking at all of Dalvik’s byte code and ensuring that its semantics are represented in a straightforward way. He divided the language into two classes of terms: statements and expressions.

program ::= class-def ...

class-def ::= class class-name extends class-name
              { field-def ... method-def ... }

field-def ::= var field-name ;

method-def ::= def method-name($name, ..., $name) { body }

body ::= stmt ...

stmt ::= label label:
      |  skip ;
      |  goto label ;
      |  if aexp goto label ;
      |  $name := aexp | cexp ;
      |  return aexp ;
      |  aexp.field-name := aexp ;
      |  push-handler class-name label ;
      |  pop-handler ;
      |  throw aexp ;
      |  move-exception $name ;

cexp ::= new class-name
      |  invoke aexp.method-name(aexp,...,aexp)
      |  invoke super.method-name(aexp,...,aexp)

aexp ::= this
      |  true | false
      |  null
      |  void
      |  $name
      |  int
      |  atomic-op(aexp, ..., aexp)
      |  instanceof(aexp, class-name)
      |  aexp.field-name

Transitions: Evaluation and stepping

Continuations

Remembering from earilier in the article, there are three types of continutations: assignment, handler, and halt. Each of the components of a Dalvik program will use those generalized definitions.

We can define an applyKont function to aid in the overall machine design, which will be utilized when we encounter returns and exceptions.

Then using applyKont with the assign and handle continuations is defined by:

With this definition, we can translate this to code with apply/κ:

apply_kont.rkt
1
2
3
4
5
6
7
8
9
10
(define (apply/κ κ val σ)
  (match κ
    ; assignment continuation
    [`(,name ,next ,fp ,κ_)
        (let ([σ_ (extend σ fp name val)])
          (next fp σ_ κ_))]
    ; handle continuation
    [`(,classname ,label ,κ_) (apply/κ κ_ val σ)]
    ; the termination continuation
    ['halt `(answer ,val)]))

Assignment: Atomic statements/expressions

Atomic expressions are expressions which evaluation must terminate, never cause and exception or side effect.

Atomic statements assign an atomic value to a variable, this involves evaluation of the statement/expression, calculating the frame address and updating the store.

The Atomic Expression Evaluator

To evaluate an atomic expression, we use the atomic expression evaluator:

We have some key types of atomic expressions and how they are evaluated.

Atomic values that can be immediately returned such as integers, booleans, void, null.

Register lookups simply involve knowing what the frame pointer offset is to do a lookup of the atomic value. Since we have encoded this with the name of the expression along with the frame pointer we have the frame address encoded into the store with (fp, name). There are two special registers "$this" and $ex, but use the same semantics as other register lookups.

Accessing an object field is similar to register lookups, you get the field offset from the object pointer with (op, field) from the store.

aexp_eval.rkt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
; AExp X FP X Store -> Value
(define (eval/atomic aexp ptr σ)
  (match aexp
    [(? atom?) aexp]
    [`(,v ,op ,n)
     (opify (eval/atomic v ptr σ) op (eval/atomic n ptr σ))]
    [else (lookup σ ptr aexp)]))

; A helper to determine if these are plain values
(define (atom? aexp)
  (match aexp
    [(? void?) #t]
    [(? null?) #t]
    [(? boolean?) #t]
    [(? integer?) #t]
    [else #f]))

; evaluate aexpr operations
(define (opify lh op rh)
  (match op
    ...
    ['!= (not (= lh rh))]
    ['* (* lh rh)]
    ['- (- lh rh)]))

The Atomic Assignment Statement

Like I said earlier, we need to evaluate the atomic expression and assign it a variable-value pair in the store. We can define this operation as:

The is the store updated with the new atomic assignment variable and value mapping:

I have a couple of articles regarding variable substitution and implementation if you are interested in getting a larger view of what is going on here.

atomic_assignment.rkt
1
2
3
4
5
6
7
8
(define (next state)
  ...
    (match current-stmt
      [`(,varname = ,aexp)
            (let* ([val (atomic-eval aexp fp σ)]
                   [σ_ (extend σ fp varname val)])
                (state next-stmt fp σ_ κ))]
      ...

This creates a new state, where the store is now updated with a mapping of the variable var to the value val.

Object Assignment and Creation

There is another type of assignment similar to the atomic assignment statement, a new object. This is when a variable is being assigned to a brand new object, e.g. Object o = new Object(). Consequently, the definition of object creation and assignment is similar to atomic assignments:

The is the store updated with the new object assignment variable and value mapping (corresponding to a never-before used object pointer op’):

object_assignment.rkt
1
2
3
4
5
6
7
8
(define (next state)
  ...
    (match current-stmt
      [`(,varname = new ,classname)
            (let* ([op_ `(object ,classname ,(gensym))]
                   [σ_ (extend σ fp varname op_)])
                (state next-stmt fp σ_ κ))]
      ...

This creates a new state, where the store is now updated with a mapping of the variable varname to the value (object classname (gensym)). The (gensym) is used to generate a guaranteed-to-be-globally unique value.

nop, label, line

There are thre types of statements that cause no change in state: nop, label, and line. We can define nop as: $$step(\mathbf{nop}: (\vec{s}, fp, \sigma, \kappa) \to (\vec{s}, fp, \sigma, \kappa)$$ This says that when we se a nop, get the next statement in the list of statements and run it.

The only difference between nop and label is that label has an identifier (the label) with it $$step(\mathbf{label}\;\mathit{l}: (\vec{s}, fp, \sigma, \kappa)) \to (\vec{s}, fp, \sigma, \kappa)$$

line is defined the same as label and is a side-effect of the s-expression generation, not part of the actual grammar.

We can add a few new items to our transition function’s match statement:

next_noplabel.rkt
1
2
3
4
5
6
(define (next state)
  ...
    (match current-stmt
      ['(nop) (state next-stmt fp σ κ)]
      [`(line ,n) (state next-stmt fp σ κ)]
      [`(label ,l) (state next-stmt fp σ κ)])))

goto is much like nop, except that we must do a lookup using the label to find the next statement sequence. $$next(\mathbf{goto}\;\mathit{label} : \vec{s}, fp, \sigma, \kappa) = (S(label), fp, \sigma, \kappa)$$

next_noplabel.rkt
1
2
3
4
5
(define (next state)
  ...
    (match current-stmt
      [`(goto ,l) (state (lookup-label l) fp σ κ)]
      ...

S function for Label Lookups

Labels are identifiers for statements and used in jumping from one statement to another. We will need to store these labels for lookup later. So, let’s define a label map and then a mechanism to lookup labels. We define this mapping function as

In code, what we are trying to do is to find the label and execute the next statment, so we will need a label store, a way to update the store, and a way to lookup the next statement by the label

label_stor.rkt
1
2
3
4
5
6
7
8
9
10
; global label store
(define label-stor (make-hash))

; update the label store
(define (extend-label-stor label stmt)
  (hash-set! label-stor label stmt))

; lookup the statement from the label in the store
(define (lookup-label label)
  (hash-ref label-stor label))

But, this means we also need to update the store when we see a label, so an update to the earlier match construct is in order:

label.rkt
1
2
3
4
5
6
7
(define (next state)
  ...
    (match current-stmt
      [`(label ,l)
            (extend-label-stor l next-stmt)
            (state next-stmt fp σ κ)]
      ...

if-goto Statement

The if-goto statement is similar to a jump, the only difference being that the conditional statement must be evaluated before you can determine which branch to execute. We will use the atomic-eval that was constructed earlier to determine the truthiness of the expression, then either issue a goto or just move to the next statement.

ifgoto.rkt
1
2
3
4
5
6
7
8
9
10
11
12
(define (next state)
  ...
    (match current-stmt
      ; next(if e goto label) =
      ;  if true then label
      ;  else next-stmt
      [`(if ,e goto ,l)
        ; if true, goto label, otherwise next statment
        (if (atomic-eval e fp σ)
              (state (lookup-label l) fp σ κ)
              (state next-stmt fp σ κ))]
      ...

Invoking Methods

Dalvik supports inheritance, due to a possible traversal of super classes, method invocation is necessarily the most complicated to model. Methods involve using all four components of the CESK machine: Control, Environment, Store, Continuation.

It is useful to abstract away a simplified situation: assume that the method has already been looked up. Thus, we can define an applyMethod helper function to aid in applying the method to its arguments: $$applyMethod : Method \times Name \times Value \times AExp^{*} \times FP \times Store \times Kont \rightharpoonup \Sigma$$

Further assume a method is defined as m = def methodName($v_1,...,$v_n) {body}

applyMethod Helper

applyMethod needs to do the following:

  • Lookup the values of the arguments
  • Bind those values to the formal parameters of the method
  • Create a new frame pointer
  • Create a new continuation
  • Ensure the next sequence of statements is included in the new continuation
apply_method.rkt
1
2
3
4
5
6
7
8
(define (apply/method m name val exps fp σ κ s)
  (let* ([fp_ (gensym fp)]
         [σ_ (extend σ fp_ "$this" val)]
         [κ_ `(,name ,s ,fp ,κ)])
    (let ([σ* (car (map (lambda (v e)
                     (extend σ_ fp_ v (eval/atomic e fp σ)))
                   (method-formals m) exps))])
      (state (method-body m) fp_ σ* κ_))))

Invoking a method

With apply/method now doing much of the heavy lifting, invoking a method is reduced to a simple method lookup. lookup is a partial function that traverses through the inheritance chain until it finds the matching method. First, let’s define invoke, we’ll need the methodName for our lookup function.

In order to run the applyMethod function, we need a few variables defined: val and m. We can get val by a store lookup:

Getting m is where we finally need to define lookup since it is what finds the correct method. We need two values to process our lookup: className and methodName. We already have methodName from our invoke function, and we can get our className by extracting it from val:

Now that we have both className and methodName we can define our partial function lookupMethod that will traverse the class hierarchy to find the correct method to invoke:

In code, this sequence of functions becomes:

match_invoke.rkt
1
2
3
4
5
6
7
8
9
10
(define (next state)
  ...
    (match current-stmt
      [`(,name = invoke ,e ,mname ,vars)
       (let* ([$val (lookup σ fp e)]
              [cname (match $val
                       [`(object ,cname ,fp) cname])]
              [m (lookup/method cname mname)])
         (apply/method m name $val vars fp σ κ next-stmt))]
      ...

We can’t implement lookup/method quite yet, however, since I still haven’t defined and implemented classes. But, for now, let’s move on.

Return Statement

A return is an application of the current continuation to the return value. We already have the apply/κ function for application, so we just need to define what to pass in the val parameter. In the case of a return value, we need to evaluate the value to ensure we get the atomic instance:

In code, this is simply:

match_return.rkt
1
2
3
4
5
(define (next state)
  ...
    (match current-stmt
      [`(return ,e) (apply/κ κ (atomic-eval e fp σ) σ)]
      ...

Exceptions

To handle exceptions we have several cases to implement: when a continuation is an exception handler, pushing and popping exception handlers, throwing exception handlers, and capturing exceptions.

Exception Handler Continuation

This is the simplest continuation to handle, we simply skip over the current continuation and we have already implemented it in the apply/κ function, so there is no need to do anything more. This would happen if the current continuation was an exception, but no exception was needed.

Pushing and Popping Exception Handlers

We will have two ways to put and get exception handlers, but pushing and popping from the program stack with a pushhandler and a pophandler:

In code:

match_pushpop.rkt
1
2
3
4
5
6
7
8
(define (next state)
  ...
    (match current-stmt
      [`(pushhandler ,name ,l)
        `(,next-stmt ,fp ,σ ,(cons `(,name ,l ,κ) κ))]
      [`(pophandler)
        `(,next-stmt ,fp ,σ ,(car κ))]
      ...

Throwing Exception Handlers

In order to throw an exception, we must search the stack for a matching exception handler. Implementing a helper function handle to do this for us will help. First, let’s define helper as:

Here is how we will traverse the stack, putting last thrown exceptions into the register “$ex” as is protocol:

If className is a subclass of className’:

If not:

A throw skips over non-handler continuations:

handle in code looks like this:

handle.rkt
1
2
3
4
5
6
7
8
9
(define (handle o fp σ ex)
  (match `(,o ,ex)
    ; handler continuation
    [`((,op ,name) (,name_ ,l ,κ))
        (if (isinstanceof name_ name)
            `(,(lookup-label l) ,fp ,(extend σ fp "$ex" o) ,κ)
            (handle o fp σ κ))]
    ; non-handler continuation
    [`(,val (,name ,s_ ,fp_ ,κ)) (handle val fp_ σ κ)]))

You might have noticed a call to isinstanceof. I will define this function later when defining classes.

With the handle helper function, we can now define the throw statement:

Then we can add this to the transition function:

match_throw.rkt
1
2
3
4
5
(define (next state)
  ...
    (match current-stmt
      [`(throw ,e) (handle (atomic-eval e fp σ) fp σ κ)]
      ...

Capturing Exceptions

Since we store the last thrown exception into the $ex register, it can be examined to determine which exception was caught. We then use this to go to the label that handles this execution branch.

We can define this capturing with a moveException statement:

The store is updated by simply moving the exception e into the register $ex:

Putting this in our transition function, this is:

match_exception.rkt
1
2
3
4
5
6
7
(define (next state)
  ...
    (match current-stmt
      [`(move-exception ,e)
          (let ([$ex (lookup σ fp "$ex")])
            `(,next-stmt ,fp ,(extend σ fp $ex e)))]
      ...

Class Definitions

With the calls to the unimplemented lookup/method and isinstanceof functions, it’s time to implement them! But, first we need to define and implement classes.

Classes are defined by their name, a potential super class, 0 or more fields (instance variables), and 0 or more methods.

What we need is to be able to represent classes in a sane way that will allow us to keep track of super classes, fields, and methods. A classDef then, in code is: (struct class {super fields methods}) where super is the name of the super class, fields is a list of field names, and methods is a mapping defined by:

Since methods also have multiple properties, we will also define a method as: (struct method {formals body}).

We will also need a way to store and lookup classes. Since class names are unique, we can define a simple table of classes and a lookupClass method defined as:

Now in code, classes are:

classes.rkt
1
2
3
4
5
6
7
8
9
10
(struct class {super fields methods})
(struct method {formals body})

(define class-table (make-hash))

(define (lookup/class name)
  (hash-ref class-table name))

(define (extend-class-table name class)
  (hash-set! class-table name class))

Method Lookup

Now that we have defined classes, method lookup is a recursive function that traverses the classes hierarchy until it reaches a matching method, defined by

In code:

lookup_method.rkt
1
2
3
4
5
6
7
8
9
(define (lookup/method classname name)
  (let* ([class (lookup/class classname)]
         [methods (class-methods class)]
         [super (class-super class)])
    (if (hash-has-key? methods name)
        (hash-ref methods name)
        (if (void? (class-super class))
            (lookup/method (class-super class) name)
            (void)))))

Since super is just a string, all we need to do is check for string equality. By definition, a class without a super class is void.

Is Instance Of

Finding out if a class is an instance of another class is simple: find out if the class name is anywhere in its direct class hierarchy, returning true if it is and false otherwise:

In code:

is_instance_of.rkt
1
2
3
4
5
6
7
8
9
10
(define (isinstanceof super base)
    (if (void? super)
        #f  ; reached the top class
        (if (equal? super base)
            #t
            (let* ([class (lookup/class base)]
                   [upper (lookup/class super)])
              (if (equal? (class-super class) super)
                  #t
                  (isinstanceof (class-super upper) super))))))

Since super is just a string, we need to only check string equality. By definition, super is void if there is no higher class.

Comments