View on GitHub

Doop Framework 101

Tutorial site for PLDI 2015

Download this project as a .zip file Download this project as a tar.gz file

Background Presentation Slides

About

Doop is a declarative framework for static analysis of Java programs, centered on pointer analysis algorithms. Doop provides a large variety of analyses and also the surrounding scaffolding to run an analysis end-to-end (fact generation, processing, statistics, etc.).

The declarative nature of Doop stems from its use of Datalog (more specifically, LogiQL, a Datalog dialect developed by LogicBlox) to specify an analysis.

The building blocks of Datalog programs come in the form of predicates. Our input facts (a.k.a. EDB logic) are represented as predicate values, e.g., Person("John") or Parent("John", "Johnny jr").

Then we have rules (a.k.a. IDB logic) to infer new information from facts already known to be true. This continues until no new information can be extracted. E.g.,

Ancestor(x, y) <- Parent(x, y).

To familiarize yourself with Datalog evaluation, you may want to try the LogicBlox REPL tutorial.

Doop + Datalog Setup

Doop is available at bitbucket as a public repo.

Please use LogicBlox engine version 3.10.X for this tutorial and download+unpack the main tutorial file from the links at the top of the page. Put the engine (or a symbolic link to it) under the directory code. Also, clone Doop under the same directory. If you are following this tutorial at its PLDI slot, the relevant files should exist in the archive you copied.

Issue the following command to set up the environment.

$ source code/environment.sh

Toy Example

We want to compute the ancestors for a set of people. Our rules (left-arrow notation) along with their type declarations (right-arrow notation) are in ancestors.logic. Our input facts (a.k.a. delta logic) are in facts.logic. The + operator in front of the delta predicates denotes that we want to add facts to our database (we could also remove or update facts).

We invoke actions on our Datalog engine using bloxbatch.

$ cd code/examples/ancestors
$ bloxbatch -db DB -create -overwrite              # create our database
$ bloxbatch -db DB -addBlock -file ancestors.logic # load our rules
$ bloxbatch -db DB -execute -file facts.logic      # load our facts
$ bloxbatch -db DB -print Ancestor                 # print computed results
predicate: Ancestor(Person, Person)
...
/--- start of Ancestor facts ---\
  [1]john, [0]harry
  [2]dave, [0]harry
  [2]dave, [1]john
\---- end of Ancestor facts ----/

The line Person(x), Name(x:n) -> string(n). states that a person (represented by an internal ID) can be constructed by providing a string in the Name predicate. This is known as refmode. This is also shown in the print output. For example, [1]john means that the string "john" is used to create a Person with the ID 1.

Doop Examples

Now let's focus on more meaningful examples using Doop directly. We will use a running example, found in Example.java.

$ cd ..
$ mkjar Example.java   # create Example.jar
$ cd ../doop

We will run a simple naive analysis (-a naive option) on the generated jar file (-j ../examples/Example.jar option). This analysis has only a few basic rules but it's a good representative skeleton of actual analyses. Since Doop performs a whole program analysis, the library will be analyzed along with application code. With the option --jre 1.7 we specify the desired version of the library.

./doop -a naive -j ../examples/Example.jar --jre 1.7 --no-stats

After the analysis has run, we can gather results by issuing queries directly to the database.

Analysis structure

Input facts are auto-generated by the framework (using Soot) and then imported to the database so our rules can process them. The input schema can be found

The rules for our naive analysis can be found in logic/analyses/naive/analysis.logic.

For example, the following rule states that when we have a heap allocation of a ?heap object that is assigned to variable ?var inside a method deemed reachable by our analysis, then we can infer that the variable may point to this heap object.

VarPointsTo(?var, ?heap) <-
  AssignHeapAllocation(?heap, ?var, ?inMethod),
  Reachable(?inMethod).

Furthermore, when we have some kind of assignment (direct or indirect) from one variable to another, and we know that the source variable may point to some heap object, then the target variable may point to the same heap object.

VarPointsTo(?to, ?heap) <-
  Assign(?to, ?from),
  VarPointsTo(?from, ?heap).

Note here that this rule is recursive; previously known facts about the VarPointTo relation may lead to the inference of additional facts. Doop analysis rules are mutually recursive in complex ways.

Accessing the database

After the end of an analysis, a symbolic link for the resulting database can be found under the results directory. Also, for convenience, a second symbolic link is created at top level called last-analysis, each time pointing to the latest successful analysis.

Get a predicate's entries

As we already saw, the easiest way to interact with the database is to simply print all the entries of a certain predicate.

$  bloxbatch -db last-analysis -print FieldPointsTo
predicate: FieldPointsTo(HeapAllocation, FieldSignature, HeapAllocation)
...
/--- start of FieldPointsTo facts ---\
  [113914]Example.test/new Cat/1, [11417]<Cat: Cat parent>, [113915]Example.test/new Cat/2
\---- end of FieldPointsTo facts ----/

Doop represents (and abstracts away) objects by using their allocation point in the program. Example.test/new Cat/1 refers to the second (zero-based indexing) Cat object allocated inside method Example.test().

As expected, the parent field of the second Cat object may point to the third Cat object.

Query #1

Our first query is to ask for VarPointTo entries of variables declared in Example.morePlay().

$ bloxbatch -db last-analysis -query \
'_(?var, ?heap) <- VarPointsTo(?var, ?heap), Var:DeclaringMethod(?var, "<Example: void morePlay(Cat)>").'
  Example.morePlay/@this, Example.main/new Example/0
  Example.morePlay/r0, Example.main/new Example/0
  Example.morePlay/r3, Example.test/new Cat/1
  Example.morePlay/r4, Example.test/new Cat/1
  Example.morePlay/r1, Example.test/new Cat/1
  Example.morePlay/@param0, Example.test/new Cat/1
  Example.morePlay/r2, Example.test/new Cat/2
  Example.morePlay/r3, Example.test/new Cat/2
  Example.morePlay/r4, Example.test/new Cat/2

The string provided to the -query flag can be a set of left and right arrow Datalog rules. Newly created predicates have to start with _ since they will only exist for the duration of the query evaluation. Refmode values can be used directly, and the engine will automatically substitute them with their internal IDs. E.g., the following part

Var:DeclaringMethod(?var, "<Example: void morePlay(Cat)>")

is equivalent to

Var:DeclaringMethod(?var, ?method), MethodSignature:Value(?method:"<Example: void morePlay(Cat)>")

Note here that Doop analyzes Java bytecode. Input facts are generated using Soot, which transforms Java bytecode to Jimple, a language based on three address code. As a result new temporary variables are introduced and also original variable names might be lost (they can be retained through specific flags in javac and Soot).


Query #2

A more advanced query can be found in query2.logic. Essentially, we compute a transitive closure on the CallGraphEdge relation. The logic used in a query can be as complicated as in any "normal" Datalog program.

$ bloxbatch -db last-analysis -query -file ../examples/query2.logic
  <Example: void main(java.lang.String[])>, <Example: void test(int)>
  <Example: void test(int)>, <Example: void morePlay(Cat)>
  <Example: void main(java.lang.String[])>, <Example: void morePlay(Cat)>
  <Example: void test(int)>, <Cat: void setParent(Cat)>
  <Example: void main(java.lang.String[])>, <Cat: void setParent(Cat)>
  <Example: void test(int)>, <Cat: Cat getParent()>
  <Example: void morePlay(Cat)>, <Cat: Cat getParent()>
  <Example: void main(java.lang.String[])>, <Cat: Cat getParent()>
  <Example: void test(int)>, <Cat: void play()>
  <Example: void morePlay(Cat)>, <Cat: void play()>
  <Animal: Animal playWith(Animal)>, <Cat: void play()>
  <Example: void main(java.lang.String[])>, <Cat: void play()>
  <Example: void test(int)>, <Dog: void play()>
  <Example: void main(java.lang.String[])>, <Dog: void play()>
  <Example: void test(int)>, <Animal: Animal playWith(Animal)>
  <Example: void morePlay(Cat)>, <Animal: Animal playWith(Animal)>
  <Example: void main(java.lang.String[])>, <Animal: Animal playWith(Animal)

The line Instruction:Method[?invocation] = ?fromMethod found in the previous query uses a special form of predicate known as functional predicate. Those are similar to normal ones, but they act like a map. Values found between the square brackets are mapped to only on value on the right.

Doop + DaCapo Benchmarks

We frequently analyze various programs from the DaCapo Benchmarks suite using a variety of advanced analyses. E.g., let's analyze the antlr benchmark using a 2 type-sensitive analysis.

./doop -a 2-type-sensitive+heap -j benchmarks/dacapo-2006/antlr.jar benchmarks/dacapo-2006/antlr-deps.jar --jre 1.7 --dacapo

Towards the end of execution, Doop will report a set of metrics gathered from the analyzed program. Those metrics are computed through the use of various queries on the resulting database. Those are found under logic/addons/statistics.

Query #3

For example, one metric is the computation of casts that potentially may fail. It joins input facts as well as facts computed during execution to infer casts where the related variable may point to an object that is incompatible with the type of the cast.

_Stats:Simple:PotentiallyFailingCast(?type, ?from, ?to) <-
    _Stats:Simple:ReachableCast(_, ?type, ?to, ?from),
    Stats:Simple:InsensVarPointsTo(?heap, ?from),
    HeapAllocation:Type[?heap] = ?heaptype,
    ! AssignCompatible(?type, ?heaptype).

The use of _ as a predicate parameter denotes that we don't care for a specific value. It represent a parameter that is not bound.

The above query can be found isolated in query3.logic.

$ bloxbatch -db last-analysis -query -file ../examples/query3.logic

Aggregate Functions

Datalog supports the use of aggregation functions. One such function is count. E.g., if we would like to compute the total number of VarPointsTo entries we would use the following.

$ bloxbatch -db last-analysis -query \
'_[] = ?n <- agg<<?n = count()>> VarPointsTo(_, _, _, _).'
  4569312