Trace Surfing Presentation

Trace Surfing
A tale of data structure recovering and other yerbas

By Agustin Gianni Immunity Inc.
Problem Statement
Given a memory trace, what information does the trace gives us about the underlying data structures?
Road map

Investigation of previous approaches Realization that they kind of suck Enlightenment phase how can we improve
Introduction
What is a memory trace?
A memory trace is a collection of all the memory accesses performed by an application.
Both reads and writes
How can I obtain a memory trace?
Binary Instrumentation

pintool DynamoRIO QEMU BOCHS
Full system emulation

Example Memory Trace

# White listed image `calc.exe` # Loading hooks from file hooks.hks # Loaded hook alloc:test_custom_alloc:00000774:0:my_alloc_ # Loaded hook free:test_custom_alloc:000007b6:0:my_free_ L:calc.exe:0x003a0000:0x0045ffff # Thread 0x0 started # Instrumented malloc at 0x75619cee # Instrumented free at 0x75619894 # Instrumented realloc at 0x7561b10d # Instrumented calloc at 0x7561c456 W:0x003a76c6:0x01d125e0:0x01d125e0:0x00000004:0x0000000f W:0x003a76cc:0x01d125e0:0x01d125e4:0x00000004:0x0000000f F:0x003b8f9a I:0x003b8f9a:0x00000031:0x00000000 F:0x003b8fdc I:0x003b8fdc:0x00000031:0x00000000 # Thread 0x1 did not finish but application exited.
Introduction
Why do we care about recovering data structures?
Large binaries are a pain to reverse
Specially Object Oriented Code
Virtual Function Tables and friends
Makes reverse engineering happier Saves time Computers got fast enough to trace every single memory access
Why not?
HexRays With data types
Introduction
Has anyone approached the problem?
Dynamic analysis

Howard: Dynamic Excavator for Reverse Engineering Data Structures Rewards: DDE, Dynamic Data Structure Excavation WYSINWYX: What You See Is Not What You eXecute
Static analysis
Based on abstract interpretation, blah, blah, blah!
The Rewards / Howard approach
Trace every single memory access

Heap Stack System Calls Library Calls Special purpose instructions
Define type sinks

For instance, string manipulation instructions on Intel architecture.
Propagate recovered types Analyze the memory trace
Type Sinks
A type sink is a function, syscall or instruction that we know which types it is taking System calls and standard libraries are the more verbose
For instance:

ssize_t read(int fd, void *buf, size_t count); Leaks four types: ssize_t, int, void *, size_t Also we can extract semantics

We know that 'fd' is a descriptor 'buf' is a buffer Etc.
Type Sinks
Instructions can also leak types
Intel String Operations
CMPS, INS, LODS, MOVS, OUTS, SCAS, STOS FADD, FDIV, FMUL, and so on. JG / JL Signed Integers JA / JB Unsigned Integers Data dereferences leak half a type
Intel Floating Point Instructions
Jumps

Memory dereferences
We just know the dereferenced address is a pointer We know that the dereferenced address contains a pointer to a function.
Indirect calls leak function pointer types
What do we want to recognize?
Things to recognize:

Structures / Classes Arrays Pointers Study how the memory is accessed
How?
Identifying Pointers
Pointers are 'easy' to detect

Just see what instructions dereference memory The dereferenced argument must be a valid pointer
Otherwise the program would crash We cannot yet know the type of the pointer If we are lucky enough, and by lucky I mean that we have sufficient code coverage, we will identify the type of the pointer.
Problem

Warning : we are entering

the terrain of the incomplete and unsound assumptions.
Absolute correctness
Do we really care about absolute correctness?

Hint I don't Even if we could automatically identify a fraction of the types correctly, that saves us work. Inconsistent typing is detected by humans We cannot get back what is not there
Eventually decisions/corrections must be done
We are not aiming to solve unsolvable problems
Compilation is not bidirectional
Although Rolf may argue this I've been told ;)
Identifying Structures
Typically structure fields are accessed in an indirect way
This depends heavily on the compiler and the optimization level. Often, access patterns will be similar. Let A be a base pointer *(A + 0) is the first field *(A + 8) is the second field And so on
Example

What we want to do is to detect indirect memory addresses.
We can obtain this from a memory trace What if A was not a structure

But
Let A be an array *(A + 0) is the first element *(A + 8) is the second element And we are screwed There is no base pointer
Also, sometimes structure fields are accessed directly
There is no way we can decide, with certainty, whether a pointer points to a structure or an array

We have to make unsound assumptions Rely on compiler specific constructs Heuristics And why not a bit of magic Still, less work than reversing manually
In the end, manual work needs to be done
To distinguish between arrays and structures we use some heuristics
Memory accesses are generally scattered
Example:

Access field at offset 0x00 Then offset 0x10 And so on
Size of the access is generally heterogeneous
Example:

Access field 2 which is an integer Then access field 3 which is a short integer Etc.
Identifying Structures - Example

6
Memory accesses
1 DWORD 2 DWORD 3 WORD 4 WORD 5 DWORD 6 BYTE 7 WORD
2 3 5
4 7
There are a considerable amount of cases where this will fail
The most trivial cases

Initializing a structure with memset Copying a structure with memcpy If we have more than one access pattern, favor the more irregular
How do we solve this
Identifying Arrays
We can identify arrays by watching memory accesses on loops
There are two cases

Sequential memory accesses Random memory accesses
Identifying Arrays
Sequential memory accesses

Let A be a pointer We are on a loop A is dereferenced at loop cycle one. B is generated also at loop cycle one. Next iteration B is dereferenced. A is likely an array pointer
Identifying Arrays
Random memory accesses
If all the accesses are of the same size we have a hint that we are dealing with an array.
But it is also likely that it could be an structure. This is getting hairy.
So, where are we?
Where are we?
Detecting whether a pointer points to an array or a structure is essentially an educated guess.

We need to further educate ourselves We need to have stronger assumptions that we can rely on. What about address reutilization
Tracing stack memory accesses is tricky
We need to tag every address with a TAG to differentiate two identical addresses accessed in different times
Tracing all memory accesses is painfully slow
We are interested in large binaries
Are we screwed then?

Not really We need to make our analysis a little bit more specific

Hence less complete But more accurate
It is all about giving up a bit of generality for a bit more of accuracy
Looking for better waves
Focus on Heap Objects
Why?
Heap objects are shared. We like data that is shared
It leads to good things from an vulnerability research point malloc like functions give us the size of the chunks Hook allocation routines and tag the returned memory with a unique id Hook also deallocation routines to keep track of valid memory chunks
We have more information
It is easier to track heap memory

Object Oriented Code
Objects are basically structures with methods

Each object method needs to somehow reference its underlying object. Objects of a given class share a set of common characteristics Or at least those object with shared state information We are dealing with structures of know size Now the whole address space is reduced to a fraction of its size
Most of them come from the heap
So if we focus on objects, the problem is a bit less complicated

Just analyze the .heap Hook the allocation routine The block is alive Hook the free routine The block is dead
Keeping track of the life of a heap memory region is simple

How to detect objects?
Not every single heap chunk is an object Heuristics!
Take advantage of calling conventions

Visual Studio: will set the 'ecx' register to the 'this' pointer GCC 32 bits: pushes as the first method argument the object GCC 64 bits: 'rsi' is set to the 'this' pointer
So we mark every tracked heap chunk that is on ecx, rsi or the first argument of a function as a possible object The object must be used inside the potential method
How to detect methods
There is no sound way We have to trust our heuristics
Which are better than most Anti-Virus heuristics :P The dynamic nature of a trace makes us rely on code coverage. Sometimes the this pointer remains spuriously in 'ecx'
We are going to miss some methods
We are going to mark some functions as methods
So, how are we now?
We are doing better!
We can detect interesting objects

We know its size We know where they are being used Detect fields Detect relationships with other types

What else we need to do?

Inheritance Composition
Detecting Object Fields

We already have all heap memory accesses in our trace If the memory access is to one of our interesting objects we save the access offset and size Since we only track interesting objects the analysis is much quicker We can implement the algorithms used by Howard/Rewards
If we have information from type sinks, we can propagate it
Detecting methods
On each function call check if ECX points to a heap object.
If true

Mark the chunk as interesting Save the access offset for future usage
Mark the function as interesting Does this function get called again with the same conditions?
That is, the same function gets called with a chunk of the same size as the 'this' parameter
How far can we go?
How far can we go?
With all the collected traces we can obtain quite a lot of information

Class Hierarchy Virtual Function Tables Types! Bonus (not really related with type inference)

Code coverage information Indirect branch resolution
How can we achieve this?
Virtual Function Tables
Useful to help IDA Pro to discover more functions For each write to an interesting chunk
Is the value written referring to .text ?
Is [value] also in .text?
This is for sure a Virtual Function Table
If not, it is just a field update
Types
Type reconstruction algorithm is divided in three phases
First Analysis Pass (FAP)
Pun intended
Second Analysis Pass (SAP) Third Analysis Pass (TAP)
First Analysis Pass
For each function
Get all its interesting chunks
That is chunks that were passed as the 'this' argument Set the composite type size to the size of the chunk
Mark the whole chunk as a composite type
If 'this' does not point to the first byte of the chunk, get the offset
Divide the composite chunk in two types at the calculated offset
Repeat the process with all the methods that used the chunk and subdivide the composite type
First Analysis Pass

chunk_address = A ecx_address = A + 0 Composite Type Composite Type TypeA Offset = 0
Chunk
In this case, TypeA fills the whole composite type
First Analysis Pass

chunk_address = A ecx_address = A + C Composite Type Composite Type TypeA
TypeB Chunk
Offset = C
In this case there are two types, we recognize this because there were two methods called with 'this' pointing at the same memory chunk but at a different offset.
First Analysis Pass continued

Add the current function to a list of methods For each write to the interesting chunk

Add a field at the offset of the write Mark the field with the corresponding basic type according to the write size
For instance, a write of four bytes is marked as uint32_t
First Analysis Pass continued
Collect a set of constraints
For each chunk that was received as the 'this' argument build a map from the method address to a list of all the types created. This will be later used build relationships between types and subsequent merging of identical types
Type_A Type_B method_at_0xcafecafe Type_C
Size = X_1 Size = X_1 Size = X_2
Second Analysis Pass
Merge similar types
Cheat by first using the type constraints collected on the FAP phase They have the same size
How do we define similar?
Equal types with differing sizes will be addressed in the third pass That is, at offset O there is a type T of size S in both types How many?

They have equivalent fields
They share a set of methods
Let N be the number of methods in Type1 Let M be the number of methods in Type2 Let S be the number of shared methods SimilarityIndex(N,M,S) = (S / (N+M)) * 100 If SimilarityIndex > SimilarityThreshold then they are similar
Third Analysis Pass
There are types that share methods and fields but they differ in size What is going on?
There are two possible scenarios
Type2 in inherits from Type1
len(Type2) > len(Type1) most of the times This is the case of for example strings in some browsers
The type has an internal buffer
Inheritance / Composition
A simple inheritance relationship is translated into a composition of structures
ClassA Field1 Field2 Field3 Field4 ClassB ClassA Field1 Field2 Field3 Field4 Field1 Field2 Field3
Two classes of different size use the same method The bigger one is likely the child class The smallest one is likely the parent class This heuristic can fail
Say that we have a dynamically allocated buffer inside a class
Rare, weird, but it can and will happen
Failure will generate an extra type but the relationships between the types will still be interesting and can be detected by a human once the information is imported into IDA Pro
Hard example :)
Example string class that will contain metadata and contents on the same chunk of memory Other recurring complex examples are hash tables
StringClass StringMetadata uint32_tlen AAAAAAAAAAAAAAA AAAAAAAAAAAAAAA AAAAAAAAAAAAAAA ???????????????
Increasing accuracy
Increasing accuracy
Accuracy of our approach is directly related with code coverage
The more code coverage, the more accuracy The smart way

Increasing code coverage
We can tweak Klee (requires source code) We can code our version of SAGE

??? Profit
The other way

Fuzz the application like a 15 year old Gather a set of input files (if possible) and calculate the set of files that gets the maximum coverage
Static Analysis
How can we further validate our results?
Detecting calling convention
We have collected a fair amount of information, how can we propagate this information?
Propagating the type information into basic blocks not executed on the trace Or we can be lazy and let HexRays decompiler to do it for us :)
Calling convention detection
A spurious function calls can happen when a non method function is called on a method The function call can receive the 'this' pointer of the previous method call We avoid this case by ruling out all the function calls that do not behave as thiscall
Given a function get its CFG Obtain a DAG (direct acyclic graph) Do a topological sort Assume ECX is a 'this' pointer
Add it to a list of 'this' aliases If instruction kills any of the 'this' aliases
For each basic block
If the alias list is empty return not thiscall Add the new alias to the list
If the instruction aliases one of the 'this' pointers
If the instruction accesses memory using one of the aliases of 'this' then the function is likely 'thiscall'

This can fail too Generally it gives a correct answer in 90% of the analyzed function
These results were validated by analyzing binaries with symbols available
In practice this information allows us to detect spurious functions detected as methods of a class
Example: calc.exe types
References
http://www.pintool.org/ http://www.dynamorio.org/ http://wiki.qemu.org/Main_Page http://bochs.sourceforge.net/ http://www.few.vu.nl/~asia/publications http://www.cs.purdue.edu/homes/xyzhang/reverse.html http://pages.cs.wisc.edu/~reps/
Thanks to

Juliano Rizzo Nicolas Waisman Pablo Sole Sean Heelan Topo Muiz

Trace Surfing Presentation

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Trace Surfing Presentation

Transféré par

Droits d'auteur :

Formats disponibles

Trace Surfing

A tale of data structure recovering and other yerbas

What is a memory trace?

A memory trace is a collection of all the memory accesses performed by an application.

Both reads and writes

How can I obtain a memory trace?

pintool DynamoRIO QEMU BOCHS

Full system emulation

Example Memory Trace

Why do we care about recovering data structures?

Large binaries are a pain to reverse

Specially Object Oriented Code

Virtual Function Tables and friends

HexRays With data types

Has anyone approached the problem?

Based on abstract interpretation, blah, blah, blah!

The Rewards / Howard approach

Trace every single memory access

Heap Stack System Calls Library Calls Special purpose instructions

Define type sinks

For instance, string manipulation instructions on Intel architecture.

Propagate recovered types Analyze the memory trace

We know that 'fd' is a descriptor 'buf' is a buffer Etc.

Instructions can also leak types

Intel String Operations

Intel Floating Point Instructions

Indirect calls leak function pointer types

What do we want to recognize?

Structures / Classes Arrays Pointers Study how the memory is accessed

Pointers are 'easy' to detect

Warning : we are entering

Do we really care about absolute correctness?

Eventually decisions/corrections must be done

We are not aiming to solve unsolvable problems

Compilation is not bidirectional

Although Rolf may argue this I've been told ;)

Typically structure fields are accessed in an indirect way

What we want to do is to detect indirect memory addresses.

Also, sometimes structure fields are accessed directly

In the end, manual work needs to be done

To distinguish between arrays and structures we use some heuristics

Memory accesses are generally scattered

Access field at offset 0x00 Then offset 0x10 And so on

Size of the access is generally heterogeneous

Identifying Structures - Example

1 DWORD 2 DWORD 3 WORD 4 WORD 5 DWORD 6 BYTE 7 WORD

There are a considerable amount of cases where this will fail

The most trivial cases

How do we solve this

We can identify arrays by watching memory accesses on loops

There are two cases

Sequential memory accesses Random memory accesses

Sequential memory accesses

Random memory accesses

But it is also likely that it could be an structure. This is getting hairy.

So, where are we?

Where are we?

Detecting whether a pointer points to an array or a structure is essentially an educated guess.

Tracing stack memory accesses is tricky

Tracing all memory accesses is painfully slow

We are interested in large binaries

Are we screwed then?