Program Synthesis for Deobfuscation

I wrote several times about code obfuscation on my personal blog over the past 10 years, but this time I’d like to underline a different aspect of it, and a novel (at least for the best of my knowledge) approach to deal with deobfuscation. First of all let me remind that code obfuscation is not a malicious activity, it is a technique widley used from attackers as well as from software houses in order to protect their own code. The goal of obfusctating technologies is to protect software from whom want to disassembly and to alterate it. In our case we are considering the attacker as the guy who made the original software and not the one is trying to deobfuscate it. This is our posture since the software we are looking for is definitely Malware, and the guy who is trying to disassembly it is the good guy who want to understand how the Malware works in order to better put in place detection and blocking mechanisms.

Today I’d like to fix in my personal blog a great reading I had this weekend about code synthesis applied to deobfuscation techniques. A nice and fascinating novel approach for future tools which I believe could play an importat role in defending the digital space by helping out cybersecurity analysts in their daily tasks.

Obfuscating Throught Opaque Predicates

The following image shows how the opaque predicates work. Mainly this technique is used just before the software compilation or, in some casese, even during the compiling time (but this is getting a very specific case). In other words the developer is inserting meaningless blocks (functions, procedures or even operations) to the code. The code without such a meaningless blocks would work as well without any alteration. In such a way the analyst, who needs to disassembly all the functions and who doesn’t know whate are opaque predicates at priori, needs to check everything before figuring out what code is “junk” and what code is “real”.

Code Obfuscation throught Opaque Predicates

For esample if we consider the following blocks it would be obvious what is an opaque predicate:

call GetCurrentProcess
cmp eax, 0xffffffff
je code_a
... dead code ...
... dead code ...
... dead code ...
code_a:
... live code ...
... live code ...
... live code ...

Since GetCurrentProcess returns 0xffffffff address (are you wondering why ? here the answer ) it’s getting obvious the comparing predicate (cmp) would return always TRUE and the jump would be always on code_a function. In this scenario we might define what is “dead code” (just after the jmp) and what is “live code” (the code_a function) and we can define the predicate call GetCurrentProcess as the opaque value.The following code shows one more opaque predicate in which every random choice the process would take it will definitely land on the same code (whell, kind of obvius, right?)

call rand
cmp eax, 0x01010101
je code_a
... live code ...
... live code ...
... live code ...
code_a:
... live code ...
... live code ...
... live code ...

The live code is the same if the jump takes place or not.

In the real life, the analyst who sees such a code obfuscated technique (using Opaque predicates) could easily adopt input tracing techniques to solve it (also known as symbolic execution). In other words, by keeping trace of variables (for example by using angr or unicorn engine) you are able to understand if the variable get change or not, but this process is np complete, you need to run everything in order to get an idea of what is happening and later on decide what piece of code to avoid, plus you would meet artificial complexity introduced by attackers to spot and to remove manually from time to time. Attackers knows you are (ab)using this technique and will keep on to insert dead or infinitive loop and a tons of junking code which will eventually heat-up your analysis machine and made you loose a lot of time.

So the bright question made by Tim Blazytko, Moritz Contag, Cornelius Aschermann and Thorsten Holz (Reference Here) was: what if we could reason about semantics only instead of syntax ? This would definitely change flavors (YAY) ! Following some of the examples made by Tim Blazytko in his PhD Thesis (available HERE). Let’s assume to have the following function as a black box (we do not know priori inputs):

f(x, y, z) := (((x ⊕ y) + ((x ∧ y) · 2)) ∨ z) + (((x ⊕ y) + ((x ∧ y) · 2)) ∧ z)

We might try to aproximate such a function to a well-known function which behave in the same way with a subset of input. For example let’s assume to give the following inputs:

(1, 1, 1) 
(2, 3, 1) 
(0, 7, 2)

We’ll get the following outputs:

(1, 1, 1) -> 3
(2, 3, 1) -> 6
(0, 7, 2) -> 9

So we learned that such a complex function behaves (with these specific input values) like a simple: h(x, y, z) := x + y + z, how cool is that !

So, if we are able to synthetize a function, why we cannot synthetize an entire program ? This is the question that researchers from Ruhr-Universitat Bochum are answering by devoloping Syntia: “Program synthesis based deobfuscation framework for the USENIX 2017 paper “. In the USENIX paper the researchers use the well-known (whidly used in many game theory as well as in the self driving car) Monte Carlo Tree Search heuristc (MCTS) as a decision making algorithm in order to perform code generation to syntethize the original program. The results are very promising and definitely worthy of investing more research and professional time in order to getting out a new deobfuscation solution.

The following image shows the Syntia result of a deobfuscated function in a JSON format, while the last image on that post is graphically representing what is going on reax variable.

{
 "output": {
     "name": "EAX", 
     "number": 0, 
     "size": 32
 }, 
 "top_non_terminal": {
     "expression": {
         "infix": "((u32 * u32) + (u32 * 1))"
     }, 
     "reward": 1.0
 }, 
 "top_terminal": {
     "expression": {
         "infix": "((mem_0x2 * mem_0x0) + (mem_0x4 * 1))"
     }, 
     "reward": 1.0
 }, 
 "successful": "yes", 
 "result": {
     "final_expression": {
         "infix": "((mem_0x2 * mem_0x0) + (mem_0x4 * 1))", 
         "simplified": "((mem_0x2 * mem_0x0) + (mem_0x4 * 1))"
     }
 }
}

Taken from Syntia: Synthesizing the Semantics of Obfuscated Code in Usenix proceedings

Conclusions

Today I have brefly described the high vevel concepts of a nice research paper I’ve read during the past weekend. Using code synthesis techniques to deobfuscate malicious behaviors is a brillant way to go and definitly worthy of investing more time. Giving to it a try in a professional world would be a great idea on my personal point of view.