Reverse-engineering of protobuf-based applications

During an application assessment we were working on, we noticed that this application was using a very particular serialization system. We discovered later that this system has been designed and developed by Google, and is commonly known as ‘Protobuf’. The fact is that we were not able to analyze every piece of data transmitted from and received by the application, and therefore unable to decode it and alter the stream. We needed a way to know how data was structured to analyze this application, and nobody seems to have developed or analyzed such a system.

juin 6, 2012

Protobuf sample app

Before going deeper, we had a look at protobuf and tried to create a sample application in order to see how things work. From this, we expected to be able to understand the serialization algorithm and then develop a proxy (since serialized data was sent and received through a network connection). Protobuf’s key relies in one or many .proto files, each one describing some messages format. These files are used by protobuf to generate many files for different languages (C++, Java, Python) that can be used as-is in a project. These files implement the serialization and deserialization routines, and are automatically generated by protoc, a compiler provided with the library. Of course, we have installed every package required on our Debian system:

# apt-get install libprotobuf-dev python-protobuf

Our sample application needs to serialize some data with a specific format. To achieve this with protobuf, the first step consists in creating a .proto file (reverseme.proto) and to declare in this file our messages’ formats. The whole language specification is documented on the official protobuf website [1].

message SomeData { 
        message OtherData { 
                required int32 test = 1 [default = 1]; 
        } 

        required OtherData data = 1; 
        optional string comment = 2; 
}

This file is then used by protoc to create two other files:

$ protoc –cpp_out=project reverseme.proto
$ ls -Al project/
-rw-rw-r-- 1 virtualabs virtualabs 19089 5 juin 00:06 reverseme.pb.cc
-rw-rw-r-- 1 virtualabs virtualabs 10231 5 juin 00:06 reverseme.pb.h

Once done, the developer only needs to include reverseme.pb.h, and uses it to serialize and deserialize data as follows:

#include <iostream> 
#include <fstream> 
#include <string> 
#include "reverseme.pb.h" 

using namespace std; 

int main(int argc, char **argv) { 
        com::example::reverseme::SomeData some_data; 
        com::example::reverseme::SomeData some_data_bis; 
        string serialized; 

        GOOGLE_PROTOBUF_VERIFY_VERSION; 

        /* Building message */ 
        cout << "Building message:" << endl; 
        com::example::reverseme::SomeData_OtherData* other_data = some_data.mutable_data(); 
        other_data->set_test(0x1337); 
        cout << "SomeData.OtherData.test = " << hex << other_data->test() << endl; 
        some_data.set_comment("Yay !"); 
        cout << "SomeData.comment = " << some_data.comment() << endl; 

        /* Serializing */ 
        cout << "Serializing data ..." << endl; 
        some_data.SerializeToString(&serialized); 

        /* Deserializing */ 
        some_data_bis.ParseFromString(serialized); 
        cout << "Deserialized message:" << endl; 
        cout << "SomeData.OtherData.test = " << hex << some_data.data().test() << endl; 
        cout << "SomeData.comment = " << some_data.comment() << endl; 

        /* Cleaning */ 
        google::protobuf::ShutdownProtobufLibrary(); 

        return 0; 
}

It is very convenient and easy. Okay, where is the problem ? Since the original .proto file is not directly used by the application but only used to generate a template (as in our sample application), it is not trivial to unserialize a piece of data without knowing its format. No format, no proxy.

We need to go deeper …

When having a look at each generated file, we noticed a strange piece of code in reverseme.pb.cc:

::google::protobuf::DescriptorPool::InternalAddGeneratedFile( 
    "\n\017reverseme.proto\022\025com.example.reverseme" 
    "\"r\n\010SomeData\0227\n\004data\030\001 \002(\0132).com.example" 
    ".reverseme.SomeData.OtherData\022\017\n\007comment" 
    "\03\002 \001(\t\032\034\n\tOtherData\022\017\n\004test\030\001 \002(\005:\0011", 156); 
}

This piece of code declares a generated .proto file but stores its content in a serialized way. This is done because of protobuf’s introspection feature, which allows the application to dynamically generate a message without knowing explicitly its format.

Hopefully, Google’s published the format used for this serialization along the library source code (see [2]). The message representing the compiled .proto file is known as FileDescriptorProto, and this representing each field as FieldDescriptor. Both of them are documented as .proto files in protobuf’s dependencies. Well, not obvious but that could work.

This metadata was the key of our success: if we could unserialize it and recover the original .proto file, then we would have been able to serialize and unserialize data.

Hard work had just started. Our first idea was to create a disassembler able to generate every .proto file from a serialized version, for Elf applications. We decided to develop this tool using Python and its protobuf bindings (python-protobuf, previously installed) for effiency.

Recovering .proto files (like a boss)

First of all, we developed a quick class to extract every serialized .proto file from a given binary application (Elf file format). Then, we developed a set of classes handling the whole process, from namespaces and messages indexation to .proto file generation.

We tried it on our previous sample application, and this is what we got from our brand new tool:

$ python protod.py reverseme
[i] Extracting from reverseme ...
[+] Indexing com.example.reverseme
[+] Processing reverseme.proto
[i] Done

The tool parsed our binary application file (reverseme), grabbed the serialized reverseme.proto file and recovered it:

package com.example.reverseme; 

message SomeData { 
 message OtherData { 
  required int32 test = 1 [default = 1]; 
 } 

 required OtherData data = 1; 
 optional string comment = 2; 
}

Its content is exactly the same as the original file, we did it. With this .proto file in our hands, it was very easy to implement a tool able to unserialize, alter and serialize again some data. As a relaying proxy usually does.

Conclusion

At the very beginning of our research, we found a lot of messages on various forums and website asking how to unserialize a stream created with protobuf but no tools to make the job. Anyway, we wrote a tool able to extract metadata from an executable and recover .proto files from it. In fact, having the executable file is enough to grab every protobuf data you need. Metadata are always valuable, especially if the system used relies on it.

The source code of our tool and of the sample application can be found in attachment. It does not recover every piece of information at the moment, but feel free to improve it (and send us your improvements).

Hope this helps !

Source code of protod.py