Showing posts with tag pdf. Show all posts.

Malicious PDF trick: XFA

Another trick that is becoming more and more common in malicious PDF files consists of storing the actual malicious content (for example, JavaScript code that exploits some vulnerability) into XFA forms. If you remember the getPageNthWord, and the info tricks that have been documented earlier, you will recognize the technique been used here.

So, what is an XFA form? XFA stands for XML Forms Architecture and it is a specification used to create form templates (forms that can be filled in by a user) and to process them (for example, validate their contents). Support for XFA forms in PDF files has been introduced by Adobe with PDF 1.5. If you want to know all the gory details, you can refer to the original XFA proposal or to the Adobe's XFA specification, which, however, being 1123-page long may be a hard read.

Let's see how it used abused in practice (the MD5 of the sample I'm analyzing is 1f26dcd4520a6965a42cefa4c7641334). The PDF first defines an XFA template, which is used to describe the appearance and interactive characteristics of the form.

obj 10 0
    /Type /EmbeddedFile    
    /Length 618    
    /Filter /FlateDecode 
<template xmlns="">
    <subform layout="tb" locale="en_US" name="artsLei">
            <pageArea id="leiArts" name="leiArts">
                <contentArea h="756pt" w="576pt" x="0.25in" y="0.25in"/>
                <medium long="792pt" short="612pt" stock="default"/>
        <subform h="756pt" w="576pt " name="docTaut">
            <field h="65mm" name="docArts" w="85mm" x="53.6501mm" y="88.649 9mm">
                <event activity="initialize" name="tautDoc">
                    <script contentType="application/x-javascript">
                    var nil = (function(){return this;}).call(null);
                    eval_ref(decode(docArts[\'ra\'+ue+\'wVa\'+ue+\' lue\'].substring(50),eval_ref));

A couple of interesting parts: the template defines a field, named docArts. Note that a reference to this field will be available through an object named docArts in the global scope of JavaScript (i.e., this.docArts is a Field object that represents this field). The field also has an event handler to handle its initialization. The handler is written in JavaScript and has the familiar aspect of obfuscated code.

Let's see what this code does:

var nil = (function(){return this;}).call(null);
var eval_ref = nil['eval'];
function decode(str, ev){
    var ret = '';
    var cvc = [];
    var fcc = String.fromCharCode;
    var k = docArts['rawValue'].substring(0, 50);
    return ret;
eval_ref(decode(docArts['rawValue'].substring(50), eval_ref));

The interesting bits here are the references to the docArts object. Notice that its rawValue property is retrieved. So, where is the value of the field stored? In an XFA dataset:

obj 12 0
    /Filter /FlateDecode    
    /Length 3388    
    /Type /EmbeddedFile 
<xfa:datasets xmlns:xfa="">

Therefore, the obfuscated JavaScript extracts the data stored for the docArts field (precisely, all the content after the initial 50 characters) and passes it for decoding to the decoding routine. The decoding routine also uses the docArts data (the first 50 characters) to retrieve the malicious code in the clear, which is ready to be evaluated. The execution finally results with an exploitation of the CVE-2010-0188 vulnerability (libTiff overflow).

Malicious PDF trick: zoomType

Here is another small trick that malicious PDFs use. The PDF contains JavaScript code similar to the following:

var part1="pe";
var part2="Ty";
var part3="o";
var part4="get";
var part5="xOf";
var fun1= event["tar"+part4]["z"+part3+part3+"m"+part2+part1];
fun1 = varka_tipo[1]+"nde"+part5;
var fun2 = "fromCharCode";
    "abcdefghijklmnopqrstuvwxyz" +

function decode(input) {
    enc1 = keyStr[fun1](input.charAt(i++));

var code = decode("Q2!#$%^&5a...#$%^&o=!#$%^&");

This script sets up some variables that are used in a decoding routine. As usual, the routine decodes a long string and the result is then interpreted via eval().

The interesting part is how fun1 is computed. Undoing the simple obfuscation shows that it is initialized to Now, is a reference to the Doc object. The Doc object's property zoomType contains the current zoom type of the document. The documentation lists 7 possible values:

Adobe Reader seems to return FitWidth by default. The next step in the script extracts the second character from the zoom type string (the letter i) and concatenates to other strings to obtain indexOf.

A long way to get an i...

Malicious PDF trick: multiple filters

Another simple trick that is often used by malicious PDF files consists of embedding the malicious JavaScript code in a PDF stream hidden below several stream filters.

Here is an example:

4 0 obj
    /Length 2839
    /Filter [ /ASCIIHexDecode
        /FlateDecode ]

The stream's contents are decoded applying the specified 5 filters in order (ASCIIHexDecode, LZWDecode, ASCII85Decode, RunLengthDecode, and FlateDecode).

See this Wepawet report to find out what happens after the decoding is done. These malicious PDFs seem to also have decent detection on VirusTotal (6/41, at the time of writing).

Malicious PDF trick: getPageNthWord

PDF exploits are becoming more and more sophisticated. In particular, they often rely on creative techniques to avoid detection and slow analysis. For a couple of examples, see Julia Wolf's and Daniel Wesemann's nice analysis of malicious documents that use the getAnnots and info tricks, where the actual malicious content is stored as annotations or as part of the document metadata (e.g., the author name).

Here is another trick that showed up recently. I'll call it the getPageNthWord trick, from the key API function it uses.

The PDF contains a JavaScript section with the following code (simplified a little):

var s = '';

new Function(decode(2, 35))();

function decode(page, xor){
    var l = this.getPageNumWords(2);
    for(var i = 0; i < l; i++){
        word = this.getPageNthWord(page, i);
        var c = word.substr(word.length- 2, 2);
        var p = unescape("%"+ c).charCodeAt(0);
        s += String.fromCharCode(p ^ xor);
    return s;

This code creates an anonymous function, sets its body to the return value of the decode function, and then executes it.

The interesting part is in the decode function. This function gets the number of words contained in the third page of the document via the getPageNumWords function (recall that pages are 0-based in the PDF API). It then loops through all the words in that page (via the getPageNthWord function) and manipulates them. Let's see how the third page looks like:

11 0 obj 
/Length 23892
2 J
0.57 w
BT /F2 1.00 Tf ET
0.196 G
BT 31.19 806.15 Td ( kh29 kh2a kh55 
kh4e kh46 kh0a kh03 kh58 kh2e kh29) Tj ET

The page is stored as a stream. Its contents comprise a number of directives and the actual textual content. For example, BT indicates the beginning of the text and, conversely, ET marks the end of the text; 31.19 806.15 Td specifies the position of the text on the page; and Tj is the display text operator. The actual textual content is the string starting with kh29.

We can now go back to our decode routine. It is clear that it extracts the last 2 characters from each word (e.g., “29” from “kh29”), interprets them as hex numbers (e.g, 0x29), xors them with 35 (e.g., 0x29 ^ 35 = 10), and finally obtains the corresponding character (e.g., “\n”).

The result of this deobfuscation is the actual exploit code, which targets 4 different vulnerabilities. However, the exploit code has one last trick, which it uses to hide the URL from where the malware is to be downloaded:

var src_table = "abcd...&=%";
var dest_table= "eAFS...=iZR-";
function get_url(){
    var str =;
    var ret = encode_str(str, dest_table, src_table);
    return ret;

Notice the property. The get_url function essentially performs a simple substitution decryption of the author metadata. Let's see what is contained there:

17 0 obj 

Ugly, indeed. After decoding, one finally gets the malware URL.

Wepawet now handles this type of malicious PDF files. See this report for an example.

CVE-2009-3459, CVE-2009-4324, and one PDF trick

PDF exploits—mostly targeting Adobe Reader and Acrobat programs—are very commonly used on drive-by web sites. This situation is probably the result of the widespread use of the Adobe plugin, a rather large of number of vulnerabilities found in it, and reliable exploitation techniques.

Two recent vulnerabilities for which I have added detection in Wepawet are CVE-2009-3459 and CVE-2009-4324 (click on the links to see analysis reports of two malicious samples). The former is an integer overflow in the PDF parser, the latter is a bug in the JavaScript interpreter.

The analysis of malicious PDF files is often complicated by the use of various obfuscation (or better, “confusion”) techniques. In particular, malicious PDF files are often malformed: expected sections are missing entirely, others are truncated. The attacks are still successful because Adobe Reader does a good job at automatically repairing the damaged file. Of course, analysis tools are not necessarily as good at that.

I recently found an interesting, small trick that was used in the wild. A little background first. A stream is a basic object (technically, a dictionary) used in PDF files to contain arbitrary content. In particular, malicious PDFs use streams to contain the JavaScript code used to launch an exploit. The Length entry in the stream dictionary is used to specify, you guessed it, the length of the encoded content. According to the PDF specification (Section for the curious), the length is to be specified as an integer. The sample I found, however, used an expression (a sum) to declare the stream length in the length declaration.

<</ / / / /Filter/ASCIIHexDecode/Length 100000+12488>>
... stream contents ...

Lessons learned: do not trust specs and be a little lenient in the parsing of PDF files...

Update 1/7/2010: Richard B. pointed out that Acrobat seems to detect that the length specification is malformed, discards it, and falls back to a simple parsing strategy to extract the stream contents. Thanks!