By Evan Sultanik, Principal Security Engineer

A couple of years ago we released PolyFile: a utility to identify and map the semantic structure of files, including polyglots, chimeras, and schizophrenic files. It’s a bit like file, binwalk, and Kaitai Struct all rolled into one. PolyFile initially used the TRiD definition database for file identification. However, this database was both too slow and prone to misclassification, so we decided to switch to libmagic, the ubiquitous library behind the file command.

What follows is a compendium of the oddities that we uncovered while developing our pure Python cleanroom implementation of libmagic.

Magical Mysteries

The libmagic library is older than over half of the human population of Earth, yet it is still in active development and is in the 99.9th percentile of most frequently installed Ubuntu packages. The library’s ongoing development is not strictly limited to bug fixes and support for matching new file formats; the library frequently receives breaking changes that add new core features to its matching engine.

libmagic has a custom domain specific language (DSL) for specifying file format patterns. Run `man 5 magic` to read its documentation. The program compiles its DSL database of file format patterns into a single definition file that is typically installed to /usr/share/file/magic.mgc. libmagic is written in C and includes several manually written parsers to identify various file types that would otherwise be difficult to represent in its DSL (for example, JSON and CSV). Unsurprisingly, these parsers have led to a number of memory safety bugs and numerous CVEs.

PolyFile is written in Python. While libmagic does have both official and independent Python wrappers, we chose to create a cleanroom implementation. Aside from the native library’s security issues, there are several additional reasons why we decided to create something new:

  1. PolyFile is already written in pure Python, and we did not want to introduce a native dependency if we could avoid it.
  2. PolyFile is intended to detect polyglots and other funky file formats that libmagic would otherwise miss, so we would have had to extend libmagic anyway.
  3. PolyFile preserves lexical information like input byte offsets throughout its parsing, in order to map semantics back to the original file locations. There was no straightforward way to do this with libmagic.

The idea of reimplementing libmagic in a language with more memory safety than C is not novel. An effort to do so in Ruby, called Arcana, occurred concurrently with PolyFile’s implementation, but it is still incomplete. PolyFile, on the other hand, correctly parses libmagic’s entire pattern database and passes all but two of libmagic’s unit tests, and correctly identifies at least as many MIME types as libmagic on Ange Albertini’s 900+ file Corkami corpus.

The Magical DSL

In order to appreciate the eldritch horrors we unearthed when reimplementing libmagic, we need to offer a brief overview of its esoteric DSL. Each DSL file contains a series of tests—one per line—that match the file’s subregions. These tests can be as simple as matching against magic byte sequences, or as complex as seemingly Turing-complete expressions. (Proving Turing-completeness is left as an exercise to the reader.)

The file command executes the DSL tests to classify the input file. The tests are organized in the DSL as a tree-like hierarchy. First, each top-level test is executed. If a test passes, then its children are each tested, in order. Tests at any level can optionally print out a message or associate the input file with a MIME type classification.

Each line in the DSL file is a test, which includes an offset, type, expected value, and message, delimited by whitespace. For example:

10    lelong    0x00000100    this is a test

This line will do the following:

  1. Start at byte offset 10 in the input file
  2. Read a signed little-endian long (4 bytes)
  3. If those bytes equal 0x100, then print “this is a test”

Now let’s add a child test, and associate it with a MIME type:

10        lelong    0x00000100    this is a test
>20       ubyte 0xFF       test two
!:mime    application/x-foo

The “>” before the “20” offset in the second test means that it is a child of the previously defined test at the higher level.
This new version will do the following:

  1. If, and only if, the first test matches, then attempt the second test.
  2. If the byte at file offset 20 equals 0xFF, then print out “test two” and also associate the entire file with the MIME type application/x-foo.

Note that the message for a parent test will be printed even if its children do not match. A child test will only be executed if its parent is matched. Children can be arbitrarily nested with additional “>” prefixes:

10        lelong    0x00000100    this is a test
>20       ubyte     0xFF          test two
!:mime    application/x-foo
>>30      ubyte     0x01          this is a child of test 2
>20       ubyte     0x0F          this is a child of the first test that will be tested if the first test passes, regardless of whether the second child passes
!:mime    application/x-bar

If a test passes, then all of its children will be tested.

So far, all of the offsets in these examples have been absolute, but the libmagic DSL also allows relative offsets:

10      lelong    0x00000100    this is a test
>&20    lelong    0x00000200    this will test 20 bytes after its parent match offset, equivalent to absolute offset 10 + 20 = 30

as well as indirect offsets:

(20.s)      lelong    0x00000100    indirect offset!

The (20.s) here means: read a little-endian short at absolute byte offset 20 in the file and use that value as the offset to read the signed little-endian long (lelong) that will be tested. Indirect offsets can also include arithmetic modifiers:

(20.s+10)   read the offset from the little-endian short at absolute byte offset 20 and add 10
(0.L*0x20)   read the offset from the big-endian long at absolute byte offset zero and multiply by 0x20

Relative and indirect offsets can also be combined:

(&0x10.S)    read the offset from the big-endian short 0x10 bytes past the parent match
(&-4.l)      read the offset from the little-endian long four bytes before the parent
&(0.S-2)     read the first two bytes of the file, interpret them as a big-endian short, subtract two, and use that value as an offset relative to the parent match

Offsets are very complex!

Despite having existed for decades, the libmagic pattern DSL is still in active development.

Mischief, Unmanaged

In developing our independent implementation of libmagic—to the point where it can parse the file command’s entire collection of magic definitions and pass all of the official unit tests— we discovered many undocumented DSL features and apparent upstream bugs.

Poorly Documented Syntax

For example, the DSL patterns for matching MSDOS files contain a poorly documented use of parenthesis within indirect offsets:

(&0x10.l+(-4))

The semantics are ambiguous; this could mean, “Read the offset from the little-endian long 0x10 bytes past the parent match decremented by four,” or it could mean, “Read the offset from the little-endian long 0x10 bytes past the parent match and add the value read from the last four bytes in the file.” It turns out that it is the latter.

Undocumented Syntax

The elf pattern uses an undocumented ${x?true:false} ternary operator syntax. This syntax can also occur inside a !:mime directive!

Some specifications, like the CAD file format, use the undocumented regex /b modifier. It is unclear from the libmagic source code whether this modifier is simply ignored or if it has a purpose. PolyFile currently ignores it and allows regexes to be applied to both ASCII and binary data.

According to the documentation, the search keyword—which performs a literal string search from a given offset—is supposed to be followed by an integer search range. But this search range is apparently optional.

Some specifications, like BER, use “search/b64”, which is undocumented syntax. PolyFile treats this as equivalent to the compliant search/b/64.

The regex keyword has an undocumented T modifier. What is a T modifier? Judging from libmagic’s code, it appears to trim whitespace from the resulting match.

Bugs

The libmagic DSL has a type specifically for matching globally unique identifiers (GUIDs) that follows a standardized structure as defined by RFC 4122. One of the definitions in the DSL for Microsoft’s Advanced Systems Format (ASF) multimedia container does not conform to RFC 4122—it is two bytes short. Presumably libmagic silently ignores invalid GUIDs. We caught it because PolyFile validates all GUIDs against RFC 4122. This bug was present in libmagic from December of 2019 until we reported it to the libmagic maintainers in April 2022. In the meantime, PolyFile has a workaround for the bug and has always used the correct GUID.

Metagame

PolyFile is a safer alternative to libmagic that is nearly feature-compatible.

$ polyfile -I suss.png
image/png………………………………………………………..PNG image data
application/pdf…………………………………………………..Malformed PDF
application/zip…………………………………………………..ZIP end of central directory record Java JAR archive
application/java-archive…………………………………………..ZIP end of central directory record Java JAR archive
application/x-brainfuck……………………………………………Brainf*** Program

PolyFile even has an interactive debugger, modeled after gdb, to debug DSL patterns during matching. (See the -db option.) This is useful for DSL developers both for libmagic and PolyFile. But PolyFile can do so much more! For example, it can optionally output an interactive HTML hex viewer that maps out the structure of a file. It’s free and open source. You can install it right now by running pip3 install polyfile or clone its GitHub repository.