Parsing Adobe Illustrator

March 8, 2022

The next big thing for me after Opera was https://opendesign.dev/blog/open-design-2-0 - and my first job was to make the Adobe Illustrator file parser work within the framework set for 2.0. This meant primarily: code that can run fully client-side and can be shipped as such. The server-side approach has the benefit of being able to use a more diverse ecosystem of libraries, but the browser ecosystem currently constraints you to Javascript… and thankfully also WebAssembly.

Finding the right solution

Adobe Illustrator files are usually saved in "PDF-compatible" mode - this means there are two parts to successfully parsing them:

PDF part,
additional data encoded in custom format - so-called "private data" due to keys they are put under,

So naturally potential libraries had to fit one additional criterion: being able to extract raw private data - so that we can parse it. In the end, I investigated a couple of libraries: in Javascript, Go and Rust. For each of them, I tested how many files of our test corpus they were able to open without error, and whether I could get to the private data. Sadly I couldn’t get the best-performing one (https://mozilla.github.io/pdf.js/) to yield it.

The second best (https://github.com/pdfcpu/pdfcpu) worked though. Go boasted about its ability to compile to WebAssembly (WASM), so I had high hopes. The only restriction came from the team maintaining the next piece in our stack: they wanted as much as possible done in Typescript for ease of maintenance. So the plan of action became:

verify we can indeed compile pdfcpu to WASM,
parse the file using pdfcpu and dump resulting Go structures to JSON,
extract and parse private data in Typescript,
transform the output to be identical to one generated with the previous library,
verify on our internal corpus of data that both outputs are identical,

Making it work

Right out of the gate, I met with the first hitch: turns out there is indeed support for WASM output from the Go compiler, but the output is… not quite the best. It suffered from two major problems:

The resulting file was >10MB - quite a lot for something shipped to a browser,
To run, it needed some "glue" - a special JS file, shipped alongside the compiler, which defined certain global values that the WASM file depended upon.

I spent some time trying to tackle both problems: the former by using another Go compiler (https://tinygo.org) and shrinking the WASM file, latter by trying to rewrite the file to use modules and thus avoid global variables.

Both fronts failed in a sense - I succeeded in achieving a bit smaller (3MB) binary, but the resulting code would be much harder to maintain. Rewriting the drop-in file proved too complex - there were multiple references between generated code, often by name or some more complex lookup mechanism.

After discussing it with the team, we agreed to proceed despite the shortcomings, mostly because:

on the first front - we could load the WASM file asynchronously on demand - to avoid needing to bundle it alongside the main JS file,
on the second - future Go versions promised better support for ESModules, meaning the file might get a lot simpler.

Unforeseen problems

As usual, not every problem can be foreseen and this project was no exception. I stumbled into shortcomings in the pdfcpu library:

<https://github.com/pdfcpu/pdfcpu/issues/122 > and had to write my lexer, parser and interpreter for the PS subset used by PDF for text encoding. Sadly this was done in Typescript, so it couldn’t be upstreamed,
handling for certain exotic image formats: https://github.com/pdfcpu/pdfcpu/pull/483 - which was successfully merged,

Also, some minor details required careful workarounds: there were weird cases in the previous implementation which made diffing much more difficult. To make it work, I added feature flags which enabled compatibility with the previous version. After successfully migrating to the new library, these could be then turned off.

The "dual build" was an interesting solution: due to WASM limitations, that version turned out to be slower and less featureful (it could only load files under 2GBs). To solve this problem, the published package contains four binaries:

linux/amd64
darwin/amd64
darwin/arm64
WASM,

and exports two "contexts":

filesystem-based, which runs executables and uses the filesystem for results,
in-memory, which uses WASM primitives to run parsing and communicate results,

The former is faster, but usable only on Node.js (or any other JS platform that has access to the filesystem and exec), the latter is slower but can be used in the browser.

Integration

Having a working library is one thing, successfully integrating it into your product is another. In this case, we succeeded on both fronts: my parser is now used by our main release candidate for OpenDesign v2: rev 783276.

The parser is available publicly on npm - and the code is open-sourced on github.

Tags: ceros backend typescript wasm golang

Problems with docker

Unclogging the update drain