Introducing DPack

Kris Zyp
Doctor Evidence Development
9 min readJun 6, 2019

--

DPack is structured data format that can serialize and encode large data structures extremely efficiently, often with 2 to 3 times less space, and can parse faster than JSON or other common alternatives. In addition, dpack supports a broad and extensible range of data types, object and structure referencing, and supports blocks that allows for partial/deferred parsing. DPack is autonomous, no out of band information is necessary. While no single format may be perfect for every application, we believe that dpack is a powerful and broadly useful data format for easily serializing and storing complex and diverse data structures, in a compact form, designed for efficient and scalable parsing and handling, for a variety of needs, including database storage and web data transfer.

DPack was born from work on improving the performance and scalability of our applications. At Doctor Evidence, our interfaces give users access to large, complex data structures that can be filtered and analyzed in a variety of ways, and this involves transferring large amounts of data between our servers and web clients. Like most, we have relied on JSON for storing and transferring data. However, we started to investigate other possibilities, since a better format could potentially yield significant benefits with such heavy data usage. We tried out MessagePack, but the performance of even the fastest libraries was inferior to JSON. And while it produces smaller serializations of data, if the data was gzipped (and we always gzip the data we transfer to the browser), it turns out the gzipped MessagePack is actually larger than gzipped JSON in most cases due to unusual character frequencies which defeats Huffman encoding. MessagePack also does not support any type of deferred parsing, which can be extremely valuable for performing faster indexing and transformation operations on the backend. BSON does feature deferred parsing, but suffers from similar performance and compactness issues as MessagePack.

Consequently, we built dpack. DPack was built with a number of key objectives in mind. First, most data structures in real applications often have a large degree of structural homogeneity; data frequently reuses certain data structures. By being able to internally reuse these structures, a large amount of the unnecessary redundancy in data serialization can be eliminated. Likewise, many string-based fields will frequently reuse the same strings frequently. The dpack format provides the ability to reference structural properties, strings, and objects so they can be efficiently reused without wasted space or parsing/serialization time. DPack also was built from the ground up to support deferred parsing on embedded blocks of data, and to support extensible types. All of this is available in the JavaScript dpack library that we built.

The bottom line: dpack is faster, more compact, and more expressive than JSON for many large data structures in web applications.

Performance

For performance benchmark comparisons, I measured performance and sizes of some our data structures. Here at Doctor Evidence we deal heavily with data structures that describe the data from medical research studies. These data structures can be quite complex, and large, but also have a decent amount of structure similarities within the data. While this is probably more complex than typical applications, I think it is a very reasonable benchmark. Below are test results for the performance and size of serializing a typical study data structure (obviously smaller is better in size and time). These were all performed on Node 12.3 (V8 7.4), with a thousand iterations:

                  dpack   JSON    Avro   MessagePackSize (bytes)      20344   68816   49259  58484
Gzip size 7672 8822 8711 9378
Serialize time 0.57ms 0.34ms 0.93ms 1.1ms
Parse time 0.39ms 1.2ms 0.57ms 2.5ms
Serialize w/ gzip 1.1ms 1.4ms 1.5ms 2.1ms
Parse w/ gzip 0.52ms 1.5ms 0.71ms 2.6ms

As you can see, dpack dramatically outperforms JSON and MessagePack both in speed and size overall, with the exception of isolated serialization speed. Different data structures can produce very different outcomes. I have tested with various data structures from our application and typical web sites/ applications (like Jira, Ebay, etc.). Some data structures (with highly varied keys) perform poorly, but most of these have similar improvements in performance and compactness. It should also be noted that the improvements typically tend to go away with very small structures (like a single object). V8 also has a serialization format (used for structured cloning), which has very fast serialization, but is more verbose than MessagePack.

Gzipping files naturally reduces some of the difference between formats, as it eliminates much of the redundancy. However, DPack is designed to work at the structural level, and carefully designed to use characters that will align well with Huffman encoding, ensuring that DPack actually complements the work of GZipping, and file size reductions are still achieved. And, if you look closely at this table, you will notice that because DPack creates a smaller file size to compress, and this actually greatly improves the performance of gzipping. DPack generally doesn’t serialize faster than JSON in isolation, but when considered together with gzipping, the overall performance is generally superior for large objects; here the DPack output was gzipped in half the time.

Progressive Parsing

Our JavaScript library also includes support for progressive parsing. This means that dpack can be parsed while it is downloading, even before the download is complete. DPack provides access to the data structure during this progressive download and parsing, and the partially constructed data structure can be accessed and even used while a download is in progress. Furthermore this means that for large data structures, when a download is finally completed, most of the data will already be parsed. When the the last chunk is received, only that last chunk of data needs to be parsed. This can provide much faster response times to receiving data structures from a server since the parsing doesn’t have to wait until after a download is complete before beginning (as is typical with JSON).

Deferred/Lazy Parsing

The dpack library includes support for deferred parsing with nested “blocks” within dpack documents. This can be a powerful feature for database applications that store data in dpack. By being able to define blocks that do not need to be parsed unless they are accessed, other parts of a data structure can be accessed without requiring full parsing. This can be particularly useful when querying or indexing data in database. We store large amounts of complex research data in our databases that require frequent querying and indexing. For example, we often have a data structures like this that we may put in our database:

const study = {
name: ‘name of study’,
category: ‘cancer’,
id: 5245,
moreData: {…very large data structure in here…}
}
database.put(study)

Now if we have a database that holds a large number of these data structures, imagine we want to query or index by the category:

var cancerStudies = databaseCursorOfStudies.filter(study =>
study.category === 'oncology')

In this case, we could be iterating through a very large number of database entries. This can be extremely slow if we have to parse the entire data structure just to read its category. However, with dpack, we can define specific objects as “blocks” that will be serialized in such a way that the parser can defer the parsing until the data is needed:

const study = {
name: 'name of study',
category: 'oncology',
id: 5245,
moreData: asBlock({…very large data structure in here…})
}

Now this large nested data structures is written as a block, and the parser can parse the name, category, and id properties (and values) without have to parse the contents of moreData. Furthermore, the value of moreData is a Proxy object that can be referenced and included or copied into other objects, and it won’t be parsed until one of the properties in the moreData object is accessed. For indices, this means data could be indexed by category and the moreData contents could be transferred to another entity/row in a table without ever having to parse or serialize it (we actually rarely do full table scans, we create indices which benefit just as much from this feature). Likewise, these study objects could be serialized to an HTTP response, and the existing binary contents of moreData can be directly transferred to the network stream without any need to parse and re-serialize the data, as long moreData is not modified. This has potentially large performance benefits by minimizing parsing and serialization costs in both databases and web application servers. With nested blocks, objects can be modified, and any nested serialization that is unmodified can be reused so that only modified sections of an object graph actually need to be re-serialized.

Shared Structures

DPack achieves compact, efficient serialization by being able to reuse properties and structures so they don’t have to be rewritten for every object that shares the same properties. Consequently, serializing small objects on their own may not have the opportunity to reuse properties. However, our dpack library includes the ability to define and create a shared structure, such that many small objects that share common properties (in a database store) can use a shared structure of properties and can therefore benefit from DPack property reuse, and all the objects in a store can share a common DPack set of properties. The serialization of this shared structure can be simply prepended to object serializations when they leave the system and are parsed in the browser (and the shared structure is only needed once when multiple objects are sent to the browser).

By combining shared structures with blocks, dpack can serialize data from databases by using extremely efficient buffer copying directly from pre-serialized shared structures, and directly from source buffer data from databases. By leveraging these dpack blocks in combination with shared structures, we have been able to efficiently store massive amounts of clinical study data, and send this data to our browser-based web clients with remarkable speed and minimal resource/CPU usage.

Extensible Data Types

DPack was built to support serializing and parsing extensible data types. This can include user provided classes and types. It also includes built-in support for handling Date, Set, and Map instances. These can be used or included in any data structure and they will be round-tripped back to their original form through serialization and parsing.

In addition, user defined classes can be registered with the JavaScript library allowing these objects to be tagged and serialized back to the original type. By supporting this at the data format level, performance is improved since properties can be read and assigned directly to the properly typed instance without requiring additional property copying steps, typical of schemes to deserialize data back into user classes that are not natively integrated into the data format.

Encoding Compatibility

While DPack is a “binary” format in the sense that it is designed and optimized for byte-level machine reading, it is also designed to be encoded entirely as standard characters, with all tokens being standard ASCII characters. This provides a performance advantage because DPack can be decoded from bytes to characters in a single pass, making it very quick and easy for browsers to use the native decoding mechanism and continue the parsing process with the decoded text.

This also means DPack can actually be encoded in any ASCII-compatible character encoding. While the default encoding is UTF-8, it can also be encoded in ASCII-compatible character set.

Emergence of Benefits

While dpack is certainly a more sophisticated data format to describe (see the specification here) than some alternatives, dpack is designed for minimal implementation complexity, with a structure that is conducive to simple state machine implementation, while providing built-in support for features like a broad set of data types, custom data types, referencing, deferred parsing of blocks, and progressive parsing. And ultimately this translates into easier and simpler application development on top of dpack; you don’t have to build and invent custom techniques for translating data into more compact forms, describing different types, or creating deferred parsing techniques.

And by being able to use a single format, that can consistently represent the structures and needs of a real-world complex data within the various demands of database and web servers, we can use this format for data storage, network transfers, inter-process communication, and various intermediate caches. And because we can retain a single format, and defer parsing at any level, data can move between layers with tremendous efficiency. The emergent combination of these techniques creates advantages beyond any one feature on its own.

You can find the JavaScript package here:
https://github.com/DoctorEvidence/dpackjs

We have created a web-based conversion tool so you can try your own data structures in dpack here:
https://doctorevidence.github.io/dpackjs/

And the specification is available here:
https://github.com/DoctorEvidence/dpack

If you are interested in working with the development team at Doctor Evidence, see the rest of this site for more information on work.

--

--