Xz compression

4/30/2023

From the tests, we wanted to record three primary metrics: We have about 1500 templates totaling 133.8 MB of document-state JSON.įor this set of documents we tried every combination of binary format and compression algorithm (at their various compression levels ). We also decided to try XZ, Zstandard, and bzip2.Īs a test bed of documents, we decided to use our system templates (Lucidchart and Lucidpress ‘blueprint’ documents that we provide to our users) as our sample data. We thought it might be a good fit on our document-state JSON as well. However, a couple of years ago we started using Brotli to compress our static front-end javascript assets, and saw very good results. Historically, we have just used gzip for compression of our document-state because it is fast, gets effective results, and works natively in the JVM. We decided to test out the following serialization methods : Because our document state includes plenty of booleans and numbers, it seemed like a no-brainer that a binary serialization technique would beat out JSON. Similarly false will be 5-bytes in JSON, but a single byte (or conceivably less) in a binary format. However, a binary format could represent the same number as a 8-byte floating point double. But, its simplicity and human-readability mean it isn’t the most space-efficient format out there.įor example, representing the number 1234.567890123457 will take 18 bytes in UTF-8 stringified JSON. JSON is human readable, relatively concise, simple to understand, and is universally supported. Thus, we decided to investigate alternative serialization and compression methods to find the pair would minimize the costs of persisting the new data. So, even if we could only reduce the size of our persisted data by a few percentage points, it would translate to real-world savings. However, as we started sampling some data and crunching the numbers, we realized that within a year or two we would have hundreds of terabytes of data, costing thousands of dollars per month in infrastructure costs. Our plan was initially to just gzip the document-state JSON when persisting the snapshots. In order to improve the performance and fidelity of our revision history feature, we decided that we should start persisting ‘key-frames’ (snapshots) of our document-state data (rather than just the deltas).

However, we recently undertook a project that made us question whether or not we should be using JSON at our persistence layer.

And, while it certainly isn’t perfect, its convenience and simplicity have made it our format of choice at Lucid. Perhaps none are more pervasive than JSON: the de facto serialization method for web applications. There are a lot of data serialization formats out there. In our testing, Brotli proved to be very effective for long-term persistence. TL DR: If you are considering using an alternative binary format in order to reduce the size of your persisted JSON, consider this: the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method.

0 Comments

BLOG

Xz compression

Leave a Reply.

Author

Archives

Categories