back to list
3/25/2017, 12:06:35 PM
A Comparison of Database Compression Technologies

1. Background

From the databases’ points of view, when hardware resources are limited ( such as CPU. memory, SSD capacity and speed etc. ), they always want to:

  • Store more data by compressing them ( we have bunch of compression algorithms in the world )
  • Faster access, this generally means faster compress / decompress speed and larger cache

For almost all popular compression algorithms, they highly depend on data context:

  • Data closer to each other has stronger relevance and may have more redundancy
  • Bigger compression data blocks generally means better compression ratio ( of course there’s an upper bound )

2. Block Compression

For most scenarios that use blocks or files as compression unit, the traditional ( streaming ) compression algorithm works pretty well and people are getting used to it. Many compression algorithms that based on this method has been implemented, including gzip, bzip2, snappy ( from Google ) and rising star zstd :

  • gzip is supported by almost all platforms and becoming an industry standard. It’s compression ratio, compressing speed, decompressing speed are well balanced.
  • bzip2 is based on Burrows–Wheeler Transform, it splits input data into blocks, compresses them separately. Turns out its compression ratio is pretty good, but compressing/decompressing speed is slow.
  • snappy is come from Google, good at compressing and decompressing speed but compression ratio is not good enough. It is suitable for scenarios that don’t require great compression ratio but need to compress very fast(e.g. real-time compression)
  • lz4 is a strong competitor of snappy. It is faster than snappy, especially decompressing speed.
  • zstd is a rising star, it has a much better compression ratio than lz4 and snappy with a lower speed of compressing and decompressing. Comparing to gzip, their compression ratios are almost the same but zstd has a much faster compressing/decompressing speed.

In the area of databases, IO optimized B-Tree has built its unshakeable position in the good old days. Spinning disk is good when read/write large chunk of data, B-Tree is also good for larger block/page size, which gives traditional compression algorithms a larger data context, and that leads to a better compression ratio.

During that time, memory capacity and speed are much faster than hard-disk, most applications don’t have a database performance bottom neck.

Nowadays we have SSD, PCIe SSD, 3D xpoint etc. and memory is getting larger and larger, the weakness of block compression is becoming conspicuous:

  • Larger block size leads to better compression but worse performance, and vice versa.
  • Block compression is saving hard disk and SSD space, but they are getting cheaper and cheaper.
  • Memory is much expensive but block compression waste more due to its double caching problem.

Because of these problems, many real-time applications have no choice but to disable compression in production.

2.1. Principle of Block Compression

Streaming compression algorithms such as snappy, lz4, gzip, bzip2, zstd etc. compress data with blocks or pages. These blocks are generally 4K ~ 32K (TokuDB, which is known as great compression, uses 2M ~ 4M). The block we are talking about here is not physical block like memory page or block device, it’s just a logical block.

When block compression is enabled, access performance will drop consequently. The reasons are:

  • During writing, many records will be compressed into a single block. Larger block size, better compression but worse performance and vice versa.
  • During reading, we should decompress the whole block and then find the target record in the decompressed dataset. That means lots of unused records are decompressed unnecessarily. And of course, smaller block size means better performance but worse compression.

Traditional databases generally use larger DB Cache to relief this kind of problems. Though DB Cache can greatly improve access performance of hot data, it brings double cache problem ( same data in both OS file system cache and DB Cache ). And if DB Cache is missed, databases still need to decompress a whole block from disk, this is the main cause of slow query ( another cause is file system cache miss ).

These problems make traditional databases can’t improve both access speed and space efficiency.

2.2. Block Compression in RocksDB

BlockBasedTable in RocksDB is an SSTable which uses block compression. Its index can only locate target block, not record. Block size is defined in dboption.

A block contains multiple key-value pairs(a key-value pair is a record). For example, if we have M records, index size will be 1/M:

  • Lager M leads to smaller index size.
  • Lager M means larger block size, compression algorithms ( gzip, snappy etc. ) could have larger data context, leads to better compression ratio.

When creating a BlockBasedTable, key-value pairs will first be written into a buffer. When the buffer reaches its limit ( block size ), it will be compressed into a BlockBasedTable file and the file offset and the first key will be recorded as well ( the first key is used for creating index ).

If a single record is too large that even exceeds the block size, this record will be treated as a single block (Single record will never be split into several blocks no matter how large it is).

After all key-value pairs have been written into blocks, the database will create an index based on all blocks’ first key and file offsets. So in a BlockedBasedTable file, data is placed before the index, and the metadata is placed at the end of the file ( similar to File Header of a normal file except it is placed at the end of the file ).

When an user executes a search operation, the database will first find target block by search key, then try to find this block from DB cache. If it can’t find the target block, the database will load it from Hard Disk / SSD, decompress it and put it into DB Cache. Then database will try to find target record from this decompressed block. There are several different implementations of DB Cache in RocksDB such as LRU Cache, Clock Cache and Counting Cache ( used for stats cache hits ) and other special Caches.

In most situations, OS has its file system cache, so for the same block it could be both in DB cache ( decompressed data ) and file system cache ( compressed data ). This is a waste of memory, so RocksDB provides a tradeoff: users can use DIRECT_IO option to disable file system cache, only use DB cache, it will save some memory but it can slow down its performance when OS cache miss.

3. FM-Index

FM-Index is the synonym for “Full Text Matching Index”, belongs to Succinct Data Structure family. It has the ability to compress data and access directly on compressed data.

FM-Index has a rich operation set and has a long history. Since a few years ago, FM-Index was becoming a little activity: Someone in Github implemented a whole set of Succinct Data Structures and Berkeley is beginning to use FM-Index.

FM-Index is an offline algorithm that compresses whole dataset at one time, and then never modify it, most of the implementations are based on Burrows–Wheeler Transform(BWT is based on suffix array),the most important operations of FM-Index are:

  • data = extract(offset, length)
  • {offset} = search(string), return multiple positions/offsets of the matched results

Though FM-Index has more other operations, it has several serious problems:

  • Too complicated to implement
  • Compression ratio is poor ( much worse than streaming compression like gzip ) search and extract are very slow ( single thread speed less than 7MB/s at CPU i7-6700K, which is the fastest CPU of 2016 )
  • Compressing is slow and memory inefficient ( sometimes memory cost is 50 times of data size )
  • Data model is Flat Text, not Key-Value
    • There’s a simple way to adapt Flat Text model into key-value model: pick a character that doesn’t appear in both key and value like #, insert it before and after all keys, and after each key is its value.
    • Then search #key# will return the position of #key# and we can then extract value after it.

4. Terark Searchable Compression

Terark proposed a Searchable Compression concept that is able to search and extract data directly on compressed data ( like FM-Index ). But its data model is key-value from the bottom. According to the benchmark report, it is much faster than FM-Index, the most important highlights are:

  • Use global compression, not traditional block compression
  • Use different global compression algorithms for keys and values
  • Use CO-Index compression for keys which provides search functionality
  • Use PA-Zip compression for values which provides seekable extract functionality

4.1. Key Compression: CO-Index

General indexing technologies need much larger space to store keys’ index. Some of them are using prefix compression, which can partially relief the problem of index expanding, but still can’t totally solve it.

Terark’s CO-Index ( Compressed Ordered Index ) uses a Nested Succinct Trie data structure to solve this problem.

Nested Succinct Trie has a lot of advantages, comparing to traditional data structures, it can significantly reduce the memory usage ( 10 ~ 50 times less ). And it supports a bunch of useful functions:

  • Point search
  • Range search
  • Sequential traverse
  • Prefix search
  • Regex search ( NOT one by one sequential scan )

Comparing to FM-Index, CO-Index has significant advantages, for example ( let’s assume all data in FM-Index is KEY ):

Function 1) When using CO-Index to implement a search operation of FM-Index, this search operation returns an internal ID which is used for accessing value.
2) If we have an ID already, it could be used to extract related key
Performance The search speed of CO-Index could generally be 3X faster than FM-Index
Compression The compression ratio of CO-Index could generally be 4X better than FM-Index

4.1.1. CO-Index implementation

In fact, Terark has implemented different kinds of CO-Index, among these implementations, Nested Succinct Trie is the most widely used one.

Succinct Data Structure is a succinct representation of traditional data structures, it can significantly reduce memory usage of data structures, use bitmap and rank-select(on the bitmap) to find data.

Though it can greatly reduce memory usage, but it is very complicated to implement and its performance is not good. There is an open source implementation SDSL-Lite. Terark uses its own rank-select and has a much better performance.

Succinct Tree Example

A pointer-based binary tree node will be struct Node {Node *left, *right; };, each node uses 2 pointer. If we have 2^16 nodes, it will at least use 512 KB (assume 4 bytes for a pointer). Even we optimize it using bits to represent a pointer, it will still cost us 256 KB ( 16 bits for each pointer )

But let’s change it into succinct data structure, each of the node only requires 2.5 bits (in which 0.5 is for rank-select index), the total memory usage is 20 KB.

Succinct Trie

Trie could be used as index, the trie below is a Succinct Trie that uses LOUDS represent. It has four keys : hat, is, it, a.

succinct_trie.png

Nested Patricia Trie

Succinct Trie is not enough for a great compression, so we introduced path compression and nesting and comprise all of them.

We know that Patricia Trie can merge several single-child nodes into one big node and label it ( e.g. ulus, undus below ). Terark makes this labeled node into another Trie and nested into the parent Trie.

nested_patricia_trie.png

4.2. Value Compression: PA-Zip

Terark has invented a compression technology named PA-Zip ( Point Accessible Zip ) : each value has an ID which is used to extract the compressed data. From this point of view, the ID here is a Point, so it’s called Point Accessible Zip.

PA-Zip compresses all values of a key-value model database by global compression instead of block/page compression.

PA-Zip has a better compression ratio, doesn’t have double cache problem, don’t need a DB cache and can extract value by a single ID. If we regard an extract operation as decompression:

  • Sequential reading speed will be 500 MB/s generally, 7 GB/s maximum, suitable for offline analysis and traditional database could do that too.
  • Random reading speed will be 300 MB/s generally (single thread), 3 GB/s maximum. Suitable for online service and traditional database can not do that.
  • In some situation, a warm up may be required. Since TerarkDB doesn’t need a DB cache, a warm up for TerarkDB is extremely fast: database load successfully means warm up is done ( throughput is SSD’s sequential read performance )

Comparing to FM-Index, PA-Zip’s purpose is to implement an extract operation with a much better performance and compression:

Function FM-Index’s extract
Performance 100X faster than FM-Index’s extract
Compression 3X better than FM-Index’s compression

4.3. Combine Key and Value

Keys are global compressed by CO-Index and values are global compressed by PA-Zip. Each time we search for a key, we will get an internal ID. Using this ID we can seek a related value. In the whole process, only necessary data will be touched, no waste decompress and cache.

From this point, DB Cache is not required. By using mmap the whole DB only needs a single layer of cache, file system cache, and furthermore, with a much better compression, TerarkDB greatly reduces memory cost and system complexity. With all of these works, TerarkDB gains a tremendous performance ( especially random read ).