ZS: a file format for compressed sets¶
ZS is a simple, read-only, binary file format designed for distributing, querying, and archiving arbitarily large data sets (up to tens of terabytes and beyond) – so long as those data sets can be represented as a set of arbitrary binary records. Of course it works on small data sets too. You can think of it as an alternative to storing data in tab- or comma-separated files – each line in such a file becomes a record in a ZS file. But ZS has a number of advantages over these traditional formats:
- ZS files are small: ZS files (optionally) store data in compressed form. The 3-gram counts from the 2012 US English release of the Google N-grams are distributed as a set of gzipped text files in tab-separated format, and take 1.3 terabytes of space. Uncompressed, this data set comes to more than 10 terabytes (and would be even more if loaded into a database). The same data in a ZS file with the default settings (LZMA compression) takes just 0.75 terabytes – this is more than 41% smaller than the current distribution format, and 13.5x smaller than the raw data.
Nonetheless, ZS files are fast: Decompression is an inherently slow and serial operation, which means that reading compressed files can easily become the bottleneck in an analysis. Google distributes the 3-gram counts in many separate
.gz
files; one of these, for example, contains just the n-grams that begin with the letters “th”. Using a single core on a handy compute server, we find that we can get decompressed data out of this.gz
file at ~190 MB/s. At this rate, reading this one file takes more than 47 minutes – and that’s before we even begin analyzing the data inside it.The LZMA compression used in our ZS file is, on its own, slower than gzip. If we restrict ourselves to a single core, then we can only read our ZS file at ~50 MB/s. However, ZS files allow for multithreaded decompression. Using 8 cores, gunzip runs at... still ~190 MB/s, because gzip decompression cannot be parallelized. On those same 8 cores, our ZS file decompresses at ~390 MB/s – a nearly linear speedup. This is also ~3x faster than our test server can read an uncompressed file from disk.
In fact, ZS files are really, REALLY fast: Suppose we want to know how many different Google-scanned books published in the USA in 1955 used the phrase “this is fun”. ZS files have a limited indexing ability that lets you quickly locate any arbitrary span of records that fall within a given sorted range, or share a certain textual prefix. This isn’t as nice as a full-fledged database system that can query on any column, but it can be extremely useful for data sets where the first column (or first several columns) are usually used for lookup. Using our example file, finding the “this is fun” entry takes 5 disk seeks and ~25 milliseconds of CPU time – something like 85 ms all told. (And hot cache performance – e.g., when performing repeated queries in the same file – is even better.) The answer, by the way, is 27 books:
$ zs dump --prefix='this is fun\t1955\t' google-books-eng-us-all-20120701-3gram.zs this is fun 1955 27 27
When this data is stored as gzipped text, then only way to locate an individual record, or span of similar records, is start decompressing the file from the beginning and wait until the records we want happen to scroll by, which in this case – as noted above – could take more than 45 minutes. Using ZS makes this query ~33,000x faster.
ZS files contain rich metadata: In addition to the raw data records, every ZS file contains a set of structured metadata in the form of an arbitrary JSON document. You can use this to store information about this file’s record format (e.g., column names), notes on data collection or preprocessing steps, recommended citation information, or whatever you like, and be confident that it will follow your data where-ever it goes.
ZS files are network friendly: Suppose you know you just want to look up a few individual records that are buried inside that 0.75 terabyte file, or want a large span of records that are still much smaller than the full file (e.g., all 3-grams that begin “this is”). With ZS, you don’t have to actually download the full 0.75 terabytes of data. Given a URL to the file, the ZS tools can find and fetch just the parts of the file you need, using nothing but standard HTTP. Of course going back and forth to the server does add overhead; if you need to make a large number of queries then it might be faster (and kinder to whoever’s hosting the file!) to just download it. But there’s no point in throwing around gigabytes of data to answer a kilobyte question.
If you have the ZS tools installed, you can try it right now. Here’s a live trace of the readthedocs.org servers searching the 3-gram database stored at UC San Diego. Note that the computer in San Diego has no special software installed at all – this is just a static file that’s available for download over HTTP:
$ time zs dump --prefix='this is fun\t' http://cpl-data.ucsd.edu/zs/google-books-20120701/eng-us-all/google-books-eng-us-all-20120701-3gram.zs /bin/sh: 1: time: not found ...
ZS files are splittable: If you’re using a big distributed data processing system (e.g. Hadoop), then it’s useful to split up your file into pieces that approximately match the underlying storage chunks, so each CPU can work on locally stored data. This is only possible, though, if your file format makes it possible to efficiently start reading near arbitrary positions in a file. With ZS files, this is possible (though because this requires multiple index lookups, it’s not as convenient as in file formats designed with this as a primary consideration).
ZS files are ever-vigilant: Computer hardware is simply not reliable, especially on scales of years and terabytes. I’ve dealt with RAID cards that would occasionally flip a single bit in the data that was being read from disk. How confident are you that this won’t be a key bit that totally changes your results? Standard text files provide no mechanism for detecting data corruption. Gzip and other traditional compression formats provide some protection, but it’s only guaranteed to work if you read the entire file from start to finish and then remember to check the error code at the end, every time. But ZS is different: it protects every bit of data with 64-bit CRC checksums, and the software we distribute will never show you any data that hasn’t first been double-checked for correctness. (Fortunately, the cost of this checking is negligible; all the times quoted above include these checks). If it matters to you whether your analysis gets the right answer, then ZS is a good choice.
Relying on the ZS format creates minimal risk: The ZS file format is simple and fully documented; it’s not hard to write an implementation for your favorite language. In an emergency, an average programmer with access to standard libraries could write a minimal but working decompressor in just an hour or two. The reference implementation is BSD-licensed, undergoes exhaustive automated testing (>98% coverage) after every checkin, and just in case there are any ambiguities in the English spec, we also have a complete file format validator, so you can confirm that your files match the spec and be confident that they will be readable by any compliant implementation.
ZS files have a name composed entirely of sibilants: How many file formats can say that?
This manual documents the reference implementation of the ZS file
format, which includes both a command-line zs
tool for
manipulating ZS files and a fast and featureful Python API, and also
provides a complete specification of the ZS file format in enough
detail to allow independent implementations.
Contents:
Project overview¶
- Documentation:
- http://zs.readthedocs.org/
- Installation:
You need either Python 2.7, or else Python 3.3 or greater.
Because
zs
includes a C extension, you’ll also need a C compiler and Python headers. On Ubuntu or Debian, for example, you get these with:sudo apt-get install build-essential python-dev
Once you have the ability to build C extensions, then on Python 3 you should be able to just run:
pip install zs
On Python 2.7, things are slightly more complicated: here,
zs
requires thebackports.lzma
package, which in turn requires the liblzma library. On Ubuntu or Debian, for example, something like this should work:sudo apt-get install liblzma-dev pip install backports.lzma pip install zs
zs
also requires the following packages:six
,docopt
,requests
. However, these are all pure-Python packages which pip will install for you automatically when you runpip install zs
.- Downloads:
- http://pypi.python.org/pypi/zs/
- Code and bug tracker:
- https://github.com/njsmith/zs
- Contact:
- Nathaniel J. Smith <nathaniel.smith@ed.ac.uk>
- Citation:
If you use this software in work that leads to a scientific publication, and feel that a citation would be appropriate, then here is a possible citation:
Smith, N. J. (submitted). ZS: A file format for efficiently distributing, using, and archiving record-oriented data sets of any size. Retrieved from http://vorpus.org/papers/draft/zs-paper.pdf
- Developer dependencies (only needed for hacking on source):
- Cython: needed to build from checkout
- nose: needed to run tests
- nose-cov: because we use multiprocessing, we need this package to get useful test coverage information
- nginx: needed to run HTTP tests
- License:
- 2-clause BSD, see LICENSE.txt for details.
The command-line zs
tool¶
The zs
tool can be used from the command-line to create, view,
and check ZS files.
The main zs
command on its own isn’t very useful. It can tell
you what version you have – these docs were built with:
$ zs --version
0.10.0+dev
And it can tell you what subcommands are available:
$ zs --help
ZS: a space-efficient file format format for distributing, archiving,
and querying large data sets.
Usage:
zs <subcommand> [<args>...]
zs --version
zs --help
Available subcommands:
zs dump Get contents of a .zs file.
zs info Get general metadata about a .zs file.
zs validate Check a .zs file for validity.
zs make Create a new .zs file with specified contents.
For details, use 'zs <subcommand> --help'.
These subcommands are documented further below.
Note
In case you have the Python zs
package installed,
but somehow do not have the zs
executable available on your
path, then it can also be invoked as python -m zs
. E.g., these
two commands do the same thing:
$ zs dump myfile.zs
$ python -m zs dump myfile.zs
zs make
¶
zs make
allows you to create ZS files. In its simplest form, it
just reads in a text file, and writes out a ZS file, treating each
line as a separate record.
For example, if we have this data file (a tiny excerpt from the Web 1T dataset released by Google; note that the last whitespace in each line is a tab character):
$ cat tiny-4grams.txt
not done explicitly . 42
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
not done fast , 52
not done fast enough 71
Then we can compress it into a ZS file by running:
$ zs make '{"corpus": "doc-example"}' tiny-4grams.txt tiny-4grams.zs
zs: Opening new ZS file: tiny-4grams.zs
zs: Reading input file: tiny-4grams.txt
zs: Blocks written: 1
zs: Blocks written: 2
zs: Updating header...
zs: Done.
The first argument specifies some arbitrary metadata that will be saved into the ZS file, in the form of a JSON string; the second argument names the file we want to convert; and the third argument names the file we want to create.
Note
You must ensure that your file is sorted before running
zs make
. (If you don’t, then it will error out and scold you.)
GNU sort is very useful for this task – but don’t forget to set
LC_ALL=C
in your environment before calling sort, to make sure
that it uses ASCIIbetical ordering instead of something
locale-specific.
When your file is too large to fit into RAM, GNU sort will spill the data onto disk in temporary files. When your file is too large to fit onto disk, then a useful incantation is:
gunzip -c myfile.gz | env LC_ALL=C sort --compress-program=lzop \
| zs make "{...}" - myfile.zs
The --compress-program
option tells sort to automatically
compress and decompress the temporary files using the lzop
utility, so that you never end up with uncompressed data on
disk. (gzip
also works, but will be slower.)
Many other options are also available:
$ zs make --help
Create a new .zs file.
Usage:
zs make <metadata> <input_file> <new_zs_file>
zs make [--terminator TERMINATOR | --length-prefixed=TYPE]
[-j PARALLELISM]
[--no-spinner]
[--branching-factor=FACTOR]
[--approx-block-size=SIZE]
[--codec=CODEC] [-z COMPRESS-LEVEL]
[--no-default-metadata]
[--]
<metadata> <input_file> <new_zs_file>
zs make --help
Arguments:
<metadata> Arbitrary JSON-encoded metadata that will be stored in your
new ZS file. This must be a JSON "object", i.e., the
outermost characters have to be {}. If you're just messing
about, then you can just use "{}" here and be done, but for
any file that will live for long then we strongly recommend
adding more details about what this file is. See the
"Metadata conventions" section of the ZS manual for more
information.
<input_file> A file containing the records to be packed into the
new .zs file. Use "-" for stdin. Records must already be
sorted in ASCIIbetical order. You may want to do something
like:
cat myfile.txt | env LC_ALL=C sort | zs make - myfile.zs
<new_zs_file> The file to create. Conventionally uses the file extension
".zs".
Input file options:
--terminator=TERMINATOR Treat the input file as containing a series of
records separated by TERMINATOR. Standard Python
string escapes are supported (e.g., "\x00" for
NUL-terminated records). The default is
appropriate for standard Unix/OS X text files. If
your have a text file with Windows-style line
endings, then you'll want to use "\r\n"
instead. [default: \n]
--length-prefixed=TYPE Treat the input file as containing a series of
records containing arbitrary binary data, each
prefixed by its length in bytes, with this length
encoded according to TYPE. (Valid options:
uleb128, u64le.)
Processing options:
-j PARALLELISM The number of CPUs to use for compression.
[default: guess]
--no-spinner Disable the progress meter.
Output file options:
--branching-factor=FACTOR Number of keys in each *index* block.
[default: 1024]
--approx-block-size=SIZE Approximate *uncompressed* size of the records in
each *data* block, in bytes. [default: 393216]
--codec=CODEC Compression algorithm. (Valid options: none,
deflate, lzma.) [default: lzma]
-z COMPRESS-LEVEL, --compress-level=COMPRESS-LEVEL
Degree of compression to use. Interpretation
depends on the codec in use:
deflate: An integer between 1 and 9.
(Default: 6)
lzma: One of the strings 0, 0e, 1, or 1e.
The number (0 versus 1) indicates the history
size used in the compression -- there's no
point in using 1 or 1e unless you also
increase --approx-block-size. The presence of
the "e" turns on "extreme" mode, which is
several times slower, but may produce
substantially smaller files. (Default: 0e)
--no-default-metadata By default, 'zs make' adds an extra "build-info"
key to the metadata, recording the time, host,
user who created the file, and zs library
version. This option disables this behaviour.
zs info
¶
zs info
displays some general information about a ZS file. For example:
$ zs info tiny-4grams.zs
{
"root_index_offset": 380,
"root_index_length": 41,
"total_file_length": 421,
"codec": "deflate",
"data_sha256": "403b706aa1f8f5d1d2ffd2765507239bd5a5025bde3f89df8035f8a5b9348b11",
"metadata": {
"corpus": "doc-example",
"build-info": {
"host": "branna.vorpus.org",
"time": "2014-04-29T12:41:59.660529Z",
"version": "zs 0.9.0",
"user": "njs"
}
},
"statistics": {
"root_index_level": 1
}
}
The most interesting part of this output might be the "metadata"
field, which contains arbitrary metadata describing the file. Here we
see that our custom key was indeed added, and that zs make
also
added some default metadata. (If we wanted to suppress this we could
have used the --no-default-metadata
option.) The "data_sha256"
field is, as you might expect, a SHA-256 hash of the data contained
in this file – two ZS files will have the same value here if and
only if they contain exactly the same logical records, regardless of
compression and other details of physical file layout. The "codec"
field tells us which kind of compression was used. The other fields
have to do with more obscure technical
aspects of the ZS file format; see the documentation for the
ZS
class and the file format specification
for details.
zs info
is fast, even on arbitrarily large files, because it
looks at only the header and the root index; it doesn’t have to
uncompress the actual data. If you find a large ZS file on the web
and want to see its metadata before downloading it, you can pass an
HTTP URL to zs info
directly on the command line, and it will
download only as much of the file as it needs to.
zs info
doesn’t take many options:
$ zs info --help
Display general information from a .zs file's header.
Usage:
zs info [--metadata-only] [--] <zs_file>
zs info --help
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Options:
-m, --metadata-only Output only the file's metadata, not any general
information about it.
Output will be valid JSON.
zs dump
¶
So zs info
tells us about the contents of a ZS file, but how
do we get our data back out? That’s the job of zs dump
. In the
simplest case, it simply dumps the whole file to standard output, with
one record per line – the inverse of zs make
. For example, this
lets us “uncompress” our ZS file to recover the original file:
$ zs dump tiny-4grams.zs
not done explicitly . 42
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
not done fast , 52
not done fast enough 71
But we can also extract just a subset of the data. For example, we can
pull out a single line (notice the use of \t
to specify a tab
character – Python-style backslash character sequences are fully
supported):
$ zs dump tiny-4grams.zs --prefix="not done extensive testing\t"
not done extensive testing 749
Or a set of related ngrams:
$ zs dump tiny-4grams.zs --prefix="not done extensive "
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
Or any arbitrary range:
$ zs dump tiny-4grams.zs --start="not done ext" --stop="not done fast"
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
Just like zs info
, zs dump
is fast – it reads only the data
it needs to to satisfy your query. (Of course, if you request the
whole file, then it will read the whole file – but it does this in an
optimized way; see the -j
option if you want to tune how many CPUs
it uses for decompression.) And just like zs info
, zs dump
can directly take an HTTP URL on the command line, and will download
only as much data as it has to.
We also have several options to let us control the output format. ZS files allow records to contain arbitrary data, which means that it’s possible to have a record that contains a newline embedded in it. So we might prefer to use some other character to mark the ends of records, like NUL:
$ zs dump tiny-4grams.zs --terminator="\x00"
...but putting the output from that into these docs would be hard to read. Instead we’ll demonstrate with something sillier:
$ zs dump tiny-4grams.zs --terminator="XYZZY" --prefix="not done extensive "
not done extensive research 225XYZZYnot done extensive testing 749XYZZYnot done extensive tests 87XYZZY
Of course, this will still have a problem if any of our records
contained the string “XYZZY” – in fact, our records could in theory
contain anything we might choose to use as a terminator, so if we
have an arbitrary ZS file whose contents we know nothing about, then
none of the options we’ve seen so far is guaranteed to work. The
safest approach is to instead use a format in which each record is
explicitly prefixed by its length. zs dump
can produce
length-prefixed output with lengths encoded in either u64le or uleb128
format (see Integer representations for details about what
these are).
$ zs dump tiny-4grams.zs --prefix="not done extensive " --length-prefixed=u64le | hd
/bin/sh: 1: hd: not found
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe
Obviously this is mostly intended for when you want to read the data into another program. For example, if you had a ZS file that was compressed using the lzma codec and you wanted to convert it to the deflate codec, the easiest and safest way to do that is with a command like:
$ zs dump --length-prefixed=uleb128 myfile-lzma.zs | \
zs make --length-prefixed=uleb128 --codec=deflate \
"$(zs info -m myfile-lzma.zs)" - myfile-deflate.zs
If you’re using Python, of course, the most convenient way to read a
ZS file into your program is not to use zs dump
at all, but to use
the zs
library API directly.
Full options:
$ zs dump --help
Unpack some or all of the contents of a .zs file.
Usage:
zs dump <zs_file>
zs dump [--start=START] [--stop=STOP] [--prefix=PREFIX]
[--terminator=TERMINATOR | --length-prefixed=TYPE]
[-j PARALLELISM]
[-o FILE]
[--] <zs_file>
zs dump --help
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Selection options:
--start=START Output only records which are >= START.
--stop=STOP Do not output any records which are >= STOP.
--prefix=PREFIX Output only records which begin with PREFIX.
Python string escapes (e.g., "\n", "\x00") are allowed. All comparisons
are performed using ASCIIbetical ordering.
Processing options:
-j PARALLELISM The number of CPUs to use for decompression. Note
that if you know that you are only reading a small
number of records, then -j0 may be the fastest
option, since it reduces startup overhead.
[default: guess]
Output options:
-o FILE, --output=FILE Output to the given file, or "-" for stdout.
[default: -]
Record framing options:
--terminator=TERMINATOR String used to terminate records in output. Python
string escapes are allowed (e.g., "\n", "\x00").
[default: \n]
--length-prefixed=TYPE Instead of terminating records with a marker,
prefix each record with its length, encoded as
TYPE. (Options: uleb128, u64le)
ZS files are organized as a collection of records, which may contain
arbitrary data. By default, these are output as individual lines. However,
this may not be a great idea if you have records which themselves contain
newline characters. As an alternative, you can request that they instead be
terminated by some arbitrary string, or else request that each record be
prefixed by its length, encoded in either unsigned little-endian base-128
(uleb128) format or unsigned little-endian 64-bit (u64le) format.
Warning
Due to limitations in the multiprocessing module in
Python 2, zs dump
can be poorly behaved if you hit control-C
(e.g., refusing to exit).
On a Unix-like platform, if you have a zs dump
that is ignoring
control-C, then try hitting control-Z and then running kill
%zs
.
The easy workaround to this problem is to use Python 3 to run
zs
. The not so easy workaround is to implement a custom process
pool manager for Python 2 – patches accepted!
zs validate
¶
This command can be used to fully validate a ZS file for self-consistency and compliance with the specification (see On-disk layout of ZS files); this makes it rather useful to anyone trying to write new software to generate ZS files.
It is also useful because it verifies the SHA-256 checksum and all of
the per-block checksums, providing extremely strong protection against
errors caused by disk failures, cosmic rays, and other such
annoyances. However, this is not usually necessary, since the zs
commands and the zs
library interface never return any data
unless it passes a 64-bit checksum. With ZS you can be sure that your
results have not been corrupted by hardware errors, even if you never
run zs validate
at all.
Full options:
$ zs validate --help
Check a .zs file for errors or data corruption.
Usage:
zs validate [-j PARALLELISM] [--] <zs_file>
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Options:
-j PARALLELISM The number of CPUs to use for decompression.
[default: guess]
The zs
library for Python¶
Quickstart¶
Using the example file we created when demonstrating zs make, we can write:
In [1]: from zs import ZS
In [2]: z = ZS("example/tiny-4grams.zs")
In [3]: for record in z:
...: print(record.decode("utf-8"))
...:
not done explicitly . 42
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
not done fast , 52
not done fast enough 71
# Notice that on Python 3.x, we search using byte strings, and we get
# byte strings back.
# (On Python 2.x, byte strings are the same as regular strings.)
In [4]: for record in z.search(prefix=b"not done extensive testing\t"):
...: print(record.decode("utf-8"))
...: