marisa-trie
¶
Static memory-efficient Trie-like structures for Python (2.7 and 3.4+) based on marisa-trie C++ library.
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Note
There are official SWIG-based Python bindings included in C++ library distribution; this package provides alternative Cython-based pip-installable Python bindings.
Installation¶
pip install marisa-trie
Current limitations¶
- The library is not tested with mingw32 compiler;
.prefixes()
method ofBytesTrie
andRecordTrie
is quite slow and doesn’t have iterator counterpart;read()
andwrite()
methods don’t work with file-like objects (they work only with real files; pickling works fine for file-like objects);- there are
keys()
anditems()
methods but novalues()
method.
License¶
Wrapper code is licensed under MIT License.
Bundled marisa-trie C++ library is dual-licensed under LGPL and BSD 2-clause license.
Tutorial¶
Tries¶
There are several trie classes in this package:
marisa_trie.BinaryTrie |
|
marisa_trie.Trie |
|
marisa_trie.RecordTrie |
|
marisa_trie.BytesTrie |
marisa_trie.Trie¶
Create a new trie from a list of keys:
>>> import marisa_trie
>>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
Check if a key is present:
>>> u'key1' in trie
True
>>> u'key20' in trie
False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys in a trie:
>>> trie[u'key2']
1
Note that you can’t assign a value to a marisa_trie.Trie
key,
but can use the returned ID to store values in a separate data structure
(e.g. in a Python list or NumPy array).
An ID can be mapped back to the corresponding key:
>>> trie.restore_key(1)
u'key2'
Query a trie
Find all trie keys which are prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
Find all trie keys which start with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
The latter is complemented by
items()
which returns all matching(key, ID)
pairs.
All query methods have generator-based versions prefixed with iter
.
Note
If you’re looking for a trie with bytes keys, check out
BinaryTrie
.
marisa_trie.RecordTrie¶
Create a new trie from a list of (key, data)
pairs:
>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [(1, 2), (2, 1), (3, 3), (2, 1)]
>>> fmt = "<HH" # two short integers.
>>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
Each data tuple would be converted to bytes using struct.pack()
. Take a
look at available format strings
here.
Check if a key is present:
>>> u'foo' in trie
True
>>> u'spam' in trie
False
marisa_trie.RecordTrie
allows duplicate keys. Therefore __getitem__
and
get
return a list of values.
>>> trie[u'bar']
[(2, 1)]
>>> trie[u'foo']
[(1, 2), (2, 1)]
>>> trie.get(u'bar', 123)
[(2, 1)]
>>> trie.get(u'BAAR', 123) # default value.
123
Similarly, keys()
and
items()
take into account key multiplicities:
>> trie.keys(u'fo')
[u'foo', u'foo', u'foobar']
>> trie.items(u'fo')
[(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
marisa_trie.BytesTrie¶
BytesTrie
is similar to RecordTrie
, but the values are raw bytes,
not tuples:
>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2']
>>> trie = marisa_trie.BytesTrie(zip(keys, values))
>>> trie[u'bar']
[b'bar-value']
Persistence¶
Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.
Save trie to a file:
>>> trie.save('my_trie.marisa')
Load trie from a file:
>>> trie2 = marisa_trie.Trie()
>>> trie2.load('my_trie.marisa')
Note
You may also build a trie using marisa-build
command-line
utility (provided by underlying C++ library; it should be
downloaded and compiled separately) and then load the trie
from the resulting file using load
.
Trie objects are picklable:
>>> import pickle
>>> data = pickle.dumps(trie)
>>> trie3 = pickle.loads(data)
Memory mapped I/O¶
It is possible to use memory mapped file as data source:
>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
This way the whole dictionary won’t be loaded fully to memory; memory mapped I/O is an easy way to share dictionary data among processes.
Warning
Memory mapped trie might cause lots of random disk accesses which considerably increases the search time.
Storage options¶
marisa-trie C++ library provides some configuration options for trie storage; See “Enumeration Constants” section in the library docs.
These options are exposed as order
, num_tries
, cache_size
and binary
keyword arguments for trie constructors.
For example, set order
to marisa_trie.LABEL_ORDER
in order to
make trie functions return results in alphabetical oder:
>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
Note that two tries constructed from identical data but with different order
arguments will compare unequal:
>>> t1 = marisa_trie.Trie(order=marisa_trie.LABEL_ORDER)
>>> t2 = marisa_trie.Trie(order=marisa_trie.WEIGHT_ORDER)
>>> t1 == t2
False
Benchmarks¶
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
- dict(unicode words -> word lenghts): about 600M
- list(unicode words) : about 300M
- BaseTrie from datrie library: about 70M
marisa_trie.RecordTrie
: 11Mmarisa_trie.Trie
: 7M
Note
Lengths of words were stored as values in datrie.BaseTrie
and marisa_trie.RecordTrie
. RecordTrie
compresses
similar values and the key compression is better so it uses
much less memory than datrie.BaseTrie
.
marisa_trie.Trie
provides auto-assigned IDs. It is not possible
to store arbitrary values in marisa_trie.Trie
so it uses less
memory than RecordTrie
.
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict building 2.919M words/sec
Trie building 0.394M words/sec
BytesTrie building 0.355M words/sec
RecordTrie building 0.354M words/sec
dict __getitem__ (hits) 8.239M ops/sec
Trie __getitem__ (hits) not supported
BytesTrie __getitem__ (hits) 0.498M ops/sec
RecordTrie __getitem__ (hits) 0.404M ops/sec
dict get() (hits) 4.410M ops/sec
Trie get() (hits) not supported
BytesTrie get() (hits) 0.458M ops/sec
RecordTrie get() (hits) 0.364M ops/sec
dict get() (misses) 4.869M ops/sec
Trie get() (misses) not supported
BytesTrie get() (misses) 0.849M ops/sec
RecordTrie get() (misses) 0.816M ops/sec
dict __contains__ (hits) 8.053M ops/sec
Trie __contains__ (hits) 1.018M ops/sec
BytesTrie __contains__ (hits) 0.605M ops/sec
RecordTrie __contains__ (hits) 0.618M ops/sec
dict __contains__ (misses) 6.489M ops/sec
Trie __contains__ (misses) 2.047M ops/sec
BytesTrie __contains__ (misses) 1.079M ops/sec
RecordTrie __contains__ (misses) 1.123M ops/sec
dict items() 57.248 ops/sec
Trie items() not supported
BytesTrie items() 11.691 ops/sec
RecordTrie items() 8.369 ops/sec
dict keys() 217.920 ops/sec
Trie keys() 19.589 ops/sec
BytesTrie keys() 14.849 ops/sec
RecordTrie keys() 15.369 ops/sec
Trie.prefixes (hits) 0.594M ops/sec
Trie.prefixes (mixed) 1.874M ops/sec
Trie.prefixes (misses) 1.447M ops/sec
RecordTrie.prefixes (hits) 0.103M ops/sec
RecordTrie.prefixes (mixed) 0.458M ops/sec
RecordTrie.prefixes (misses) 0.164M ops/sec
Trie.iter_prefixes (hits) 0.588M ops/sec
Trie.iter_prefixes (mixed) 1.470M ops/sec
Trie.iter_prefixes (misses) 1.170M ops/sec
Trie.keys(prefix="xxx"), avg_len(res)==415 5.044K ops/sec
Trie.keys(prefix="xxxxx"), avg_len(res)==17 89.363K ops/sec
Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3 258.732K ops/sec
Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 293.199K ops/sec
Trie.keys(prefix="xxx"), NON_EXISTING 1169.524K ops/sec
RecordTrie.keys(prefix="xxx"), avg_len(res)==415 3.836K ops/sec
RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17 73.591K ops/sec
RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3 229.515K ops/sec
RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 269.228K ops/sec
RecordTrie.keys(prefix="xxx"), NON_EXISTING 1071.433K ops/sec
Tries from marisa_trie
are static and uses less memory, tries from
datrie are faster and can be updated.
You may also give DAWG a try - it is usually faster than
marisa-trie
and sometimes can use less memory (depending on data).
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.
Contributing¶
Contributions are welcome! Development happens at GitHub. Feel free to submit ideas, bug reports and pull requests.
If you found a bug in a C++ part please report it to the original bug tracker.
CHANGES¶
0.7.5 (2018-04-10)¶
- Removed redundant
DeprecationWarning
messages inTrie.save
andTrie.load
. - Dropped support for Python 2.6.
- Rebuild Cython wrapper with Cython 0.28.1.
0.7.4 (2017-03-27)¶
- Fixed packaging issue,
MANIFEST.in
was not updated afterlibmarisa-trie
became a submodule.
0.7.3 (2017-02-14)¶
- Added
BinaryTrie
for storing arbitrary sequences of bytes, e.g. IP addresses (thanks Tomasz Melcer); - Deprecated
Trie.has_keys_with_prefix
which can be trivially implemented in terms ofTrie.iterkeys
; - Deprecated
Trie.read
andTrie.write
which onlywork for “real” files and duplicate the functionality ofload
andsave
. See issue #31 on GitHub; - Updated
libmarisa-trie
to the latest version. Yay, 64-bit Windows support. - Rebuilt Cython wrapper with Cython 0.25.2.
0.7.2 (2015-04-21)¶
- packaging issue is fixed.
0.7.1 (2015-04-21)¶
- setup.py is switched to setuptools;
- a tiny speedup;
- wrapper is rebuilt with Cython 0.22.
0.7 (2014-12-15)¶
trie1 == trie2
andtrie1 != trie2
now work (thanks Sergei Lebedev);for key in trie:
is fixed (thanks Sergei Lebedev);- wrapper is rebuilt with Cython 0.21.1 (thanks Sergei Lebedev);
- https://bitbucket.org/kmike/marisa-trie repo is no longer supported.
0.6 (2014-02-22)¶
- New
Trie
methods:__getitem__
,get
,items
,iteritems
.trie[u'key']
is now the same astrie.key_id(u'key')
. - small optimization for
BytesTrie.get
. - wrapper is rebuilt with Cython 0.20.1.
0.5.3 (2014-02-08)¶
- small
Trie.restore_key
optimization (it should work 5-15% faster)
0.5.2 (2014-02-08)¶
- fix
Trie.restore_key
method - it was reading past declared string length; - rebuild wrapper with Cython 0.20.
0.5.1 (2013-10-03)¶
has_keys_with_prefix(prefix)
method (thanks Matt Hickford)
0.5 (2013-05-07)¶
BytesTrie.iterkeys
,BytesTrie.iteritems
,RecordTrie.iterkeys
andRecordTrie.iteritems
methods;- wrapper is rebuilt with Cython 0.19;
value_separator
parameter forBytesTrie
andRecordTrie
.
0.4 (2013-02-28)¶
- improved trie building:
weights
optional parameter; - improved trie building: unnecessary input sorting is removed;
- wrapper is rebuilt with Cython 0.18;
- bundled marisa-trie C++ library is updated to svn r133.
0.3.8 (2013-01-03)¶
- Rebuild wrapper with Cython pre-0.18;
- update benchmarks.
0.3.7 (2012-09-21)¶
- Update bundled marisa-trie C++ library (this may fix more mingw issues);
- Python 3.3 support is back.
0.3.6 (2012-09-05)¶
- much faster (3x-7x)
.items()
and.keys()
methods for all tries; faster (up to 3x).prefixes()
method forTrie
.
0.3.5 (2012-08-30)¶
- Pickling of RecordTrie is fixed (thanks lazarou for the report);
- error messages should become more useful.
0.3.4 (2012-08-29)¶
- Issues with mingw32 should be resolved (thanks Susumu Yata).
0.3.3 (2012-08-27)¶
.get(key, default=None)
method forBytesTrie
andRecordTrie
;- small README improvements.
0.3.2 (2012-08-26)¶
- Small code cleanup;
load
,read
andmmap
methods returns ‘self’;- I can’t run tests (via tox) under Python 3.3 so it is removed from supported versions for now.
0.3.1 (2012-08-23)¶
.prefixes()
support for RecordTrie and BytesTrie.
0.3 (2012-08-23)¶
- RecordTrie and BytesTrie are introduced;
- IntTrie class is removed (probably temporary?);
- dumps/loads methods are renamed to tobytes/frombytes;
- benchmark & tests improvements;
- support for MARISA-trie config options is added.
0.2 (2012-08-19)¶
- Pickling/unpickling support;
- dumps/loads methods;
- python 3.3 workaround;
- improved tests;
- benchmarks.
0.1 (2012-08-17)¶
Initial release.