pyelasticsearch¶
pyelasticsearch is a clean, future-proof, high-scale API to elasticsearch. It provides…
- Transparent conversion of Python data types to and from JSON, including datetimes and the arbitrary-precision Decimal type
- Translation of HTTP failure status codes into exceptions
- Connection pooling
- HTTP basic auth and HTTPS support
- Load balancing across nodes in a cluster
- Failed-node marking to avoid downed nodes for a period
- Optional automatic retrying of failed requests
- Thread safety
- Loosely coupled design, letting you customize things like JSON encoding and bulk indexing
For more on our philosophy and history, see Comparison with elasticsearch-py, the “Official Client”.
A Taste of the API¶
Make a pooling, balancing, all-singing, all-dancing connection object:
>>> from pyelasticsearch import ElasticSearch
>>> es = ElasticSearch('http://localhost:9200/')
Index a document:
>>> es.index('contacts',
... 'person',
... {'name': 'Joe Tester', 'age': 25, 'title': 'QA Master'},
... id=1)
{u'_type': u'person', u'_id': u'1', u'ok': True, u'_version': 1, u'_index': u'contacts'}
Index a couple more documents, this time in a single request using the bulk-indexing API:
>>> docs = [{'id': 2, 'name': 'Jessica Coder', 'age': 32, 'title': 'Programmer'},
... {'id': 3, 'name': 'Freddy Tester', 'age': 29, 'title': 'Office Assistant'}]
>>> es.bulk((es.index_op(doc, id=doc.pop('id')) for doc in docs),
... index='contacts',
... doc_type='person')
If we had many documents and wanted to chunk them for performance,
bulk_chunks()
would easily rise to the task,
dividing either at a certain number of documents per batch or, for curated
platforms like Google App Engine, at a certain number of bytes. Thanks to
the decoupled design, you can even substitute your own batching function if
you have unusual needs. Bulk indexing is the most demanding ES task in most
applications, so we provide very thorough tools for representing operations,
optimizing wire traffic, and dealing with errors. See
bulk()
for more.
Refresh the index to pick up the latest:
>>> es.refresh('contacts')
{u'ok': True, u'_shards': {u'successful': 5, u'failed': 0, u'total': 10}}
Get just Jessica’s document:
>>> es.get('contacts', 'person', 2)
{u'_id': u'2',
u'_index': u'contacts',
u'_source': {u'age': 32, u'name': u'Jessica Coder', u'title': u'Programmer'},
u'_type': u'person',
u'_version': 1,
u'exists': True}
Perform a simple search:
>>> es.search('name:joe OR name:freddy', index='contacts')
{u'_shards': {u'failed': 0, u'successful': 42, u'total': 42},
u'hits': {u'hits': [{u'_id': u'1',
u'_index': u'contacts',
u'_score': 0.028130024999999999,
u'_source': {u'age': 25,
u'name': u'Joe Tester',
u'title': u'QA Master'},
u'_type': u'person'},
{u'_id': u'3',
u'_index': u'contacts',
u'_score': 0.028130024999999999,
u'_source': {u'age': 29,
u'name': u'Freddy Tester',
u'title': u'Office Assistant'},
u'_type': u'person'}],
u'max_score': 0.028130024999999999,
u'total': 2},
u'timed_out': False,
u'took': 4}
Perform a search using the elasticsearch query DSL:
>>> query = {
... 'query': {
... 'filtered': {
... 'query': {
... 'query_string': {'query': 'name:tester'}
... },
... 'filter': {
... 'range': {
... 'age': {
... 'from': 27,
... 'to': 37,
... },
... },
... },
... },
... },
... }
>>> es.search(query, index='contacts')
{u'_shards': {u'failed': 0, u'successful': 42, u'total': 42},
u'hits': {u'hits': [{u'_id': u'3',
u'_index': u'contacts',
u'_score': 0.19178301,
u'_source': {u'age': 29,
u'name': u'Freddy Tester',
u'title': u'Office Assistant'},
u'_type': u'person'}],
u'max_score': 0.19178301,
u'total': 1},
u'timed_out': False,
u'took': 2}
Delete the index:
>>> es.delete_index('contacts')
{u'acknowledged': True, u'ok': True}
For more, see the full API Documentation.
Contents¶
Features¶
JSON Conversion¶
pyelasticsearch converts transparently between Python datastructures and JSON.
- In request bodies, all the standard conversions are made: strings, numeric types, nulls, etc.
- We convert
datetime
anddate
instances to the format ES understands:2012-02-23T14:26:01
.date
objects are taken to represent midnight on their day. - Python sets are converted to ES lists.
You can customize JSON conversion by setting the
json_encoder
attribute on an
ElasticSearch object.
Connection Pooling¶
Connection pooling saves setting up a whole new TCP connection for each ES request, dropping latency by an order of magnitude. The ElasticSearch object is thread-safe; to take best advantage of connection pooling, create one instance, and share it among all threads. At most, the object will hold a number of connections to each node equal to the number of threads.
Load-balancing and Failover¶
An ElasticSearch
object can take a list of node URLs
on construction. This lets us balance load and maintain availability when
nodes go down: pyelasticsearch will randomly choose a server URL for each
request. If a node fails to respond before a timeout period elapses, it is
assumed down and not tried again for awhile. Meanwhile, pyelasticsearch will
retry the request on a different node if max_retries
was set to something
greater than zero at construction. If all nodes are marked as down,
pyelasticsearch will loosen its standards and try sending requests to them,
marking them alive if they respond.
Forward-Compatibility Kwargs¶
All methods that correspond to ES calls take an arbitrary set of kwargs that
can be used to pass query string parameters directly to ES. Certain kwargs
(called out by the @es_kwargs
decorator) are explicitly recognized as being
claimed by ES and will never be trod upon by future versions of
pyelasticsearch. To avoid conflicts, kwargs not yet so recognized should have
“es_” prepended by the caller. pyelasticsearch will strip off the “es_” and
pass the rest along to ES unscathed. Ideally, we’ll then add explicit
recognition of those args in a future release.
These “pass-through” kwargs are converted to text as follows:
- Bools
True
: “true”False
: “false”- Strings
- Passed unmolested
- Ints, longs, and floats
- Converted to strings via
str()
- Lists and tuples
- Joined with commas, e.g.
['one-index', 'two-index']
becomesone-index,two-index
- Datetimes and dates
- Datetimes are converted to ISO strings, like
2001-12-25T13:04:56
, dates convert to midnight:2001-12-25T00:00:00
.
Anything else raises a TypeError.
API Documentation¶
A Word About Forward-Compatibility Kwargs¶
In the following documentation, the phrase “other kwargs listed below” refers
to the kwargs documented in a subsequent Parameters section. However, it also
implicitly includes any kwargs the caller might care to make up and have passed
to ES as query string parameters. These kwargs must start with es_
for
forward compatibility and will be unprefixed and converted to strings as
discussed in Features.
ElasticSearch Class¶
Unless otherwise indicated, methods return the JSON-decoded response sent by elasticsearch. This way, you don’t lose any part of the return value, no matter how esoteric. But fear not: if there was an error, an exception will be raised, so it’ll be hard to miss.
-
class
pyelasticsearch.
ElasticSearch
(urls='http://localhost', timeout=60, max_retries=0, port=9200, username=None, password=None, ca_certs='/home/docs/checkouts/readthedocs.org/user_builds/pyelasticsearch/envs/latest/local/lib/python2.7/site-packages/certifi/cacert.pem', client_cert=None)[source]¶ An object which manages connections to elasticsearch and acts as a go-between for API calls to it
This object is thread-safe. You can create one instance and share it among all threads.
Parameters: - urls – A URL or iterable of URLs of ES nodes. These can be full
URLs with port numbers, like
http://elasticsearch.example.com:9200
, or you can pass the port separately using theport
kwarg. To do HTTP basic authentication, you can use RFC-2617-style URLs likehttp://someuser:somepassword@example.com:9200
or the separateusername
andpassword
kwargs below. - timeout – Number of seconds to wait for each request before raising Timeout
- max_retries – How many other servers to try, in series, after a request times out or a connection fails
- username – Authentication username to send via HTTP basic auth
- password – Password to use in HTTP basic auth. If a username and password are embedded in a URL, those are favored.
- port – The default port to connect on, for URLs that don’t include an explicit port
- ca_certs – A path to a bundle of CA certificates to trust. The default is to use Mozilla’s bundle, the same one used by Firefox.
- client_cert – A certificate to authenticate the client to the server
-
json_encoder
= <class 'pyelasticsearch.client.JsonEncoder'>¶ You can set this attribute on an instance to customize JSON encoding. The stock JsonEncoder class maps Python datetimes to ES-style datetimes and Python sets to ES lists. You can subclass it to add more.
- urls – A URL or iterable of URLs of ES nodes. These can be full
URLs with port numbers, like
Bulk Indexing Methods¶
-
class
pyelasticsearch.
ElasticSearch
[source] -
bulk
(actions, index=None, doc_type=None[, other kwargs listed below])[source]¶ Perform multiple index, delete, create, or update actions per request.
Used with helper routines
index_op()
,delete_op()
, andupdate_op()
, this provides an efficient, readable way to do large-scale changes. This contrived example illustrates the structure:es.bulk([es.index_op({'title': 'All About Cats', 'pages': 20}), es.index_op({'title': 'And Rats', 'pages': 47}), es.index_op({'title': 'And Bats', 'pages': 23})], doc_type='book', index='library')
More often, you’ll want to index (or delete or update) a larger number of documents. In those cases, yield your documents from a generator, and use
bulk_chunks()
to divide them into multiple requests:from pyelasticsearch import bulk_chunks def documents(): for book in books: yield es.index_op({'title': book.title, 'pages': book.pages}) # index_op() also takes kwargs like index= and id= in case # you want more control. # # You could also yield some delete_ops or update_ops here. # bulk_chunks() breaks your documents into smaller requests for speed: for chunk in bulk_chunks(documents(), docs_per_chunk=500, bytes_per_chunk=10000): # We specify a default index and doc type here so we don't # have to repeat them in every operation: es.bulk(chunk, doc_type='book', index='library')
Parameters: - actions – An iterable of bulk actions, generally the output of
bulk_chunks()
but sometimes a list of calls toindex_op()
,delete_op()
, andupdate_op()
directly. Specifically, an iterable of JSON-encoded bytestrings that can be joined with newlines and sent to ES. - index – Default index to operate on
- doc_type – Default type of document to operate on. Cannot be
specified without
index
. - consistency – See the ES docs.
- refresh – See the ES docs.
- replication – See the ES docs.
- routing – See the ES docs.
- timeout – See the ES docs.
Return the decoded JSON response on success.
Raise
BulkError
if any of the individual actions fail. The exception provides enough about the failed actions to identify them for retrying.Sometimes there is an error with the request in general, not with any individual actions. If there is a connection error, timeout, or other transport error, a more general exception will be raised, as with other methods; see Error Handling.
See ES’s bulk API for more detail.
- actions – An iterable of bulk actions, generally the output of
-
index_op
(doc, doc_type=None, overwrite_existing=True, **meta)[source]¶ Return a document-indexing operation that can be passed to
bulk()
. (See there for examples.)Specifically, return a 2-line, JSON-encoded bytestring.
Parameters: - doc – A mapping of property names to values.
- doc_type – The type of the document to index, if different from
the one you pass to
bulk()
- overwrite_existing – Whether we should overwrite existing documents of the same ID and doc type. (If False, this does a create operation.)
- meta – Other args controlling how the document is indexed,
like
id
(most common),index
(next most common),version
, androuting
. See ES’s bulk API for details on these.
-
delete_op
(doc_type=None, **meta)[source]¶ Return a document-deleting operation that can be passed to
bulk()
.def actions(): ... yield es.delete_op(id=7) yield es.delete_op(id=9, index='some-non-default-index', doc_type='some-non-default-type') ... es.bulk(actions(), ...)
Specifically, return a JSON-encoded bytestring.
Parameters: - doc_type – The type of the document to delete, if different
from the one passed to
bulk()
- meta – A description of what document to delete and how to do it.
Example:
{"index": "library", "id": 2, "version": 4}
. See ES’s bulk API for a list of all the options.
- doc_type – The type of the document to delete, if different
from the one passed to
-
update_op
(doc=None, doc_type=None, upsert=None, doc_as_upsert=None, script=None, params=None, lang=None, **meta)[source]¶ Return a document-updating operation that can be passed to
bulk()
.def actions(): ... yield es.update_op(doc={'pages': 4}, id=7, version=21) ... es.bulk(actions(), ...)
Specifically, return a JSON-encoded bytestring.
Parameters: - doc – A partial document to be merged into the existing document
- doc_type – The type of the document to update, if different
from the one passed to
bulk()
- upsert – The content for the new document created if the document does not exist
- script – The script to be used to update the document
- params – A dict of the params to be put in scope of the script
- lang – The language of the script. Omit to use the default,
specified by
script.default_lang
. - meta – Other args controlling what document to update and how
to do it, like
id
,index
, andretry_on_conflict
, destined for the action line itself rather than the payload. See ES’s bulk API for details on these.
-
bulk_index
(index, doc_type, docs, id_field='id', parent_field='_parent'[, other kwargs listed below])[source]¶ Index a list of documents as efficiently as possible.
Note
This is deprecated in favor of
bulk()
, which supports all types of bulk actions, not just indexing, is compatible withbulk_chunks()
for batching, and has a simpler, more flexible design.Parameters: - index – The name of the index to which to add the document. Pass None if you will specify indices individual in each doc.
- doc_type – The type of the document
- docs – An iterable of Python mapping objects, convertible to JSON, representing documents to index
- id_field – The field of each document that holds its ID. Removed from document before indexing.
- parent_field – The field of each document that holds its parent ID, if any. Removed from document before indexing.
- index_field – The field of each document that holds the index to
put it into, if different from the
index
arg. Removed from document before indexing. - type_field – The field of each document that holds the doc type it
should become, if different from the
doc_type
arg. Removed from the document before indexing. - consistency – See the ES docs.
- refresh – See the ES docs.
- replication – See the ES docs.
- routing – See the ES docs.
- timeout – See the ES docs.
Raise
BulkError
if the request as a whole succeeded but some of the individual actions failed. You can pull enough about the failed actions out of the exception to identify them for retrying.See ES’s bulk API for more detail.
-
There’s also a helper function, outside the ElasticSearch class:
-
pyelasticsearch.
bulk_chunks
(actions, docs_per_chunk=300, bytes_per_chunk=None)[source]¶ Return groups of bulk-indexing operations to send to
bulk()
.Return an iterable of chunks, each of which is a JSON-encoded line or pair of lines in the format understood by ES’s bulk API.
Parameters: - actions – An iterable of bulk actions, JSON-encoded. The best idea is
to pass me a list of the outputs from
index_op()
,delete_op()
, andupdate_op()
. - docs_per_chunk – The number of documents (or, more technically,
actions) to put in each chunk. Set to None to use only
bytes_per_chunk
. - bytes_per_chunk – The maximum number of bytes of HTTP body payload to
put in each chunk. Leave at None to use only
docs_per_chunk
. This option helps prevent timeouts when you have occasional very large documents. Without it, you may get unlucky: several large docs might land in one chunk, and ES might time out.
Chunks are capped by
docs_per_chunk
orbytes_per_chunk
, whichever is reached first. Obviously, we cannot make a chunk to smaller than its smallest doc, but we do the best we can. If bothdocs_per_chunk
andbytes_per_chunk
are None, all docs end up in one big chunk (and you might as well not use this at all).- actions – An iterable of bulk actions, JSON-encoded. The best idea is
to pass me a list of the outputs from
Other Methods¶
-
class
pyelasticsearch.
ElasticSearch
[source] -
-
close_index
(index)[source]¶ Close an index.
Parameters: index – The index to close See ES’s close-index API for more detail.
-
cluster_state
(metric='_all', index='_all'[, other kwargs listed below])[source]¶ Return state information about the cluster.
Parameters: - metric – Which metric to return: one of “version”, “master_node”, “nodes”, “routing_table”, “meatadata”, or “blocks”, an iterable of them, or a comma-delimited string of them. Defaults to all metrics.
- index – An index or iterable of indexes to return info about
- local – See the ES docs.
See ES’s cluster-state API for more detail.
-
count
(query[, other kwargs listed below])[source]¶ Execute a query against one or more indices and get hit count.
Parameters: - query – A dictionary that will convert to ES’s query DSL or a
string that will serve as a textual query to be passed as the
q
query string parameter - index – An index or iterable of indexes to search. Omit to search all.
- doc_type – A document type or iterable thereof to search. Omit to search all.
- df – See the ES docs.
- analyzer – See the ES docs.
- default_operator – See the ES docs.
- source – See the ES docs.
- routing – See the ES docs.
See ES’s count API for more detail.
- query – A dictionary that will convert to ES’s query DSL or a
string that will serve as a textual query to be passed as the
-
create_index
(index, settings=None)[source]¶ Create an index with optional settings.
Parameters: - index – The name of the index to create
- settings – A dictionary of settings
If the index already exists, raise
IndexAlreadyExistsError
.See ES’s create-index API for more detail.
-
delete
(index, doc_type, id[, other kwargs listed below])[source]¶ Delete a typed JSON document from a specific index based on its ID.
Parameters: - index – The name of the index from which to delete
- doc_type – The type of the document to delete
- id – The (string or int) ID of the document to delete
- routing – See the ES docs.
- parent – See the ES docs.
- replication – See the ES docs.
- consistency – See the ES docs.
- refresh – See the ES docs.
See ES’s delete API for more detail.
-
delete_all
(index, doc_type[, other kwargs listed below])[source]¶ Delete all documents of the given doc type from an index.
Parameters: - index – The name of the index from which to delete. ES does not support this being empty or “_all” or a comma-delimited list of index names (in 0.19.9).
- doc_type – The name of a document type
- routing – See the ES docs.
- parent – See the ES docs.
- replication – See the ES docs.
- consistency – See the ES docs.
- refresh – See the ES docs.
See ES’s delete API for more detail.
-
delete_by_query
(index, doc_type, query[, other kwargs listed below])[source]¶ Delete typed JSON documents from a specific index based on query.
Parameters: - index – An index or iterable thereof from which to delete
- doc_type – The type of document or iterable thereof to delete
- query – A dictionary that will convert to ES’s query DSL or a
string that will serve as a textual query to be passed as the
q
query string parameter. (Passing theq
kwarg yourself is deprecated.) - q – See the ES docs.
- df – See the ES docs.
- analyzer – See the ES docs.
- default_operator – See the ES docs.
- sourcerouting – See the ES docs.
- replication – See the ES docs.
- consistency – See the ES docs.
See ES’s delete-by-query API for more detail.
-
delete_index
(index)[source]¶ Delete an index.
Parameters: index – An index or iterable thereof to delete If the index is not found, raise
ElasticHttpNotFoundError
.See ES’s delete-index API for more detail.
-
flush
(index=None[, other kwargs listed below])[source]¶ Flush one or more indices (clear memory).
Parameters: - index – An index or iterable of indexes
- refresh – See the ES docs.
See ES’s flush API for more detail.
-
gateway_snapshot
(index=None)[source]¶ Gateway snapshot one or more indices.
Parameters: index – An index or iterable of indexes See ES’s gateway-snapshot API for more detail.
-
get
(index, doc_type, id[, other kwargs listed below])[source]¶ Get a typed JSON document from an index by ID.
Parameters: - index – The name of the index from which to retrieve
- doc_type – The type of document to get
- id – The ID of the document to retrieve
- realtime – See the ES docs.
- fields – See the ES docs.
- routing – See the ES docs.
- preference – See the ES docs.
- refresh – See the ES docs.
See ES’s get API for more detail.
-
get_mapping
(index=None, doc_type=None)[source]¶ Fetch the mapping definition for a specific index and type.
Parameters: - index – An index or iterable thereof
- doc_type – A document type or iterable thereof
Omit both arguments to get mappings for all types and indexes.
See ES’s get-mapping API for more detail.
-
get_settings
(index[, other kwargs listed below])[source]¶ Get the settings of one or more indexes.
Parameters: index – An index or iterable of indexes See ES’s get-settings API for more detail.
-
health
(index=None[, other kwargs listed below])[source]¶ Report on the health of the cluster or certain indices.
Parameters: - index – The index or iterable of indexes to examine
- level – See the ES docs.
- wait_for_status – See the ES docs.
- wait_for_relocating_shards – See the ES docs.
- wait_for_nodes – See the ES docs.
- timeout – See the ES docs.
See ES’s cluster-health API for more detail.
-
index
(index, doc_type, doc, id=None, overwrite_existing=True[, other kwargs listed below])[source]¶ Put a typed JSON document into a specific index to make it searchable.
Parameters: - index – The name of the index to which to add the document
- doc_type – The type of the document
- doc – A Python mapping object, convertible to JSON, representing the document
- id – The ID to give the document. Leave blank to make one up.
- overwrite_existing – Whether we should overwrite existing documents of the same ID and doc type
- routing – A value hashed to determine which shard this indexing request is routed to
- parent – The ID of a parent document, which leads this document to
be routed to the same shard as the parent, unless
routing
overrides it. - timestamp – An explicit value for the (typically automatic)
timestamp associated with a document, for use with
ttl
and such - ttl – The time until this document is automatically removed from the index. Can be an integral number of milliseconds or a duration like ‘1d’.
- percolate – An indication of which percolator queries, registered against this index, should be checked against the new document: ‘*’ or a query string like ‘color:green’
- consistency – An indication of how many active shards the contact node should demand to see in order to let the index operation succeed: ‘one’, ‘quorum’, or ‘all’
- replication – Set to ‘async’ to return from ES before finishing replication.
- refresh – Pass True to refresh the index after adding the document.
- timeout – A duration to wait for the relevant primary shard to become available, in the event that it isn’t: for example, “5m”
- fields – See the ES docs.
See ES’s index API for more detail.
-
more_like_this
(index, doc_type, id, fields, body=''[, other kwargs listed below])[source]¶ Execute a “more like this” search query against one or more fields and get back search hits.
Parameters: - index – The index to search and where the document for comparison lives
- doc_type – The type of document to find others like
- id – The ID of the document to find others like
- mlt_fields – The list of fields to compare on
- body – A dictionary that will convert to ES’s query DSL and be passed as the request body
- search_type – See the ES docs.
- search_indices – See the ES docs.
- search_types – See the ES docs.
- search_scroll – See the ES docs.
- search_size – See the ES docs.
- search_from – See the ES docs.
- like_text – See the ES docs.
- percent_terms_to_match – See the ES docs.
- min_term_freq – See the ES docs.
- max_query_terms – See the ES docs.
- stop_words – See the ES docs.
- min_doc_freq – See the ES docs.
- max_doc_freq – See the ES docs.
- min_word_len – See the ES docs.
- max_word_len – See the ES docs.
- boost_terms – See the ES docs.
- boost – See the ES docs.
- analyzer – See the ES docs.
See ES’s more-like-this API for more detail.
-
multi_get
(ids, index=None, doc_type=None, fields=None[, other kwargs listed below])[source]¶ Get multiple typed JSON documents from ES.
Parameters: - ids – An iterable, each element of which can be either an a dict or
an id (int or string). IDs are taken to be document IDs. Dicts are
passed through the Multi Get API essentially verbatim, except that
any missing
_type
,_index
, orfields
keys are filled in from the defaults given in thedoc_type
,index
, andfields
args. - index – Default index name from which to retrieve
- doc_type – Default type of document to get
- fields – Default fields to return
See ES’s Multi Get API for more detail.
- ids – An iterable, each element of which can be either an a dict or
an id (int or string). IDs are taken to be document IDs. Dicts are
passed through the Multi Get API essentially verbatim, except that
any missing
-
open_index
(index)[source]¶ Open an index.
Parameters: index – The index to open See ES’s open-index API for more detail.
-
optimize
(index=None[, other kwargs listed below])[source]¶ Optimize one or more indices.
Parameters: - index – An index or iterable of indexes
- max_num_segments – See the ES docs.
- only_expunge_deletes – See the ES docs.
- refresh – See the ES docs.
- flush – See the ES docs.
- wait_for_merge – See the ES docs.
See ES’s optimize API for more detail.
-
percolate
(index, doc_type, doc[, other kwargs listed below])[source]¶ Run a JSON document through the registered percolator queries, and return which ones match.
Parameters: - index – The name of the index to which the document pretends to belong
- doc_type – The type the document should be treated as if it has
- doc – A Python mapping object, convertible to JSON, representing the document
- routing – See the ES docs.
- preference – See the ES docs.
- ignore_unavailable – See the ES docs.
- percolate_format – See the ES docs.
Use
index()
to register percolators. See ES’s percolate API for more detail.
-
put_mapping
(index, doc_type, mapping[, other kwargs listed below])[source]¶ Register specific mapping definition for a specific type against one or more indices.
Parameters: - index – An index or iterable thereof
- doc_type – The document type to set the mapping of
- mapping – A dict representing the mapping to install. For example, this dict can have top-level keys that are the names of doc types.
- ignore_conflicts – See the ES docs.
See ES’s put-mapping API for more detail.
-
refresh
(index=None)[source]¶ Refresh one or more indices.
Parameters: index – An index or iterable of indexes See ES’s refresh API for more detail.
-
search
(query[, other kwargs listed below])[source]¶ Execute a search query against one or more indices and get back search hits.
Parameters: - query – A dictionary that will convert to ES’s query DSL or a
string that will serve as a textual query to be passed as the
q
query string parameter - index – An index or iterable of indexes to search. Omit to search all.
- doc_type – A document type or iterable thereof to search. Omit to search all.
- size – Limit the number of results to
size
. Use withes_from
to implement paginated searching. - routing – See the ES docs.
See ES’s search API for more detail.
- query – A dictionary that will convert to ES’s query DSL or a
string that will serve as a textual query to be passed as the
-
send_request
(method, path_components, body='', query_params=None)[source]¶ Send an HTTP request to ES, and return the JSON-decoded response.
This is mostly an internal method, but it also comes in handy if you need to use a brand new ES API that isn’t yet explicitly supported by pyelasticsearch, while still taking advantage of our connection pooling and retrying.
Retry the request on different servers if the first one is down and the
max_retries
constructor arg was > 0.On failure, raise an
ElasticHttpError
, aConnectionError
, or aTimeout
.Parameters: - method – An HTTP method, like “GET”
- path_components – An iterable of path components, to be joined by “/”
- body – A map of key/value pairs to be sent as the JSON request body. Alternatively, a string to be sent verbatim, without further JSON encoding.
- query_params – A map of querystring param names to values or
None
-
status
(index=None[, other kwargs listed below])[source]¶ Retrieve the status of one or more indices
Parameters: - index – An index or iterable thereof
- recovery – See the ES docs.
- snapshot – See the ES docs.
See ES’s index-status API for more detail.
-
update
(index, doc_type, id, script[, other kwargs listed below])[source]¶ Update an existing document. Raise
TypeError
ifscript
,doc
andupsert
are all unspecified.Parameters: - index – The name of the index containing the document
- doc_type – The type of the document
- id – The ID of the document
- script – The script to be used to update the document
- params – A dict of the params to be put in scope of the script
- lang – The language of the script. Omit to use the default,
specified by
script.default_lang
. - doc – A partial document to be merged into the existing document
- upsert – The content for the new document created if the document does not exist
- doc_as_upsert – The provided document will be inserted if the document does not already exist
- routing – See the ES docs.
- parent – See the ES docs.
- timeout – See the ES docs.
- replication – See the ES docs.
- consistency – See the ES docs.
- percolate – See the ES docs.
- refresh – See the ES docs.
- retry_on_conflict – See the ES docs.
- fields – See the ES docs.
See ES’s Update API for more detail.
-
update_aliases
(settings[, other kwargs listed below])[source]¶ Atomically add, remove, or update aliases in bulk.
Parameters: actions – A list of the actions to perform
-
update_all_settings
(settings)[source]¶ Update the settings of all indexes.
Parameters: settings – A dictionary of settings See ES’s update-settings API for more detail.
-
update_settings
(index, settings)[source]¶ Change the settings of one or more indexes.
Parameters: - index – An index or iterable of indexes
- settings – A dictionary of settings
See ES’s update-settings API for more detail.
-
Error Handling¶
Any method representing an ES API call can raise one of the following exceptions:
-
exception
pyelasticsearch.exceptions.
ConnectionError
¶ Exception raised there is a connection error and we are out of retries. (See the
max_retries
argument toElasticSearch
.)
-
exception
pyelasticsearch.exceptions.
Timeout
¶ Exception raised when an HTTP request times out and we are out of retries. (See the
max_retries
argument toElasticSearch
.)
-
exception
pyelasticsearch.exceptions.
BulkError
[source]¶ Exception raised when one or more bulk actions fail
You can extract document IDs from these to retry them.
-
errors
¶ Return a list of actions that failed, in the format emitted by ES:
{"index" : { "_index" : "test", "_type" : "type1", "_id" : "1", "status" : 409, "error" : "VersionConflictEngineException[[test][2] [type1][1]: version conflict, current [3], provided [2]]" } }, {"update" : { "_index" : "index1", "_type" : "type1", "_id" : "1", "status" : 404, "error" : "DocumentMissingException[[index1][-1] [type1][1]: document missing]" } }, ...
-
-
exception
pyelasticsearch.exceptions.
ElasticHttpError
[source]¶ Exception raised when ES returns a non-OK (>=400) HTTP status code
-
error
¶ A string error message
-
status_code
¶ The HTTP status code of the response that precipitated the error
-
-
exception
pyelasticsearch.exceptions.
ElasticHttpNotFoundError
[source]¶ Exception raised when a request to ES returns a 404
Debugging¶
pyelasticsearch logs to the elasticsearch.trace
logger using the Python
logging module. If you configure that to show INFO-level messages, then it’ll
show the requests in curl form and their responses. To see when a server is
marked as dead, follow the elasticsearch
logger.
import logging
logging.getLogger('elasticsearch.trace').setLevel(logging.INFO)
logging.getLogger('elasticsearch').setLevel(logging.INFO)
Note
This assumes that logging is already set up with something like this:
import logging
logging.basicConfig()
pyelasticsearch will log lines like:
INFO:elasticsearch.trace: curl
-XGET 'http://localhost:9200/fooindex/testdoc/_search' -d '{"fa
cets": {"topics": {"terms": {"field": "topics"}}}}'
You can copy and paste the curl line, and it’ll work on the command line.
Comparison with elasticsearch-py, the “Official Client”¶
pyelasticsearch was created before Elasticsearch-the-company provided its own
client libraries for anything other than Java. There was no reliable,
large-scale ES client for Python: pyes was closest, but it suffered from
unreliability and pervasive weirdness, like closing sockets in __del__
and
doing things which were obvious no-ops. We adapted pyelasticsearch from an
older, very simple client library and gave it a complete API overhaul in
version 0.2, inspired by the principles of poetic API design.
Elasticsearch-the-company later created its own clients, with a strong leaning toward keeping them similar across languages for ease of support and maintenance. The upside is that their libraries always support the latest ES features, down to every last nook and cranny, because the relevant parts are autogenerated from a generic API description language. The downside is that they feel autogenerated: some things end up less than Pythonic.
Which Should You Use?¶
The official Python client borrows much design—and code—from pyelasticsearch. Starting in 1.0, we return the favor, using elasticsearch-py’s transport layer rather than maintaining our own. The important differences remain at the API level.
In general, pyelasticsearch focuses on…
Pythonic-ness
pyelasticsearch is designed to feel elegant to the caller. For example, we strive for symmetry: creating an index is
es.create()
, and searching one ises.search()
. In elasticsearch-py, creating an index is nested insidees.indices.create(<index name>)
, an artifact of code organization. The tradeoff for added design thought is that the project moves slower.Good defaults and simple interfaces
For example, there is only a single transport, HTTP(S), but it is almost always the right one. Thrift, the leading alternative, yields a 15% speed boost but only when using many small requests. It doesn’t help at all for bulk indexing, where speed is most often a concern, and it complicates troubleshooting, proxying, and setup. In fact, it’s deprecated in ES 1.5 and will be removed in 2.0.
For another example, if you use an HTTPS URL, the authenticity of the server certificate will be automatically verified using Mozilla’s certificate authority store. You neither have to manually enable verification nor provide your own store.
The tradeoff here is that we don’t expose as many knobs to twiddle as the official client. If you have unusual needs, like using self-signed SSL certificates, we might not be for you. Otherwise, you can enjoy less verbose code.
Safety
If something fails, it always raises an exception, making it hard to accidentally ignore. elasticsearch-py doesn’t always do this: you need to check for errors explicitly when using its bulk indexing helper, for example.
In addition, data loss is hard to stumble into; we put up guiderails. For example, calling the update-settings API with no indices would, if we simply followed the ES REST API, update all indices, a far-reaching destructive action caused by an omission. We require the explicit use of an
update_all_settings()
method if you want to do this.Better documentation
You should never need to read the source code to figure out what to do. In order to twiddle many of the aforementioned knobs in elasticsearch-py, you must squirrel kwargs down through multiple undocumented layers, from constructor to constructor, until something finally understands them. On the way, it’s often unclear what’s public and what’s private.
Our top-level docs are comprehensive with regard to our API, we link to the ES docs for details about their system, and we try to respect the Law of Demeter in our layering.
Conversely, elasticsearch-py focuses on…
Exhaustive functionality
It provides explicit hooks into every corner of ES and keeps up to date with ES releases.
Our strategy is to provide Forward-Compatibility Kwargs (which elasticsearch-py adopted as well) and
send_request()
for the period between an ES release and when we design APIs for its new features.Cross-language homogeneity
If you’re using ES from multiple languages every day, you might enjoy an API that looks similar across them.
Conversely, we aim for idiomatic Python.
Migrating From pyes
¶
Moving your project from pyes
to pyelasticsearch
is easy, especially
for simple use cases. Here are some code changes that will aid your porting.
pyelasticsearch
requiresrequests
1.x. Breaking changes were introduced inrequests
1.0, so if your project was using a previous version, you may need to update your code. Most likely, you just need to changeresponse.json
toresponse.json()
.Instantiating the client should be as simple as changing the invocation…
pyes.ES(host, **kwargs)
…to…
pyelasticsearch.ElasticSearch(host, **kwargs)
pyelasticsearch
has no methodcreate_index_if_missing
. Instead, you’ll need catch the exception manually:try: connection.create_index(index='already_existing_index') except pyelasticsearch.IndexAlreadyExistsError as ex: print 'Index already exists, moving on...'
Instead of using
pyes
’s_send_request
, usesend_request()
. This also requires the path to be passed as an iterable instead of a string. For example…es._send_request('POST', 'my_index/my_doc_type', body)
…becomes…
connection.send_request('POST', ['my_index', 'my_doc_type'], body)
The
indices
keyword argument inpyes
turns toindex
inpyelasticsearch
, whether the method takes multiple indices or not.The
doc_types
keyword argument inpyes
turns todoc_type
inpyelasticsearch
.get()
will raiseElasticHttpNotFoundError
if the requested documents are not found.pyes
expects arguments toindex
to be in a different order than ourindex()
. The document to be indexed needs to be moved from the first positional argument to the third.send_request()
will raise an error if the response can’t be converted to JSON. If you expect that a response will not be JSON, catch the exception and inspect the status code. For example…connection = ElasticSearch(host) try: # Check for the existence of the "pycon" index: connection.send_request('HEAD', ['pycon']) except InvalidJsonResponseError as exc: if exc.response.status_code == 200: print 'The index exists!'
If using
search_raw
frompyes
, you can usesearch()
and, if necessary, rename the keyword arguments.
Changelog¶
v1.4.1 (2018-04-02)¶
- Recognize new “index already exists” spelling so we raise the right exceptions. Close #195.
- Fix CI setup.
- Drop Python 2.6 support.
- Drop nose for testing.
v1.4¶
- Add support for custom certificate authorities via the
ca_certs
arg to theElasticSearch
constructor. - Add support for client certificates via the
client_cert
arg.
v1.3¶
- Add support for HTTPS.
- Add username, password, and port kwargs to the constructor so you don’t have to repeat their values if they’re the same across many servers.
v1.2.4 (2015-05-21)¶
- Don’t crash when the
query_params
kwarg is omitted from calls tosend_request()
.
v1.2.3 (2015-04-17)¶
- Make
delete_all_indexes()
work. - Fix a bug in which specifying
_all
as an index name sometimes caused doctype names to be treated as index names.
v1.2.2 (2015-04-10)¶
- Correct a typo in the
bulk()
docs.
v1.2.1 (2015-04-09)¶
- Update ES doc links, now that Elastic has changed domains and reorganized its docs.
- Require elasticsearch lib 1.3 or greater, as that’s when it started exposing
ConnectionTimeout
.
v1.2 (2015-03-06)¶
- Make sure the Content-Length header gets set when calling
create_index()
with no explicitsettings
arg. This solves 411s when using nginx as a proxy. - Add
doc_as_upsert()
arg toupdate()
. - Make
bulk_chunks()
compute perfectly optimal results, no longer ever exceeding the byte limit unless a single document is over the limit on its own.
v1.1 (2015-02-12)¶
- Introduce new bulk API, supporting all types of bulk operations (index,
update, create, and delete), providing chunking via
bulk_chunks()
, and introducing per-action error-handling. All errors raise exceptions–even individual failed operations–and the exceptions expose enough data to identify operations for retrying or reporting. The design is decoupled in case you want to create your own chunkers or operation builders. - Deprecate
bulk_index()
in favor of the more capablebulk()
. - Make one last update to
bulk_index()
. It now catches individual operation failures, raisingBulkError
. Also add theindex_field
andtype_field
args, allowing you to index across different indices and doc types within one request. ElasticSearch
object now defaults to http://localhost:9200/ if you don’t provide any node URLs.- Improve docs: give a better overview on the front page, and document how to customize JSON encoding.
v1.0 (2015-01-23)¶
- Switch to elasticsearch-py’s transport and downtime-pooling machinery, much of which was borrowed from us anyway.
- Make bulk indexing (and likely other network things) 15 times faster.
- Add a comparison with the official client to the docs.
- Fix
delete_by_query()
to work with ES 1.0 and later. - Bring
percolate()
es_kwargs up to date. - Fix all tests that were failing on modern versions of ES.
- Tolerate errors that are non-strings and create exceptions for them properly.
Note
Backward incompatible:
- Drop compatibility with elasticsearch < 1.0.
- Redo
cluster_state()
to work with ES 1.0 and later. Arguments have changed. - InvalidJsonResponseError no longer provides access to the HTTP response
(in the
response
property): just the bad data (theinput
property). - Change from the logger “pyelasticsearch” to “elasticsearch.trace”.
- Remove
revival_delay
param from ElasticSearch object. - Remove
encode_body
param fromsend_request()
. Now all dicts are JSON-encoded, and all strings are left alone.
v0.7.1 (2014-08-12)¶
- Brings tests up to date with
update_aliases()
API change.
v0.7 (2014-08-12)¶
- When an
id_field
is specified forbulk_index()
, don’t index it under its original name as well; use it only as the_id
. - Rename
aliases()
toget_aliases()
for consistency with other methods. Original name still works but is deprecated. Add analias
kwarg to the method so you can fetch specific aliases.
Note
Backward incompatible:
update_aliases()
no longer requires a dict with anactions
key; that much is implied. Just pass the value of that key.
v0.6.1 (2013-11-01)¶
- Update package requirements to allow requests 2.0, which is in fact compatible. (Natim)
- Properly raise
IndexAlreadyExistsException
even if the error is reported by a node other than the one to which the client is directly connected. (Jannis Leidel)
v0.6 (2013-07-23)¶
Note
Note the change in behavior of bulk_index()
in this release. This change
probably brings it more in line with your expectations. But double check,
since it now overwrites existing docs in situations where it didn’t before.
Also, we made a backward-incompatible spelling change to a little-used
index()
kwarg.
bulk_index()
now overwrites any existing doc of the same ID and doctype. Before, in certain versions of ES (like 0.90RC2), it did nothing at all if a document already existed, probably much to your surprise. (We removed the'op_type': 'create'
pair, whose intentions were always mysterious.) (Gavin Carothers)- Rename the
force_insert
kwarg ofindex()
tooverwrite_existing
. The old name implied the opposite of what it actually did. (Gavin Carothers)
v0.5 (2013-04-20)¶
- Support multiple indices and doctypes in
delete_by_query()
. Accept both string and JSON queries in thequery
arg, just assearch()
does. Passing theq
arg explicitly is now deprecated. - Add
multi_get
. - Add
percolate
. Thanks, Adam Georgiou and Joseph Rose! - Add ability to specify the parent document in
bulk_index()
. Thanks, Gavin Carothers! - Remove the internal, undocumented
from_python
method. django-haystack users will need to upgrade to a newer version that avoids using it. - Refactor JSON encoding machinery. Now it’s clearer how to customize it: just
plug your custom JSON encoder class into
ElasticSearch.json_encoder
. - Don’t crash under
python -OO
. - Support non-ASCII URL path components (like Unicode document IDs) and query string param values.
- Switch to the nose testrunner.
v0.4.1 (2013-03-25)¶
- Fix a bug introduced in 0.4 wherein “None” was accidentally sent to ES when
an ID wasn’t passed to
index()
.
v0.4 (2013-03-19)¶
- Support Python 3.
- Support more APIs:
cluster_state
get_settings
update_aliases
andaliases
update
(existed but didn’t work before)
- Support the
size
param of thesearch
method. (You can now changees_size
tosize
in your code if you like.) - Support the
fields
param onindex
andupdate
methods, new since ES 0.20. - Maintain better precision of floats when passed to ES.
- Change endpoint of bulk indexing so it works on ES < 0.18.
- Support documents whose ID is 0.
- URL-escape path components, so doc IDs containing funny chars work.
- Add a dedicated
IndexAlreadyExistsError
exception for when you try to create an index that already exists. This helps you trap this situation unambiguously. - Add docs about upgrading from pyes.
- Remove the undocumented and unused
to_python
method.
v0.3 (2013-01-10)¶
- Correct the
requests
requirement to require a version that has everything we need. In fact, require requests 1.x, which has a stable API. - Add
update()
method. - Make
send_request
method public so you can use ES APIs we don’t yet explicitly support. - Handle JSON translation of Decimal class and sets.
- Make
more_like_this()
take an arbitrary request body so you can filter the returned docs. - Replace the
fields
arg ofmore_like_this
withmlt_fields
. This makes it actually work, as it’s the param name ES expects. - Make explicit our undeclared dependency on simplejson.
v0.2 (2012-10-06)¶
Many thanks to Erik Rose for almost completely rewriting the API to follow best practices, improve the API user experience, and make pyelasticsearch future-proof.
Note
This release is backward-incompatible in numerous ways, please read the following section carefully. If in doubt, you can easily stick with pyelasticsearch 0.1.
Backward-incompatible changes:
- Simplify
search()
andcount()
calling conventions. Each now supports either a textual or a dict-based query as its first argument. There’s no longer a need to, for example, pass an empty string as the first arg in order to use a JSON query (a common case). - Standardize on the singular for the names of the
index
anddoc_type
kwargs. It’s not always obvious whether an ES API allows for multiple indexes. This was leading me to have to look aside to the docs to determine whether the kwarg was calledindex
orindexes
. Using the singular everywhere will result in fewer doc lookups, especially for the common case of a single index. - Rename
morelikethis
tomore_like_this
for consistency with other methods. index()
now takes(index, doc_type, doc)
rather than(doc, index, doc_type)
, for consistency withbulk_index()
and other methods.- Similarly,
put_mapping()
now takes(index, doc_type, mapping)
rather than(doc_type, mapping, index)
. - To prevent callers from accidentally destroying large amounts of data…
delete()
no longer deletes all documents of a doctype when no ID is specified; usedelete_all()
instead.delete_index()
no longer deletes all indexes when none are given; usedelete_all_indexes()
instead.update_settings()
no longer updates the settings of all indexes when none are specified; useupdate_all_settings()
instead.
setup_logging()
is gone. If you want to configure logging, use the logging module’s usual facilities. We still log to the “pyelasticsearch” named logger.- Rethink error handling:
- Raise a more specific exception for HTTP error codes so callers can catch it without examining a string.
- Catch non-JSON responses properly, and raise the more specific
NonJsonResponseError
instead of the genericElasticSearchError
. - Remove mentions of nonexistent exception types that would cause crashes
in their
except
clauses. - Crash harder if JSON encoding fails: that always indicates a bug in pyelasticsearch.
- Remove the ill-defined
ElasticSearchError
. - Raise
ConnectionError
rather thanElasticSearchError
if we can’t connect to a node (and we’re out of auto-retries). - Raise
ValueError
rather thanElasticSearchError
if no documents are passed tobulk_index
. - All exceptions are now more introspectable, because they don’t
immediately mash all the context down into a string. For example, you can
recover the unmolested response object from
ElasticHttpError
. - Removed
quiet
kwarg, meaning we always expose errors.
Other changes:
- Add Sphinx documentation.
- Add load-balancing across multiple nodes.
- Add failover in the case where a node doesn’t respond.
- Add
close_index
,open_index
,update_settings
,health
. - Support passing arbitrary kwargs through to the ES query string. Known ones are taken verbatim; unanticipated ones need an “es_” prefix to guarantee forward compatibility.
- Automatically convert
datetime
objects when encoding JSON. - Recognize and convert datetimes and dates in pass-through kwargs. This is
useful for
timeout
. - In routines that can take either one or many indexes, don’t require the caller to wrap a single index name in a list.
- Many other internal improvements
v0.1 (2012-08-30)¶
Initial release based on the work of Robert Eanes and other authors
Development Notes¶
Testing¶
To run the tests:
% python setup.py test
This should automatically install the additional dependencies required for testing if you don’t have them.
Documentation¶
Documentation is located in docs/
and requires Sphinx to build.
To get the requirements:
% pip install Sphinx
To build the docs:
% cd docs/
% make html
Documentation committed and pushed to the main repository is available on ReadTheDocs at http://pyelasticsearch.readthedocs.org/.
Philosophy¶
pyelasticsearch is intended as a low-level, lossless API to elasticsearch. That is, it generally refrains from adding abstractions that limit flexibility or power. For example, it handles JSON conversion because there is a strict one-to-one mapping between JSON and Python dictionaries: nothing is lost. It converts bad HTTP status codes to exceptions, but you can still access the raw codes and responses by drilling into the exceptions.
Therefore, pyelasticsearch is a good choice for building higher-level APIs upon—ones which make common cases easier but where certain edge cases feel like “coloring outside the lines”. One such library is elasticutils. However, pyelasticsearch is also meant to be directly usable by humans: a great deal of care has been taken to keep calls brief, understandable, consistent, and error-resistant and to deal in data structures which are easy to manipulate with Python’s built-in routines.
Patches along these lines are always welcome. Thank you for trying pyelasticsearch!