3.1.4¶
pystorm lets you run Python code against real-time streams of data. Integrates with Apache Storm.
Quickstart¶
Dependencies¶
Spouts and Bolts¶
The general flow for creating new spouts and bolts using pystorm is to add
them to your src
folder and update the corresponding topology definition.
Let’s create a spout that emits sentences until the end of time:
import itertools
from pystorm.spout import Spout
class SentenceSpout(Spout):
def initialize(self, stormconf, context):
self.sentences = [
"She advised him to take a long holiday, so he immediately quit work and took a trip around the world",
"I was very glad to get a present from her",
"He will be here in half an hour",
"She saw him eating a sandwich",
]
self.sentences = itertools.cycle(self.sentences)
def next_tuple(self):
sentence = next(self.sentences)
self.emit([sentence])
def ack(self, tup_id):
pass # if a tuple is processed properly, do nothing
def fail(self, tup_id):
pass # if a tuple fails to process, do nothing
The magic in the code above happens in the initialize()
and
next_tuple()
functions. Once the spout enters the main run loop,
pystorm will call your spout’s initialize()
method.
After initialization is complete, pystorm will continually call the spout’s
next_tuple()
method where you’re expected to emit tuples that match
whatever you’ve defined in your topology definition.
Now let’s create a bolt that takes in sentences, and spits out words:
import re
from pystorm.bolt import Bolt
class SentenceSplitterBolt(Bolt):
def process(self, tup):
sentence = tup.values[0] # extract the sentence
sentence = re.sub(r"[,.;!\?]", "", sentence) # get rid of punctuation
words = [[word.strip()] for word in sentence.split(" ") if word.strip()]
if not words:
# no words to process in the sentence, fail the tuple
self.fail(tup)
return
for word in words:
self.emit([word])
# tuple acknowledgement is handled automatically
The bolt implementation is even simpler. We simply override the default
process()
method which pystorm calls when a tuple has been emitted by
an incoming spout or bolt. You are welcome to do whatever processing you would
like in this method and can further emit tuples or not depending on the purpose
of your bolt.
If your process()
method completes without raising an Exception, pystorm
will automatically ensure any emits you have are anchored to the current tuple
being processed and acknowledged after process()
completes.
If an Exception is raised while process()
is called, pystorm
automatically fails the current tuple prior to killing the Python process.
Failed Tuples¶
In the example above, we added the ability to fail a sentence tuple if it did
not provide any words. What happens when we fail a tuple? Storm will send a
“fail” message back to the spout where the tuple originated from (in this case
SentenceSpout
) and pystorm calls the spout’s
fail()
method. It’s then up to your spout
implementation to decide what to do. A spout could retry a failed tuple, send
an error message, or kill the topology.
Bolt Configuration Options¶
You can disable the automatic acknowleding, anchoring or failing of tuples by
adding class variables set to false for: auto_ack
, auto_anchor
or
auto_fail
. All three options are documented in
pystorm.bolt.Bolt
.
Example:
from pystorm.bolt import Bolt
class MyBolt(Bolt):
auto_ack = False
auto_fail = False
def process(self, tup):
# do stuff...
if error:
self.fail(tup) # perform failure manually
self.ack(tup) # perform acknowledgement manually
Handling Tick Tuples¶
Ticks tuples are built into Storm to provide some simple forms of cron-like behaviour without actually having to use cron. You can receive and react to tick tuples as timer events with your python bolts using pystorm too.
The first step is to override process_tick()
in your custom
Bolt class. Once this is overridden, you can set the storm option
topology.tick.tuple.freq.secs=<frequency>
to cause a tick tuple
to be emitted every <frequency>
seconds.
You can see the full docs for process_tick()
in
pystorm.bolt.Bolt
.
Example:
from pystorm.bolt import Bolt
class MyBolt(Bolt):
def process_tick(self, freq):
# An action we want to perform at some regular interval...
self.flush_old_state()
Then, for example, to cause process_tick()
to be called every
2 seconds on all of your bolts that override it, you can launch
your topology under sparse run
by setting the appropriate -o
option and value as in the following example:
$ sparse run -o "topology.tick.tuple.freq.secs=2" ...
API¶
Tuples¶
-
class
pystorm.component.
Tuple
(id, component, stream, task, values)¶ Storm’s primitive data type passed around via streams.
Variables: - id – the ID of the Tuple.
- component – component that the Tuple was generated from.
- stream – the stream that the Tuple was emitted into.
- task – the task the Tuple was generated from.
- values – the payload of the Tuple where data is stored.
You should never have to instantiate an instance of a
pystorm.component.Tuple
yourself as pystorm handles this for you
prior to, for example, a pystorm.bolt.Bolt
’s process()
method
being called.
None of the emit methods for bolts or spouts require that you pass a
pystorm.component.Tuple
instance.
Components¶
Both pystorm.bolt.Bolt
and
pystorm.spout.Spout
inherit from a common base-class,
pystorm.component.Component
. It handles the basic
Multi-Lang IPC between Storm and Python.
-
class
pystorm.component.
Component
(input_stream=<open file '<stdin>', mode 'r'>, output_stream=<open file '<stdout>', mode 'w'>, rdb_signal=u'SIGUSR1', serializer=u'json')[source]¶ Base class for spouts and bolts which contains class methods for logging messages back to the Storm worker process.
Variables: - input_stream – The
file
-like object to use to retrieve commands from Storm. Defaults tosys.stdin
. - output_stream – The
file
-like object to send messages to Storm with. Defaults tosys.stdout
. - topology_name – The name of the topology sent by Storm in the initial handshake.
- task_id – The numerical task ID for this component, as sent by Storm in the initial handshake.
- component_name – The name of this component, as sent by Storm in the initial handshake.
- debug – A
bool
indicating whether or not Storm is running in debug mode. Specified by the topology.debug Storm setting. - storm_conf – A
dict
containing the configuration values sent by Storm in the initial handshake with this component. - context – The context of where this component is in the topology. See the Storm Multi-Lang protocol documentation for details.
- pid – An
int
indicating the process ID of this component as retrieved byos.getpid()
. - logger –
A logger to use with this component.
Note
Using
Component.logger
combined with thepystorm.component.StormHandler
handler is the recommended way for logging messages from your component. If you useComponent.log
instead, the logging messages will always be sent to Storm, even if they aredebug
level messages and you are running in production. Usingpystorm.component.StormHandler
ensures that you will instead have your logging messages filtered on the Python side and only have the messages you actually want logged serialized and sent to Storm. - serializer – The
Serializer
that is used to serialize messages between Storm and Python. - exit_on_exception – A
bool
indicating whether or not the process should exit when an exception other thanStormWentAwayError
is raised. Defaults toTrue
.
-
emit
(tup, tup_id=None, stream=None, anchors=None, direct_task=None, need_task_ids=False)[source]¶ Emit a new Tuple to a stream.
Parameters: - tup (
list
orpystorm.component.Tuple
) – the Tuple payload to send to Storm, should contain only JSON-serializable data. - tup_id (str) – the ID for the Tuple. If omitted by a
pystorm.spout.Spout
, this emit will be unreliable. - stream (str) – the ID of the stream to emit this Tuple to. Specify
None
to emit to default stream. - anchors (list) – IDs the Tuples (or
pystorm.component.Tuple
instances) which the emitted Tuples should be anchored to. This is only passed bypystorm.bolt.Bolt
. - direct_task (int) – the task to send the Tuple to.
- need_task_ids (bool) – indicate whether or not you’d like the task IDs
the Tuple was emitted (default:
False
).
Returns: None
, unlessneed_task_ids=True
, in which case it will be alist
of task IDs that the Tuple was sent to if. Note that when specifying direct_task, this will be equal to[direct_task]
.- tup (
-
initialize
(storm_conf, context)[source]¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
log
(message, level=None)[source]¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
raise_exception
(exception, tup=None)[source]¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
report_metric
(name, value)[source]¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
- input_stream – The
Spouts¶
Spouts are data sources for topologies, they can read from any data source and emit tuples into streams.
-
class
pystorm.spout.
Spout
(input_stream=<open file '<stdin>', mode 'r'>, output_stream=<open file '<stdout>', mode 'w'>, rdb_signal=u'SIGUSR1', serializer=u'json')[source]¶ Bases:
pystorm.component.Component
Base class for all pystorm spouts.
For more information on spouts, consult Storm’s Concepts documentation.
-
ack
(tup_id)[source]¶ Called when a bolt acknowledges a Tuple in the topology.
Parameters: tup_id (str) – the ID of the Tuple that has been fully acknowledged in the topology.
-
activate
()[source]¶ Called when the Spout has been activated after being deactivated.
Note
This requires at least Storm 1.1.0.
-
deactivate
()[source]¶ Called when the Spout has been deactivated.
Note
This requires at least Storm 1.1.0.
-
emit
(tup, tup_id=None, stream=None, direct_task=None, need_task_ids=False)[source]¶ Emit a spout Tuple message.
Parameters: - tup (list or tuple) – the Tuple to send to Storm, should contain only JSON-serializable data.
- tup_id (str) – the ID for the Tuple. Leave this blank for an unreliable emit.
- stream (str) – ID of the stream this Tuple should be emitted to. Leave empty to emit to the default stream.
- direct_task (int) – the task to send the Tuple to if performing a direct emit.
- need_task_ids (bool) – indicate whether or not you’d like the task IDs
the Tuple was emitted (default:
False
).
Returns: None
, unlessneed_task_ids=True
, in which case it will be alist
of task IDs that the Tuple was sent to if. Note that when specifying direct_task, this will be equal to[direct_task]
.
-
fail
(tup_id)[source]¶ Called when a Tuple fails in the topology
A spout can choose to emit the Tuple again or ignore the fail. The default is to ignore.
Parameters: tup_id (str) – the ID of the Tuple that has failed in the topology either due to a bolt calling fail()
or a Tuple timing out.
-
initialize
(storm_conf, context)¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
static
is_heartbeat
(tup)¶ Returns: Whether or not the given Tuple is a heartbeat
-
log
(message, level=None)¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
next_tuple
()[source]¶ Implement this function to emit Tuples as necessary.
This function should not block, or Storm will think the spout is dead. Instead, let it return and pystorm will send a noop to storm, which lets it know the spout is functioning.
-
raise_exception
(exception, tup=None)¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
read_handshake
()¶ Read and process an initial handshake message from Storm.
-
read_message
()¶ Read a message from Storm via serializer.
-
report_metric
(name, value)¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
-
run
()¶ Main run loop for all components.
Performs initial handshake with Storm and reads Tuples handing them off to subclasses. Any exceptions are caught and logged back to Storm prior to the Python process exiting.
Warning
Subclasses should not override this method.
-
send_message
(message)¶ Send a message to Storm via stdout.
-
-
class
pystorm.spout.
ReliableSpout
(*args, **kwargs)[source]¶ Bases:
pystorm.spout.Spout
Reliable spout that will automatically replay failed tuples.
Failed tuples will be replayed up to
max_fails
times.For more information on spouts, consult Storm’s Concepts documentation.
-
ack
(tup_id)[source]¶ Called when a bolt acknowledges a Tuple in the topology.
Parameters: tup_id (str) – the ID of the Tuple that has been fully acknowledged in the topology.
-
activate
()¶ Called when the Spout has been activated after being deactivated.
Note
This requires at least Storm 1.1.0.
-
deactivate
()¶ Called when the Spout has been deactivated.
Note
This requires at least Storm 1.1.0.
-
emit
(tup, tup_id=None, stream=None, direct_task=None, need_task_ids=False)[source]¶ Emit a spout Tuple & add metadata about it to unacked_tuples.
In order for this to work, tup_id is a required parameter.
See
Bolt.emit()
.
-
fail
(tup_id)[source]¶ Called when a Tuple fails in the topology
A reliable spout will replay a failed tuple up to
max_fails
times.Parameters: tup_id (str) – the ID of the Tuple that has failed in the topology either due to a bolt calling fail()
or a Tuple timing out.
-
initialize
(storm_conf, context)¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
static
is_heartbeat
(tup)¶ Returns: Whether or not the given Tuple is a heartbeat
-
log
(message, level=None)¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
next_tuple
()¶ Implement this function to emit Tuples as necessary.
This function should not block, or Storm will think the spout is dead. Instead, let it return and pystorm will send a noop to storm, which lets it know the spout is functioning.
-
raise_exception
(exception, tup=None)¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
read_handshake
()¶ Read and process an initial handshake message from Storm.
-
read_message
()¶ Read a message from Storm via serializer.
-
report_metric
(name, value)¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
-
run
()¶ Main run loop for all components.
Performs initial handshake with Storm and reads Tuples handing them off to subclasses. Any exceptions are caught and logged back to Storm prior to the Python process exiting.
Warning
Subclasses should not override this method.
-
send_message
(message)¶ Send a message to Storm via stdout.
-
Bolts¶
-
class
pystorm.bolt.
Bolt
(*args, **kwargs)[source]¶ Bases:
pystorm.component.Component
The base class for all pystorm bolts.
For more information on bolts, consult Storm’s Concepts documentation.
Variables: - auto_anchor – A
bool
indicating whether or not the bolt should automatically anchor emits to the incoming Tuple ID. Tuple anchoring is how Storm provides reliability, you can read more about Tuple anchoring in Storm’s docs. Default isTrue
. - auto_ack – A
bool
indicating whether or not the bolt should automatically acknowledge Tuples afterprocess()
is called. Default isTrue
. - auto_fail – A
bool
indicating whether or not the bolt should automatically fail Tuples when an exception occurs when theprocess()
method is called. Default isTrue
.
Example:
from pystorm.bolt import Bolt class SentenceSplitterBolt(Bolt): def process(self, tup): sentence = tup.values[0] for word in sentence.split(" "): self.emit([word])
-
ack
(tup)[source]¶ Indicate that processing of a Tuple has succeeded.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to acknowledge.
-
emit
(tup, stream=None, anchors=None, direct_task=None, need_task_ids=False)[source]¶ Emit a new Tuple to a stream.
Parameters: - tup (
list
orpystorm.component.Tuple
) – the Tuple payload to send to Storm, should contain only JSON-serializable data. - stream (str) – the ID of the stream to emit this Tuple to. Specify
None
to emit to default stream. - anchors (list) – IDs the Tuples (or
pystorm.component.Tuple
instances) which the emitted Tuples should be anchored to. Ifauto_anchor
is set toTrue
and you have not specifiedanchors
,anchors
will be set to the incoming/most recent Tuple ID(s). - direct_task (int) – the task to send the Tuple to.
- need_task_ids (bool) – indicate whether or not you’d like the task IDs
the Tuple was emitted (default:
False
).
Returns: None
, unlessneed_task_ids=True
, in which case it will be alist
of task IDs that the Tuple was sent to if. Note that when specifying direct_task, this will be equal to[direct_task]
.- tup (
-
fail
(tup)[source]¶ Indicate that processing of a Tuple has failed.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to fail (itsid
ifstr
).
-
initialize
(storm_conf, context)¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
static
is_heartbeat
(tup)¶ Returns: Whether or not the given Tuple is a heartbeat
-
log
(message, level=None)¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
process
(tup)[source]¶ Process a single Tuple
pystorm.component.Tuple
of inputThis should be overridden by subclasses.
pystorm.component.Tuple
objects contain metadata about which component, stream and task it came from. The actual values of the Tuple can be accessed by callingtup.values
.Parameters: tup ( pystorm.component.Tuple
) – the Tuple to be processed.
-
process_tick
(tup)[source]¶ Process special ‘tick Tuples’ which allow time-based behaviour to be included in bolts.
Default behaviour is to ignore time ticks. This should be overridden by subclasses who wish to react to timer events via tick Tuples.
Tick Tuples will be sent to all bolts in a toplogy when the storm configuration option ‘topology.tick.tuple.freq.secs’ is set to an integer value, the number of seconds.
Parameters: tup ( pystorm.component.Tuple
) – the Tuple to be processed.
-
raise_exception
(exception, tup=None)¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
read_handshake
()¶ Read and process an initial handshake message from Storm.
-
read_message
()¶ Read a message from Storm via serializer.
-
report_metric
(name, value)¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
-
run
()¶ Main run loop for all components.
Performs initial handshake with Storm and reads Tuples handing them off to subclasses. Any exceptions are caught and logged back to Storm prior to the Python process exiting.
Warning
Subclasses should not override this method.
-
send_message
(message)¶ Send a message to Storm via stdout.
- auto_anchor – A
-
class
pystorm.bolt.
BatchingBolt
(*args, **kwargs)[source]¶ Bases:
pystorm.bolt.Bolt
A bolt which batches Tuples for processing.
Batching Tuples is unexpectedly complex to do correctly. The main problem is that all bolts are single-threaded. The difficult comes when the topology is shutting down because Storm stops feeding the bolt Tuples. If the bolt is blocked waiting on stdin, then it can’t process any waiting Tuples, or even ack ones that were asynchronously written to a data store.
This bolt helps with that by grouping Tuples received between tick Tuples into batches.
To use this class, you must implement
process_batch
.group_key
can be optionally implemented so that Tuples are grouped beforeprocess_batch
is even called.Variables: - auto_anchor –
A
bool
indicating whether or not the bolt should automatically anchor emits to the incoming Tuple ID. Tuple anchoring is how Storm provides reliability, you can read more about Tuple anchoring in Storm’s docs. Default isTrue
. - auto_ack – A
bool
indicating whether or not the bolt should automatically acknowledge Tuples afterprocess_batch()
is called. Default isTrue
. - auto_fail – A
bool
indicating whether or not the bolt should automatically fail Tuples when an exception occurs when theprocess_batch()
method is called. Default isTrue
. - ticks_between_batches – The number of tick Tuples to wait before processing a batch.
Example:
from pystorm.bolt import BatchingBolt class WordCounterBolt(BatchingBolt): ticks_between_batches = 5 def group_key(self, tup): word = tup.values[0] return word # collect batches of words def process_batch(self, key, tups): # emit the count of words we had per 5s batch self.emit([key, len(tups)])
-
ack
(tup)¶ Indicate that processing of a Tuple has succeeded.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to acknowledge.
-
emit
(tup, **kwargs)[source]¶ Modified emit that will not return task IDs after emitting.
See
pystorm.component.Bolt
for more information.Returns: None
.
-
fail
(tup)¶ Indicate that processing of a Tuple has failed.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to fail (itsid
ifstr
).
-
group_key
(tup)[source]¶ Return the group key used to group Tuples within a batch.
By default, returns None, which put all Tuples in a single batch, effectively just time-based batching. Override this to create multiple batches based on a key.
Parameters: tup ( pystorm.component.Tuple
) – the Tuple used to extract a group keyReturns: Any hashable
value.
-
initialize
(storm_conf, context)¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
static
is_heartbeat
(tup)¶ Returns: Whether or not the given Tuple is a heartbeat
-
static
is_tick
(tup)¶ Returns: Whether or not the given Tuple is a tick Tuple
-
log
(message, level=None)¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
process
(tup)[source]¶ Group non-tick Tuples into batches by
group_key
.Warning
This method should not be overriden. If you want to tweak how Tuples are grouped into batches, override
group_key
.
-
process_batch
(key, tups)[source]¶ Process a batch of Tuples. Should be overridden by subclasses.
Parameters: - key (hashable) – the group key for the list of batches.
- tups (list) – a list of
pystorm.component.Tuple
s for the group.
-
process_batches
()[source]¶ Iterate through all batches, call process_batch on them, and ack.
Separated out for the rare instances when we want to subclass BatchingBolt and customize what mechanism causes batches to be processed.
-
process_tick
(tick_tup)[source]¶ Increment tick counter, and call
process_batch
for all current batches if tick counter exceedsticks_between_batches
.See
pystorm.component.Bolt
for more information.Warning
This method should not be overriden. If you want to tweak how Tuples are grouped into batches, override
group_key
.
-
raise_exception
(exception, tup=None)¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
read_handshake
()¶ Read and process an initial handshake message from Storm.
-
read_message
()¶ Read a message from Storm via serializer.
-
read_tuple
()¶ Read a tuple from the pipe to Storm.
-
report_metric
(name, value)¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
-
run
()¶ Main run loop for all components.
Performs initial handshake with Storm and reads Tuples handing them off to subclasses. Any exceptions are caught and logged back to Storm prior to the Python process exiting.
Warning
Subclasses should not override this method.
-
send_message
(message)¶ Send a message to Storm via stdout.
- auto_anchor –
-
class
pystorm.bolt.
TicklessBatchingBolt
(*args, **kwargs)[source]¶ Bases:
pystorm.bolt.BatchingBolt
A BatchingBolt which uses a timer thread instead of tick tuples.
Batching tuples is unexpectedly complex to do correctly. The main problem is that all bolts are single-threaded. The difficult comes when the topology is shutting down because Storm stops feeding the bolt tuples. If the bolt is blocked waiting on stdin, then it can’t process any waiting tuples, or even ack ones that were asynchronously written to a data store.
This bolt helps with that grouping tuples based on a time interval and then processing them on a worker thread.
To use this class, you must implement
process_batch
.group_key
can be optionally implemented so that tuples are grouped beforeprocess_batch
is even called.Variables: - auto_anchor –
A
bool
indicating whether or not the bolt should automatically anchor emits to the incoming tuple ID. Tuple anchoring is how Storm provides reliability, you can read more about tuple anchoring in Storm’s docs. Default isTrue
. - auto_ack – A
bool
indicating whether or not the bolt should automatically acknowledge tuples afterprocess_batch()
is called. Default isTrue
. - auto_fail – A
bool
indicating whether or not the bolt should automatically fail tuples when an exception occurs when theprocess_batch()
method is called. Default isTrue
. - secs_between_batches –
The time (in seconds) between calls to
process_batch()
. Note that if there are no tuples in any batch, the TicklessBatchingBolt will continue to sleep.Note
Can be fractional to specify greater precision (e.g. 2.5).
Example:
from pystorm.bolt import TicklessBatchingBolt class WordCounterBolt(TicklessBatchingBolt): secs_between_batches = 5 def group_key(self, tup): word = tup.values[0] return word # collect batches of words def process_batch(self, key, tups): # emit the count of words we had per 5s batch self.emit([key, len(tups)])
-
ack
(tup)¶ Indicate that processing of a Tuple has succeeded.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to acknowledge.
-
emit
(tup, **kwargs)¶ Modified emit that will not return task IDs after emitting.
See
pystorm.component.Bolt
for more information.Returns: None
.
-
fail
(tup)¶ Indicate that processing of a Tuple has failed.
Parameters: tup ( str
orpystorm.component.Tuple
) – the Tuple to fail (itsid
ifstr
).
-
group_key
(tup)¶ Return the group key used to group Tuples within a batch.
By default, returns None, which put all Tuples in a single batch, effectively just time-based batching. Override this to create multiple batches based on a key.
Parameters: tup ( pystorm.component.Tuple
) – the Tuple used to extract a group keyReturns: Any hashable
value.
-
initialize
(storm_conf, context)¶ Called immediately after the initial handshake with Storm and before the main run loop. A good place to initialize connections to data sources.
Parameters:
-
static
is_heartbeat
(tup)¶ Returns: Whether or not the given Tuple is a heartbeat
-
static
is_tick
(tup)¶ Returns: Whether or not the given Tuple is a tick Tuple
-
log
(message, level=None)¶ Log a message to Storm optionally providing a logging level.
Parameters: Warning
This will send your message to Storm regardless of what level you specify. In almost all cases, you are better of using
Component.logger
and not settingpystorm.log.path
, because that will use apystorm.component.StormHandler
to do the filtering on the Python side (instead of on the Java side after taking the time to serialize your message and send it to Storm).
-
process
(tup)¶ Group non-tick Tuples into batches by
group_key
.Warning
This method should not be overriden. If you want to tweak how Tuples are grouped into batches, override
group_key
.
-
process_batch
(key, tups)¶ Process a batch of Tuples. Should be overridden by subclasses.
Parameters: - key (hashable) – the group key for the list of batches.
- tups (list) – a list of
pystorm.component.Tuple
s for the group.
-
process_batches
()¶ Iterate through all batches, call process_batch on them, and ack.
Separated out for the rare instances when we want to subclass BatchingBolt and customize what mechanism causes batches to be processed.
-
raise_exception
(exception, tup=None)¶ Report an exception back to Storm via logging.
Parameters: - exception – a Python exception.
- tup – a
Tuple
object.
-
read_handshake
()¶ Read and process an initial handshake message from Storm.
-
read_message
()¶ Read a message from Storm via serializer.
-
read_tuple
()¶ Read a tuple from the pipe to Storm.
-
report_metric
(name, value)¶ Report a custom metric back to Storm.
Parameters: - name – Name of the metric. This can be anything.
- value – Value of the metric. This is usually a number.
Only supported in Storm 0.9.3+.
Note
In order for this to work, the metric must be registered on the Storm side. See example code here.
-
run
()¶ Main run loop for all components.
Performs initial handshake with Storm and reads Tuples handing them off to subclasses. Any exceptions are caught and logged back to Storm prior to the Python process exiting.
Warning
Subclasses should not override this method.
-
send_message
(message)¶ Send a message to Storm via stdout.
- auto_anchor –