Dask¶
Dask is a flexible parallel computing library for analytic computing.
Dask is composed of two components:
 Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
 “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to largerthanmemory or distributed environments. These parallel collections run on top of the dynamic task schedulers.
Dask emphasizes the following virtues:
 Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
 Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
 Native: Enables distributed computing in Pure Python with access to the PyData stack.
 Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
 Scales up: Runs resiliently on clusters with 1000s of cores
 Scales down: Trivial to set up and run on a laptop in a single process
 Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans
See the dask.distributed documentation (separate website) for more technical information on Dask’s distributed scheduler,
Familiar user interface¶
Dask DataFrame mimics Pandas  documentation
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('20150101.csv') df = dd.read_csv('2015**.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Dask Array mimics NumPy  documentation
import numpy as np import dask.array as da
f = h5py.File('myfile.hdf5') f = h5py.File('myfile.hdf5')
x = np.array(f['/smalldata']) x = da.from_array(f['/bigdata'],
chunks=(1000, 1000))
x  x.mean(axis=1) x  x.mean(axis=1).compute()
Dask Bag mimics iterators, Toolz, and PySpark  documentation
import dask.bag as db
b = db.read_text('2015**.json.gz').map(json.loads)
b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute()
Dask Delayed mimics for loops and wraps custom code  documentation
from dask import delayed
L = []
for fn in filenames: # Use for loops to build up computation
data = delayed(load)(fn) # Delay execution of function
L.append(delayed(process)(data)) # Build connections between variables
result = delayed(summarize)(L)
result.compute()
The concurrent.futures interface provides general submission of custom tasks:  documentation
from dask.distributed import Client
client = Client('scheduler:port')
futures = []
for fn in filenames:
future = client.submit(load, fn)
futures.append(future)
summary = client.submit(summarize, futures)
summary.result()
Scales from laptops to clusters¶
Dask is convenient on a laptop. It installs trivially with
conda
or pip
and extends the size of convenient datasets from “fits in
memory” to “fits on disk”.
Dask can scale to a cluster of 100s of machines. It is resilient, elastic, data local, and low latency. For more information see documentation on the distributed scheduler.
This ease of transition between singlemachine to moderate cluster enables users both to start simple and to grow when necessary.
Complex Algorithms¶
Dask represents parallel computations with task graphs. These
directed acyclic graphs may have arbitrary structure, which enables both
developers and users the freedom to build sophisticated algorithms and to
handle messy situations not easily managed by the map/filter/groupby
paradigm common in most data engineering frameworks.
We originally needed this complexity to build complex algorithms for ndimensional arrays but have found it to be equally valuable when dealing with messy situations in everyday problems.
Index¶
Getting Started
Install Dask¶
You can install dask with conda
, with pip
, or by installing from source.
Conda¶
Dask is installed by default in Anaconda.
You can update Dask using the conda command:
conda install dask
This installs Dask and all common dependencies, including Pandas and NumPy.
Dask packages are maintained both on the default channel and on condaforge.
Optionally, you can obtain a minimal dask installation using the following command:
conda install daskcore
This will install a minimal set of dependencies required to run dask, similar to (but not exactly the same as) pip install dask
below.
Pip¶
You can install everything required for most common uses of dask (arrays, dataframes, …) This installs both Dask and dependencies like NumPy, Pandas, and so on that are necessary for different workloads. This is often the right choice for Dask users:
pip install "dask[complete]" # Install everything
You can also install only the Dask library. Modules like dask.array, dask.dataframe, or dask.distributed won’t work until you also install NumPy, Pandas, or Tornado respectively. This is common for downstream library maintainers:
pip install dask # Install only core parts of dask
We also maintain other dependency sets for different subsets of functionality:
pip install "dask[array]" # Install requirements for dask array
pip install "dask[bag]" # Install requirements for dask bag
pip install "dask[dataframe]" # Install requirements for dask dataframe
pip install "dask[distributed]" # Install requirements for distributed dask
We have these options so that users of the lightweight core dask scheduler aren’t required to download the more exotic dependencies of the collections (Numpy, Pandas, Tornado, etc..)
Install from Source¶
To install dask from source, clone the repository from github:
git clone https://github.com/dask/dask.git
cd dask
python setup.py install
or use pip
locally if you want to install all dependencies as well:
pip install e .[complete]
You can view the list of all dependencies within the extras_require
field
of setup.py
.
Anaconda¶
Dask is included by default in the Anaconda distribution
Test¶
Test dask with py.test
:
cd dask
py.test dask
Although please aware that installing dask naively may not install all
requirements by default. Please read the pip
section above that discusses
requirements. You may choose to install the dask[complete]
which includes
all dependencies for all collections. Alternatively you may choose to test
only certain submodules depending on the libraries within your environment.
For example to test only dask core and dask array we would run tests as
follows:
py.test dask/tests dask/array/tests
Setup¶
This page describes various ways to set up Dask on different hardware, either locally on your own machine or on a distributed cluster. If you are just getting started then this page is unnecessary. Dask does not require any setup if you only want to use it on a single computer.
Dask has two families of task schedulers:
 Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use. It can only be used on a single machine and does not scale.
 Distributed scheduler: This scheduler is more sophisticated, offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster.
If you import Dask, set up a computation, and then call compute
then you
will use the singlemachine scheduler by default. To use the dask.distributed
scheduler you must set up a Client
import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute() # This uses the singlemachine scheduler by default
from dask.distributed import Client
client = Client(...) # Connect to distributed cluster and override default
df.x.sum().compute() # This now runs on the distributed system
Note that the newer dask.distributed
scheduler is often preferable even on
single workstations. It contains many diagnostics and features not found in
the older singlemachine scheduler. The following pages explain in more detail
how to set up Dask on a variety of local and distributed hardware.
 Single Machine:
 Default Scheduler: The nosetup default. Uses local threads or processes for largerthanmemory processing
 Dask.distributed: The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.
 Distributed computing:
 Manual Setup: The command line interface to set up
daskscheduler
anddaskworker
processes. Useful for IT or anyone building a deployment solution.  SSH: Use SSH to set up Dask across an unmanaged cluster
 High Performance Computers: How to run Dask on traditional HPC environments using tools like MPI, or job schedulers like SLURM, SGE, TORQUE, LSF, and so on
 Kubernetes: Deploy Dask with the popular Kubernetes resource manager using either Helm or a native deployment.
 Python API (advanced): Create
Scheduler
andWorker
objects from Python as part of a distributed Tornado TCP application. This page is useful for those building custom frameworks.  Docker containers are available and may be useful in some of the solutions above.
 Cloud for current recommendations on how to deploy Dask and Jupyter on common cloud providers like Amazon, Google, or Microsoft Azure.
 Manual Setup: The command line interface to set up
SingleMachine Scheduler¶
The default Dask scheduler provides parallelism on a single machine by using either threads or processes. It is the default choice used by Dask because it requires no setup. You don’t need to make any choices or set anything up to use this scheduler, however you do have a choice between threads and processes:
Threads: Use multiple threads in the same process. This option is good for numeric code that releases the GIL (like NumPy, Pandas, ScikitLearn, Numba, …) because data is free to share. This is the default scheduler for
dask.array
,dask.dataframe
, anddask.delayed
Processes: Send data to separate processes for processing. This option is good when operating on pure Python objects like strings or JSONlike dictionary data that holds onto the GIL but not very good when operating on numeric data like Pandas dataframes or NumPy arrays. Using processes avoids GIL issues but can also result in a lot of interprocess communication, which can be slow. This is the default scheduler for
dask.bag
and is sometimes useful withdask.dataframe
.Note that the dask.distributed scheduler is often a better choice when working with GILbound code. See Dask.distributed on a single machine.
Singlethreaded: Execute computations in a single thread. This option provides no parallelism, but is useful when debugging or profiling. Turning your parallel execution into a sequential one can be a convenient option in many situations where you want to better understand what is going on.
Selecting Threads, Processes, or Single Threaded¶
Currently these options are available by selecting different get
functions:
dask.threaded.get
: The threaded schedulerdask.multiprocessing.get
: The multiprocessing schedulerdask.local.get_sync
: The singlethreaded scheduler
You can specify these functions in any of the following ways:
When calling
.compute()
x.compute(scheduler='threads')
With a context manager
with dask.config.set(scheduler='threads'): x.compute() y.compute()
As a global setting
dask.config.set(scheduler='threads')
Single Machine: Dask.distributed¶
The dask.distributed scheduler works well on a single machine. It is sometimes preferred over the default scheduler for the following reasons:
 It provides access to asynchronous API, notably Futures
 It provides a diagnostic dashboard that can provide valuable insight on performance and progress
 It handles data locality with more sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.
You can create a dask.distributed scheduler by importing and creating a
Client
with no arguments. This overrides whatever default was previously
set.
from dask.distributed import Client
client = Client()
You can navigate to http://localhost:8787/status to see the diagnostic dashboard if you have Bokeh installed.
Client¶
You can trivially set up a local cluster on your machine by instantiating a Dask Client with no arguments
from dask.distributed import Client
client = Client()
This sets up a scheduler in your local process and several processes running singlethreaded Workers.
If you want to run workers in your same process you can pass the
processes=False
keyword argument.
client = Client(processes=False)
This is sometimes preferable if you want to avoid interworker communication and your computations release the GIL. This is common when primarily using NumPy or Dask.array.
LocalCluster¶
The Client()
call described above is shorthand for creating a LocalCluster
and then passing that to your client.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
This is equivalent, but somewhat more explicit. You may want to look at the keyword arguments available on LocalCluster to understand the options available to you on handling the mixture of threads and processes, specifying explicit ports, and so on.

class
distributed.deploy.local.
LocalCluster
(n_workers=None, threads_per_worker=None, processes=True, loop=None, start=None, ip=None, scheduler_port=0, silence_logs=30, diagnostics_port=8787, services={}, worker_services={}, service_kwargs=None, asynchronous=False, **worker_kwargs)¶ Create local Scheduler and Workers
This creates a “cluster” of a scheduler and workers running on the local machine.
Parameters:  n_workers: int
Number of workers to start
 processes: bool
Whether to use processes (True) or threads (False). Defaults to True
 threads_per_worker: int
Number of threads per each worker
 scheduler_port: int
Port of the scheduler. 8786 by default, use 0 to choose a random port
 silence_logs: logging level
Level of logs to print out to stdout.
logging.WARN
by default. Use a falsey value like False or None for no change. ip: string
IP address on which the scheduler will listen, defaults to only localhost
 diagnostics_port: int
Port on which the web will be provided. 8787 by default, use 0 to choose a random port,
None
to disable it, or an(ip:port)
tuple to listen on a different IP address than the scheduler. kwargs: dict
Extra worker arguments, will be passed to the Worker constructor.
 service_kwargs: Dict[str, Dict]
Extra keywords to hand to the running services
Examples
>>> c = LocalCluster() # Create a local cluster with as many workers as cores >>> c LocalCluster("127.0.0.1:8786", workers=8, ncores=8)
>>> c = Client(c) # connect to local cluster
Add a new worker to the cluster
>>> w = c.start_worker(ncores=2)
Shut down the extra worker
>>> c.remove_worker(w)
Pass extra keyword arguments to Bokeh
>>> LocalCluster(service_kwargs={'bokeh': {'prefix': '/foo'}})

close
(timeout=20)¶ Close the cluster

scale_down
(workers)¶ Remove
workers
from the clusterGiven a list of worker addresses this function should remove those workers from the cluster. This may require tracking which jobs are associated to which worker address.
This can be implemented either as a function or as a Tornado coroutine.

scale_up
(n, **kwargs)¶ Bring the total count of workers up to
n
This function/coroutine should bring the total number of workers up to the number
n
.This can be implemented either as a function or as a Tornado coroutine.

start_worker
(**kwargs)¶ Add a new worker to the running cluster
Parameters:  port: int (optional)
Port on which to serve the worker, defaults to 0 or random
 ncores: int (optional)
Number of threads to use. Defaults to number of logical cores
Returns:  The created Worker or Nanny object. Can be discarded.
Examples
>>> c = LocalCluster() >>> c.start_worker(ncores=2)

stop_worker
(w)¶ Stop a running worker
Examples
>>> c = LocalCluster() >>> w = c.start_worker(ncores=2) >>> c.stop_worker(w)
Command Line¶
This is the most fundamental way to deploy Dask on multiple machines. In production environments this process is often automated by some other resource manager and so it is rare that people need to follow these instructions explicitly. Instead, these instructions are useful for IT professionals who may want to set up automated services to deploy Dask within their institution.
A dask.distributed
network consists of one daskscheduler
process and
several daskworker
processes that connect to that scheduler. These are
normal Python processes that can be executed from the command line. We launch
the daskscheduler
executable in one process and the daskworker
executable in several processes, possibly on different machines.
Launch daskscheduler
on one node:
$ daskscheduler
Scheduler at: tcp://192.0.0.100:8786
Then launch daskworker
on the rest of the nodes, providing the address to
the node that hosts daskscheduler
:
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.1:12345
Registered to: tcp://192.0.0.100:8786
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.2:40483
Registered to: tcp://192.0.0.100:8786
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.3:27372
Registered to: tcp://192.0.0.100:8786
The workers connect to the scheduler, which then sets up a longrunning network connection back to the worker. The workers will learn the location of other workers from the scheduler.
Handling Ports¶
The scheduler and workers both need to accept TCP connections on an open port.
By default the scheduler binds to port 8786
and the worker binds to a
random open port. If you are behind a firewall then you may have to open
particular ports or tell Dask to listen on particular ports with the port
and workerport
keywords.:
daskscheduler port 8000
daskworker bokehport 8000 nannyport 8001
Nanny Processes¶
Dask workers are run within a Nanny process that monitors the worker process and restarts it if necessary.
Diagnostic Web Servers¶
Additionally Dask schedulers and workers host interactive diagnostic web
servers using Bokeh. These are optional, but
generally useful to users. The diagnostic server on the scheduler is
particularly valuable, and is served on port 8787
by default (configurable
with the bokehport
keyword).
 More information about relevant ports is available by looking at the help pages with
daskscheduler help
anddaskworker help
Automated Tools¶
There are various mechanisms to deploy these executables on a cluster, ranging from manually SSHing into all of the machines to more automated systems like SGE/SLURM/Torque or Yarn/Mesos. Additionally, cluster SSH tools exist to send the same commands to many machines. We recommend searching online for “cluster ssh” or “cssh”..
API¶
These may be outdated. We recommend referring to the help
text of your
installed version.
daskscheduler¶
$ daskscheduler help
Usage: daskscheduler [OPTIONS]
Options:
host TEXT URI, IP or hostname of this server
port INTEGER Serving port
interface TEXT Preferred network interface like 'eth0' or 'ib0'
tlscafile PATH CA cert(s) file for TLS (in PEM format)
tlscert PATH certificate file for TLS (in PEM format)
tlskey PATH private key file for TLS (in PEM format)
bokehport INTEGER Bokeh port for visual diagnostics
bokeh / nobokeh Launch Bokeh Web UI [default: True]
show / noshow Show web UI
bokehwhitelist TEXT IP addresses to whitelist for bokeh.
bokehprefix TEXT Prefix for the bokeh app
usexheaders BOOLEAN User xheaders in bokeh app for ssl termination in
header [default: False]
pidfile TEXT File to write the process PID
schedulerfile TEXT File to write connection information. This may be a
good way to share connection information if your
cluster is on a shared network file system.
localdirectory TEXT Directory to place scheduler files
preload TEXT Module that should be loaded by each worker process
like "foo.bar" or "/path/to/foo.py"
help Show this message and exit.
daskworker¶
$ daskworker help
Usage: daskworker [OPTIONS] [SCHEDULER]
Options:
tlscafile PATH CA cert(s) file for TLS (in PEM format)
tlscert PATH certificate file for TLS (in PEM format)
tlskey PATH private key file for TLS (in PEM format)
workerport INTEGER Serving computation port, defaults to random
nannyport INTEGER Serving nanny port, defaults to random
bokehport INTEGER Bokeh port, defaults to 8789
bokeh / nobokeh Launch Bokeh Web UI [default: True]
listenaddress TEXT The address to which the worker binds.
Example: tcp://0.0.0.0:9000
contactaddress TEXT The address the worker advertises to the
scheduler for communication with it and other
workers. Example: tcp://127.0.0.1:9000
host TEXT Serving host. Should be an ip address that is
visible to the scheduler and other workers.
See listenaddress and contactaddress if
you need different listen and contact
addresses. See interface.
interface TEXT Network interface like 'eth0' or 'ib0'
nthreads INTEGER Number of threads per process.
nprocs INTEGER Number of worker processes. Defaults to one.
name TEXT A unique name for this worker like 'worker1'
memorylimit TEXT Bytes of memory that the worker can use. This
can be an integer (bytes), float (fraction of
total system memory), string (like 5GB or
5000M), 'auto', or zero for no memory
management
reconnect / noreconnect Reconnect to scheduler if disconnected
nanny / nonanny Start workers in nanny process for management
pidfile TEXT File to write the process PID
localdirectory TEXT Directory to place worker files
resources TEXT Resources for task constraints like "GPU=2
MEM=10e9"
schedulerfile TEXT Filename to JSON encoded scheduler
information. Use with daskscheduler
schedulerfile
deathtimeout FLOAT Seconds to wait for a scheduler before closing
bokehprefix TEXT Prefix for the bokeh app
preload TEXT Module that should be loaded by each worker
process like "foo.bar" or "/path/to/foo.py"
help Show this message and exit.
SSH¶
The convenience script daskssh
opens several SSH connections to your
target computers and initializes the network accordingly. You can
give it a list of hostnames or IP addresses:
$ daskssh 192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4
Or you can use normal UNIX grouping:
$ daskssh 192.168.0.{1,2,3,4}
Or you can specify a hostfile that includes a list of hosts:
$ cat hostfile.txt
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
$ daskssh hostfile hostfile.txt
The daskssh
utility depends on the paramiko
:
pip install paramiko
CLI Options¶
Launch a distributed cluster over SSH. A daskscheduler
process will run
on the first host specified in [HOSTNAMES] or in the hostfile (unless
scheduler
is specified explicitly). One or more daskworker
processes
will be run each host in [HOSTNAMES] or in the hostfile. Use command line
flags to adjust how many daskworker process are run on each host
(nprocs
) and how many cpus are used by each daskworker process
(nthreads
).
Options:
Note: This table may grow out of date, you should check daskssh help
to get an uptodate listing of all options.
Option  TYPE  Description 

–scheduler  TEXT  Specify scheduler node. Defaults to first address. 
–schedulerport  INTEGER  Specify scheduler port number. Defaults to port 8786. 
–nthreads  INTEGER  Number of threads per worker process. Defaults to number of cores divided by the number of processes per host. 
–nprocs  INTEGER  Number of worker processes per host. Defaults to one. 
–hostfile  PATH  Textfile with hostnames/IP addresses 
–sshusername  TEXT  Username to use when establishing SSH connections. 
–sshport  INTEGER  Port to use for SSH connections. 
–sshprivatekey  TEXT  Private key file to use for SSH connections. 
–logdirectory  PATH  Directory to use on all cluster nodes for the output of daskscheduler and daskworker commands. 
–remotepython  TEXT  Path to Python on remote nodes. 
–memorylimit  TEXT  Bytes of memory that the worker can use. This can be an integer (bytes), float(fraction of total system memory) string (like 5GB or 5000M), ‘auto’, or zero for no memory management. 
–workerport  INTEGER  Serving computation port, defaults to random 
–nannyport  INTEGER  Serving nanny port, defaults to random 
–nohost  Do not pass the hostname to the worker. 
High Performance Computers¶
Relevant Machines¶
This page includes instructions and guidelines when deploying Dask on high performance supercomputers commonly found in scientific and industry research labs. These systems commonly have the following attributes:
 Some mechanism to launch MPI applications or use job schedulers like SLURM, SGE, TORQUE, LSF, DRMAA, PBS, or others
 A shared network file system visible to all machines in the cluster
 A high performance network interconnect, such as Infiniband
 Little or no nodelocal storage
Where to start¶
Most of this page documents best practices to use Dask on an HPC cluster. This is technical and aimed both at users with some experience deploying Dask and also system administrators.
New users may instead prefer to start with one of the following projects, which provide easy highlevel access to Dask using resource managers that are commonly deployed on HPC systems:
 daskjobqueue for use with PBS, SLURM, and SGE resource managers
 daskdrmaa for use with any DRMAA compliant resource manager
They provide interfaces that look like the following:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(processes=18,
threads=4, memory="6GB",
project='P48500028',
queue='premium',
resource_spec='select=1:ncpus=36:mem=109G',
walltime='02:00:00')
cluster.start_workers(100) # Start 100 jobs that match the description above
from dask.distributed import Client
client = Client(cluster) # Connect to that cluster
Using MPI¶
You can launch a Dask network using mpirun
or mpiexec
and the
daskmpi
command line executable.
mpirun np 4 daskmpi schedulerfile /home/$USER/scheduler.json
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
This depends on the mpi4py library. It only
uses MPI to start the Dask cluster, and not for internode communication. MPI
implementations differ. The use of mpirun np 4
is specific to the
mpich
MPI implementation installed through conda and linked to mpi4py
conda install mpi4py
It is not necessary to use exactly this implementation, but you may want to
verify that your mpi4py
Python library is linked against the proper
mpirun/mpiexec
executable and that the flags used (like np 4
) are
correct for your system. The system administrator of your cluster should be
very familiar with these concerns and able to help.
Run daskmpi help
to see more options for the daskmpi
command.
High Performance Network¶
Many HPC systems have both standard Ethernet networks as well as
highperformance networks capable of increased bandwidth. You can instruct
Dask to use the highperformance network interface by using the interface
keyword to the daskworker
, daskscheduler
, or daskmpi
commands
mpirun np 4 daskmpi schedulerfile /home/$USER/scheduler.json interface ib0
In the code example above we have assumed that your cluster has an Infiniband
network interface called ib0
. You can check this by asking your system
administrator or by inspecting the output of ifconfig
$ ifconfig
lo Link encap:Local Loopback # Localhost
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
eth0 Link encap:Ethernet HWaddr XX:XX:XX:XX:XX:XX # Ethernet
inet addr:192.168.0.101
...
ib0 Link encap:Infiniband # Fast InfiniBand
inet addr:172.42.0.101
https://stackoverflow.com/questions/43881157/howdoiuseaninfinibandnetworkwithdask
No Local Storage¶
Users often exceed memory limits available to a specific Dask deployment. In normal operation Dask spills excess data to disk. However, in HPC systems the individual compute nodes often lack locally attached storage, preferring instead to store data in a robust high performance network storage solution. As a result when a Dask cluster starts to exceed memory limits its workers can start making many small writes to the remote network file system. This is both inefficient (small writes to a network file system are much slower than local storage for this use case) and potentially dangerous to the file system itself.
See this page
for more information on Dask’s memory policies. Consider changing the
following values to your ~/.dask/config.yaml
file
# Fractions of worker memory at which we take action to avoid memory blowup
# Set any of the lower three values to False to turn off the behavior entirely
workermemorytarget: false # don't spill to disk
workermemoryspill: false # don't spill to disk
workermemorypause: 0.80 # fraction at which we pause worker threads
workermemoryterminate: 0.95 # fraction at which we terminate the worker
This stops Dask workers from spilling to disk, and instead relies entirely on mechanisms to stop them from processing when they reach memory limits.
As a reminder, you can set the memory limit for a worker using the
memorylimit
keyword:
daskmpi ... memorylimit 10GB
Alternatively if you do have local storage mounted on your compute nodes you
can point Dask workers to use a particular location in your filesystem using
the localdirectory
keyword:
daskmpi ... localdirectory /scratch
Launch Many Small Jobs¶
HPC job schedulers are optimized for large monolithic jobs with many nodes that all need to run as a group at the same time. Dask jobs can be quite a bit more flexible, workers can come and go without strongly affecting the job. So if we separate our job into many smaller jobs we can often get through the job scheduling queue much more quickly than a typical job. This is particularly valuable when we want to get started right away and interact with a Jupyter notebook session rather than waiting for hours for a suitable allocation block to become free.
So, to get a large cluster quickly we recommend allocating a daskscheduler process on one node with a modest wall time (the intended time of your session) and then allocating many small singlenode daskworker jobs with shorter wall times (perhaps 30 minutes) that can easily squeeze into extra space in the job scheduler. As you need more computation you can add more of these singlenode jobs or let them expire.
Use Dask to colaunch a Jupyter server¶
Dask can help you by launching other services alongside it. For example you
can run a Jupyter notebook server on the machine running the daskscheduler
process with the following commands
from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')
import socket
host = client.run_on_scheduler(socket.gethostname)
def start_jlab(dask_scheduler):
import subprocess
proc = subprocess.Popen(['/path/to/jupyter', 'lab', 'ip', host, 'nobrowser'])
dask_scheduler.jlab_proc = proc
client.run_on_scheduler(start_jlab)
Concrete Example with PBS¶
The Pangeo project maintains instructions on how to deploy Dask on various HPC systems maintained by NCAR using the PBS job scheduler. Their more concrete instructions may not apply to your situation in particular, but it may be helpful to see a full solution.
Kubernetes¶
Kubernetes and Helm¶
It is easy to launch a Dask cluster and Jupyter notebook server on cloud resources using Kubernetes and Helm.
This is particularly useful when you want to deploy a fresh Python environment on Cloud services, like Amazon Web Services, Google Compute Engine, or Microsoft Azure.
If you already have Python environments running in a preexisting Kubernetes cluster then you may prefer the Kubernetes native documentation, which is a bit lighter weight.
Launch Kubernetes Cluster¶
This document assumes that you have a Kubernetes cluster and Helm installed.
If this is not the case then you might consider setting up a Kubernetes cluster either on one of the common cloud providers like Google, Amazon, or Microsoft’s. We recommend the first part of the documentation in the guide Zero to JupyterHub that focuses on Kubernetes and Helm. You do not need to follow all of these instructions. JupyterHub is not necessary to deploy Dask:
Alternatively you may want to experiment with Kubernetes locally using Minikube.
Helm Install Dask¶
Dask maintains a Helm chart in the default stable channel at https://kubernetescharts.storage.googleapis.com . This should be added to your helm installation by default. You can update the known channels to make sure you have uptodate charts as follows:
helm repo update
Now you can launch Dask on your Kubernetes cluster using the Dask Helm chart:
helm install stable/dask
This deploys a daskscheduler
, several daskworker
processes, and
also an optional Jupyter server.
Verify Deployment¶
This might make a minute to deploy. You can check on the status with
kubectl
:
kubectl get pods
kubectl get services
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
baldeeljupyter924045334twtxd 0/1 ContainerCreating 0 1m
baldeelscheduler3074430035cn1dt 1/1 Running 0 1m
baldeelworker3032746726202jt 1/1 Running 0 1m
baldeelworker3032746726b8nqq 1/1 Running 0 1m
baldeelworker3032746726d0chx 0/1 ContainerCreating 0 1m
$ kubectl get services
NAME TYPE CLUSTERIP EXTERNALIP PORT(S) AGE
baldeeljupyter LoadBalancer 10.11.247.201 35.226.183.149 80:30173/TCP 2m
baldeelscheduler LoadBalancer 10.11.245.241 35.202.201.129 8786:31166/TCP,80:31626/TCP 2m
kubernetes ClusterIP 10.11.240.1 <none> 443/TCP
48m
You can use the addresses under EXTERNALIP
to connect to your nowrunning
Jupyter and Dask systems.
Notice the name baldeel
. This is the name that Helm has given to your
particular deployment of Dask. You could, for example, have multiple
DaskandJupyter clusters running at once and each would be given a different
name. You will use this name to refer to your deployment in the future. You
can list all active helm deployments with:
helm list
NAME REVISION UPDATED STATUS CHART NAMESPACE
baldeel 1 Wed Dec 6 11:19:54 2017 DEPLOYED dask0.1.0 default
Connect to Dask and Jupyter¶
When we ran kubectl get services
we saw some externally visible IPs
mrocklin@pangeo181919:~$ kubectl get services
NAME TYPE CLUSTERIP EXTERNALIP PORT(S) AGE
baldeeljupyter LoadBalancer 10.11.247.201 35.226.183.149 80:30173/TCP 2m
baldeelscheduler LoadBalancer 10.11.245.241 35.202.201.129 8786:31166/TCP,80:31626/TCP 2m
kubernetes ClusterIP 10.11.240.1 <none> 443/TCP 48m
We can navigate to these from any web browser. One is the Dask diagnostic
dashboard. The other is the Jupyter server. You can log into the Jupyter
notebook server with the password, dask
.
You can create a notebook and create a Dask client from there. The
DASK_SCHEDULER_ADDRESS
environment variable has been populated with the
address of the Dask scheduler. This is available in Python in the config
dictionary.
>>> from dask.distributed import Client, config
>>> config['scheduleraddress']
'baldeelscheduler:8786'
Although you don’t need to use this address, the Dask client will find this variable automatically.
from dask.distributed import Client, config
client = Client()
Configure Environment¶
By default the Helm deployment launches three workers using two cores each and a standard conda environment. We can customize this environment by creating a small yaml file that implements a subset of the values in the dask helm chart values.yaml file
For example we can increase the number of workers, and include extra conda and pip packages to install on the both the workers and Jupyter server (these two environments should be matched).
# config.yaml
worker:
replicas: 8
resources:
limits:
cpu: 2
memory: 7.5G
requests:
cpu: 2
memory: 7.5G
env:
 name: EXTRA_CONDA_PACKAGES
value: numba xarray c condaforge
 name: EXTRA_PIP_PACKAGES
value: s3fs daskml upgrade
# We want to keep the same packages on the worker and jupyter environments
jupyter:
enabled: true
env:
 name: EXTRA_CONDA_PACKAGES
value: numba xarray matplotlib c condaforge
 name: EXTRA_PIP_PACKAGES
value: s3fs daskml upgrade
This config file overrides configuration for number and size of workers and the conda and pip packages installed on the worker and Jupyter containers. In general we will want to make sure that these two software environments match.
Update your deployment to use this configuration file. Note that you will not use helm install for this stage. That would create a new deployment on the same Kubernetes cluster. Instead you will upgrade your existing deployment by using the current name:
helm upgrade baldeel stable/dask f config.yaml
This will update those containers that need to be updated. It may take a minute or so.
As a reminder, you can list the names of deployments you have using helm
list
Check status and logs¶
For standard issues you should be able to see worker status and logs using the
Dask dashboard (in particular see the worker links from the info/
page).
However if your workers aren’t starting you can check on the status of pods and
their logs with the following commands
kubectl get pods
kubectl logs <PODNAME>
mrocklin@pangeo181919:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
baldeeljupyter3805078281n1qk2 1/1 Running 0 18m
baldeelscheduler3074430035cn1dt 1/1 Running 0 58m
baldeelworker19318819141q09p 1/1 Running 0 18m
baldeelworker1931881914856mm 1/1 Running 0 18m
baldeelworker19318819149lgzb 1/1 Running 0 18m
baldeelworker1931881914bdn2c 1/1 Running 0 16m
baldeelworker1931881914jq70m 1/1 Running 0 17m
baldeelworker1931881914qsgj7 1/1 Running 0 18m
baldeelworker1931881914s2phd 1/1 Running 0 17m
baldeelworker1931881914srmmg 1/1 Running 0 17m
mrocklin@pangeo181919:~$ kubectl logs baldeelworker1931881914856mm
EXTRA_CONDA_PACKAGES environment variable found. Installing.
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /opt/conda/envs/dask:
The following NEW packages will be INSTALLED:
fasteners: 0.14.1py36_2 condaforge
monotonic: 1.3py36_0 condaforge
zarr: 2.1.4py36_0 condaforge
Proceed ([y]/n)?
monotonic1.3 100% ############################### Time: 0:00:00 11.16 MB/s
fasteners0.14 100% ############################### Time: 0:00:00 576.56 kB/s
...
Delete Helm deployment¶
You can always delete a helm deployment using its name:
helm delete baldeel purge
Note that this does not destroy any clusters that you may have allocated on a Cloud service, you will need to delete those explicitly.
Avoid the Jupyter Server¶
Sometimes you do not need to run a Jupyter server alongside your Dask cluster.
jupyter:
enabled: false
Kubernetes Native¶
See external documentation on DaskKubernetes for more information.
Kubernetes is a popular system for deploying distributed applications on clusters, particularly in the cloud. You can use Kubernetes to launch Dask workers in the following two ways:
Helm: You can launch a Dask scheduler, several workers, and an optional Jupyter Notebook server on a Kubernetes easily using Helm.
helm repo update # get latest helm charts helm install stable/dask # deploy standard dask chart
This is a good choice if you want to do the following:
 Run a managed Dask cluster for a long period of time
 Also deploy a Jupyter server from which to run code,
 Share the same Dask cluster between many automated services
 Try out Dask for the first time on a cloudbased system like Amazon, Google, or Microsoft Azure (see also our Cloud documentation)
See Dask and Helm documentation for more information.
Native: You can quickly deploy Dask workers on Kubernetes from within a Python script or interactive session using DaskKubernetes.
from dask_kubernetes import KubeCluster cluster = KubeCluster.from_yaml('workertemplate.yaml') cluster.scale(20) # add 20 workers cluster.adapt() # or create and destroy workers dynamically based on workload from dask.distributed import Client client = Client(cluster)
This is a good choice if you want to do the following:
 Dynamically create a personal and ephemeral deployment for interactive use
 Allow many individuals the ability to launch their own custom dask deployments, rather than depend on a centralized system
 Quickly adapt Dask cluster size to the current workload
See DaskKubernetes documentation for more information.
You may also want to see the documentation on using Dask with Docker containers to help you manage your software environments on Kubernetes.
Python API (advanced)¶
In some rare cases experts may want to create Scheduler
and Worker
objects explicitly in Python manually. This is often necessary when making
tools to automatically deploy Dask in custom settings.
However, often it is sufficient to rely on the Dask command line interface.
Scheduler¶
Start the Scheduler, provide the listening port (defaults to 8786) and Tornado
IOLoop (defaults to IOLoop.current()
)
from distributed import Scheduler
from tornado.ioloop import IOLoop
from threading import Thread
s = Scheduler()
s.start('tcp://:8786') # Listen on TCP port 8786
loop = IOLoop.current()
loop.start()
Alternatively, you may want the IOLoop and scheduler to run in a separate
thread. In that case you would replace the loop.start()
call with the
following:
t = Thread(target=loop.start, daemon=True)
t.start()
Worker¶
On other nodes start worker processes that point to the URL of the scheduler.
from distributed import Worker
from tornado.ioloop import IOLoop
from threading import Thread
w = Worker('tcp://127.0.0.1:8786')
w.start() # choose randomly assigned port
loop = IOLoop.current()
loop.start()
Alternatively, replace Worker
with Nanny
if you want your workers to be
managed in a separate process by a local nanny process. This allows workers to
restart themselves in case of failure, provides some additional monitoring, and
is useful when coordinating many workers that should live in different
processes to avoid the GIL.
Cloud Deployments¶
To get started running Dask on common Cloud providers like Amazon, Google, or Microsoft we currently recommend deploying Dask with Kubernetes and Helm.
All three major cloud vendors now provide managed Kubernetes services. This allows us to reliably provide the same experience across all clouds, and ensures that solutions for any one provider remain uptodate.
Data Access¶
You may want to install additional libraries in your Jupyter and worker images to access the object stores of each cloud
Historical Libraries¶
Dask previously maintained libraries for deploying Dask on Amazon’s EC2. Due to sporadic interest, and churn both within the Dask library and EC2 itself, these were not well maintained. They have since been deprecated in favor of the Kubernetes and Helm solution.
Adaptive Deployments¶
Motivation¶
Most Dask deployments are static, with a single scheduler and a fixed number of workers. This results in predictable behavior, but is wasteful of resources in two situations:
 The user may not be using the cluster, perhaps they are busy interpreting a recent result or plot, and so the workers sit idly, taking up valuable shared resources from other potential users
 The user may be very active, and is limited by their original allocation.
Particularly efficient users may learn to manually add and remove workers during their session, but this is rare. Instead, we would like the size of a Dask cluster to match the computational needs at any given time. This is the goal of the adaptive deployments discussed in this document. These are particularly helpful for interactive workloads, which are characterized by long periods of inactivity interrupted with short bursts of heavy activity. Adaptive deployments can result in both faster analyses that give users much more power but with much less pressure on computational resources.
Adaptive¶
To make setting up adaptive deployments easy, some Dask deployment solutions
offer an .adapt()
method. Here is an example with
dask_kubernetes.KubeCluster.
from dask_kubernetes import KubeCluster
cluster = KubeCluster()
cluster.adapt(minimum=0, maximum=100) # scale between 0 and 100 workers
For more keyword options, see the Adaptive class below:
Adaptive (scheduler, cluster[, interval, …]) 
Adaptively allocate workers based on scheduler load. 
Dependence on a Resource Manager¶
The Dask scheduler does not know how to launch workers on its own, instead it relies on an external resource scheduler like Kubernetes above, or Yarn, SGE, SLURM, Mesos, or some other inhouse system (see setup documentation for options). In order to use adaptive deployments you must provide some mechanism for the scheduler to launch new workers. Typically this is done by using one of the solutions listed in the setup documentation, or by subclassing from the Cluster superclass, and implementing that API
Cluster 
Superclass for cluster objects 
Scaling Heuristics¶
The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers:
 The historical runtime of every function and task that it has seen, and all of the functions that it is currently able to run for users
 The amount of memory used and available on each worker
 Which workers are idle or saturated for various reasons, like the presence of specialized hardware
From these it is able to determine a target number of workers by dividing the
cumulative expected runtime of all pending tasks by the target_duration
parameter (defaults to five seconds). This number of workers serves as a
baseline request for the resource manager. This number can be altered for a
variety of reasons:
 If the cluster needs more memory then it will choose either the target number of workers, or twice the current number of workers, whichever is larger.
 If the target is outside of the range of the minimum and maximum values then it is clipped to fit within that range.
Additionally, when scaling down Dask preferentially chooses those workers that
are idle and have the least data in memory. It moves that data to other
machines before retiring the worker. To avoid rapid cycling of the cluster up
and down in size, we only retire a worker after a few cycles have gone by where
it has consistently been a good idea to retire it (controlled by the
wait_count
and interval
parameters.)
API¶

class
distributed.deploy.
Adaptive
(scheduler, cluster, interval='1s', startup_cost='1s', scale_factor=2, minimum=0, maximum=None, wait_count=3, target_duration='5s', **kwargs)¶ Adaptively allocate workers based on scheduler load. A superclass.
Contains logic to dynamically resize a Dask cluster based on current use. This class needs to be paired with a system that can create and destroy Dask workers using a cluster resource manager. Typically it is built into already existing solutions, rather than used directly by users. It is most commonly used from the
.adapt(...)
method of various Dask cluster classes.Parameters:  scheduler: distributed.Scheduler
 cluster: object
Must have scale_up and scale_down methods/coroutines
 startup_cost : timedelta or str, default “1s”
Estimate of the number of seconds for nnFactor representing how costly it is to start an additional worker. Affects quickly to adapt to high tasks per worker loads
 interval : timedelta or str, default “1000 ms”
Milliseconds between checks
 wait_count: int, default 3
Number of consecutive times that a worker should be suggested for removal before we remove it.
 scale_factor : int, default 2
Factor to scale by when it’s determined additional workers are needed
 target_duration: timedelta or str, default “5s”
Amount of time we want a computation to take. This affects how aggressively we scale up.
 minimum: int
Minimum number of workers to keep around
 maximum: int
Maximum number of workers to keep around
 **kwargs:
Extra parameters to pass to Scheduler.workers_to_close
Notes
Subclasses can override
Adaptive.should_scale_up()
andAdaptive.workers_to_close()
to control when the cluster should be resized. The default implementation checks if there are too many tasks per worker or too little memory available (seeAdaptive.needs_cpu()
andAdaptive.needs_memory()
).Adaptive.get_scale_up_kwargs()
method controls the arguments passed to the cluster’sscale_up
method.Examples
This is commonly used from existing Dask classes, like KubeCluster
>>> from dask_kubernetes import KubeCluster >>> cluster = KubeCluster() >>> cluster.adapt(minimum=10, maximum=100)
Alternatively you can use it from your own Cluster class by subclassing from Dask’s Cluster superclass
>>> from distributed.deploy import Cluster >>> class MyCluster(Cluster): ... def scale_up(self, n): ... """ Bring worker count up to n """ ... def scale_down(self, workers): ... """ Remove worker addresses from cluster """
>>> cluster = MyCluster() >>> cluster.adapt(minimum=10, maximum=100)

class
distributed.deploy.
Cluster
¶ Superclass for cluster objects
This expects a local Scheduler defined on the object. It provides common methods and an IPython widget display.
Clusters inheriting from this class should provide the following:
A local
Scheduler
object at.scheduler
scale_up and scale_down methods as defined below:
 def scale_up(self, n: int):
‘’’ Brings total worker count up to
n
‘’‘ def scale_down(self, workers: List[str]):
‘’’ Close the workers with the given addresses ‘’‘
This will provide a general
scale
method as well as an IPython widget for display.See also
LocalCluster
 a simple implementation with local workers
Examples
>>> from distributed.deploy import Cluster >>> class MyCluster(cluster): ... def scale_up(self, n): ... ''' Bring the total worker count up to n ''' ... pass ... def scale_down(self, workers): ... ''' Close the workers with the given addresses ''' ... pass
>>> cluster = MyCluster() >>> cluster.scale(5) # scale manually >>> cluster.adapt(minimum=1, maximum=100) # scale automatically
Docker Images¶
Example docker images are maintained at https://github.com/dask/daskdocker and https://hub.docker.com/r/daskdev/ .
Each image installs the full Dask conda package (including the distributed scheduler), Numpy, and Pandas on top of a Miniconda installation on top of a Debian image.
These images are large, around 1GB.
daskdev/dask
: This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas. This image is about 1GB in size.
daskdev/dasknotebook
: This is based on the Jupyter basenotebook image and so is appropriate for use both normally as a Jupyter server, and also as part of a JupyterHub deployment. It also includes a matching Dask software environment described above. This image is about 2GB in size.
Extensibility¶
Users can mildly customize the software environment by populating the
environment variables EXTRA_APT_PACKAGES
, EXTRA_CONDA_PACKAGES
, and
EXTRA_PIP_PACKAGES
. If these environmet variables are set they will
trigger calls to the following respectively:
aptget install $EXTRA_APT_PACKAGES
conda install $EXTRA_CONDA_PACKAGES
pip install $EXTRA_PIP_PACKAGES
Note that using these can significantly delay the container from starting,
especially when using apt
, or conda
(pip
is relatively fast).
Remember that it is important for software versions to match between Dask workers and Dask clients. As a result it is often useful to include the same extra packages in both Jupyter and Worker images.
Source¶
Docker files are maintained at https://github.com/dask/daskdocker . This repository also includes a dockercompose configuration.
Use Cases¶
Dask is a versatile tool that supports a variety of workloads. This page contains brief and illustrative examples for how people use Dask in practice. This page emphasizes breadth and hopefully inspires readers to find new ways that Dask can serve them beyond their original intent.
Overview¶
Dask use cases can be roughly divided in the following two categories:
 Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to analyze large datasets with familiar techniques. This is similar to Databases, Spark, or big array libraries.
 Custom task scheduling. You submit a graph of functions that depend on each other for custom workloads. This is similar to Luigi, Airflow, Celery, or Makefiles.
Most people today approach Dask assuming it is a framework like Spark, designed for the first use case around large collections of uniformly shaped data. However, many of the more productive and novel use cases fall into the second category, using Dask to parallelize custom workflows.
Dask compute environments can be divided into the following two categories:
 Single machine parallelism with threads or processes: The Dask singlemachine scheduler leverages the full CPU power of a laptop or a large workstation and changes the space limitation from “fits in memory” to “fits on disk”. This scheduler is simple to use and doesn’t have the computational or conceptual overhead of most “big data” systems.
 Distributed cluster parallelism on multiple nodes: The Dask distributed scheduler coordinates the actions of multiple machines on a cluster. It scales anywhere from a single machine to a thousand machines, but not significantly beyond.
The single machine scheduler is useful to more individuals (more people have personal laptops than have access to clusters) and probably accounts for 80+% of the use of Dask today. The distributed machine scheduler is useful to larger organizations like universities, research labs, or private companies.
Below we give specific examples of how people use Dask. We start with large NumPy/Pandas/List examples because they’re somewhat more familiar to people looking at “big data” frameworks. We then follow with custom scheduling examples, which tend to be applicable more often, and are arguably a bit more interesting.
Collection Examples¶
Dask contains large parallel collections for ndimensional arrays (similar to NumPy), dataframes (similar to Pandas), and lists (similar to PyToolz or PySpark).
On disk arrays¶
Scientists studying the earth have 10GB to 100GB of regularly gridded weather data on their laptop’s hard drive stored as many individual HDF5 or NetCDF files. They use dask.array to treat this stack of HDF5 or NetCDF files as a single NumPy array (or a collection of NumPy arrays with the XArray project). They slice, perform reductions, perform seasonal averaging etc. all with straight Numpy syntax. These computations take a few minutes (reading 100GB from disk is somewhat slow) but previously infeasible computations become convenient from the comfort of a personal laptop.
It’s not so much parallel computing that is valuable here but rather the ability to comfortably compute on largerthanmemory data without special hardware.
import h5py
dataset = h5py.File('myfile.hdf5')['/x']
import dask.array as da
x = da.from_array(dataset, chunks=dataset.chunks)
y = x[::10]  x.mean(axis=0)
y.compute()
Directory of CSV or tabular HDF files¶
Analysts studying time series data have a large directory of CSV, HDF, or otherwise formatted tabular files. They usually use Pandas for this kind of data but either the volume is too large or dealing with a large number of files is confusing. They use dask.dataframe to logically wrap all of these different files into one logical dataframe that is built on demand to save space. Most of their Pandas workflow is the same (Dask.dataframe is a subset of Pandas) so they switch from Pandas to Dask.dataframe and back easily without significantly changing their code.
import dask.dataframe as dd
df = dd.read_csv('data/2016*.*.csv', parse_dates=['timestamp'])
df.groupby(df.timestamp.dt.hour).value.mean().compute()
Directory of CSV files on HDFS¶
The same analyst as above uses dask.dataframe with the dask.distributed scheduler to analyze terabytes of data on their institution’s Hadoop cluster straight from Python. This uses either the hdfs3 or pyarrow Python libraries for HDFS management
This solution is particularly attractive because it stays within the Python ecosystem, and uses the speed and algorithm set of Pandas, a tool with which the analyst is already very comfortable.
from dask.distributed import Client
client = Client('clusteraddress:8786')
import dask.dataframe as dd
df = dd.read_csv('hdfs://data/2016*.*.csv', parse_dates=['timestamp'])
df.groupby(df.timestamp.dt.hour).value.mean().compute()
Directories of custom format files¶
The same analyst has a bunch of files of a custom format not supported by Dask.dataframe, or perhaps these files are in a directory structure that encodes important information about his data (such as the date or other metadata.) They use dask.delayed to teach Dask.dataframe how to load the data and then pass into dask.dataframe for tabular algorithms.
 Example Notebook: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
JSON data¶
Data Engineers with click stream data from a website or mechanical engineers with telemetry data from mechanical instruments have large volumes of data in JSON or some other semistructured format. They use dask.bag to manipulate many Python objects in parallel either on their personal machine, where they stream the data through memory or across a cluster.
import dask.bag as db
import json
records = db.read_text('data/2015**.json').map(json.loads)
records.filter(lambda d: d['name'] == 'Alice').pluck('id').frequencies()
Custom Examples¶
The large collections (array, dataframe, bag) are wonderful when they fit the application, for example if you want to perform a groupby on a directory of CSV data. However several parallel computing applications don’t fit neatly into one of these higher level abstractions. Fortunately, Dask provides a wide variety of ways to parallelize more custom applications. These use the same machinery as the arrays and dataframes, but allow the user to develop custom algorithms specific to their problem.
Embarrassingly parallel computation¶
A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.
They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an inhouse cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.
def process(data):
...
return ...
Normal Sequential Processing:
results = [process(x) for x in inputs]
Build Dask Computation:
from dask import compute, delayed
values = [delayed(process)(x) for x in inputs]
Multiple Threads:
import dask.threaded
results = compute(*values, scheduler='threads')
Multiple Processes:
import dask.multiprocessing
results = compute(*values, scheduler='processes')
Distributed Cluster:
from dask.distributed import Client
client = Client("clusteraddress:8786")
results = compute(*values, scheduler='distributed')
Complex dependencies¶
A financial analyst has many models that depend on each other in a complex web of computations.
data = [load(fn) for fn in filenames]
reference = load_from_database(query)
A = [model_a(x, reference) for x in data]
B = [model_b(x, reference) for x in data]
roll_A = [roll(A[i], A[i + 1]) for i in range(len(A)  1)]
roll_B = [roll(B[i], B[i + 1]) for i in range(len(B)  1)]
compare = [compare_ab(a, b) for a, b in zip(A, B)]
results = summarize(compare, roll_A, roll_B)
These models are time consuming and need to be run on a variety of inputs and situations. The analyst has his code now as a collection of Python functions and is trying to figure out how to parallelize such a codebase. They use dask.delayed to wrap their function calls and capture the implicit parallelism.
from dask import compute, delayed
data = [delayed(load)(fn) for fn in filenames]
reference = delayed(load_from_database)(query)
A = [delayed(model_a)(x, reference) for x in data]
B = [delayed(model_b)(x, reference) for x in data]
roll_A = [delayed(roll)(A[i], A[i + 1]) for i in range(len(A)  1)]
roll_B = [delayed(roll)(B[i], B[i + 1]) for i in range(len(B)  1)]
compare = [delayed(compare_ab)(a, b) for a, b in zip(A, B)]
lazy_results = delayed(summarize)(compare, roll_A, roll_B)
They then depend on the dask schedulers to run this complex web of computations in parallel.
results = compute(lazy_results)
They appreciate how easy it was to transition from the experimental code to a scalable parallel version. This code is also easy enough for their teammates to understand easily and extend in the future.
Algorithm developer¶
A graduate student in machine learning is prototyping novel parallel algorithms. They are in a situation much like the financial analyst above except that they need to benchmark and profile their computation heavily under a variety of situations and scales. The dask profiling tools provide the feedback they need to understand their parallel performance, including how long each task takes, how intense communication is, and their scheduling overhead. They scale their algorithm between 1 and 50 cores on single workstations and then scale out to a cluster running their computation at thousands of cores. They don’t have access to an institutional cluster, so instead they use dask on the cloud to easily provision clusters of varying sizes.
Their algorithm is written the same in all cases, drastically reducing the cognitive load, and letting the readers of their work experiment with their system on their own machines, aiding reproducibility.
ScikitLearn or Joblib User¶
A data scientist wants to scale their machine learning pipeline to run on their
cluster to accelerate parameter searches. They already use the sklearn
njobs=
parameter to accelerate their computation on their local computer
with Joblib. Now they wrap their sklearn
code with a context manager to
parallelize the exact same code across a cluster (also available with
IPyParallel)
import distributed.joblib
with joblib.parallel_backend('distributed',
scheduler_host=('192.168.1.100', 8786)):
result = GridSearchCV( ... ) # normal sklearn code
Academic Cluster Administrator¶
A system administrator for a university compute cluster wants to enable many researchers to use the available cluster resources, which are currently lying idle. The research faculty and graduate students lack experience with job schedulers and MPI, but are comfortable interacting with Python code through a Jupyter notebook.
Teaching the faculty and graduate students to parallelize software has proven time consuming. Instead the administrator sets up dask.distributed on a sandbox allocation of the cluster and broadly publishes the address of the scheduler, pointing researchers to the dask.distributed quickstart. Utilization of the cluster climbs steadily over the next week as researchers are more easily able to parallelize their computations without having to learn foreign interfaces. The administrator is happy because resources are being used without significant handholding.
As utilization increases the administrator has a new problem; the shared dask.distributed cluster is being overused. The administrator tracks use through Dask diagnostics to identify which users are taking most of the resources. They contact these users and teach them how to launch their own dask.distributed clusters using the traditional job scheduler on their cluster, making space for more new users in the sandbox allocation.
Financial Modeling Team¶
Similar to the case above, a team of modelers working at a financial institution run a complex network of computational models on top of each other. They started using dask.delayed individually, as suggested above, but realized that they often perform highly overlapping computations, such as always reading the same data.
Now they decide to use the same Dask cluster collaboratively to save on these costs. Because Dask intelligently hashes computations in a way similar to how Git works, they find that when two people submit similar computations the overlapping part of the computation runs only once.
Ever since working collaboratively on the same cluster they find that their frequently running jobs run much faster, because most of the work is already done by previous users. When they share scripts with colleagues they find that those repeated scripts complete immediately rather than taking several hours.
They are now able to iterate and share data as a team more effectively, decreasing their time to result and increasing their competitive edge.
As this becomes more heavily used on the company cluster they decide to set up
an autoscaling system. They use their dynamic job scheduler (perhaps SGE,
LSF, Mesos, or Marathon) to run a single daskscheduler
24/7 and then scale
up and down the number of daskworkers
running on the cluster based on
computational load. This solution ends up being more responsive (and thus more
heavily used) than their previous attempts to provide institutionwide access
to parallel computing but because it responds to load it still acts as a good
citizen in the cluster.
Streaming data engineering¶
A data engineer responsible for watching a data feed needs to scale out a continuous process. They combine dask.distributed with normal Python Queues to produce a rudimentary but effective stream processing system.
Because dask.distributed is elastic, they can scale up or scale down their cluster resources in response to demand.
Examples¶
Array¶
Creating Dask arrays from NumPy arrays¶
We can create Dask arrays from any object that implements NumPy slicing, like a
numpy.ndarray
or ondisk formats like h5py or netCDF Dataset objects. This
is particularly useful with on disk arrays that don’t fit in memory but, for
simplicity’s sake, we show how this works on a NumPy array.
The following example uses da.from_array
to create a Dask array from a NumPy
array, which isn’t particularly valuable (the NumPy array already works in
memory just fine) but is easy to play with.
>>> import numpy as np
>>> import dask.array as da
>>> x = np.arange(1000)
>>> y = da.from_array(x, chunks=(100))
>>> y.mean().compute()
499.5
Creating Dask arrays from HDF5 Datasets¶
We can construct dask array objects from other array objects that support
numpystyle slicing. In this example, we wrap a dask array around an HDF5 dataset,
chunking that dataset into blocks of size (1000, 1000)
:
>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> dset = f['/data/path']
>>> import dask.array as da
>>> x = da.from_array(dset, chunks=(1000, 1000))
Often we have many such datasets. We can use the stack
or concatenate
functions to bind many dask arrays into one:
>>> dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')]
>>> arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
>>> x = da.stack(arrays, axis=0) # Stack along a new first axis
Note that none of the data is loaded into memory yet, the dask array just
contains a graph of tasks showing how to load the data. This allows
dask.array
to do work on datasets that don’t fit into RAM.
Creating random arrays¶
In a simple case, we can create arrays with random data using the da.random
module.
>>> import dask.array as da
>>> x = da.random.normal(0, 1, size=(100000,100000), chunks=(1000, 1000))
>>> x.mean().compute()
0.0002280808453825202
Build Custom Dask.Array Function¶
As discussed in the array design document to create a
dask Array
object we need the following:
 A dask graph
 A name specifying a set of keys within that graph
 A
chunks
tuple giving chunk shape information  A NumPy dtype
Often dask.array
functions take other Array
objects as inputs along
with parameters, add tasks to a new dask dictionary, create a new chunks
tuple, and then construct and return a new Array
object. The hard parts
are invariably creating the right tasks and creating a new chunks
tuple.
Careful review of the array design document is suggested.
Example eye¶
Consider this simple example with the eye
function.
from dask.base import tokenize
def eye(n, blocksize):
chunks = ((blocksize,) * n // blocksize,
(blocksize,) * n // blocksize)
name = 'eye' + tokenize(n, blocksize) # unique identifier
dsk = {(name, i, j): (np.eye, blocksize)
if i == j else
(np.zeros, (blocksize, blocksize))
for i in range(n // blocksize)
for j in range(n // blocksize)}
dtype = np.eye(0).dtype # take dtype default from numpy
return Array(dsk, name, chunks, dtype)
This example is particularly simple because it doesn’t take any Array
objects as input.
Example diag¶
Consider the function diag
that takes a 1d vector and produces a 2d matrix
with the values of the vector along the diagonal. Consider the case where the
input is a 1d array with chunk sizes (2, 3, 4)
in the first dimension like
this:
[x_0, x_1], [x_2, x_3, x_4], [x_5, x_6, x_7, x_8]
We need to create a 2d matrix with chunks equal to ((2, 3, 4), (2, 3, 4))
where the ith block along the diagonal of the output is the result of calling
np.diag
on the ith
block of the input and all other blocks are zero.
from dask.base import tokenize
def diag(v):
"""Construct a diagonal array, with ``v`` on the diagonal."""
assert v.ndim == 1
chunks = (v.chunks[0], v.chunks[0]) # repeat chunks twice
name = 'diag' + tokenize(v) # unique identifier
dsk = {(name, i, j): (np.diag, (v.name, i))
if i == j else
(np.zeros, (v.chunks[0][i], v.chunks[0][j]))
for i in range(len(v.chunks[0]))
for j in range(len(v.chunks[0]))}
dsk.update(v.dask) # include dask graph of the input
dtype = v.dtype # output has the same dtype as the input
return Array(dsk, name, chunks, dtype)
>>> x = da.arange(9, chunks=((2, 3, 4),))
>>> x
dask.array<arange1, shape=(9,), chunks=((2, 3, 4)), dtype=int64>
>>> M = diag(x)
>>> M
dask.array<diag2, shape=(9, 9), chunks=((2, 3, 4), (2, 3, 4)), dtype=int64>
>>> M.compute()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 4, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 6, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 7, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 8]])
Example Lazy Reader¶
Dask may also be used as a lazy loader. Consider the following function which takes filenames and a reader:
from dask.array import Array
from dask.base import tokenize
def read_custom(reader, filenames):
'''
This creates a dask array based on numpy files of the same length.
Parameters

reader: callable
The function that reads the files
The reader should take a filename as an argument and return a numpy
array (np.ndarray instance).
filenames : List[str]
the names of the files of the same length.
These must be numpy files of same shape and dtype
This will concatenate them together as the same dask array.
Examples

>>> read_custom(np.load, ['foo1.npy', 'foo1.npy'])
>>> read_custom(skimage.io.imread, ['1.jpg', '2.jpg', '3.jpg'])
'''
# Read one file to get example shape and dtype
example = reader(filenames[0])
chunks = ((1,) * len(filenames),) + tuple((d,) for d in example.shape)
name = 'read_custom' + tokenize(reader, *filenames) # unique identifier
dsk = {(name, i, 0, 0): (operator.getitem,
(reader, fn), # read array from file
(None, Ellipsis)) # add extra dimension like x[None, ...]
for i, fn in enumerate(filenames)}
return Array(dsk, name, chunks, example.dtype)
This may be useful when processing time series of images, for instance. Alternatively, people often do this in practice by just using dask.delayed as in the following example:
import skimage.io
import dask.array as da
from dask import delayed
imread = delayed(skimage.io.imread, pure=True) # Lazy version of imread
filenames = sorted(glob.glob('*.jpg'))
lazy_images = [imread(url) for url in urls] # Lazily evaluate imread on each url
arrays = [da.from_delayed(lazy_image, # Construct a small Dask array
dtype=sample.dtype, # for every lazy value
shape=sample.shape)
for lazy_value in lazy_values]
stack = da.stack(arrays, axis=0) # Stack all small Dask arrays into one
Bag¶
Read JSON records from disk¶
We commonly use dask.bag
to process unstructured or semistructured data:
>>> import dask.bag as db
>>> import json
>>> js = db.read_text('logs/2015*.json.gz').map(json.loads)
>>> js.take(2)
({'name': 'Alice', 'location': {'city': 'LA', 'state': 'CA'}},
{'name': 'Bob', 'location': {'city': 'NYC', 'state': 'NY'})
>>> result = js.pluck('name').frequencies() # just another Bag
>>> dict(result) # Evaluate Result
{'Alice': 10000, 'Bob': 5555, 'Charlie': ...}
Word count¶
In this example, we’ll use dask
to count the number of words in text files
(Enron email dataset, 6.4 GB) both locally and on a cluster (along with the
distributed and hdfs3 libraries).
Local computation¶
Download the first text file (76 MB) in the dataset to your local machine:
$ wget https://s3.amazonaws.com/blazedata/enronemail/edrmenronv2_allenp_xml.zip/merged.txt
Import dask.bag
and create a bag
from the single text file:
>>> import dask.bag as db
>>> b = db.read_text('merged.txt', blocksize=10000000)
View the first ten lines of the text file with .take()
:
>>> b.take(10)
('Date: Tue, 26 Sep 2000 09:26:00 0700 (PDT)\r\n',
'From: Phillip K Allen\r\n',
'To: pallen70@hotmail.com\r\n',
'Subject: Investment Structure\r\n',
'XSDOC: 948896\r\n',
'XZLID: zledrmenronv2allenp1713.eml\r\n',
'\r\n',
' Forwarded by Phillip K Allen/HOU/ECT on 09/26/2000 \r\n',
'04:26 PM \r\n',
'\r\n')
We can write a word count expression using the bag
methods to split the
lines into words, concatenate the nested lists of words into a single list,
count the frequencies of each word, then list the top 10 words by their count:
>>> wordcount = b.str.split().flatten().frequencies().topk(10, lambda x: x[1])
Note that the combined operations in the previous expression are lazy. We can
trigger the word count computation using .compute()
:
>>> wordcount.compute()
[('P', 288093),
('1999', 280917),
('2000', 277093),
('FO', 255844),
('AC', 254962),
('1', 240458),
('0', 233198),
('2', 224739),
('O', 223927),
('3', 221407)]
This computation required about 7 seconds to run on a laptop with 8 cores and 16 GB RAM.
Cluster computation with HDFS¶
Next, we’ll use dask
along with the distributed and hdfs3 libraries
to count the number of words in all of the text files stored in a Hadoop
Distributed File System (HDFS).
Copy the text data from Amazon S3 into HDFS on the cluster:
$ hadoop distcp s3n://AWS_SECRET_ID:AWS_SECRET_KEY@blazedata/enronemail hdfs:///tmp/enron
where AWS_SECRET_ID
and AWS_SECRET_KEY
are valid AWS credentials.
We can now start a distributed
scheduler and workers on the cluster,
replacing SCHEDULER_IP
and SCHEDULER_PORT
with the IP address and port of
the distributed
scheduler:
$ daskscheduler # On the head node
$ daskworker SCHEDULER_IP:SCHEDULER_PORT nprocs 4 nthreads 1 # On the compute nodes
Because our computations use pure Python rather than numeric libraries (e.g., NumPy, pandas), we started the workers with multiple processes rather than with multiple threads. This helps us avoid issues with the Python Global Interpreter Lock (GIL) and increases efficiency.
In Python, import the hdfs3
and the distributed
methods used in this
example:
>>> from dask.distributed import Client, progress
Initialize a connection to the distributed
executor:
>>> client = Client('SCHEDULER_IP:SCHEDULER_PORT')
Create a bag
from the text files stored in HDFS. This expression will not
read data from HDFS until the computation is triggered:
>>> import dask.bag as db
>>> b = db.read_text('hdfs:///tmp/enron/*/*')
We can write a word count expression using the same bag
methods as the
local dask
example:
>>> wordcount = b.str.split().flatten().frequencies().topk(10, lambda x: x[1])
We are ready to count the number of words in all of the text files using
distributed
workers. We can map the wordcount
expression to a future
that triggers the computation on the cluster.
>>> future = client.compute(wordcount)
Note that the compute
operation is nonblocking, and you can continue to
work in the Python shell/notebook while the computations are running.
We can check the status of the future
while all of the text files are being
processed:
>>> print(future)
<Future: status: pending, key: finalize0f2f51e2350a886223f11e5a1a7bc948>
>>> progress(future)
[########################################]  100% Completed  8min 15.2s
This computation required about 8 minutes to run on a cluster with three worker
machines, each with 4 cores and 16 GB RAM. For comparison, running the same
computation locally with dask
required about 20 minutes on a single machine
with the same specs.
When the future
finishes reading in all of the text files and counting
words, the results will exist on each worker. To sum the word counts for all of
the text files, we need to gather the results from the dask.distributed
workers:
>>> results = client.gather(future)
Finally, we print the top 10 words from all of the text files:
>>> print(results)
[('0', 67218227),
('the', 19588747),
('', 14126955),
('to', 11893912),
('N/A', 11814994),
('of', 11725144),
('and', 10254267),
('in', 6685245),
('a', 5470711),
('or', 5227787)]
The complete Python script for this example is shown below:
# wordcount.py
# Local computation
import dask.bag as db
b = db.read_text('merged.txt')
b.take(10)
wordcount = b.str.split().flatten().frequencies().topk(10, lambda x: x[1])
wordcount.compute()
# Cluster computation with HDFS
from dask.distributed import Client, progress
client = Client('SCHEDULER_IP:SCHEDULER_PORT')
b = db.read_text('hdfs:///tmp/enron/*/*')
wordcount = b.str.split().flatten().frequencies().topk(10, lambda x: x[1])
future = client.compute(wordcount)
print(future)
progress(future)
results = client.gather(future)
print(results)
DataFrame¶
Dataframes from CSV files¶
Suppose we have a collection of CSV files with data:
data1.csv:
time,temperature,humidity
0,22,58
1,21,57
2,25,57
3,26,55
4,22,53
5,23,59
data2.csv:
time,temperature,humidity
0,24,85
1,26,83
2,27,85
3,25,92
4,25,83
5,23,81
data3.csv:
time,temperature,humidity
0,18,51
1,15,57
2,18,55
3,19,51
4,19,52
5,19,57
and so on.
We can create Dask dataframes from CSV files using dd.read_csv
.
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
We can work with the Dask dataframe as usual, which is composed of Pandas dataframes. We can list the first few rows.
>>> df.head()
time temperature humidity
0 0 22 58
1 1 21 57
2 2 25 57
3 3 26 55
4 4 22 53
Or we can compute values over the entire dataframe.
>>> df.temperature.mean().compute()
22.055555555555557
>>> df.humidity.std().compute()
14.710829233324224
Dataframes from HDF5 files¶
This section provides working examples of dask.dataframe
methods to read HDF5 files. HDF5 is a unique technology suite that makes possible the management of large and complex data collections. To learn more about HDF5, visit the HDF Group Tutorial page. For an overview of dask.dataframe
, its limitations, scope, and use, see the DataFrame overview section.
Important Note – dask.dataframe.read_hdf
uses pandas.read_hdf
, thereby inheriting its abilities and limitations. See pandas HDF5 documentation for more information.
Examples Covered¶
 Use
dask.dataframe
to:
 Create dask DataFrame by loading a specific dataset (key) from a single HDF5 file
 Create dask DataFrame from a single HDF5 file with multiple datasets (keys)
 Create dask DataFrame by loading multiple HDF5 files with different datasets (keys)
Generate Example Data¶
Here is some code to generate sample HDF5 files.
import string, json, random
import pandas as pd
import numpy as np
# dict to keep track of hdf5 filename and each key
fileKeys = {}
for i in range(10):
# randomly pick letter as dataset key
groupkey = random.choice(list(string.ascii_lowercase))
# randomly pick a number as hdf5 filename
filename = 'my' + str(np.random.randint(100)) + '.h5'
# Make a dataframe; 26 rows, 2 columns
df = pd.DataFrame({'x': np.random.randint(1, 1000, 26),
'y': np.random.randint(1, 1000, 26)},
index=list(string.ascii_lowercase))
# Write hdf5 to current directory
df.to_hdf(filename, key='/' + groupkey, format='table')
fileKeys[filename] = groupkey
print(fileKeys) # prints hdf5 filenames and keys for each
Read single dataset from HDF5¶
The first order of dask.dataframe
business is creating a dask DataFrame using a single HDF5 file’s dataset. The code to accomplish this task is:
import dask.dataframe as dd
df = dd.read_hdf('my86.h5', key='/c')
Load multiple datasets from single HDF5 file¶
Loading multiple datasets from a single file requires a small tweak and use of the wildcard character:
import dask.dataframe as dd
df = dd.read_hdf('my86.h5', key='/*')
Learn more about dask.dataframe
methods by visiting the API documentation.
Create dask DataFrame from multiple HDF5 files¶
The next example is a natural progression from the previous example (e.g. using a wildcard). Add a wildcard for the key and path parameters to read multiple files and multiple keys:
import dask.dataframe as dd
df = dd.read_hdf('./*.h5', key='/*')
These exercises cover the basics of using dask.dataframe
to work with HDF5 data. For more information on the user functions to manipulate and explore dataframes (visualize, describe, compute, etc.) see API documentation. To explore the other data formats supported by dask.dataframe
, visit the section on creating dataframes .
Delayed¶
Build Custom Arrays¶
Here we have a serial blocked computation for computing the mean of all positive elements in a large, on disk array:
x = h5py.File('myfile.hdf5')['/x'] # Trillion element array on disk
sums = []
counts = []
for i in range(1000000): # One million times
chunk = x[1000000*i:1000000*(i + 1)] # Pull out chunk
positive = chunk[chunk > 0] # Filter out negative elements
sums.append(positive.sum()) # Sum chunk
counts.append(positive.size) # Count chunk
result = sum(sums) / sum(counts) # Aggregate results
Below is the same code, parallelized using dask.delayed
:
x = delayed(h5py.File('myfile.hdf5')['/x']) # Trillion element array on disk
sums = []
counts = []
for i in range(1000000): # One million times
chunk = x[1000000*i:1000000*(i + 1)] # Pull out chunk
positive = chunk[chunk > 0] # Filter out negative elements
sums.append(positive.sum()) # Sum chunk
counts.append(positive.size) # Count chunk
result = delayed(sum)(sums) / delayed(sum)(counts) # Aggregate results
result.compute() # Perform the computation
Only 3 lines had to change to make this computation parallel instead of serial.
 Wrap the original array in
delayed
. This makes all the slices on it returnDelayed
objects.  Wrap both calls to
sum
withdelayed
.  Call the
compute
method on the result.
While the for loop above still iterates fully, it’s just building up a graph of the computation that needs to happen, without actually doing any computing.
Data Processing Pipelines¶
Now, rebuilding the example from custom graphs:
from dask import delayed, value
@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)
files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
stored.compute()
This builds the same graph as seen before, but using normal Python syntax. In
fact, the only difference between Python code that would do this in serial, and
the parallel version with dask is the delayed
decorators on the functions, and
the call to compute
at the end.
Distributed Concurrent.futures¶
Tutorial¶
A Dask tutorial from July 2015 (fairly old) is available here: https://github.com/dask/dasktutorial
Community¶
Dask is used and developed by individuals at a variety of institutions. It sits within the broader Python numeric ecosystem commonly referred to as PyData or SciPy.
Discussion¶
Conversation happens in the following places:
 Usage questions are directed to Stack Overflow with the #dask tag. Dask developers monitor this tag and get emails whenever a question is asked.
 Bug reports and feature requests are managed on the GitHub issue tracker
 Chat occurs on at gitter.im/dask/dask for general conversation and gitter.im/dask/dev for developer conversation. Note that because gitter chat is not searchable by future users we discourage usage questions and bug reports on gitter and instead ask people to use Stack Overflow or GitHub.
 Weekly developer meeting usually happens at Thursday 4pm UTC (11am in New York, 8am in Los Angeles, 12am in Beijing) at https://appear.in/daskdev with meeting notes on a publicly viewable document Subscribe to this public calendar to receive updates if the meeting is rescheduled or canceled.
Asking for help¶
We welcome usage questions and bug reports from all users, even those who are new to using the project. There are a few things you can do to improve the likelihood of quickly getting a good answer.
 Ask questions in the right place. In particular we strongly prefer the use of StackOverflow and Github issues over Gitter chat. Github and StackOverflow are more easily searchable by future users and so is more efficient for everyone’s time. Gitter chat is strictly reserved for developer and community discussion.
 Create a minimal example. It is ideal to create minimal, complete, verifiable examples. This significantly reduces the time that answerers spend understanding your situation and so results in higher quality answers more quickly.
Paid support¶
Dask is an open source project that originated at Anaconda Inc.. In addition to the previous options, Anaconda offers paid training and support: https://www.anaconda.com/support.
Why Dask?¶
This document gives highlevel motivation on why people choose to adopt Dask.
Python’s role in Data Science¶
Python has grown to become the dominant language both in data analytics, and general programming:
This is fueled both by computational libraries like Numpy, Pandas, and ScikitLearn and by a wealth of libraries for visualization, interactive notebooks, collaboration, and so forth.
However these packages were not designed to scale beyond a single machine. Dask was developed to scale these packages and the surrounding ecosystem. Dask works with the existing Python ecosystem to scale it to multicore machines and distributed clusters.
Familiar API¶
Analysts often use tools like Pandas, ScikitLearn, Numpy, and the rest of the Python ecosystem to analyze data on their personal computer. They like these tools because they are efficient, intuitive, and widely trusted. However when they choose to apply their analyses to larger datasets they find that these tools were not designed to scale beyond a single machine, and so the analyst is forced to rewrite their computation using a more scalable tool, often in another language altogether. This rewrite process slows down discovery and causes frustration.
Dask provides ways to scale Pandas, ScikitLearn, and Numpy workflows with minimal rewriting. Dask integrates well with these tools so that it copies most of their API and uses their data structures internally. Dask is codeveloped with these libraries to ensure that they evolve consistently, minimizing friction caused from transitioning from workloads on a local laptop, to a multicore workstation, to a distributed cluster. Analysts familiar with Pandas/ScikitLearn/Numpy will be immediately familiar with their Dask equivalents, and have much of their intuition carry over to a scalable context.
Scales out to clusters¶
As datasets and computations scale faster than CPUs and RAM we need to find ways to scale our computations across multiple machines. This introduces many new concerns:
 How to have computers talk to each other over the network?
 How and when to move data between machines?
 How to recover from machine failures?
 How to deploy on an inhouse cluster?
 How to deploy on the cloud?
 How to deploy on an HPC supercomputer?
 How to provide an API to this system that users find intuitive?
 …
While it is possible to build these systems inhouse (and indeed, many exist) many organizations are increasingly depending on solutions developed within the open source community. These tend to be more robust, secure, and fully featured without being tended by inhouse staff.
Dask solves these problems. It is routinely run on thousandmachine clusters to process hundreds of terabytes of data efficiently. It has utilities and documentation on how to deploy inhouse, on the cloud, or on HPC supercomputers. It supports encryption and authentication using TLS/SSL certificates. It is resilient and can handle the failure of worker nodes gracefully and is elastic and so can take advantage of new nodes added onthefly. Dask includes several user APIs that are used and smoothed over by thousands of researchers across the globe working in different domains.
Scales down to single computers¶
But a massive cluster is not always the right choice
Today’s laptops and workstations are surprisingly powerful and, if used correctly, can often handle datasets and computations for which we previously depended on clusters. A modern laptop has a multicore CPU, 32GB of RAM, and flashbased hard drives that can stream through data several times faster than HDDs or SSDs of even a year or two ago.
As a result analysts can often manipulate 100GB+ datasets on their laptop or 1TB+ datasets on a workstation without bothering with the cluster at all. They sometimes prefer this for the following reasons:
 They can use their local software environment, rather than being constrained by what is available on the cluster
 They can more easily work while in transit, at a coffee shop, or at home away from the VPN
 Debugging errors and analyzing performance are generally much easier on a single machine without having to pore through logs
 Generally their iteration cycles are faster
 Their computations may be more efficient because all of the data is local and doesn’t need to flow through the network or between separate processes
Dask can enable efficient parallel computations on single machines by leveraging their multicore CPUs and streaming data efficiently from disk. Dask can run on a distributed cluster, but it doesn’t have to. Dask allows you to swap out the cluster for singlemachine schedulers are surprisingly lightweight, require no setup, and can run entirely within the same process as the user’s session.
To avoid excess memory use Dask is good at finding ways to evaluate computations in a lowmemory footprint when possible by pulling in chunks of data from disk, doing the necessary processing, and throwing away intermediate values as quickly as possible. This lets analysts perform computations on moderately large datasets (100GB+) even on relatively lowpower laptops. This requires no configuration and no setup, meaning that adding Dask to a singlemachine computation adds very little cognitive overhead.
Integrates with the Python ecosystem¶
Python includes computational libraries like Numpy, Pandas, and ScikitLearn, along with thousands of others in data access, plotting, statistics, image and signals processing, and more. These libraries work together seamlessly to produce a cohesive ecosystem of packages that coevolve to meet the needs of analysts in many domains.
This ecosystem is tied together by common standards and protocols to which everyone adheres, which allows these packages to benefit each other in surprising and delightful ways.
Dask evolved from within this ecosystem. It abides by these standards and protocols and actively engages in community efforts to push forward new ones. This enables the rest of the ecosystem to benefit from parallel and distributed computing with minimal coordination. Dask does not seek to disrupt or displace the existing ecosystem, but rather to complement and benefit it from within.
As a result, Dask development is pushed forward by developer communities from Pandas, Numpy, ScikitLearn, ScikitImage, Jupyter, and others. This engagement from the broader community growth helps users to trust the project and helps to ensure that the Python ecosystem will continue to evolve in a smooth and sustainable manner.
Supports complex applications¶
Some parallel computations are simple and just apply the same routine onto many inputs without any kind of coordination. These are simple to parallelize with any system.
Somewhat more complex computations can be expressed with the mapshufflereduce pattern popularized by Hadoop and Spark. This is often sufficient to do most data cleaning tasks, databasestyle queries, and some lightweight machine learning algorithms.
However, more complex parallel computations exist which do not fit into these paradigms and so are difficult to perform with traditional bigdata technologies. These include more advanced algorithms for statistics or machine learning, time series or local operations, or bespoke parallelism often found within the systems of large enterprises.
Many companies and institutions today have problems which are clearly parallelizable, but not clearly transformable into a big dataframe computation. Today these companies tend to solve their problems either by writing custom code with lowlevel systems like MPI, ZeroMQ, or sockets and complex queuing systems, or by shoving their problem into a standard bigdata technology like MapReduce or Spark, and hoping for the best.
Dask helps to resolve these situations by exposing lowlevel APIs to its internal task scheduler which is capable of executing very advanced computations. This gives engineers within the institution the ability to build their own parallel computing system using the same engine that powers Dask’s arrays, dataframes, and machine learning algorithms, but now with the institution’s own custom logic. This allows inhouse engineers to keep complex business logic inhouse while still relying on Dask to handle network communication, load balancing, resilience, diagnostics, etc..
Responsive feedback¶
Because everything happens remotely interactive parallel computing can be frustrating for users. They don’t have a good sense how computations are progressing, what might be going wrong, or what parts of their code they should focus on for performance. The added distance between a user and their computation can drastically affect how quickly they are able to identify and resolve bugs and performance problems, which can drastically increase their time to solution.
Dask keeps users informed and content with a suite of helpful diagnostic and investigative tools including the following:
 A realtime and responsive dashboard that shows current progress, communication costs, memory use, and more, updated every 100ms
 A statistical profiler installed on every worker that polls each thread every 10ms to determine which lines in your code are taking up the most time across your entire computation.
 An embedded IPython kernel in every worker and the scheduler, allowing users to directly investigate the state of their computation with a popup terminal
 The ability to reraise errors locally, so that they can use the traditional debugging tools to which they are accustomed, even when the error happens remotely
Collections
Dask collections are the main interaction point for users. They look like NumPy and pandas but generate dask graphs internally. If you are a dask user then you should start here.
User Interfaces¶
Dask supports several user interfaces:
 High Level
 Arrays: parallel Numpy
 Bags: parallel lists
 Dataframes: parallel Pandas
 Machine Learning : parallel ScikitLearn
 Others from external projects, like XArray
Each of these user interfaces employs the same underlying parallel computing machinery, and so has the same scaling, diagnostics, resilience, and so on, but each provides a different set of parallel algorithms and programming style.
This document helps you to decide which user interface best suits your needs, and gives some general information that applies to all of the interfaces. The pages linked to above give more information about each interface in greater depth.
High Level Collections¶
Many people who start using Dask are explicitly looking for a scalable version of Numpy, Pandas, or ScikitLearn. For these situations the starting point within Dask is usually fairly clear. If you want scalable Numpy, they start with Dask array; if you want scalable Pandas, they start with Dask dataframe, and so on.
These highlevel interfaces copy the standard interface with slight variation. These interfaces automatically parallelize over larger datasets for you for a large subset of the API from the original project.
# Arrays
import dask.array as da
x = da.random.uniform(low=0, high=10, size=(10000, 10000), # normal Numpy code
chunks=(1000, 1000)) # break into chunks of size 1000x1000
y = x + x.T  x.mean(axis=0) # Use normal syntax for high level algorithms
# Dataframes
import dask.dataframe as dd
df = dd.read_csv('2018**.csv', parse_dates='timestamp', # normal Pandas code
blocksize=64000000) # break text into 64MB chunks
s = df.groupby('name').balance.mean() # Use normal syntax for high level algorithms
# Bags / lists
import dask.bag as db
b = db.read_text('*.json').map(json.loads)
total = (b.filter(lambda d: d['name'] == 'Alice')
.map(lambda d: d['balance'])
.sum())
It is important to remember that while APIs may be similar some differences do exist. Additionally, the performance of some algorithms may differ from their inmemory counterparts due to the advantages and disadvantages of parallel programming. Some thought and attention is still required when using Dask.
Low Level Interfaces¶
Often when parallelizing existing code bases or building custom algorithms you run into code that is parallelizable, but isn’t just a big dataframe or array. Consider the forloopy code below:
results = []
for a in A:
for b in B:
if a < b:
c = f(a, b)
else:
c = g(a, b)
results.append(c)
There is potential parallelism in this code (the many calls to f
and g
can be done in parallel), but it’s not clear how to rewrite it into a big
array or dataframe so that it can use a higherlevel API. Even if you could
rewrite it into one of these paradigms, it’s not clear that this would be a
good idea. Much of the meaning would likely be lost in translation, and this
process would become much more difficult for more complex systems.
Instead, Dask’s lowerlevel APIs let you write parallel code one function call at a time within the context of your existing for loops. A common solution here is to use Dask delayed to wrap individual function calls into a lazily constructed task graph:
import dask
lazy_results = []
for a in A:
for b in B:
if a < b:
c = dask.delayed(f)(a, b) # add lazy task
else:
c = dask.delayed(g)(a, b) # add lazy task
lazy_results.append(c)
results = dask.compute(*lazy_results) # compute all in parallel
Combining High and Low Level¶
It is common to combine high and low level interfaces. For example you might use Dask array/bag/dataframe to load in data and do initial preprocessing, then switch to Dask delayed for a custom algorithm that is specific to your domain, then switch back to Dask array/dataframe to clean up and store results. Understanding both sets of user interfaces and how to switch between them can be a productive combination.
# Convert to a list of delayed Pandas dataframes
delayed_values = df.to_delayed()
# Manipulate delayed values arbitrarily as you like
# Convert many delayed Pandas dataframes back to a single Dask dataframe
df = dd.from_delayed(delayed_values)
Laziness and Computing¶
Most Dask user interfaces are lazy meaning that they do not evaluate until
you explicitly ask for a result using the compute
method:
# This array syntax doesn't cause computation
y = x + x.T  x.mean(axis=0)
# Trigger computation by explicitly calling the compute method
y = y.compute()
If you have multiple results that you want to compute at the same time, use the
dask.compute
function. This can share intermediate results and so be more
efficient:
# compute multiple results at the same time with the compute function
min, max = dask.compute(y.min(), y.max())
Note that the compute()
function returns inmemory results. It converts
Dask dataframes to Pandas dataframes, Dask arrays to Numpy arrays, and Dask
bags to lists. You should only call compute on results that will fit
comfortably in memory. If your result does not fit in memory then you might
consider writing it to disk instead.
# Write larger results out to disk rather than store them in memory
my_dask_dataframe.to_parquet('myfile.parquet')
my_dask_array.to_hdf5('myfile.hdf5')
my_dask_bag.to_textfiles('myfile.*.txt')
Persist into Distributed Memory¶
Alternatively, if you are on a cluster then you may want to trigger a
computation and store the results in distributed memory. In this case you do
not want to call compute
, which would create a single Pandas, Numpy, or
List result, but instead you want to call persist
, which returns a new Dask
object that points to actively computing, or already computed results spread
around your cluster’s memory.
# Compute returns an inmemory nonDask object
y = y.compute()
# Persist returns an inmemory Dask object that uses distributed storage if available
y = y.persist()
This is common to see after data loading an preprocessing steps, but before rapid iteration, exploration, or complex algorithms. For example we might read in a lot of data, filter down to a more manageable subset, and then persist data into memory so that we can iterate quickly.
import dask.dataframe as dd
df = dd.read_parquet('...')
df = df[df.name == 'Alice'] # select important subset of data
df = df.persist() # trigger computation in the background
# These are all relatively fast now that the relevant data is in memory
df.groupby(df.id).balance.sum().compute() # explore data quickly
df.groupby(df.id).balance.mean().compute() # explore data quickly
df.id.nunique() # explore data quickly
Lazy vs Immediate¶
As mentioned above, most Dask workloads are lazy, that is they don’t start any
work, until you explicitly trigger them with a call to compute()
.
However sometimes you do want to submit work as quickly as possible, track it
over time, submit new work or cancel work depending on partial results, and so
on. This can be useful when tracking or responding to realtime events,
handling streaming data, or when building complex and adaptive algorithms.
For these situations people typically turn to the futures interface which is a lowlevel interface like Dask delayed, but operates immediately rather than lazily.
Here is the same example with Dask delayed and Dask futures to illustrate the difference.
Delayed: Lazy¶
@dask.delayed
def inc(x):
return x + 1
@dask.delayed
def add(x, y):
return x + y
a = inc(1) # no work has happened yet
b = inc(2) # no work has happened yet
c = add(a, b) # no work has happened yet
c = c.compute() # This triggers all of the above computations
Futures: Immediate¶
from dask.distributed import Client
client = Client()
def inc(x):
return x + 1
def add(x, y):
return x + y
a = client.submit(inc, 1) # work starts immediately
b = client.submit(inc, 2) # work starts immediately
c = client.submit(add, a, b) # work starts immediately
c = c.result() # block until work finishes, then gather result
You can also trigger work with the highlevel collections using the
persist
function. This will cause work to happen in the background when
using the distributed scheduler.
Combining Interfaces¶
There are established ways to combine the interfaces above:
The highlevel interfaces (array, bag, dataframe) have a
to_delayed
method that can convert to a sequence (or grid) of Dask delayed objectsdelayeds = df.to_delayed()
The highlevel interfaces (array, bag, dataframe) have a
from_delayed
method that can convert from either Delayed or Future objectsdf = dd.from_delayed(delayeds) df = dd.from_delayed(futures)
The
Client.compute
method converts Delayed objects into Futures.futures = client.compute(delayeds)
The
dask.distributed.futures_of
function gathers futures from persisted collectionsfrom dask.distributed import futures_of df = df.persist() # start computation in the background futures = futures_of(df)
The Dask.delayed object converts Futures into delayed objects.
delayed_value = dask.delayed(future)
The approaches above should suffice to convert any interface into any other. We often see some antipatterns that do not work as well:
 Calling lowlevel APIs (delayed or futures) on highlevel objects (like
Dask arrays or dataframes) This downgrades those objects to their Numpy or
Pandas equivalents, which may not be desired.
Often people are looking for APIs like
dask.array.map_blocks
ordask.dataframe.map_partitions
instead.  Calling
compute()
on Future objects. Often people want the.result()
method instead.  Calling Numpy/Pandas functions on highlevel Dask objects or highlevel Dask functions on Numpy/Pandas objects
Conclusion¶
Most people who use Dask start with only one of the interfaces above but eventually learn how to use a few interfaces together. This helps them leverage the sophisticated algorithms in the highlevel interfaces while also working around tricky problems with the lowlevel interfaces.
For more information, see the documentation for the particular user interfaces below:
 High Level
 Arrays: parallel Numpy
 Bags: parallel lists
 Dataframes: parallel Pandas
 Machine Learning : parallel ScikitLearn
 Others from external projects, like XArray
Array¶
API¶
Top level user functions:
all (a[, axis, out, keepdims]) 
Test whether all array elements along a given axis evaluate to True. 
allclose (a, b[, rtol, atol, equal_nan]) 
Returns True if two arrays are elementwise equal within a tolerance. 
angle (x[, deg]) 
Return the angle of the complex argument. 
any (a[, axis, out, keepdims]) 
Test whether any array element along a given axis evaluates to True. 
apply_along_axis (func1d, axis, arr, *args, …) 
Apply a function to 1D slices along the given axis. 
apply_over_axes (func, a, axes) 
Apply a function repeatedly over multiple axes. 
arange (*args, **kwargs) 
Return evenly spaced values from start to stop with step size step. 
arccos (x[, out]) 
Trigonometric inverse cosine, elementwise. 
arccosh (x[, out]) 
Inverse hyperbolic cosine, elementwise. 
arcsin (x[, out]) 
Inverse sine, elementwise. 
arcsinh (x[, out]) 
Inverse hyperbolic sine elementwise. 
arctan (x[, out]) 
Trigonometric inverse tangent, elementwise. 
arctan2 (x1, x2[, out]) 
Elementwise arc tangent of x1/x2 choosing the quadrant correctly. 
arctanh (x[, out]) 
Inverse hyperbolic tangent elementwise. 
argmax (a[, axis, out]) 
Returns the indices of the maximum values along an axis. 
argmin (a[, axis, out]) 
Returns the indices of the minimum values along an axis. 
argtopk (a, k[, axis, split_every]) 
Extract the indices of the k largest elements from a on the given axis, and return them sorted from largest to smallest. 
argwhere (a) 
Find the indices of array elements that are nonzero, grouped by element. 
around (a[, decimals, out]) 
Evenly round to the given number of decimals. 
array (object[, dtype, copy, order, subok, ndmin]) 
Create an array. 
asanyarray (a) 
Convert the input to a dask array. 
asarray (a) 
Convert the input to a dask array. 
atleast_1d (*arys) 
Convert inputs to arrays with at least one dimension. 
atleast_2d (*arys) 
View inputs as arrays with at least two dimensions. 
atleast_3d (*arys) 
View inputs as arrays with at least three dimensions. 
bincount (x[, weights, minlength]) 
Count number of occurrences of each value in array of nonnegative ints. 
bitwise_and (x1, x2[, out]) 
Compute the bitwise AND of two arrays elementwise. 
bitwise_not (x[, out]) 
Compute bitwise inversion, or bitwise NOT, elementwise. 
bitwise_or (x1, x2[, out]) 
Compute the bitwise OR of two arrays elementwise. 
bitwise_xor (x1, x2[, out]) 
Compute the bitwise XOR of two arrays elementwise. 
block (arrays[, allow_unknown_chunksizes]) 
Assemble an ndarray from nested lists of blocks. 
broadcast_arrays (*args, **kwargs) 
Broadcast any number of arrays against each other. 
broadcast_to (x, shape[, chunks]) 
Broadcast an array to a new shape. 
coarsen (reduction, x, axes[, trim_excess]) 
Coarsen array by applying reduction to fixed size neighborhoods 
ceil (x[, out]) 
Return the ceiling of the input, elementwise. 
choose (a, choices[, out, mode]) 
Construct an array from an index array and a set of arrays to choose from. 
clip (*args, **kwargs) 
Clip (limit) the values in an array. 
compress (condition, a[, axis, out]) 
Return selected slices of an array along given axis. 
concatenate (seq[, axis, …]) 
Concatenate arrays along an existing axis 
conj (x[, out]) 
Return the complex conjugate, elementwise. 
copysign (x1, x2[, out]) 
Change the sign of x1 to that of x2, elementwise. 
corrcoef (x[, y, rowvar, bias, ddof]) 
Return Pearson productmoment correlation coefficients. 
cos (x[, out]) 
Cosine elementwise. 
cosh (x[, out]) 
Hyperbolic cosine, elementwise. 
count_nonzero (a) 
Counts the number of nonzero values in the array a . 
cov (m[, y, rowvar, bias, ddof, fweights, …]) 
Estimate a covariance matrix, given data and weights. 
cumprod (a[, axis, dtype, out]) 
Return the cumulative product of elements along a given axis. 
cumsum (a[, axis, dtype, out]) 
Return the cumulative sum of the elements along a given axis. 
deg2rad (x[, out]) 
Convert angles from degrees to radians. 
degrees (x[, out]) 
Convert angles from radians to degrees. 
diag (v[, k]) 
Extract a diagonal or construct a diagonal array. 
diff (a[, n, axis]) 
Calculate the nth discrete difference along given axis. 
digitize (x, bins[, right]) 
Return the indices of the bins to which each value in input array belongs. 
dot (a, b[, out]) 
Dot product of two arrays. 
dstack (tup) 
Stack arrays in sequence depth wise (along third axis). 
ediff1d (ary[, to_end, to_begin]) 
The differences between consecutive elements of an array. 
einsum (subscripts, *operands[, out, dtype, …]) 
Evaluates the Einstein summation convention on the operands. 
empty 
Blocked variant of empty 
empty_like (a[, dtype, chunks]) 
Return a new array with the same shape and type as a given array. 
exp (x[, out]) 
Calculate the exponential of all elements in the input array. 
expm1 (x[, out]) 
Calculate exp(x)  1 for all elements in the array. 
eye (N, chunks[, M, k, dtype]) 
Return a 2D Array with ones on the diagonal and zeros elsewhere. 
fabs (x[, out]) 
Compute the absolute values elementwise. 
fix (*args, **kwargs) 
Round to nearest integer towards zero. 
flatnonzero (a) 
Return indices that are nonzero in the flattened version of a. 
flip (m, axis) 
Reverse element order along axis. 
flipud (m) 
Flip array in the up/down direction. 
fliplr (m) 
Flip array in the left/right direction. 
floor (x[, out]) 
Return the floor of the input, elementwise. 
fmax (x1, x2[, out]) 
Elementwise maximum of array elements. 
fmin (x1, x2[, out]) 
Elementwise minimum of array elements. 
fmod (x1, x2[, out]) 
Return the elementwise remainder of division. 
frexp (x[, out1, out2]) 
Decompose the elements of x into mantissa and twos exponent. 
fromfunction (function, shape, **kwargs) 
Construct an array by executing a function over each coordinate. 
frompyfunc (func, nin, nout) 
Takes an arbitrary Python function and returns a Numpy ufunc. 
full 
Blocked variant of full 
full_like (a, fill_value[, dtype, chunks]) 
Return a full array with the same shape and type as a given array. 
gradient (f, *varargs, **kwargs) 
Return the gradient of an Ndimensional array. 
histogram (a[, bins, range, normed, weights, …]) 
Blocked variant of numpy.histogram() . 
hstack (tup) 
Stack arrays in sequence horizontally (column wise). 
hypot (x1, x2[, out]) 
Given the “legs” of a right triangle, return its hypotenuse. 
imag (*args, **kwargs) 
Return the imaginary part of the elements of the array. 
indices (dimensions[, dtype, chunks]) 
Implements NumPy’s indices for Dask Arrays. 
insert (arr, obj, values[, axis]) 
Insert values along the given axis before the given indices. 
isclose (a, b[, rtol, atol, equal_nan]) 
Returns a boolean array where two arrays are elementwise equal within a tolerance. 
iscomplex (*args, **kwargs) 
Returns a bool array, where True if input element is complex. 
isfinite (x[, out]) 
Test elementwise for finiteness (not infinity or not Not a Number). 
isin (element, test_elements[, …]) 

isinf (x[, out]) 
Test elementwise for positive or negative infinity. 
isnan (x[, out]) 
Test elementwise for NaN and return result as a boolean array. 
isnull (values) 
pandas.isnull for dask arrays 
isreal (*args, **kwargs) 
Returns a bool array, where True if input element is real. 
ldexp (x1, x2[, out]) 
Returns x1 * 2**x2, elementwise. 
linspace (start, stop[, num, chunks, dtype]) 
Return num evenly spaced values over the closed interval [start, stop]. 
log (x[, out]) 
Natural logarithm, elementwise. 
log10 (x[, out]) 
Return the base 10 logarithm of the input array, elementwise. 
log1p (x[, out]) 
Return the natural logarithm of one plus the input array, elementwise. 
log2 (x[, out]) 
Base2 logarithm of x. 
logaddexp (x1, x2[, out]) 
Logarithm of the sum of exponentiations of the inputs. 
logaddexp2 (x1, x2[, out]) 
Logarithm of the sum of exponentiations of the inputs in base2. 
logical_and (x1, x2[, out]) 
Compute the truth value of x1 AND x2 elementwise. 
logical_not (x[, out]) 
Compute the truth value of NOT x elementwise. 
logical_or (x1, x2[, out]) 
Compute the truth value of x1 OR x2 elementwise. 
logical_xor (x1, x2[, out]) 
Compute the truth value of x1 XOR x2, elementwise. 
map_blocks (func, *args, **kwargs) 
Map a function across all blocks of a dask array. 
map_overlap (x, func, depth[, boundary, trim]) 
Map a function over blocks of the array with some overlap 
matmul (a, b[, out]) 
Matrix product of two arrays. 
max (a[, axis, out, keepdims]) 
Return the maximum of an array or maximum along an axis. 
maximum (x1, x2[, out]) 
Elementwise maximum of array elements. 
mean (a[, axis, dtype, out, keepdims]) 
Compute the arithmetic mean along the specified axis. 
meshgrid (*xi, **kwargs) 
Return coordinate matrices from coordinate vectors. 
min (a[, axis, out, keepdims]) 
Return the minimum of an array or minimum along an axis. 
minimum (x1, x2[, out]) 
Elementwise minimum of array elements. 
modf (x[, out1, out2]) 
Return the fractional and integral parts of an array, elementwise. 
moment (a, order[, axis, dtype, keepdims, …]) 

nanargmax (x, axis, **kwargs) 

nanargmin (x, axis, **kwargs) 

nancumprod (a[, axis, dtype, out]) 
Return the cumulative product of array elements over a given axis treating Not a Numbers (NaNs) as one. 
nancumsum (a[, axis, dtype, out]) 
Return the cumulative sum of array elements over a given axis treating Not a Numbers (NaNs) as zero. 
nanmax (a[, axis, out, keepdims]) 
Return the maximum of an array or maximum along an axis, ignoring any NaNs. 
nanmean (a[, axis, dtype, out, keepdims]) 
Compute the arithmetic mean along the specified axis, ignoring NaNs. 
nanmin (a[, axis, out, keepdims]) 
Return minimum of an array or minimum along an axis, ignoring any NaNs. 
nanprod (a[, axis, dtype, out, keepdims]) 
Return the product of array elements over a given axis treating Not a Numbers (NaNs) as zero. 
nanstd (a[, axis, dtype, out, ddof, keepdims]) 
Compute the standard deviation along the specified axis, while ignoring NaNs. 
nansum (a[, axis, dtype, out, keepdims]) 
Return the sum of array elements over a given axis treating Not a Numbers (NaNs) as zero. 
nanvar (a[, axis, dtype, out, ddof, keepdims]) 
Compute the variance along the specified axis, while ignoring NaNs. 
nextafter (x1, x2[, out]) 
Return the next floatingpoint value after x1 towards x2, elementwise. 
nonzero (a) 
Return the indices of the elements that are nonzero. 
notnull (values) 
pandas.notnull for dask arrays 
ones 
Blocked variant of ones 
ones_like (a[, dtype, chunks]) 
Return an array of ones with the same shape and type as a given array. 
percentile (a, q[, interpolation]) 
Approximate percentile of 1D array 
piecewise (x, condlist, funclist, *args, **kw) 
Evaluate a piecewisedefined function. 
prod (a[, axis, dtype, out, keepdims]) 
Return the product of array elements over a given axis. 
ptp (a[, axis, out]) 
Range of values (maximum  minimum) along an axis. 
rad2deg (x[, out]) 
Convert angles from radians to degrees. 
radians (x[, out]) 
Convert angles from degrees to radians. 
ravel (a[, order]) 
Return a contiguous flattened array. 
real (*args, **kwargs) 
Return the real part of the elements of the array. 
rechunk (x, chunks[, threshold, block_size_limit]) 
Convert blocks in dask array x for new chunks. 
repeat (a, repeats[, axis]) 
Repeat elements of an array. 
reshape (x, shape) 
Reshape array to new shape 
result_type (*arrays_and_dtypes) 
Returns the type that results from applying the NumPy type promotion rules to the arguments. 
rint (x[, out]) 
Round elements of the array to the nearest integer. 
roll (a, shift[, axis]) 
Roll array elements along a given axis. 
round (a[, decimals, out]) 
Round an array to the given number of decimals. 
sign (x[, out]) 
Returns an elementwise indication of the sign of a number. 
signbit (x[, out]) 
Returns elementwise True where signbit is set (less than zero). 
sin (x[, out]) 
Trigonometric sine, elementwise. 
sinh (x[, out]) 
Hyperbolic sine, elementwise. 
sqrt (x[, out]) 
Return the positive squareroot of an array, elementwise. 
square (x[, out]) 
Return the elementwise square of the input. 
squeeze (a[, axis]) 
Remove singledimensional entries from the shape of an array. 
stack (seq[, axis]) 
Stack arrays along a new axis 
std (a[, axis, dtype, out, ddof, keepdims]) 
Compute the standard deviation along the specified axis. 
sum (a[, axis, dtype, out, keepdims]) 
Sum of array elements over a given axis. 
take (a, indices[, axis, out, mode]) 
Take elements from an array along an axis. 
tan (x[, out]) 
Compute tangent elementwise. 
tanh (x[, out]) 
Compute hyperbolic tangent elementwise. 
tensordot (a, b[, axes]) 
Compute tensor dot product along specified axes for arrays >= 1D. 
tile (A, reps) 
Construct an array by repeating A the number of times given by reps. 
topk (a, k[, axis, split_every]) 
Extract the k largest elements from a on the given axis, and return them sorted from largest to smallest. 
transpose (a[, axes]) 
Permute the dimensions of an array. 
tril (m[, k]) 
Lower triangle of an array with elements above the kth diagonal zeroed. 
triu (m[, k]) 
Upper triangle of an array with elements above the kth diagonal zeroed. 
trunc (x[, out]) 
Return the truncated value of the input, elementwise. 
unique (ar[, return_index, return_inverse, …]) 
Find the unique elements of an array. 
var (a[, axis, dtype, out, ddof, keepdims]) 
Compute the variance along the specified axis. 
vdot (a, b) 
Return the dot product of two vectors. 
vnorm (a[, ord, axis, dtype, keepdims, …]) 
Vector norm 
vstack (tup) 
Stack arrays in sequence vertically (row wise). 
where (condition, [x, y]) 
Return elements, either from x or y, depending on condition. 
zeros 
Blocked variant of zeros 
zeros_like (a[, dtype, chunks]) 
Return an array of zeros with the same shape and type as a given array. 
Fast Fourier Transforms¶
fft.fft_wrap (fft_func[, kind, dtype]) 
Wrap 1D, 2D, and ND real and complex FFT functions 
fft.fft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.fft 
fft.fft2 (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.fft2 
fft.fftn (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.fftn 
fft.ifft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.ifft 
fft.ifft2 (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.ifft2 
fft.ifftn (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.ifftn 
fft.rfft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.rfft 
fft.rfft2 (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.rfft2 
fft.rfftn (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.rfftn 
fft.irfft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.irfft 
fft.irfft2 (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.irfft2 
fft.irfftn (a[, s, axes]) 
Wrapping of numpy.fft.fftpack.irfftn 
fft.hfft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.hfft 
fft.ihfft (a[, n, axis]) 
Wrapping of numpy.fft.fftpack.ihfft 
fft.fftfreq (n[, d]) 
Return the Discrete Fourier Transform sample frequencies. 
fft.rfftfreq (n[, d]) 
Return the Discrete Fourier Transform sample frequencies (for usage with rfft, irfft). 
fft.fftshift (x[, axes]) 
Shift the zerofrequency component to the center of the spectrum. 
fft.ifftshift (x[, axes]) 
The inverse of fftshift. 
Linear Algebra¶
linalg.cholesky (a[, lower]) 
Returns the Cholesky decomposition, \(A = L L^*\) or \(A = U^* U\) of a Hermitian positivedefinite matrix A. 
linalg.inv (a) 
Compute the inverse of a matrix with LU decomposition and forward / backward substitutions. 
linalg.lstsq (a, b) 
Return the leastsquares solution to a linear matrix equation using QR decomposition. 
linalg.lu (a) 
Compute the lu decomposition of a matrix. 
linalg.norm (x[, ord, axis, keepdims]) 
Matrix or vector norm. 
linalg.qr (a[, name]) 
Compute the qr factorization of a matrix. 
linalg.solve (a, b[, sym_pos]) 
Solve the equation a x = b for x . 
linalg.solve_triangular (a, b[, lower]) 
Solve the equation a x = b for x, assuming a is a triangular matrix. 
linalg.svd (a[, name]) 
Compute the singular value decomposition of a matrix. 
linalg.svd_compressed (a, k[, n_power_iter, …]) 
Randomly compressed rankk thin Singular Value Decomposition. 
linalg.tsqr (data[, name, compute_svd]) 
Direct TallandSkinny QR algorithm 
Masked Arrays¶
ma.filled 

ma.fix_invalid 

ma.getdata 

ma.getmaskarray 

ma.masked_array 

ma.masked_equal 

ma.masked_greater 

ma.masked_greater_equal 

ma.masked_inside 

ma.masked_invalid 

ma.masked_less 

ma.masked_less_equal 

ma.masked_not_equal 

ma.masked_outside 

ma.masked_values 

ma.masked_where 

ma.set_fill_value 
Random¶
random.beta (a, b[, size]) 
Draw samples from a Beta distribution. 
random.binomial (n, p[, size]) 
Draw samples from a binomial distribution. 
random.chisquare (df[, size]) 
Draw samples from a chisquare distribution. 
random.choice (a[, size, replace, p]) 
Generates a random sample from a given 1D array 
random.exponential ([scale, size]) 
Draw samples from an exponential distribution. 
random.f (dfnum, dfden[, size]) 
Draw samples from an F distribution. 
random.gamma (shape[, scale, size]) 
Draw samples from a Gamma distribution. 
random.geometric (p[, size]) 
Draw samples from the geometric distribution. 
random.gumbel ([loc, scale, size]) 
Draw samples from a Gumbel distribution. 
random.hypergeometric (ngood, nbad, nsample) 
Draw samples from a Hypergeometric distribution. 
random.laplace ([loc, scale, size]) 
Draw samples from the Laplace or double exponential distribution with specified location (or mean) and scale (decay). 
random.logistic ([loc, scale, size]) 
Draw samples from a logistic distribution. 
random.lognormal ([mean, sigma, size]) 
Draw samples from a lognormal distribution. 
random.logseries (p[, size]) 
Draw samples from a logarithmic series distribution. 
random.negative_binomial (n, p[, size]) 
Draw samples from a negative binomial distribution. 
random.noncentral_chisquare (df, nonc[, size]) 
Draw samples from a noncentral chisquare distribution. 
random.noncentral_f (dfnum, dfden, nonc[, size]) 
Draw samples from the noncentral F distribution. 
random.normal ([loc, scale, size]) 
Draw random samples from a normal (Gaussian) distribution. 
random.pareto (a[, size]) 
Draw samples from a Pareto II or Lomax distribution with specified shape. 
random.poisson ([lam, size]) 
Draw samples from a Poisson distribution. 
random.power (a[, size]) 
Draws samples in [0, 1] from a power distribution with positive exponent a  1. 
random.random ([size]) 
Return random floats in the halfopen interval [0.0, 1.0). 
random.random_sample ([size]) 
Return random floats in the halfopen interval [0.0, 1.0). 
random.rayleigh ([scale, size]) 
Draw samples from a Rayleigh distribution. 
random.standard_cauchy ([size]) 
Draw samples from a standard Cauchy distribution with mode = 0. 
random.standard_exponential ([size]) 
Draw samples from the standard exponential distribution. 
random.standard_gamma (shape[, size]) 
Draw samples from a standard Gamma distribution. 
random.standard_normal ([size]) 
Draw samples from a standard Normal distribution (mean=0, stdev=1). 
random.standard_t (df[, size]) 
Draw samples from a standard Student’s t distribution with df degrees of freedom. 
random.triangular (left, mode, right[, size]) 
Draw samples from the triangular distribution. 
random.uniform ([low, high, size]) 
Draw samples from a uniform distribution. 
random.vonmises (mu, kappa[, size]) 
Draw samples from a von Mises distribution. 
random.wald (mean, scale[, size]) 
Draw samples from a Wald, or inverse Gaussian, distribution. 
random.weibull (a[, size]) 
Draw samples from a Weibull distribution. 
random.zipf (a[, size]) 
Standard distributions 
Stats¶
stats.ttest_ind (a, b[, axis, equal_var]) 
Calculates the Ttest for the means of TWO INDEPENDENT samples of scores. 
stats.ttest_1samp (a, popmean[, axis, nan_policy]) 
Calculates the Ttest for the mean of ONE group of scores. 
stats.ttest_rel (a, b[, axis, nan_policy]) 
Calculates the Ttest on TWO RELATED samples of scores, a and b. 
stats.chisquare (f_obs[, f_exp, ddof, axis]) 
Calculates a oneway chi square test. 
stats.power_divergence (f_obs[, f_exp, ddof, …]) 
CressieRead power divergence statistic and goodness of fit test. 
stats.skew (a[, axis, bias, nan_policy]) 
Computes the skewness of a data set. 
stats.skewtest (a[, axis, nan_policy]) 
Tests whether the skew is different from the normal distribution. 
stats.kurtosis (a[, axis, fisher, bias, …]) 
Computes the kurtosis (Fisher or Pearson) of a dataset. 
stats.kurtosistest (a[, axis, nan_policy]) 
Tests whether a dataset has normal kurtosis 
stats.normaltest (a[, axis, nan_policy]) 
Tests whether a sample differs from a normal distribution. 
stats.f_oneway (*args) 
Performs a 1way ANOVA. 
stats.moment (a[, moment, axis, nan_policy]) 
Calculates the nth moment about the mean for a sample. 
Image Support¶
image.imread (filename[, imread, preprocess]) 
Read a stack of images into a dask array 
Slightly Overlapping Ghost Computations¶
ghost.ghost (x, depth, boundary) 
Share boundaries between neighboring blocks 
ghost.map_overlap (x, func, depth[, …]) 
Map a function over blocks of the array with some overlap 
Create and Store Arrays¶
from_array (x, chunks[, name, lock, asarray, …]) 
Create dask array from something that looks like an array 
from_delayed (value, shape, dtype[, name]) 
Create a dask array from a dask delayed value 
from_npy_stack (dirname[, mmap_mode]) 
Load dask array from stack of npy files 
store (sources, targets[, lock, regions, …]) 
Store dask arrays in arraylike objects, overwrite data in target 
to_hdf5 (filename, *args, **kwargs) 
Store arrays in HDF5 file 
to_npy_stack (dirname, x[, axis]) 
Write dask array to a stack of .npy files 
Generalized Ufuncs¶
apply_gufunc 

as_gufunc 

gufunc 
Internal functions¶
atop (func, out_ind, *args, **kwargs) 
Tensor operation: Generalized inner and outer products 
top (func, output, out_indices, …) 
Tensor operation 
Other functions¶

dask.array.
from_array
(x, chunks, name=None, lock=False, asarray=True, fancy=True, getitem=None)¶ Create dask array from something that looks like an array
Input must have a
.shape
and support numpystyle slicing.Parameters:  x : array_like
 chunks : int, tuple
How to chunk the array. Must be one of the following forms:  A blocksize like 1000.  A blockshape like (1000, 1000).  Explicit sizes of all blocks along all dimensions like
((1000, 1000, 500), (400, 400)).
1 as a blocksize indicates the size of the corresponding dimension.
 name : str, optional
The key name to use for the array. Defaults to a hash of
x
. Usename=False
to generate a random name instead of hashing (fast) lock : bool or Lock, optional
If
x
doesn’t support concurrent reads then provide a lock here, or pass in True to have dask.array create one for you. asarray : bool, optional
If True (default), then chunks will be converted to instances of
ndarray
. Set to False to pass passed chunks through unchanged. fancy : bool, optional
If
x
doesn’t support fancy indexing (e.g. indexing with lists or arrays) then set to False. Default is True.
Examples
>>> x = h5py.File('...')['/data/path'] >>> a = da.from_array(x, chunks=(1000, 1000))
If your underlying datastore does not support concurrent reads then include the
lock=True
keyword argument orlock=mylock
if you want multiple arrays to coordinate around the same lock.>>> a = da.from_array(x, chunks=(1000, 1000), lock=True)

dask.array.
from_delayed
(value, shape, dtype, name=None)¶ Create a dask array from a dask delayed value
This routine is useful for constructing dask arrays in an adhoc fashion using dask delayed, particularly when combined with stack and concatenate.
The dask array will consist of a single chunk.
Examples
>>> from dask import delayed >>> value = delayed(np.ones)(5) >>> array = from_delayed(value, (5,), float) >>> array dask.array<fromvalue, shape=(5,), dtype=float64, chunksize=(5,)> >>> array.compute() array([1., 1., 1., 1., 1.])

dask.array.
store
(sources, targets, lock=True, regions=None, compute=True, return_stored=False, **kwargs)¶ Store dask arrays in arraylike objects, overwrite data in target
This stores dask arrays into object that supports numpystyle setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.
If your data fits in memory then you may prefer calling
np.array(myarray)
instead.Parameters:  sources: Array or iterable of Arrays
 targets: arraylike or Delayed or iterable of arraylikes and/or Delayeds
These should support setitem syntax
target[10:20] = ...
 lock: boolean or threading.Lock, optional
Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular
threading.Lock
object to be shared among all writes. regions: tuple of slices or iterable of tuple of slices
Each
region
tuple inregions
should be such thattarget[region].shape = source.shape
for the corresponding source and target in sources and targets, respectively. compute: boolean, optional
If true compute immediately, return
dask.delayed.Delayed
otherwise return_stored: boolean, optional
Optionally return the stored result (default False).
Examples
>>> x = ...
>>> import h5py >>> f = h5py.File('myfile.hdf5') >>> dset = f.create_dataset('/data', shape=x.shape, ... chunks=x.chunks, ... dtype='f8')
>>> store(x, dset)
Alternatively store many arrays at the same time
>>> store([x, y, z], [dset1, dset2, dset3])

dask.array.
coarsen
(reduction, x, axes, trim_excess=False)¶ Coarsen array by applying reduction to fixed size neighborhoods
Parameters:  reduction: function
Function like np.sum, np.mean, etc…
 x: np.ndarray
Array to be coarsened
 axes: dict
Mapping of axis to coarsening factor
Examples
>>> x = np.array([1, 2, 3, 4, 5, 6]) >>> coarsen(np.sum, x, {0: 2}) array([ 3, 7, 11]) >>> coarsen(np.max, x, {0: 3}) array([3, 6])
Provide dictionary of scale per dimension
>>> x = np.arange(24).reshape((4, 6)) >>> x array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]])
>>> coarsen(np.min, x, {0: 2, 1: 3}) array([[ 0, 3], [12, 15]])
You must avoid excess elements explicitly
>>> x = np.array([1, 2, 3, 4, 5, 6, 7, 8]) >>> coarsen(np.min, x, {0: 3}, trim_excess=True) array([1, 4])

dask.array.
stack
(seq, axis=0)¶ Stack arrays along a new axis
Given a sequence of dask Arrays form a new dask Array by stacking them along a new dimension (axis=0 by default)
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.stack(data, axis=0) >>> x.shape (3, 4, 4)
>>> da.stack(data, axis=1).shape (4, 3, 4)
>>> da.stack(data, axis=1).shape (4, 4, 3)
Result is a new dask Array

dask.array.
concatenate
(seq, axis=0, allow_unknown_chunksizes=False)¶ Concatenate arrays along an existing axis
Given a sequence of dask Arrays form a new dask Array by stacking them along an existing dimension (axis=0 by default)
Parameters:  seq: list of dask.arrays
 axis: int
Dimension along which to align all of the arrays
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.concatenate(data, axis=0) >>> x.shape (12, 4)
>>> da.concatenate(data, axis=1).shape (4, 12)
Result is a new dask Array

dask.array.
all
(a, axis=None, out=None, keepdims=False)¶ Test whether all array elements along a given axis evaluate to True.
Parameters:  a : array_like
Input array or object that can be converted to an array.
 axis : None or int or tuple of ints, optional
Axis or axes along which a logical AND reduction is performed. The default (axis = None) is to perform a logical AND over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.7.0.
If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if
dtype(out)
is float, the result will consist of 0.0’s and 1.0’s). See doc.ufuncs (Section “Output arguments”) for more details. keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  all : ndarray, bool
A new boolean or array is returned unless out is specified, in which case a reference to out is returned.
See also
ndarray.all
 equivalent method
any
 Test whether any element along a given axis evaluates to True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> np.all([[True,False],[True,True]]) False
>>> np.all([[True,False],[True,True]], axis=0) array([ True, False], dtype=bool)
>>> np.all([1, 4, 5]) True
>>> np.all([1.0, np.nan]) True
>>> o=np.array([False]) >>> z=np.all([1, 4, 5], out=o) >>> id(z), id(o), z (28293632, 28293632, array([ True], dtype=bool))

dask.array.
allclose
(a, b, rtol=1e05, atol=1e08, equal_nan=False)¶ Returns True if two arrays are elementwise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b.
If either array contains one or more NaNs, False is returned. Infs are treated as equal if they are in the same place and of the same sign in both arrays.
Parameters:  a, b : array_like
Input arrays to compare.
 rtol : float
The relative tolerance parameter (see Notes).
 atol : float
The absolute tolerance parameter (see Notes).
 equal_nan : bool
Whether to compare NaN’s as equal. If True, NaN’s in a will be considered equal to NaN’s in b in the output array.
New in version 1.10.0.
Returns:  allclose : bool
Returns True if the two arrays are equal within the given tolerance; False otherwise.
Notes
If the following equation is elementwise True, then allclose returns True.
absolute(a  b) <= (atol + rtol * absolute(b))The above equation is not symmetric in a and b, so that allclose(a, b) might be different from allclose(b, a) in some rare cases.
Examples
>>> np.allclose([1e10,1e7], [1.00001e10,1e8]) False >>> np.allclose([1e10,1e8], [1.00001e10,1e9]) True >>> np.allclose([1e10,1e8], [1.0001e10,1e9]) False >>> np.allclose([1.0, np.nan], [1.0, np.nan]) False >>> np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True) True

dask.array.
angle
(x, deg=0)¶ Return the angle of the complex argument.
Parameters:  z : array_like
A complex number or sequence of complex numbers.
 deg : bool, optional
Return angle in degrees if True, radians if False (default).
Returns:  angle : ndarray or scalar
The counterclockwise angle from the positive real axis on the complex plane, with dtype as numpy.float64.
See also
arctan2
,absolute
Examples
>>> np.angle([1.0, 1.0j, 1+1j]) # in radians array([ 0. , 1.57079633, 0.78539816]) >>> np.angle(1+1j, deg=True) # in degrees 45.0

dask.array.
any
(a, axis=None, out=None, keepdims=False)¶ Test whether any array element along a given axis evaluates to True.
Returns single boolean unless axis is not
None
Parameters:  a : array_like
Input array or object that can be converted to an array.
 axis : None or int or tuple of ints, optional
Axis or axes along which a logical OR reduction is performed. The default (axis = None) is to perform a logical OR over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.7.0.
If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if it is of type float, then it will remain so, returning 1.0 for True and 0.0 for False, regardless of the type of a). See doc.ufuncs (Section “Output arguments”) for details.
 keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  any : bool or ndarray
A new boolean or ndarray is returned unless out is specified, in which case a reference to out is returned.
See also
ndarray.any
 equivalent method
all
 Test whether all elements along a given axis evaluate to True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> np.any([[True, False], [True, True]]) True
>>> np.any([[True, False], [False, False]], axis=0) array([ True, False], dtype=bool)
>>> np.any([1, 0, 5]) True
>>> np.any(np.nan) True
>>> o=np.array([False]) >>> z=np.any([1, 4, 5], out=o) >>> z, o (array([ True], dtype=bool), array([ True], dtype=bool)) >>> # Check now that z is a reference to o >>> z is o True >>> id(z), id(o) # identity of z and o (191614240, 191614240)

dask.array.
apply_along_axis
(func1d, axis, arr, *args, **kwargs)¶ Apply a function to 1D slices along the given axis.
Execute func1d(a, *args) where func1d operates on 1D arrays and a is a 1D slice of arr along axis.
Parameters:  func1d : function
This function should accept 1D arrays. It is applied to 1D slices of arr along the specified axis.
 axis : integer
Axis along which arr is sliced.
 arr : ndarray
Input array.
 args : any
Additional arguments to func1d.
 kwargs: any
Additional named arguments to func1d.
New in version 1.9.0.
Returns:  apply_along_axis : ndarray
The output array. The shape of outarr is identical to the shape of arr, except along the axis dimension, where the length of outarr is equal to the size of the return value of func1d. If func1d returns a scalar outarr will have one fewer dimensions than arr.
See also
apply_over_axes
 Apply a function repeatedly over multiple axes.
Examples
>>> def my_func(a): ... """Average first and last element of a 1D array""" ... return (a[0] + a[1]) * 0.5 >>> b = np.array([[1,2,3], [4,5,6], [7,8,9]]) >>> np.apply_along_axis(my_func, 0, b) array([ 4., 5., 6.]) >>> np.apply_along_axis(my_func, 1, b) array([ 2., 5., 8.])
For a function that doesn’t return a scalar, the number of dimensions in outarr is the same as arr.
>>> b = np.array([[8,1,7], [4,3,9], [5,2,6]]) >>> np.apply_along_axis(sorted, 1, b) array([[1, 7, 8], [3, 4, 9], [2, 5, 6]])

dask.array.
apply_over_axes
(func, a, axes)¶ Apply a function repeatedly over multiple axes.
func is called as res = func(a, axis), where axis is the first element of axes. The result res of the function call must have either the same dimensions as a or one less dimension. If res has one less dimension than a, a dimension is inserted before axis. The call to func is then repeated for each axis in axes, with res as the first argument.
Parameters:  func : function
This function must take two arguments, func(a, axis).
 a : array_like
Input array.
 axes : array_like
Axes over which func is applied; the elements must be integers.
Returns:  apply_over_axis : ndarray
The output array. The number of dimensions is the same as a, but the shape can be different. This depends on whether func changes the shape of its output with respect to its input.
See also
apply_along_axis
 Apply a function to 1D slices of an array along the given axis.
Notes
This function is equivalent to tuple axis arguments to reorderable ufuncs with keepdims=True. Tuple axis arguments to ufuncs have been availabe since version 1.7.0.
Examples
>>> a = np.arange(24).reshape(2,3,4) >>> a array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])
Sum over axes 0 and 2. The result has same number of dimensions as the original array:
>>> np.apply_over_axes(np.sum, a, [0,2]) array([[[ 60], [ 92], [124]]])
Tuple axis arguments to ufuncs are equivalent:
>>> np.sum(a, axis=(0,2), keepdims=True) array([[[ 60], [ 92], [124]]])

dask.array.
arange
(*args, **kwargs)¶ Return evenly spaced values from start to stop with step size step.
The values are halfopen [start, stop), so including start and excluding stop. This is basically the same as python’s range function but for dask arrays.
When using a noninteger step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.
Parameters:  start : int, optional
The starting value of the sequence. The default is 0.
 stop : int
The end of the interval, this value is excluded from the interval.
 step : int, optional
The spacing between the values. The default is 1 when not specified. The last value of the sequence.
 chunks : int
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  samples : dask array
See also

dask.array.
arccos
(x[, out])¶ Trigonometric inverse cosine, elementwise.
The inverse of cos so that, if
y = cos(x)
, thenx = arccos(y)
.Parameters:  x : array_like
xcoordinate on the unit circle. For real arguments, the domain is [1, 1].
 out : ndarray, optional
Array of the same shape as a, to store results in. See doc.ufuncs (Section “Output arguments”) for more details.
Returns:  angle : ndarray
The angle of the ray intersecting the unit circle at the given xcoordinate in radians [0, pi]. If x is a scalar then a scalar is returned, otherwise an array of the same shape as x is returned.
Notes
arccos is a multivalued function: for each x there are infinitely many numbers z such that cos(z) = x. The convention is to return the angle z whose real part lies in [0, pi].
For realvalued input data types, arccos always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytic function that has branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse cos is also known as acos or cos^1.
References
M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
Examples
We expect the arccos of 1 to be 0, and of 1 to be pi:
>>> np.arccos([1, 1]) array([ 0. , 3.14159265])
Plot arccos:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(1, 1, num=100) >>> plt.plot(x, np.arccos(x)) >>> plt.axis('tight') >>> plt.show()

dask.array.
arccosh
(x[, out])¶ Inverse hyperbolic cosine, elementwise.
Parameters:  x : array_like
Input array.
 out : ndarray, optional
Array of the same shape as x, to store results in. See doc.ufuncs (Section “Output arguments”) for details.
Returns:  arccosh : ndarray
Array of the same shape as x.
Notes
arccosh is a multivalued function: for each x there are infinitely many numbers z such that cosh(z) = x. The convention is to return the z whose imaginary part lies in [pi, pi] and the real part in
[0, inf]
.For realvalued input data types, arccosh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccosh is a complex analytical function that has a branch cut [inf, 1] and is continuous from above on it.
References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, http://en.wikipedia.org/wiki/Arccosh Examples
>>> np.arccosh([np.e, 10.0]) array([ 1.65745445, 2.99322285]) >>> np.arccosh(1) 0.0

dask.array.
arcsin
(x[, out])¶ Inverse sine, elementwise.
Parameters:  x : array_like
ycoordinate on the unit circle.
 out : ndarray, optional
Array of the same shape as x, in which to store the results. See doc.ufuncs (Section “Output arguments”) for more details.
Returns:  angle : ndarray
The inverse sine of each element in x, in radians and in the closed interval
[pi/2, pi/2]
. If x is a scalar, a scalar is returned, otherwise an array.
Notes
arcsin is a multivalued function: for each x there are infinitely many numbers z such that \(sin(z) = x\). The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arcsin always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arcsin is a complex analytic function that has, by convention, the branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse sine is also known as asin or sin^{1}.
References
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79ff. http://www.math.sfu.ca/~cbm/aands/
Examples
>>> np.arcsin(1) # pi/2 1.5707963267948966 >>> np.arcsin(1) # pi/2 1.5707963267948966 >>> np.arcsin(0) 0.0

dask.array.
arcsinh
(x[, out])¶ Inverse hyperbolic sine elementwise.
Parameters:  x : array_like
Input array.
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  out : ndarray
Array of of the same shape as x.
Notes
arcsinh is a multivalued function: for each x there are infinitely many numbers z such that sinh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arcsinh always returns real output. For each value that cannot be expressed as a real number or infinity, it returns
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytical function that has branch cuts [1j, infj] and [1j, infj] and is continuous from the right on the former and from the left on the latter.
The inverse hyperbolic sine is also known as asinh or
sinh^1
.References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, http://en.wikipedia.org/wiki/Arcsinh Examples
>>> np.arcsinh(np.array([np.e, 10.0])) array([ 1.72538256, 2.99822295])

dask.array.
arctan
(x[, out])¶ Trigonometric inverse tangent, elementwise.
The inverse of tan, so that if
y = tan(x)
thenx = arctan(y)
.Parameters:  x : array_like
Input values. arctan is applied to each element of x.
Returns:  out : ndarray
Out has the same shape as x. Its real part is in
[pi/2, pi/2]
(arctan(+/inf)
returns+/pi/2
). It is a scalar if x is a scalar.
See also
Notes
arctan is a multivalued function: for each x there are infinitely many numbers z such that tan(z) = x. The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arctan always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctan is a complex analytic function that has [1j, infj] and [1j, infj] as branch cuts, and is continuous from the left on the former and from the right on the latter.
The inverse tangent is also known as atan or tan^{1}.
References
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
Examples
We expect the arctan of 0 to be 0, and of 1 to be pi/4:
>>> np.arctan([0, 1]) array([ 0. , 0.78539816])
>>> np.pi/4 0.78539816339744828
Plot arctan:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(10, 10) >>> plt.plot(x, np.arctan(x)) >>> plt.axis('tight') >>> plt.show()

dask.array.
arctan2
(x1, x2[, out])¶ Elementwise arc tangent of
x1/x2
choosing the quadrant correctly.The quadrant (i.e., branch) is chosen so that
arctan2(x1, x2)
is the signed angle in radians between the ray ending at the origin and passing through the point (1,0), and the ray ending at the origin and passing through the point (x2, x1). (Note the role reversal: the “ycoordinate” is the first function parameter, the “xcoordinate” is the second.) By IEEE convention, this function is defined for x2 = +/0 and for either or both of x1 and x2 = +/inf (see Notes for specific values).This function is not defined for complexvalued arguments; for the socalled argument of complex values, use angle.
Parameters:  x1 : array_like, realvalued
ycoordinates.
 x2 : array_like, realvalued
xcoordinates. x2 must be broadcastable to match the shape of x1 or vice versa.
Returns:  angle : ndarray
Array of angles in radians, in the range
[pi, pi]
.
Notes
arctan2 is identical to the atan2 function of the underlying C library. The following special values are defined in the C standard: [1]
x1 x2 arctan2(x1,x2) +/ 0 +0 +/ 0 +/ 0 0 +/ pi > 0 +/inf +0 / +pi < 0 +/inf 0 / pi +/inf +inf +/ (pi/4) +/inf inf +/ (3*pi/4) Note that +0 and 0 are distinct floating point numbers, as are +inf and inf.
References
[1] (1, 2) ISO/IEC standard 9899:1999, “Programming language C.” Examples
Consider four points in different quadrants:
>>> x = np.array([1, +1, +1, 1]) >>> y = np.array([1, 1, +1, +1]) >>> np.arctan2(y, x) * 180 / np.pi array([135., 45., 45., 135.])
Note the order of the parameters. arctan2 is defined also when x2 = 0 and at several other special points, obtaining values in the range
[pi, pi]
:>>> np.arctan2([1., 1.], [0., 0.]) array([ 1.57079633, 1.57079633]) >>> np.arctan2([0., 0., np.inf], [+0., 0., np.inf]) array([ 0. , 3.14159265, 0.78539816])

dask.array.
arctanh
(x[, out])¶ Inverse hyperbolic tangent elementwise.
Parameters:  x : array_like
Input array.
Returns:  out : ndarray
Array of the same shape as x.
See also
emath.arctanh
Notes
arctanh is a multivalued function: for each x there are infinitely many numbers z such that tanh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arctanh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctanh is a complex analytical function that has branch cuts [1, inf] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse hyperbolic tangent is also known as atanh or
tanh^1
.References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, http://en.wikipedia.org/wiki/Arctanh Examples
>>> np.arctanh([0, 0.5]) array([ 0. , 0.54930614])

dask.array.
argmax
(a, axis=None, out=None)¶ Returns the indices of the maximum values along an axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
 out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
Returns:  index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
See also
ndarray.argmax
,argmin
amax
 The maximum value along a given axis.
unravel_index
 Convert a flat index into an index tuple.
Notes
In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
Examples
>>> a = np.arange(6).reshape(2,3) >>> a array([[0, 1, 2], [3, 4, 5]]) >>> np.argmax(a) 5 >>> np.argmax(a, axis=0) array([1, 1, 1]) >>> np.argmax(a, axis=1) array([2, 2])
>>> b = np.arange(6) >>> b[1] = 5 >>> b array([0, 5, 2, 3, 4, 5]) >>> np.argmax(b) # Only the first occurrence is returned. 1

dask.array.
argmin
(a, axis=None, out=None)¶ Returns the indices of the minimum values along an axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
 out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
Returns:  index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
See also
ndarray.argmin
,argmax
amin
 The minimum value along a given axis.
unravel_index
 Convert a flat index into an index tuple.
Notes
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
Examples
>>> a = np.arange(6).reshape(2,3) >>> a array([[0, 1, 2], [3, 4, 5]]) >>> np.argmin(a) 0 >>> np.argmin(a, axis=0) array([0, 0, 0]) >>> np.argmin(a, axis=1) array([0, 0])
>>> b = np.arange(6) >>> b[4] = 0 >>> b array([0, 1, 2, 3, 0, 5]) >>> np.argmin(b) # Only the first occurrence is returned. 0

dask.array.
argtopk
(a, k, axis=1, split_every=None)¶ Extract the indices of the k largest elements from a on the given axis, and return them sorted from largest to smallest. If k is negative, extract the indices of the k smallest elements instead, and return them sorted from smallest to largest.
This assumes that
k
is small. All results will be returned in a single chunk along the given axis.Examples
>>> import dask.array as da >>> x = np.array([5, 1, 3, 6]) >>> d = da.from_array(x, chunks=2) >>> d.argtopk(2).compute() array([3, 0]) >>> d.argtopk(2).compute() array([1, 2])

dask.array.
argwhere
(a)¶ Find the indices of array elements that are nonzero, grouped by element.
Parameters:  a : array_like
Input data.
Returns:  index_array : ndarray
Indices of elements that are nonzero. Indices are grouped by element.
Notes
np.argwhere(a)
is the same asnp.transpose(np.nonzero(a))
.The output of
argwhere
is not suitable for indexing arrays. For this purpose usewhere(a)
instead.Examples
>>> x = np.arange(6).reshape(2,3) >>> x array([[0, 1, 2], [3, 4, 5]]) >>> np.argwhere(x>1) array([[0, 2], [1, 0], [1, 1], [1, 2]])

dask.array.
around
(a, decimals=0, out=None)¶ Evenly round to the given number of decimals.
Parameters:  a : array_like
Input data.
 decimals : int, optional
Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary. See doc.ufuncs (Section “Output arguments”) for details.
Returns:  rounded_array : ndarray
An array of the same type as a, containing the rounded values. Unless out was specified, a new array is created. A reference to the result is returned.
The real and imaginary parts of complex numbers are rounded separately. The result of rounding a float is a float.
Notes
For values exactly halfway between rounded decimal values, Numpy rounds to the nearest even value. Thus 1.5 and 2.5 round to 2.0, 0.5 and 0.5 round to 0.0, etc. Results may also be surprising due to the inexact representation of decimal fractions in the IEEE floating point standard [1] and errors introduced when scaling by powers of ten.
References
[1] (1, 2) “Lecture Notes on the Status of IEEE 754”, William Kahan, http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF [2] “How Futile are Mindless Assessments of Roundoff in FloatingPoint Computation?”, William Kahan, http://www.cs.berkeley.edu/~wkahan/Mindless.pdf Examples
>>> np.around([0.37, 1.64]) array([ 0., 2.]) >>> np.around([0.37, 1.64], decimals=1) array([ 0.4, 1.6]) >>> np.around([.5, 1.5, 2.5, 3.5, 4.5]) # rounds to nearest even value array([ 0., 2., 2., 4., 4.]) >>> np.around([1,2,3,11], decimals=1) # ndarray of ints is returned array([ 1, 2, 3, 11]) >>> np.around([1,2,3,11], decimals=1) array([ 0, 0, 0, 10])

dask.array.
array
(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)¶ Create an array.
Parameters:  object : array_like
An array, any object exposing the array interface, an object whose __array__ method returns an array, or any (nested) sequence.
 dtype : datatype, optional
The desired datatype for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.
 copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will only be made if __array__ returns a copy, if obj is a nested sequence, or if a copy is needed to satisfy any of the other requirements (dtype, order, etc.).
 order : {‘C’, ‘F’, ‘A’}, optional
Specify the order of the array. If order is ‘C’, then the array will be in Ccontiguous order (lastindex varies the fastest). If order is ‘F’, then the returned array will be in Fortrancontiguous order (firstindex varies the fastest). If order is ‘A’ (default), then the returned array may be in any order (either C, Fortrancontiguous, or even discontiguous), unless a copy is required, in which case it will be Ccontiguous.
 subok : bool, optional
If True, then subclasses will be passedthrough, otherwise the returned array will be forced to be a baseclass array (default).
 ndmin : int, optional
Specifies the minimum number of dimensions that the resulting array should have. Ones will be prepended to the shape as needed to meet this requirement.
Returns:  out : ndarray
An array object satisfying the specified requirements.
See also
empty
,empty_like
,zeros
,zeros_like
,ones
,ones_like
,fill
Examples
>>> np.array([1, 2, 3]) array([1, 2, 3])
Upcasting:
>>> np.array([1, 2, 3.0]) array([ 1., 2., 3.])
More than one dimension:
>>> np.array([[1, 2], [3, 4]]) array([[1, 2], [3, 4]])
Minimum dimensions 2:
>>> np.array([1, 2, 3], ndmin=2) array([[1, 2, 3]])
Type provided:
>>> np.array([1, 2, 3], dtype=complex) array([ 1.+0.j, 2.+0.j, 3.+0.j])
Datatype consisting of more than one element:
>>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')]) >>> x['a'] array([1, 3])
Creating an array from subclasses:
>>> np.array(np.mat('1 2; 3 4')) array([[1, 2], [3, 4]])
>>> np.array(np.mat('1 2; 3 4'), subok=True) matrix([[1, 2], [3, 4]])

dask.array.
asanyarray
(a)¶ Convert the input to a dask array.
Subclasses of
np.ndarray
will be passed through as chunks unchanged.Parameters:  a : arraylike
Input data, in any form that can be converted to a dask array.
Returns:  out : dask array
Dask array interpretation of a.
Examples
>>> import dask.array as da >>> import numpy as np >>> x = np.arange(3) >>> da.asanyarray(x) dask.array<array, shape=(3,), dtype=int64, chunksize=(3,)>
>>> y = [[1, 2, 3], [4, 5, 6]] >>> da.asanyarray(y) dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3)>

dask.array.
asarray
(a)¶ Convert the input to a dask array.
Parameters:  a : arraylike
Input data, in any form that can be converted to a dask array.
Returns:  out : dask array
Dask array interpretation of a.
Examples
>>> import dask.array as da >>> import numpy as np >>> x = np.arange(3) >>> da.asarray(x) dask.array<array, shape=(3,), dtype=int64, chunksize=(3,)>
>>> y = [[1, 2, 3], [4, 5, 6]] >>> da.asarray(y) dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3)>

dask.array.
atleast_1d
(*arys)¶ Convert inputs to arrays with at least one dimension.
Scalar inputs are converted to 1dimensional arrays, whilst higherdimensional inputs are preserved.
Parameters:  arys1, arys2, … : array_like
One or more input arrays.
Returns:  ret : ndarray
An array, or sequence of arrays, each with
a.ndim >= 1
. Copies are made only if necessary.
See also
Examples
>>> np.atleast_1d(1.0) array([ 1.])
>>> x = np.arange(9.0).reshape(3,3) >>> np.atleast_1d(x) array([[ 0., 1., 2.], [ 3., 4., 5.], [ 6., 7., 8.]]) >>> np.atleast_1d(x) is x True
>>> np.atleast_1d(1, [3, 4]) [array([1]), array([3, 4])]

dask.array.
atleast_2d
(*arys)¶ View inputs as arrays with at least two dimensions.
Parameters:  arys1, arys2, … : array_like
One or more arraylike sequences. Nonarray inputs are converted to arrays. Arrays that already have two or more dimensions are preserved.
Returns:  res, res2, … : ndarray
An array, or tuple of arrays, each with
a.ndim >= 2
. Copies are avoided where possible, and views with two or more dimensions are returned.
See also
Examples
>>> np.atleast_2d(3.0) array([[ 3.]])
>>> x = np.arange(3.0) >>> np.atleast_2d(x) array([[ 0., 1., 2.]]) >>> np.atleast_2d(x).base is x True
>>> np.atleast_2d(1, [1, 2], [[1, 2]]) [array([[1]]), array([[1, 2]]), array([[1, 2]])]

dask.array.
atleast_3d
(*arys)¶ View inputs as arrays with at least three dimensions.
Parameters:  arys1, arys2, … : array_like
One or more arraylike sequences. Nonarray inputs are converted to arrays. Arrays that already have three or more dimensions are preserved.
Returns:  res1, res2, … : ndarray
An array, or tuple of arrays, each with
a.ndim >= 3
. Copies are avoided where possible, and views with three or more dimensions are returned. For example, a 1D array of shape(N,)
becomes a view of shape(1, N, 1)
, and a 2D array of shape(M, N)
becomes a view of shape(M, N, 1)
.
See also
Examples
>>> np.atleast_3d(3.0) array([[[ 3.]]])
>>> x = np.arange(3.0) >>> np.atleast_3d(x).shape (1, 3, 1)
>>> x = np.arange(12.0).reshape(4,3) >>> np.atleast_3d(x).shape (4, 3, 1) >>> np.atleast_3d(x).base is x True
>>> for arr in np.atleast_3d([1, 2], [[1, 2]], [[[1, 2]]]): ... print(arr, arr.shape) ... [[[1] [2]]] (1, 2, 1) [[[1] [2]]] (1, 2, 1) [[[1 2]]] (1, 1, 2)

dask.array.
bincount
(x, weights=None, minlength=None)¶ Count number of occurrences of each value in array of nonnegative ints.
The number of bins (of size 1) is one larger than the largest value in x. If minlength is specified, there will be at least this number of bins in the output array (though it will be longer if necessary, depending on the contents of x). Each bin gives the number of occurrences of its index value in x. If weights is specified the input array is weighted by it, i.e. if a value
n
is found at positioni
,out[n] += weight[i]
instead ofout[n] += 1
.Parameters:  x : array_like, 1 dimension, nonnegative ints
Input array.
 weights : array_like, optional
Weights, array of the same shape as x.
 minlength : int, optional
A minimum number of bins for the output array.
New in version 1.6.0.
Returns:  out : ndarray of ints
The result of binning the input array. The length of out is equal to
np.amax(x)+1
.
Raises:  ValueError
If the input is not 1dimensional, or contains elements with negative values, or if minlength is nonpositive.
 TypeError
If the type of the input is float or complex.
Examples
>>> np.bincount(np.arange(5)) array([1, 1, 1, 1, 1]) >>> np.bincount(np.array([0, 1, 1, 3, 2, 1, 7])) array([1, 3, 1, 1, 0, 0, 0, 1])
>>> x = np.array([0, 1, 1, 3, 2, 1, 7, 23]) >>> np.bincount(x).size == np.amax(x)+1 True
The input array needs to be of integer dtype, otherwise a TypeError is raised:
>>> np.bincount(np.arange(5, dtype=np.float)) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: array cannot be safely cast to required type
A possible use of
bincount
is to perform sums over variablesize chunks of an array, using theweights
keyword.>>> w = np.array([0.3, 0.5, 0.2, 0.7, 1., 0.6]) # weights >>> x = np.array([0, 1, 1, 2, 2, 2]) >>> np.bincount(x, weights=w) array([ 0.3, 0.7, 1.1])

dask.array.
bitwise_and
(x1, x2[, out])¶ Compute the bitwise AND of two arrays elementwise.
Computes the bitwise AND of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
&
.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
Returns:  out : array_like
Result.
See also
logical_and
,bitwise_or
,bitwise_xor
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 is represented by
00001101
. Likewise, 17 is represented by00010001
. The bitwise AND of 13 and 17 is therefore000000001
, or 1:>>> np.bitwise_and(13, 17) 1
>>> np.bitwise_and(14, 13) 12 >>> np.binary_repr(12) '1100' >>> np.bitwise_and([14,3], 13) array([12, 1])
>>> np.bitwise_and([11,7], [4,25]) array([0, 1]) >>> np.bitwise_and(np.array([2,5,255]), np.array([3,14,16])) array([ 2, 4, 16]) >>> np.bitwise_and([True, True], [False, True]) array([False, True], dtype=bool)

dask.array.
bitwise_not
(x[, out])¶ Compute bitwise inversion, or bitwise NOT, elementwise.
Computes the bitwise NOT of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
~
.For signed integer inputs, the two’s complement is returned. In a two’scomplement system negative numbers are represented by the two’s complement of the absolute value. This is the most common method of representing signed integers on computers [1]. A Nbit two’scomplement system can represent every integer in the range \(2^{N1}\) to \(+2^{N1}1\).
Parameters:  x1 : array_like
Only integer and boolean types are handled.
Returns:  out : array_like
Result.
See also
bitwise_and
,bitwise_or
,bitwise_xor
,logical_not
binary_repr
 Return the binary representation of the input number as a string.
Notes
bitwise_not is an alias for invert:
>>> np.bitwise_not is np.invert True
References
[1] (1, 2) Wikipedia, “Two’s complement”, http://en.wikipedia.org/wiki/Two’s_complement Examples
We’ve seen that 13 is represented by
00001101
. The invert or bitwise NOT of 13 is then:>>> np.invert(np.array([13], dtype=uint8)) array([242], dtype=uint8) >>> np.binary_repr(x, width=8) '00001101' >>> np.binary_repr(242, width=8) '11110010'
The result depends on the bitwidth:
>>> np.invert(np.array([13], dtype=uint16)) array([65522], dtype=uint16) >>> np.binary_repr(x, width=16) '0000000000001101' >>> np.binary_repr(65522, width=16) '1111111111110010'
When using signed integer types the result is the two’s complement of the result for the unsigned type:
>>> np.invert(np.array([13], dtype=int8)) array([14], dtype=int8) >>> np.binary_repr(14, width=8) '11110010'
Booleans are accepted as well:
>>> np.invert(array([True, False])) array([False, True], dtype=bool)

dask.array.
bitwise_or
(x1, x2[, out])¶ Compute the bitwise OR of two arrays elementwise.
Computes the bitwise OR of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator

.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  out : array_like
Result.
See also
logical_or
,bitwise_and
,bitwise_xor
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 has the binaray representation
00001101
. Likewise, 16 is represented by00010000
. The bitwise OR of 13 and 16 is then000111011
, or 29:>>> np.bitwise_or(13, 16) 29 >>> np.binary_repr(29) '11101'
>>> np.bitwise_or(32, 2) 34 >>> np.bitwise_or([33, 4], 1) array([33, 5]) >>> np.bitwise_or([33, 4], [1, 2]) array([33, 6])
>>> np.bitwise_or(np.array([2, 5, 255]), np.array([4, 4, 4])) array([ 6, 5, 255]) >>> np.array([2, 5, 255])  np.array([4, 4, 4]) array([ 6, 5, 255]) >>> np.bitwise_or(np.array([2, 5, 255, 2147483647L], dtype=np.int32), ... np.array([4, 4, 4, 2147483647L], dtype=np.int32)) array([ 6, 5, 255, 2147483647]) >>> np.bitwise_or([True, True], [False, True]) array([ True, True], dtype=bool)

dask.array.
bitwise_xor
(x1, x2[, out])¶ Compute the bitwise XOR of two arrays elementwise.
Computes the bitwise XOR of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
^
.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
Returns:  out : array_like
Result.
See also
logical_xor
,bitwise_and
,bitwise_or
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 is represented by
00001101
. Likewise, 17 is represented by00010001
. The bitwise XOR of 13 and 17 is therefore00011100
, or 28:>>> np.bitwise_xor(13, 17) 28 >>> np.binary_repr(28) '11100'
>>> np.bitwise_xor(31, 5) 26 >>> np.bitwise_xor([31,3], 5) array([26, 6])
>>> np.bitwise_xor([31,3], [5,6]) array([26, 5]) >>> np.bitwise_xor([True, True], [False, True]) array([ True, False], dtype=bool)

dask.array.
block
(arrays, allow_unknown_chunksizes=False)¶ Assemble an ndarray from nested lists of blocks.
Blocks in the innermost lists are concatenated along the last dimension (1), then these are concatenated along the secondlast dimension (2), and so on until the outermost list is reached
Blocks can be of any dimension, but will not be broadcasted using the normal rules. Instead, leading axes of size 1 are inserted, to make
block.ndim
the same for all blocks. This is primarily useful for working with scalars, and means that code likeblock([v, 1])
is valid, wherev.ndim == 1
.When the nested list is two levels deep, this allows block matrices to be constructed from their components.
Parameters:  arrays : nested list of array_like or scalars (but not tuples)
If passed a single ndarray or scalar (a nested list of depth 0), this is returned unmodified (and not copied).
Elements shapes must match along the appropriate axes (without broadcasting), but leading 1s will be prepended to the shape as necessary to make the dimensions match.
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
Returns:  block_array : ndarray
The array assembled from the given blocks.
The dimensionality of the output is equal to the greatest of: * the dimensionality of all the inputs * the depth to which the input list is nested
Raises:  ValueError
 If list depths are mismatched  for instance,
[[a, b], c]
is illegal, and should be spelt[[a, b], [c]]
 If lists are empty  for instance,
[[a, b], []]
 If list depths are mismatched  for instance,
See also
concatenate
 Join a sequence of arrays together.
stack
 Stack arrays in sequence along a new dimension.
hstack
 Stack arrays in sequence horizontally (column wise).
vstack
 Stack arrays in sequence vertically (row wise).
dstack
 Stack arrays in sequence depth wise (along third dimension).
vsplit
 Split array into a list of multiple subarrays vertically.
Notes
When called with only scalars,
block
is equivalent to an ndarray call. Soblock([[1, 2], [3, 4]])
is equivalent toarray([[1, 2], [3, 4]])
.This function does not enforce that the blocks lie on a fixed grid.
block([[a, b], [c, d]])
is not restricted to arrays of the form:AAAbb AAAbb cccDD
But is also allowed to produce, for some
a, b, c, d
:AAAbb AAAbb cDDDD
Since concatenation happens along the last axis first, block is _not_ capable of producing the following directly:
AAAbb cccbb cccDD
Matlab’s “square bracket stacking”,
[A, B, ...; p, q, ...]
, is equivalent toblock([[A, B, ...], [p, q, ...]])
.

dask.array.
broadcast_arrays
(*args, **kwargs)¶ Broadcast any number of arrays against each other.
Parameters:  `*args` : array_likes
The arrays to broadcast.
 subok : bool, optional
If True, then subclasses will be passedthrough, otherwise the returned arrays will be forced to be a baseclass array (default).
Returns:  broadcasted : list of arrays
These arrays are views on the original arrays. They are typically not contiguous. Furthermore, more than one element of a broadcasted array may refer to a single memory location. If you need to write to the arrays, make copies first.
Examples
>>> x = np.array([[1,2,3]]) >>> y = np.array([[1],[2],[3]]) >>> np.broadcast_arrays(x, y) [array([[1, 2, 3], [1, 2, 3], [1, 2, 3]]), array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])]
Here is a useful idiom for getting contiguous copies instead of noncontiguous views.
>>> [np.array(a) for a in np.broadcast_arrays(x, y)] [array([[1, 2, 3], [1, 2, 3], [1, 2, 3]]), array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])]

dask.array.
broadcast_to
(x, shape, chunks=None)¶ Broadcast an array to a new shape.
Parameters:  x : array_like
The array to broadcast.
 shape : tuple
The shape of the desired array.
 chunks : tuple, optional
If provided, then the result will use these chunks instead of the same chunks as the source array. Setting chunks explicitly as part of broadcast_to is more efficient than rechunking afterwards. Chunks are only allowed to differ from the original shape along dimensions that are new on the result or have size 1 the input array.
Returns:  broadcast : dask array
See also

dask.array.
coarsen
(reduction, x, axes, trim_excess=False) Coarsen array by applying reduction to fixed size neighborhoods
Parameters:  reduction: function
Function like np.sum, np.mean, etc…
 x: np.ndarray
Array to be coarsened
 axes: dict
Mapping of axis to coarsening factor
Examples
>>> x = np.array([1, 2, 3, 4, 5, 6]) >>> coarsen(np.sum, x, {0: 2}) array([ 3, 7, 11]) >>> coarsen(np.max, x, {0: 3}) array([3, 6])
Provide dictionary of scale per dimension
>>> x = np.arange(24).reshape((4, 6)) >>> x array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]])
>>> coarsen(np.min, x, {0: 2, 1: 3}) array([[ 0, 3], [12, 15]])
You must avoid excess elements explicitly
>>> x = np.array([1, 2, 3, 4, 5, 6, 7, 8]) >>> coarsen(np.min, x, {0: 3}, trim_excess=True) array([1, 4])

dask.array.
ceil
(x[, out])¶ Return the ceiling of the input, elementwise.
The ceil of the scalar x is the smallest integer i, such that i >= x. It is often denoted as \(\lceil x \rceil\).
Parameters:  x : array_like
Input data.
Returns:  y : ndarray or scalar
The ceiling of each element in x, with float dtype.
Examples
>>> a = np.array([1.7, 1.5, 0.2, 0.2, 1.5, 1.7, 2.0]) >>> np.ceil(a) array([1., 1., 0., 1., 2., 2., 2.])

dask.array.
choose
(a, choices, out=None, mode='raise')¶ Construct an array from an index array and a set of arrays to choose from.
First of all, if confused or uncertain, definitely look at the Examples  in its full generality, this function is less simple than it might seem from the following code description (below ndi = numpy.lib.index_tricks):
np.choose(a,c) == np.array([c[a[I]][I] for I in ndi.ndindex(a.shape)])
.But this omits some subtleties. Here is a fully general summary:
Given an “index” array (a) of integers and a sequence of n arrays (choices), a and each choice array are first broadcast, as necessary, to arrays of a common shape; calling these Ba and Bchoices[i], i = 0,…,n1 we have that, necessarily,
Ba.shape == Bchoices[i].shape
for each i. Then, a new array with shapeBa.shape
is created as follows: if
mode=raise
(the default), then, first of all, each element of a (and thus Ba) must be in the range [0, n1]; now, suppose that i (in that range) is the value at the (j0, j1, …, jm) position in Ba  then the value at the same position in the new array is the value in Bchoices[i] at that same position;  if
mode=wrap
, values in a (and thus Ba) may be any (signed) integer; modular arithmetic is used to map integers outside the range [0, n1] back into that range; and then the new array is constructed as above;  if
mode=clip
, values in a (and thus Ba) may be any (signed) integer; negative integers are mapped to 0; values greater than n1 are mapped to n1; and then the new array is constructed as above.
Parameters:  a : int array
This array must contain integers in [0, n1], where n is the number of choices, unless
mode=wrap
ormode=clip
, in which cases any integers are permissible. choices : sequence of arrays
Choice arrays. a and all of the choices must be broadcastable to the same shape. If choices is itself an array (not recommended), then its outermost dimension (i.e., the one corresponding to
choices.shape[0]
) is taken as defining the “sequence”. out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
 mode : {‘raise’ (default), ‘wrap’, ‘clip’}, optional
Specifies how indices outside [0, n1] will be treated:
 ‘raise’ : an exception is raised
 ‘wrap’ : value becomes value mod n
 ‘clip’ : values < 0 are mapped to 0, values > n1 are mapped to n1
Returns:  merged_array : array
The merged result.
Raises:  ValueError: shape mismatch
If a and each choice array are not all broadcastable to the same shape.
See also
ndarray.choose
 equivalent method
Notes
To reduce the chance of misinterpretation, even though the following “abuse” is nominally supported, choices should neither be, nor be thought of as, a single array, i.e., the outermost sequencelike container should be either a list or a tuple.
Examples
>>> choices = [[0, 1, 2, 3], [10, 11, 12, 13], ... [20, 21, 22, 23], [30, 31, 32, 33]] >>> np.choose([2, 3, 1, 0], choices ... # the first element of the result will be the first element of the ... # third (2+1) "array" in choices, namely, 20; the second element ... # will be the second element of the fourth (3+1) choice array, i.e., ... # 31, etc. ... ) array([20, 31, 12, 3]) >>> np.choose([2, 4, 1, 0], choices, mode='clip') # 4 goes to 3 (41) array([20, 31, 12, 3]) >>> # because there are 4 choice arrays >>> np.choose([2, 4, 1, 0], choices, mode='wrap') # 4 goes to (4 mod 4) array([20, 1, 12, 3]) >>> # i.e., 0
A couple examples illustrating how choose broadcasts:
>>> a = [[1, 0, 1], [0, 1, 0], [1, 0, 1]] >>> choices = [10, 10] >>> np.choose(a, choices) array([[ 10, 10, 10], [10, 10, 10], [ 10, 10, 10]])
>>> # With thanks to Anne Archibald >>> a = np.array([0, 1]).reshape((2,1,1)) >>> c1 = np.array([1, 2, 3]).reshape((1,3,1)) >>> c2 = np.array([1, 2, 3, 4, 5]).reshape((1,1,5)) >>> np.choose(a, (c1, c2)) # result is 2x3x5, res[0,:,:]=c1, res[1,:,:]=c2 array([[[ 1, 1, 1, 1, 1], [ 2, 2, 2, 2, 2], [ 3, 3, 3, 3, 3]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]])
 if

dask.array.
clip
(*args, **kwargs)¶ Clip (limit) the values in an array.
Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of
[0, 1]
is specified, values smaller than 0 become 0, and values larger than 1 become 1.Parameters:  a : array_like
Array containing elements to clip.
 a_min : scalar or array_like
Minimum value.
 a_max : scalar or array_like
Maximum value. If a_min or a_max are array_like, then they will be broadcasted to the shape of a.
 out : ndarray, optional
The results will be placed in this array. It may be the input array for inplace clipping. out must be of the right shape to hold the output. Its type is preserved.
Returns:  clipped_array : ndarray
An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.
See also
numpy.doc.ufuncs
 Section “Output arguments”
Examples
>>> a = np.arange(10) >>> np.clip(a, 1, 8) array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8]) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, 3, 6, out=a) array([3, 3, 3, 3, 4, 5, 6, 6, 6, 6]) >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, [3,4,1,1,1,4,4,4,4,4], 8) array([3, 4, 2, 3, 4, 5, 6, 7, 8, 8])

dask.array.
compress
(condition, a, axis=None, out=None)¶ Return selected slices of an array along given axis.
When working along a given axis, a slice along that axis is returned in output for each index where condition evaluates to True. When working on a 1D array, compress is equivalent to extract.
Parameters:  condition : 1D array of bools
Array that selects which entries to return. If len(condition) is less than the size of a along the given axis, then output is truncated to the length of the condition array.
 a : array_like
Array from which to extract a part.
 axis : int, optional
Axis along which to take slices. If None (default), work on the flattened array.
 out : ndarray, optional
Output array. Its type is preserved and it must be of the right shape to hold the output.
Returns:  compressed_array : ndarray
A copy of a without the slices along axis for which condition is false.
See also
take
,choose
,diag
,diagonal
,select
ndarray.compress
 Equivalent method in ndarray
np.extract
 Equivalent method when working on 1D arrays
numpy.doc.ufuncs
 Section “Output arguments”
Examples
>>> a = np.array([[1, 2], [3, 4], [5, 6]]) >>> a array([[1, 2], [3, 4], [5, 6]]) >>> np.compress([0, 1], a, axis=0) array([[3, 4]]) >>> np.compress([False, True, True], a, axis=0) array([[3, 4], [5, 6]]) >>> np.compress([False, True], a, axis=1) array([[2], [4], [6]])
Working on the flattened array does not return slices along an axis but selects elements.
>>> np.compress([False, True], a) array([2])

dask.array.
concatenate
(seq, axis=0, allow_unknown_chunksizes=False) Concatenate arrays along an existing axis
Given a sequence of dask Arrays form a new dask Array by stacking them along an existing dimension (axis=0 by default)
Parameters:  seq: list of dask.arrays
 axis: int
Dimension along which to align all of the arrays
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.concatenate(data, axis=0) >>> x.shape (12, 4)
>>> da.concatenate(data, axis=1).shape (4, 12)
Result is a new dask Array

dask.array.
conj
(x[, out])¶ Return the complex conjugate, elementwise.
The complex conjugate of a complex number is obtained by changing the sign of its imaginary part.
Parameters:  x : array_like
Input value.
Returns:  y : ndarray
The complex conjugate of x, with same dtype as y.
Examples
>>> np.conjugate(1+2j) (12j)
>>> x = np.eye(2) + 1j * np.eye(2) >>> np.conjugate(x) array([[ 1.1.j, 0.0.j], [ 0.0.j, 1.1.j]])

dask.array.
copysign
(x1, x2[, out])¶ Change the sign of x1 to that of x2, elementwise.
If both arguments are arrays or sequences, they have to be of the same length. If x2 is a scalar, its sign will be copied to all elements of x1.
Parameters:  x1 : array_like
Values to change the sign of.
 x2 : array_like
The sign of x2 is copied to x1.
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  out : array_like
The values of x1 with the sign of x2.
Examples
>>> np.copysign(1.3, 1) 1.3 >>> 1/np.copysign(0, 1) inf >>> 1/np.copysign(0, 1) inf
>>> np.copysign([1, 0, 1], 1.1) array([1., 0., 1.]) >>> np.copysign([1, 0, 1], np.arange(3)1) array([1., 0., 1.])

dask.array.
corrcoef
(x, y=None, rowvar=1, bias=<class 'numpy._NoValue'>, ddof=<class 'numpy._NoValue'>)¶ Return Pearson productmoment correlation coefficients.
Please refer to the documentation for cov for more detail. The relationship between the correlation coefficient matrix, R, and the covariance matrix, C, is
\[R_{ij} = \frac{ C_{ij} } { \sqrt{ C_{ii} * C_{jj} } }\]The values of R are between 1 and 1, inclusive.
Parameters:  x : array_like
A 1D or 2D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables. Also see rowvar below.
 y : array_like, optional
An additional set of variables and observations. y has the same shape as x.
 rowvar : int, optional
If rowvar is nonzero (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
 bias : _NoValue, optional
Has no effect, do not use.
Deprecated since version 1.10.0.
 ddof : _NoValue, optional
Has no effect, do not use.
Deprecated since version 1.10.0.
Returns:  R : ndarray
The correlation coefficient matrix of the variables.
See also
cov
 Covariance matrix
Notes
Due to floating point rounding the resulting array may not be Hermitian, the diagonal elements may not be 1, and the elements may not satisfy the inequality abs(a) <= 1. The real and imaginary parts are clipped to the interval [1, 1] in an attempt to improve on that situation but is not much help in the complex case.
This function accepts but discards arguments bias and ddof. This is for backwards compatibility with previous versions of this function. These arguments had no effect on the return values of the function and can be safely ignored in this and previous versions of numpy.

dask.array.
cos
(x[, out])¶ Cosine elementwise.
Parameters:  x : array_like
Input array in radians.
 out : ndarray, optional
Output array of same shape as x.
Returns:  y : ndarray
The corresponding cosine values.
Raises:  ValueError: invalid return array shape
if out is provided and out.shape != x.shape (See Examples)
Notes
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
References
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972.
Examples
>>> np.cos(np.array([0, np.pi/2, np.pi])) array([ 1.00000000e+00, 6.12303177e17, 1.00000000e+00]) >>> >>> # Example of providing the optional output parameter >>> out2 = np.cos([0.1], out1) >>> out2 is out1 True >>> >>> # Example of ValueError due to provision of shape mismatched `out` >>> np.cos(np.zeros((3,3)),np.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid return array shape

dask.array.
cosh
(x[, out])¶ Hyperbolic cosine, elementwise.
Equivalent to
1/2 * (np.exp(x) + np.exp(x))
andnp.cos(1j*x)
.Parameters:  x : array_like
Input array.
Returns:  out : ndarray
Output array of same shape as x.
Examples
>>> np.cosh(0) 1.0
The hyperbolic cosine describes the shape of a hanging cable:
>>> import matplotlib.pyplot as plt >>> x = np.linspace(4, 4, 1000) >>> plt.plot(x, np.cosh(x)) >>> plt.show()

dask.array.
count_nonzero
(a)¶ Counts the number of nonzero values in the array
a
.Parameters:  a : array_like
The array for which to count nonzeros.
Returns:  count : int or array of int
Number of nonzero values in the array.
See also
nonzero
 Return the coordinates of all the nonzero values.
Examples
>>> np.count_nonzero(np.eye(4)) 4 >>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]]) 5

dask.array.
cov
(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)¶ Estimate a covariance matrix, given data and weights.
Covariance indicates the level to which two variables vary together. If we examine Ndimensional samples, \(X = [x_1, x_2, ... x_N]^T\), then the covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\).
See the notes for an outline of the algorithm.
Parameters:  m : array_like
A 1D or 2D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables. Also see rowvar below.
 y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
 rowvar : bool, optional
If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
 bias : bool, optional
Default normalization (False) is by
(N  1)
, whereN
is the number of observations given (unbiased estimate). If bias is True, then normalization is byN
. These values can be overridden by using the keywordddof
in numpy versions >= 1.5. ddof : int, optional
If not
None
the default value implied by bias is overridden. Note thatddof=1
will return the unbiased estimate, even if both fweights and aweights are specified, andddof=0
will return the simple average. See the notes for the details. The default value isNone
.New in version 1.5.
 fweights : array_like, int, optional
1D array of integer freguency weights; the number of times each observation vector should be repeated.
New in version 1.10.
 aweights : array_like, optional
1D array of observation vector weights. These relative weights are typically large for observations considered “important” and smaller for observations considered less “important”. If
ddof=0
the array of weights can be used to assign probabilities to observation vectors.New in version 1.10.
Returns:  out : ndarray
The covariance matrix of the variables.
See also
corrcoef
 Normalized covariance matrix
Notes
Assume that the observations are in the columns of the observation array m and let
f = fweights
anda = aweights
for brevity. The steps to compute the weighted covariance are as follows:>>> w = f * a >>> v1 = np.sum(w) >>> v2 = np.sum(w * a) >>> m = np.sum(m * w, axis=1, keepdims=True) / v1 >>> cov = np.dot(m * w, m.T) * v1 / (v1**2  ddof * v2)
Note that when
a == 1
, the normalization factorv1 / (v1**2  ddof * v2)
goes over to1 / (np.sum(f)  ddof)
as it should.Examples
Consider two variables, \(x_0\) and \(x_1\), which correlate perfectly, but in opposite directions:
>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T >>> x array([[0, 1, 2], [2, 1, 0]])
Note how \(x_0\) increases while \(x_1\) decreases. The covariance matrix shows this clearly:
>>> np.cov(x) array([[ 1., 1.], [1., 1.]])
Note that element \(C_{0,1}\), which shows the correlation between \(x_0\) and \(x_1\), is negative.
Further, note how x and y are combined:
>>> x = [2.1, 1, 4.3] >>> y = [3, 1.1, 0.12] >>> X = np.vstack((x,y)) >>> print(np.cov(X)) [[ 11.71 4.286 ] [ 4.286 2.14413333]] >>> print(np.cov(x, y)) [[ 11.71 4.286 ] [ 4.286 2.14413333]] >>> print(np.cov(x)) 11.71

dask.array.
cumprod
(a, axis=None, dtype=None, out=None)¶ Return the cumulative product of elements along a given axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative product is computed. By default the input is flattened.
 dtype : dtype, optional
Type of the returned array, as well as of the accumulator in which the elements are multiplied. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used instead.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type of the resulting values will be cast if necessary.
Returns:  cumprod : ndarray
A new array holding the result is returned unless out is specified, in which case a reference to out is returned.
See also
numpy.doc.ufuncs
 Section “Output arguments”
Notes
Arithmetic is modular when using integer types, and no error is raised on overflow.
Examples
>>> a = np.array([1,2,3]) >>> np.cumprod(a) # intermediate results 1, 1*2 ... # total product 1*2*3 = 6 array([1, 2, 6]) >>> a = np.array([[1, 2, 3], [4, 5, 6]]) >>> np.cumprod(a, dtype=float) # specify type of output array([ 1., 2., 6., 24., 120., 720.])
The cumulative product for each column (i.e., over the rows) of a:
>>> np.cumprod(a, axis=0) array([[ 1, 2, 3], [ 4, 10, 18]])
The cumulative product for each row (i.e. over the columns) of a:
>>> np.cumprod(a,axis=1) array([[ 1, 2, 6], [ 4, 20, 120]])

dask.array.
cumsum
(a, axis=None, dtype=None, out=None)¶ Return the cumulative sum of the elements along a given axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative sum is computed. The default (None) is to compute the cumsum over the flattened array.
 dtype : dtype, optional
Type of the returned array and of the accumulator in which the elements are summed. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type will be cast if necessary. See doc.ufuncs (Section “Output arguments”) for more details.
Returns:  cumsum_along_axis : ndarray.
A new array holding the result is returned unless out is specified, in which case a reference to out is returned. The result has the same size as a, and the same shape as a if axis is not None or a is a 1d array.
See also
Notes
Arithmetic is modular when using integer types, and no error is raised on overflow.
Examples
>>> a = np.array([[1,2,3], [4,5,6]]) >>> a array([[1, 2, 3], [4, 5, 6]]) >>> np.cumsum(a) array([ 1, 3, 6, 10, 15, 21]) >>> np.cumsum(a, dtype=float) # specifies type of output value(s) array([ 1., 3., 6., 10., 15., 21.])
>>> np.cumsum(a,axis=0) # sum over rows for each of the 3 columns array([[1, 2, 3], [5, 7, 9]]) >>> np.cumsum(a,axis=1) # sum over columns for each of the 2 rows array([[ 1, 3, 6], [ 4, 9, 15]])

dask.array.
deg2rad
(x[, out])¶ Convert angles from degrees to radians.
Parameters:  x : array_like
Angles in degrees.
Returns:  y : ndarray
The corresponding angle in radians.
See also
rad2deg
 Convert angles from radians to degrees.
unwrap
 Remove large jumps in angle by wrapping.
Notes
New in version 1.3.0.
deg2rad(x)
isx * pi / 180
.Examples
>>> np.deg2rad(180) 3.1415926535897931

dask.array.
degrees
(x[, out])¶ Convert angles from radians to degrees.
Parameters:  x : array_like
Input array in radians.
 out : ndarray, optional
Output array of same shape as x.
Returns:  y : ndarray of floats
The corresponding degree values; if out was supplied this is a reference to it.
See also
rad2deg
 equivalent function
Examples
Convert a radian array to degrees
>>> rad = np.arange(12.)*np.pi/6 >>> np.degrees(rad) array([ 0., 30., 60., 90., 120., 150., 180., 210., 240., 270., 300., 330.])
>>> out = np.zeros((rad.shape)) >>> r = degrees(rad, out) >>> np.all(r == out) True

dask.array.
diag
(v, k=0)¶ Extract a diagonal or construct a diagonal array.
See the more detailed documentation for
numpy.diagonal
if you use this function to extract a diagonal and wish to write to the resulting array; whether it returns a copy or a view depends on what version of numpy you are using.Parameters:  v : array_like
If v is a 2D array, return a copy of its kth diagonal. If v is a 1D array, return a 2D array with v on the kth diagonal.
 k : int, optional
Diagonal in question. The default is 0. Use k>0 for diagonals above the main diagonal, and k<0 for diagonals below the main diagonal.
Returns:  out : ndarray
The extracted diagonal or constructed diagonal array.
See also
Examples
>>> x = np.arange(9).reshape((3,3)) >>> x array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
>>> np.diag(x) array([0, 4, 8]) >>> np.diag(x, k=1) array([1, 5]) >>> np.diag(x, k=1) array([3, 7])
>>> np.diag(np.diag(x)) array([[0, 0, 0], [0, 4, 0], [0, 0, 8]])

dask.array.
diff
(a, n=1, axis=1)¶ Calculate the nth discrete difference along given axis.
The first difference is given by
out[n] = a[n+1]  a[n]
along the given axis, higher differences are calculated by using diff recursively.Parameters:  a : array_like
Input array
 n : int, optional
The number of times values are differenced.
 axis : int, optional
The axis along which the difference is taken, default is the last axis.
Returns:  diff : ndarray
The nth differences. The shape of the output is the same as a except along axis where the dimension is smaller by n.
 .
Examples
>>> x = np.array([1, 2, 4, 7, 0]) >>> np.diff(x) array([ 1, 2, 3, 7]) >>> np.diff(x, n=2) array([ 1, 1, 10])
>>> x = np.array([[1, 3, 6, 10], [0, 5, 6, 8]]) >>> np.diff(x) array([[2, 3, 4], [5, 1, 2]]) >>> np.diff(x, axis=0) array([[1, 2, 0, 2]])

dask.array.
digitize
(x, bins, right=False)¶ Return the indices of the bins to which each value in input array belongs.
Each index
i
returned is such thatbins[i1] <= x < bins[i]
if bins is monotonically increasing, orbins[i1] > x >= bins[i]
if bins is monotonically decreasing. If values in x are beyond the bounds of bins, 0 orlen(bins)
is returned as appropriate. If right is True, then the right bin is closed so that the indexi
is such thatbins[i1] < x <= bins[i]
or bins[i1] >= x > bins[i]`` if bins is monotonically increasing or decreasing, respectively.Parameters:  x : array_like
Input array to be binned. Prior to Numpy 1.10.0, this array had to be 1dimensional, but can now have any shape.
 bins : array_like
Array of bins. It has to be 1dimensional and monotonic.
 right : bool, optional
Indicating whether the intervals include the right or the left bin edge. Default behavior is (right==False) indicating that the interval does not include the right edge. The left bin end is open in this case, i.e., bins[i1] <= x < bins[i] is the default behavior for monotonically increasing bins.
Returns:  out : ndarray of ints
Output array of indices, of same shape as x.
Raises:  ValueError
If bins is not monotonic.
 TypeError
If the type of the input is complex.
Notes
If values in x are such that they fall outside the bin range, attempting to index bins with the indices that digitize returns will result in an IndexError.
New in version 1.10.0.
np.digitize is implemented in terms of np.searchsorted. This means that a binary search is used to bin the values, which scales much better for larger number of bins than the previous linear search. It also removes the requirement for the input array to be 1dimensional.
Examples
>>> x = np.array([0.2, 6.4, 3.0, 1.6]) >>> bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0]) >>> inds = np.digitize(x, bins) >>> inds array([1, 4, 3, 2]) >>> for n in range(x.size): ... print(bins[inds[n]1], "<=", x[n], "<", bins[inds[n]]) ... 0.0 <= 0.2 < 1.0 4.0 <= 6.4 < 10.0 2.5 <= 3.0 < 4.0 1.0 <= 1.6 < 2.5
>>> x = np.array([1.2, 10.0, 12.4, 15.5, 20.]) >>> bins = np.array([0, 5, 10, 15, 20]) >>> np.digitize(x,bins,right=True) array([1, 2, 3, 4, 4]) >>> np.digitize(x,bins,right=False) array([1, 3, 3, 4, 5])

dask.array.
dot
(a, b, out=None)¶ Dot product of two arrays.
For 2D arrays it is equivalent to matrix multiplication, and for 1D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the secondtolast of b:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
Parameters:  a : array_like
First argument.
 b : array_like
Second argument.
 out : ndarray, optional
Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be Ccontiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
Returns:  output : ndarray
Returns the dot product of a and b. If a and b are both scalars or both 1D arrays then a scalar is returned; otherwise an array is returned. If out is given, then it is returned.
Raises:  ValueError
If the last dimension of a is not the same size as the secondtolast dimension of b.
See also
Examples
>>> np.dot(3, 4) 12
Neither argument is complexconjugated:
>>> np.dot([2j, 3j], [2j, 3j]) (13+0j)
For 2D arrays it is the matrix product:
>>> a = [[1, 0], [0, 1]] >>> b = [[4, 1], [2, 2]] >>> np.dot(a, b) array([[4, 1], [2, 2]])
>>> a = np.arange(3*4*5*6).reshape((3,4,5,6)) >>> b = np.arange(3*4*5*6)[::1].reshape((5,4,6,3)) >>> np.dot(a, b)[2,3,2,1,2,2] 499128 >>> sum(a[2,3,2,:] * b[1,2,:,2]) 499128

dask.array.
dstack
(tup)¶ Stack arrays in sequence depth wise (along third axis).
Takes a sequence of arrays and stack them along the third axis to make a single array. Rebuilds arrays divided by dsplit. This is a simple way to stack 2D arrays (images) into a single 3D array for processing.
Parameters:  tup : sequence of arrays
Arrays to stack. All of them must have the same shape along all but the third axis.
Returns:  stacked : ndarray
The array formed by stacking the given arrays.
See also
stack
 Join a sequence of arrays along a new axis.
vstack
 Stack along first axis.
hstack
 Stack along second axis.
concatenate
 Join a sequence of arrays along an existing axis.
dsplit
 Split array along third axis.
Notes
Equivalent to
np.concatenate(tup, axis=2)
.Examples
>>> a = np.array((1,2,3)) >>> b = np.array((2,3,4)) >>> np.dstack((a,b)) array([[[1, 2], [2, 3], [3, 4]]])
>>> a = np.array([[1],[2],[3]]) >>> b = np.array([[2],[3],[4]]) >>> np.dstack((a,b)) array([[[1, 2]], [[2, 3]], [[3, 4]]])

dask.array.
ediff1d
(ary, to_end=None, to_begin=None)¶ The differences between consecutive elements of an array.
Parameters:  ary : array_like
If necessary, will be flattened before the differences are taken.
 to_end : array_like, optional
Number(s) to append at the end of the returned differences.
 to_begin : array_like, optional
Number(s) to prepend at the beginning of the returned differences.
Returns:  ediff1d : ndarray
The differences. Loosely, this is
ary.flat[1:]  ary.flat[:1]
.
Notes
When applied to masked arrays, this function drops the mask information if the to_begin and/or to_end parameters are used.
Examples
>>> x = np.array([1, 2, 4, 7, 0]) >>> np.ediff1d(x) array([ 1, 2, 3, 7])
>>> np.ediff1d(x, to_begin=99, to_end=np.array([88, 99])) array([99, 1, 2, 3, 7, 88, 99])
The returned array is always 1D.
>>> y = [[1, 2, 4], [1, 6, 24]] >>> np.ediff1d(y) array([ 1, 2, 3, 5, 18])

dask.array.
empty
(*args, **kwargs)¶ Blocked variant of empty
Follows the signature of empty exactly except that it also requires a keyword argument chunks=(…)
Original signature follows below. empty(shape, dtype=float, order=’C’)
Return a new array of given shape and type, without initializing entries.
Parameters:  shape : int or tuple of int
Shape of the empty array
 dtype : datatype, optional
Desired output datatype.
 order : {‘C’, ‘F’}, optional
Whether to store multidimensional data in rowmajor (Cstyle) or columnmajor (Fortranstyle) order in memory.
Returns:  out : ndarray
Array of uninitialized (arbitrary) data of the given shape, dtype, and order. Object arrays will be initialized to None.
See also
Notes
empty, unlike zeros, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution.
Examples
>>> np.empty([2, 2]) array([[ 9.74499359e+001, 6.69583040e309], [ 2.13182611e314, 3.06959433e309]]) #random
>>> np.empty([2, 2], dtype=int) array([[1073741821, 1067949133], [ 496041986, 19249760]]) #random

dask.array.
empty_like
(a, dtype=None, chunks=None)¶ Return a new array with the same shape and type as a given array.
Parameters:  a : array_like
The shape and datatype of a define these same attributes of the returned array.
 dtype : datatype, optional
Overrides the data type of the result.
 chunks : sequence of ints
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  out : ndarray
Array of uninitialized (arbitrary) data with the same shape and type as a.
See also
ones_like
 Return an array of ones with shape and type of input.
zeros_like
 Return an array of zeros with shape and type of input.
empty
 Return a new uninitialized array.
ones
 Return a new array setting values to one.
zeros
 Return a new array setting values to zero.
Notes
This function does not initialize the returned array; to do that use zeros_like or ones_like instead. It may be marginally faster than the functions that do set the array values.

dask.array.
einsum
(subscripts, *operands, out=None, dtype=None, order='K', casting='safe')¶ Evaluates the Einstein summation convention on the operands.
Using the Einstein summation convention, many common multidimensional array operations can be represented in a simple fashion. This function provides a way to compute such summations. The best way to understand this function is to try the examples below, which show how many common NumPy functions can be implemented as calls to einsum.
Parameters:  subscripts : str
Specifies the subscripts for summation.
 operands : list of array_like
These are the arrays for the operation.
 out : ndarray, optional
If provided, the calculation is done into this array.
 dtype : datatype, optional
If provided, forces the calculation to use the data type specified. Note that you may have to also give a more liberal casting parameter to allow the conversions.
 order : {‘C’, ‘F’, ‘A’, ‘K’}, optional
Controls the memory layout of the output. ‘C’ means it should be C contiguous. ‘F’ means it should be Fortran contiguous, ‘A’ means it should be ‘F’ if the inputs are all ‘F’, ‘C’ otherwise. ‘K’ means it should be as close to the layout as the inputs as is possible, including arbitrarily permuted axes. Default is ‘K’.
 casting : {‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional
Controls what kind of data casting may occur. Setting this to ‘unsafe’ is not recommended, as it can adversely affect accumulations.
 ‘no’ means the data types should not be cast at all.
 ‘equiv’ means only byteorder changes are allowed.
 ‘safe’ means only casts which can preserve values are allowed.
 ‘same_kind’ means only safe casts or casts within a kind, like float64 to float32, are allowed.
 ‘unsafe’ means any data conversions may be done.
Returns:  output : ndarray
The calculation based on the Einstein summation convention.
Notes
New in version 1.6.0.
The subscripts string is a commaseparated list of subscript labels, where each label refers to a dimension of the corresponding operand. Repeated subscripts labels in one operand take the diagonal. For example,
np.einsum('ii', a)
is equivalent tonp.trace(a)
.Whenever a label is repeated, it is summed, so
np.einsum('i,i', a, b)
is equivalent tonp.inner(a,b)
. If a label appears only once, it is not summed, sonp.einsum('i', a)
produces a view ofa
with no changes.The order of labels in the output is by default alphabetical. This means that
np.einsum('ij', a)
doesn’t affect a 2D array, whilenp.einsum('ji', a)
takes its transpose.The output can be controlled by specifying output subscript labels as well. This specifies the label order, and allows summing to be disallowed or forced when desired. The call
np.einsum('i>', a)
is likenp.sum(a, axis=1)
, andnp.einsum('ii>i', a)
is likenp.diag(a)
. The difference is that einsum does not allow broadcasting by default.To enable and control broadcasting, use an ellipsis. Default NumPystyle broadcasting is done by adding an ellipsis to the left of each term, like
np.einsum('...ii>...i', a)
. To take the trace along the first and last axes, you can donp.einsum('i...i', a)
, or to do a matrixmatrix product with the leftmost indices instead of rightmost, you can donp.einsum('ij...,jk...>ik...', a, b)
.When there is only one operand, no axes are summed, and no output parameter is provided, a view into the operand is returned instead of a new array. Thus, taking the diagonal as
np.einsum('ii>i', a)
produces a view.An alternative way to provide the subscripts and operands is as
einsum(op0, sublist0, op1, sublist1, ..., [sublistout])
. The examples below have corresponding einsum calls with the two parameter methods.New in version 1.10.0.
Views returned from einsum are now writeable whenever the input array is writeable. For example,
np.einsum('ijk...>kji...', a)
will now have the same effect asnp.swapaxes(a, 0, 2)
andnp.einsum('ii>i', a)
will return a writeable view of the diagonal of a 2D array.Examples
>>> a = np.arange(25).reshape(5,5) >>> b = np.arange(5) >>> c = np.arange(6).reshape(2,3)
>>> np.einsum('ii', a) 60 >>> np.einsum(a, [0,0]) 60 >>> np.trace(a) 60
>>> np.einsum('ii>i', a) array([ 0, 6, 12, 18, 24]) >>> np.einsum(a, [0,0], [0]) array([ 0, 6, 12, 18, 24]) >>> np.diag(a) array([ 0, 6, 12, 18, 24])
>>> np.einsum('ij,j', a, b) array([ 30, 80, 130, 180, 230]) >>> np.einsum(a, [0,1], b, [1]) array([ 30, 80, 130, 180, 230]) >>> np.dot(a, b) array([ 30, 80, 130, 180, 230]) >>> np.einsum('...j,j', a, b) array([ 30, 80, 130, 180, 230])
>>> np.einsum('ji', c) array([[0, 3], [1, 4], [2, 5]]) >>> np.einsum(c, [1,0]) array([[0, 3], [1, 4], [2, 5]]) >>> c.T array([[0, 3], [1, 4], [2, 5]])
>>> np.einsum('..., ...', 3, c) array([[ 0, 3, 6], [ 9, 12, 15]]) >>> np.einsum(3, [Ellipsis], c, [Ellipsis]) array([[ 0, 3, 6], [ 9, 12, 15]]) >>> np.multiply(3, c) array([[ 0, 3, 6], [ 9, 12, 15]])
>>> np.einsum('i,i', b, b) 30 >>> np.einsum(b, [0], b, [0]) 30 >>> np.inner(b,b) 30
>>> np.einsum('i,j', np.arange(2)+1, b) array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]]) >>> np.einsum(np.arange(2)+1, [0], b, [1]) array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]]) >>> np.outer(np.arange(2)+1, b) array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]])
>>> np.einsum('i...>...', a) array([50, 55, 60, 65, 70]) >>> np.einsum(a, [0,Ellipsis], [Ellipsis]) array([50, 55, 60, 65, 70]) >>> np.sum(a, axis=0) array([50, 55, 60, 65, 70])
>>> a = np.arange(60.).reshape(3,4,5) >>> b = np.arange(24.).reshape(4,3,2) >>> np.einsum('ijk,jil>kl', a, b) array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]]) >>> np.einsum(a, [0,1,2], b, [1,0,3], [2,3]) array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]]) >>> np.tensordot(a,b, axes=([1,0],[0,1])) array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]])
>>> a = np.arange(6).reshape((3,2)) >>> b = np.arange(12).reshape((4,3)) >>> np.einsum('ki,jk>ij', a, b) array([[10, 28, 46, 64], [13, 40, 67, 94]]) >>> np.einsum('ki,...k>i...', a, b) array([[10, 28, 46, 64], [13, 40, 67, 94]]) >>> np.einsum('k...,jk', a, b) array([[10, 28, 46, 64], [13, 40, 67, 94]])
>>> # since version 1.10.0 >>> a = np.zeros((3, 3)) >>> np.einsum('ii>i', a)[:] = 1 >>> a array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]])

dask.array.
exp
(x[, out])¶ Calculate the exponential of all elements in the input array.
Parameters:  x : array_like
Input values.
Returns:  out : ndarray
Output array, elementwise exponential of x.
See also
expm1
 Calculate
exp(x)  1
for all elements in the array. exp2
 Calculate
2**x
for all elements in the array.
Notes
The irrational number
e
is also known as Euler’s number. It is approximately 2.718281, and is the base of the natural logarithm,ln
(this means that, if \(x = \ln y = \log_e y\), then \(e^x = y\). For real input,exp(x)
is always positive.For complex arguments,
x = a + ib
, we can write \(e^x = e^a e^{ib}\). The first term, \(e^a\), is already known (it is the real argument, described above). The second term, \(e^{ib}\), is \(\cos b + i \sin b\), a function with magnitude 1 and a periodic phase.References
[1] Wikipedia, “Exponential function”, http://en.wikipedia.org/wiki/Exponential_function [2] M. Abramovitz and I. A. Stegun, “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables,” Dover, 1964, p. 69, http://www.math.sfu.ca/~cbm/aands/page_69.htm Examples
Plot the magnitude and phase of
exp(x)
in the complex plane:>>> import matplotlib.pyplot as plt
>>> x = np.linspace(2*np.pi, 2*np.pi, 100) >>> xx = x + 1j * x[:, np.newaxis] # a + ib over complex plane >>> out = np.exp(xx)
>>> plt.subplot(121) >>> plt.imshow(np.abs(out), ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi]) >>> plt.title('Magnitude of exp(x)')
>>> plt.subplot(122) >>> plt.imshow(np.angle(out), ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi]) >>> plt.title('Phase (angle) of exp(x)') >>> plt.show()

dask.array.
expm1
(x[, out])¶ Calculate
exp(x)  1
for all elements in the array.Parameters:  x : array_like
Input values.
Returns:  out : ndarray
Elementwise exponential minus one:
out = exp(x)  1
.
See also
log1p
log(1 + x)
, the inverse of expm1.
Notes
This function provides greater precision than
exp(x)  1
for small values ofx
.Examples
The true value of
exp(1e10)  1
is1.00000000005e10
to about 32 significant digits. This example shows the superiority of expm1 in this case.>>> np.expm1(1e10) 1.00000000005e10 >>> np.exp(1e10)  1 1.000000082740371e10

dask.array.
eye
(N, chunks, M=None, k=0, dtype=<class 'float'>)¶ Return a 2D Array with ones on the diagonal and zeros elsewhere.
Parameters:  N : int
Number of rows in the output.
 chunks: int
chunk size of resulting blocks
 M : int, optional
Number of columns in the output. If None, defaults to N.
 k : int, optional
Index of the diagonal: 0 (the default) refers to the main diagonal, a positive value refers to an upper diagonal, and a negative value to a lower diagonal.
 dtype : datatype, optional
Datatype of the returned array.
Returns:  I : Array of shape (N,M)
An array where all elements are equal to zero, except for the kth diagonal, whose values are equal to one.

dask.array.
fabs
(x[, out])¶ Compute the absolute values elementwise.
This function returns the absolute values (positive magnitude) of the data in x. Complex values are not handled, use absolute to find the absolute values of complex data.
Parameters:  x : array_like
The array of numbers for which the absolute values are required. If x is a scalar, the result y will also be a scalar.
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  y : ndarray or scalar
The absolute values of x, the returned values are always floats.
See also
absolute
 Absolute values including complex types.
Examples
>>> np.fabs(1) 1.0 >>> np.fabs([1.2, 1.2]) array([ 1.2, 1.2])

dask.array.
fix
(*args, **kwargs)¶ Round to nearest integer towards zero.
Round an array of floats elementwise to nearest integer towards zero. The rounded values are returned as floats.
Parameters:  x : array_like
An array of floats to be rounded
 y : ndarray, optional
Output array
Returns:  out : ndarray of floats
The array of rounded numbers
Examples
>>> np.fix(3.14) 3.0 >>> np.fix(3) 3.0 >>> np.fix([2.1, 2.9, 2.1, 2.9]) array([ 2., 2., 2., 2.])

dask.array.
flatnonzero
(a)¶ Return indices that are nonzero in the flattened version of a.
This is equivalent to a.ravel().nonzero()[0].
Parameters:  a : ndarray
Input array.
Returns:  res : ndarray
Output array, containing the indices of the elements of a.ravel() that are nonzero.
See also
Examples
>>> x = np.arange(2, 3) >>> x array([2, 1, 0, 1, 2]) >>> np.flatnonzero(x) array([0, 1, 3, 4])
Use the indices of the nonzero elements as an index array to extract these elements:
>>> x.ravel()[np.flatnonzero(x)] array([2, 1, 1, 2])

dask.array.
flip
(m, axis)¶ Reverse element order along axis.
Parameters:  axis : int
Axis to reverse element order of.
Returns:  reversed array : ndarray

dask.array.
flipud
(m)¶ Flip array in the up/down direction.
Flip the entries in each column in the up/down direction. Rows are preserved, but appear in a different order than before.
Parameters:  m : array_like
Input array.
Returns:  out : array_like
A view of m with the rows reversed. Since a view is returned, this operation is \(\mathcal O(1)\).
See also
fliplr
 Flip array in the left/right direction.
rot90
 Rotate array counterclockwise.
Notes
Equivalent to
A[::1,...]
. Does not require the array to be twodimensional.Examples
>>> A = np.diag([1.0, 2, 3]) >>> A array([[ 1., 0., 0.], [ 0., 2., 0.], [ 0., 0., 3.]]) >>> np.flipud(A) array([[ 0., 0., 3.], [ 0., 2., 0.], [ 1., 0., 0.]])
>>> A = np.random.randn(2,3,5) >>> np.all(np.flipud(A)==A[::1,...]) True
>>> np.flipud([1,2]) array([2, 1])

dask.array.
fliplr
(m)¶ Flip array in the left/right direction.
Flip the entries in each row in the left/right direction. Columns are preserved, but appear in a different order than before.
Parameters:  m : array_like
Input array, must be at least 2D.
Returns:  f : ndarray
A view of m with the columns reversed. Since a view is returned, this operation is \(\mathcal O(1)\).
See also
flipud
 Flip array in the up/down direction.
rot90
 Rotate array counterclockwise.
Notes
Equivalent to A[:,::1]. Requires the array to be at least 2D.
Examples
>>> A = np.diag([1.,2.,3.]) >>> A array([[ 1., 0., 0.], [ 0., 2., 0.], [ 0., 0., 3.]]) >>> np.fliplr(A) array([[ 0., 0., 1.], [ 0., 2., 0.], [ 3., 0., 0.]])
>>> A = np.random.randn(2,3,5) >>> np.all(np.fliplr(A)==A[:,::1,...]) True

dask.array.
floor
(x[, out])¶ Return the floor of the input, elementwise.
The floor of the scalar x is the largest integer i, such that i <= x. It is often denoted as \(\lfloor x \rfloor\).
Parameters:  x : array_like
Input data.
Returns:  y : ndarray or scalar
The floor of each element in x.
Notes
Some spreadsheet programs calculate the “floortowardszero”, in other words
floor(2.5) == 2
. NumPy instead uses the definition of floor where floor(2.5) == 3.Examples
>>> a = np.array([1.7, 1.5, 0.2, 0.2, 1.5, 1.7, 2.0]) >>> np.floor(a) array([2., 2., 1., 0., 1., 1., 2.])

dask.array.
fmax
(x1, x2[, out])¶ Elementwise maximum of array elements.
Compare two arrays and returns a new array containing the elementwise maxima. If one of the elements being compared is a NaN, then the nonnan element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are ignored when possible.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape.
Returns:  y : ndarray or scalar
The maximum of x1 and x2, elementwise. Returns scalar if both x1 and x2 are scalars.
See also
Notes
New in version 1.3.0.
The fmax is equivalent to
np.where(x1 >= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.Examples
>>> np.fmax([2, 3, 4], [1, 5, 2]) array([ 2., 5., 4.])
>>> np.fmax(np.eye(2), [0.5, 2]) array([[ 1. , 2. ], [ 0.5, 2. ]])
>>> np.fmax([np.nan, 0, np.nan],[0, np.nan, np.nan]) array([ 0., 0., NaN])

dask.array.
fmin
(x1, x2[, out])¶ Elementwise minimum of array elements.
Compare two arrays and returns a new array containing the elementwise minima. If one of the elements being compared is a NaN, then the nonnan element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are ignored when possible.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape.
Returns:  y : ndarray or scalar
The minimum of x1 and x2, elementwise. Returns scalar if both x1 and x2 are scalars.
See also
Notes
New in version 1.3.0.
The fmin is equivalent to
np.where(x1 <= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.Examples
>>> np.fmin([2, 3, 4], [1, 5, 2]) array([2, 5, 4])
>>> np.fmin(np.eye(2), [0.5, 2]) array([[ 1. , 2. ], [ 0.5, 2. ]])
>>> np.fmin([np.nan, 0, np.nan],[0, np.nan, np.nan]) array([ 0., 0., NaN])

dask.array.
fmod
(x1, x2[, out])¶ Return the elementwise remainder of division.
This is the NumPy implementation of the C library function fmod, the remainder has the same sign as the dividend x1. It is equivalent to the Matlab(TM)
rem
function and should not be confused with the Python modulus operatorx1 % x2
.Parameters:  x1 : array_like
Dividend.
 x2 : array_like
Divisor.
Returns:  y : array_like
The remainder of the division of x1 by x2.
See also
remainder
 Equivalent to the Python
%
operator.
divide
Notes
The result of the modulo operation for negative dividend and divisors is bound by conventions. For fmod, the sign of result is the sign of the dividend, while for remainder the sign of the result is the sign of the divisor. The fmod function is equivalent to the Matlab(TM)
rem
function.Examples
>>> np.fmod([3, 2, 1, 1, 2, 3], 2) array([1, 0, 1, 1, 0, 1]) >>> np.remainder([3, 2, 1, 1, 2, 3], 2) array([1, 0, 1, 1, 0, 1])
>>> np.fmod([5, 3], [2, 2.]) array([ 1., 1.]) >>> a = np.arange(3, 3).reshape(3, 2) >>> a array([[3, 2], [1, 0], [ 1, 2]]) >>> np.fmod(a, [2,2]) array([[1, 0], [1, 0], [ 1, 0]])

dask.array.
frexp
(x[, out1, out2])¶ Decompose the elements of x into mantissa and twos exponent.
Returns (mantissa, exponent), where x = mantissa * 2**exponent`. The mantissa is lies in the open interval(1, 1), while the twos exponent is a signed integer.
Parameters:  x : array_like
Array of numbers to be decomposed.
 out1 : ndarray, optional
Output array for the mantissa. Must have the same shape as x.
 out2 : ndarray, optional
Output array for the exponent. Must have the same shape as x.
Returns:  (mantissa, exponent) : tuple of ndarrays, (float, int)
mantissa is a float array with values between 1 and 1. exponent is an int array which represents the exponent of 2.
See also
ldexp
 Compute
y = x1 * 2**x2
, the inverse of frexp.
Notes
Complex dtypes are not supported, they will raise a TypeError.
Examples
>>> x = np.arange(9) >>> y1, y2 = np.frexp(x) >>> y1 array([ 0. , 0.5 , 0.5 , 0.75 , 0.5 , 0.625, 0.75 , 0.875, 0.5 ]) >>> y2 array([0, 1, 2, 2, 3, 3, 3, 3, 4]) >>> y1 * 2**y2 array([ 0., 1., 2., 3., 4., 5., 6., 7., 8.])

dask.array.
fromfunction
(function, shape, **kwargs)¶ Construct an array by executing a function over each coordinate.
The resulting array therefore has a value
fn(x, y, z)
at coordinate(x, y, z)
.Parameters:  function : callable
The function is called with N parameters, where N is the rank of shape. Each parameter represents the coordinates of the array varying along a specific axis. For example, if shape were
(2, 2)
, then the parameters in turn be (0, 0), (0, 1), (1, 0), (1, 1). shape : (N,) tuple of ints
Shape of the output array, which also determines the shape of the coordinate arrays passed to function.
 dtype : datatype, optional
Datatype of the coordinate arrays passed to function. By default, dtype is float.
Returns:  fromfunction : any
The result of the call to function is passed back directly. Therefore the shape of fromfunction is completely determined by function. If function returns a scalar value, the shape of fromfunction would match the shape parameter.
Notes
Keywords other than dtype are passed to function.
Examples
>>> np.fromfunction(lambda i, j: i == j, (3, 3), dtype=int) array([[ True, False, False], [False, True, False], [False, False, True]], dtype=bool)
>>> np.fromfunction(lambda i, j: i + j, (3, 3), dtype=int) array([[0, 1, 2], [1, 2, 3], [2, 3, 4]])

dask.array.
frompyfunc
(func, nin, nout)¶ Takes an arbitrary Python function and returns a Numpy ufunc.
Can be used, for example, to add broadcasting to a builtin Python function (see Examples section).
Parameters:  func : Python function object
An arbitrary Python function.
 nin : int
The number of input arguments.
 nout : int
The number of objects returned by func.
Returns:  out : ufunc
Returns a Numpy universal function (
ufunc
) object.
Notes
The returned ufunc always returns PyObject arrays.
Examples
Use frompyfunc to add broadcasting to the Python function
oct
:>>> oct_array = np.frompyfunc(oct, 1, 1) >>> oct_array(np.array((10, 30, 100))) array([012, 036, 0144], dtype=object) >>> np.array((oct(10), oct(30), oct(100))) # for comparison array(['012', '036', '0144'], dtype='S4')

dask.array.
full
(*args, **kwargs)¶ Blocked variant of full
Follows the signature of full exactly except that it also requires a keyword argument chunks=(…)
Original signature follows below.
Return a new array of given shape and type, filled with fill_value.
Parameters:  shape : int or sequence of ints
Shape of the new array, e.g.,
(2, 3)
or2
. fill_value : scalar
Fill value.
 dtype : datatype, optional
The desired datatype for the array, e.g., np.int8. Default is float, but will change to np.array(fill_value).dtype in a future release.
 order : {‘C’, ‘F’}, optional
Whether to store multidimensional data in C or Fortrancontiguous (row or columnwise) order in memory.
Returns:  out : ndarray
Array of fill_value with the given shape, dtype, and order.
See also
zeros_like
 Return an array of zeros with shape and type of input.
ones_like
 Return an array of ones with shape and type of input.
empty_like
 Return an empty array with shape and type of input.
full_like
 Fill an array with shape and type of input.
zeros
 Return a new array setting values to zero.
ones
 Return a new array setting values to one.
empty
 Return a new uninitialized array.
Examples
>>> np.full((2, 2), np.inf) array([[ inf, inf], [ inf, inf]]) >>> np.full((2, 2), 10, dtype=np.int) array([[10, 10], [10, 10]])

dask.array.
full_like
(a, fill_value, dtype=None, chunks=None)¶ Return a full array with the same shape and type as a given array.
Parameters:  a : array_like
The shape and datatype of a define these same attributes of the returned array.
 fill_value : scalar
Fill value.
 dtype : datatype, optional
Overrides the data type of the result.
 chunks : sequence of ints
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  out : ndarray
Array of fill_value with the same shape and type as a.
See also
zeros_like
 Return an array of zeros with shape and type of input.
ones_like
 Return an array of ones with shape and type of input.
empty_like
 Return an empty array with shape and type of input.
zeros
 Return a new array setting values to zero.
ones
 Return a new array setting values to one.
empty
 Return a new uninitialized array.
full
 Fill a new array.

dask.array.
gradient
(f, *varargs, **kwargs)¶ Return the gradient of an Ndimensional array.
The gradient is computed using second order accurate central differences in the interior and either first differences or second order accurate onesides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array.
Parameters:  f : array_like
An Ndimensional array containing samples of a scalar function.
 varargs : scalar or list of scalar, optional
N scalars specifying the sample distances for each dimension, i.e. dx, dy, dz, … Default distance: 1. single scalar specifies sample distance for all dimensions. if axis is given, the number of varargs must equal the number of axes.
 edge_order : {1, 2}, optional
Gradient is calculated using N^{th} order accurate differences at the boundaries. Default: 1.
New in version 1.9.1.
 axis : None or int or tuple of ints, optional
Gradient is calculated only along the given axis or axes The default (axis = None) is to calculate the gradient for all the axes of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.11.0.
Returns:  gradient : list of ndarray
Each element of list has the same shape as f giving the derivative of f with respect to each dimension.
Examples
>>> x = np.array([1, 2, 4, 7, 11, 16], dtype=np.float) >>> np.gradient(x) array([ 1. , 1.5, 2.5, 3.5, 4.5, 5. ]) >>> np.gradient(x, 2) array([ 0.5 , 0.75, 1.25, 1.75, 2.25, 2.5 ])
For two dimensional arrays, the return will be two arrays ordered by axis. In this example the first array stands for the gradient in rows and the second one in columns direction:
>>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=np.float)) [array([[ 2., 2., 1.], [ 2., 2., 1.]]), array([[ 1. , 2.5, 4. ], [ 1. , 1. , 1. ]])]
>>> x = np.array([0, 1, 2, 3, 4]) >>> dx = np.gradient(x) >>> y = x**2 >>> np.gradient(y, dx, edge_order=2) array([0., 2., 4., 6., 8.])
The axis keyword can be used to specify a subset of axes of which the gradient is calculated >>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=np.float), axis=0) array([[ 2., 2., 1.],
[ 2., 2., 1.]])

dask.array.
histogram
(a, bins=None, range=None, normed=False, weights=None, density=None)¶ Blocked variant of
numpy.histogram()
.Follows the signature of
numpy.histogram()
exactly with the following exceptions: Either an iterable specifying the
bins
or the number ofbins
and arange
argument is required as computingmin
andmax
over blocked arrays is an expensive operation that must be performed explicitly. weights
must be a dask.array.Array with the same block structure asa
.
Examples
Using number of bins and range:
>>> import dask.array as da >>> import numpy as np >>> x = da.from_array(np.arange(10000), chunks=10) >>> h, bins = da.histogram(x, bins=10, range=[0, 10000]) >>> bins array([ 0., 1000., 2000., 3000., 4000., 5000., 6000., 7000., 8000., 9000., 10000.]) >>> h.compute() array([1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
Explicitly specifying the bins:
>>> h, bins = da.histogram(x, bins=np.array([0, 5000, 10000])) >>> bins array([ 0, 5000, 10000]) >>> h.compute() array([5000, 5000])
 Either an iterable specifying the

dask.array.
hstack
(tup)¶ Stack arrays in sequence horizontally (column wise).
Take a sequence of arrays and stack them horizontally to make a single array. Rebuild arrays divided by hsplit.
Parameters:  tup : sequence of ndarrays
All arrays must have the same shape along all but the second axis.
Returns:  stacked : ndarray
The array formed by stacking the given arrays.
See also
stack
 Join a sequence of arrays along a new axis.
vstack
 Stack arrays in sequence vertically (row wise).
dstack
 Stack arrays in sequence depth wise (along third axis).
concatenate
 Join a sequence of arrays along an existing axis.
hsplit
 Split array along second axis.
Notes
Equivalent to
np.concatenate(tup, axis=1)
Examples
>>> a = np.array((1,2,3)) >>> b = np.array((2,3,4)) >>> np.hstack((a,b)) array([1, 2, 3, 2, 3, 4]) >>> a = np.array([[1],[2],[3]]) >>> b = np.array([[2],[3],[4]]) >>> np.hstack((a,b)) array([[1, 2], [2, 3], [3, 4]])

dask.array.
hypot
(x1, x2[, out])¶ Given the “legs” of a right triangle, return its hypotenuse.
Equivalent to
sqrt(x1**2 + x2**2)
, elementwise. If x1 or x2 is scalar_like (i.e., unambiguously castable to a scalar type), it is broadcast for use with each element of the other argument. (See Examples)Parameters:  x1, x2 : array_like
Leg of the triangle(s).
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  z : ndarray
The hypotenuse of the triangle(s).
Examples
>>> np.hypot(3*np.ones((3, 3)), 4*np.ones((3, 3))) array([[ 5., 5., 5.], [ 5., 5., 5.], [ 5., 5., 5.]])
Example showing broadcast of scalar_like argument:
>>> np.hypot(3*np.ones((3, 3)), [4]) array([[ 5., 5., 5.], [ 5., 5., 5.], [ 5., 5., 5.]])

dask.array.
imag
(*args, **kwargs)¶ Return the imaginary part of the elements of the array.
Parameters:  val : array_like
Input array.
Returns:  out : ndarray
Output array. If val is real, the type of val is used for the output. If val has complex elements, the returned type is float.
Examples
>>> a = np.array([1+2j, 3+4j, 5+6j]) >>> a.imag array([ 2., 4., 6.]) >>> a.imag = np.array([8, 10, 12]) >>> a array([ 1. +8.j, 3.+10.j, 5.+12.j])

dask.array.
indices
(dimensions, dtype=<class 'int'>, chunks=None)¶ Implements NumPy’s
indices
for Dask Arrays.Generates a grid of indices covering the dimensions provided.
The final array has the shape
(len(dimensions), *dimensions)
. The chunks are used to specify the chunking for axis 1 up tolen(dimensions)
. The 0th axis always has chunks of length 1.Parameters:  dimensions : sequence of ints
The shape of the index grid.
 dtype : dtype, optional
Type to use for the array. Default is
int
. chunks : sequence of ints
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  grid : dask array

dask.array.
insert
(arr, obj, values, axis=None)¶ Insert values along the given axis before the given indices.
Parameters:  arr : array_like
Input array.
 obj : int, slice or sequence of ints
Object that defines the index or indices before which values is inserted.
New in version 1.8.0.
Support for multiple insertions when obj is a single scalar or a sequence with one element (similar to calling insert multiple times).
 values : array_like
Values to insert into arr. If the type of values is different from that of arr, values is converted to the type of arr. values should be shaped so that
arr[...,obj,...] = values
is legal. axis : int, optional
Axis along which to insert values. If axis is None then arr is flattened first.
Returns:  out : ndarray
A copy of arr with values inserted. Note that insert does not occur inplace: a new array is returned. If axis is None, out is a flattened array.
See also
append
 Append elements at the end of an array.
concatenate
 Join a sequence of arrays along an existing axis.
delete
 Delete elements from an array.
Notes
Note that for higher dimensional inserts obj=0 behaves very different from obj=[0] just like arr[:,0,:] = values is different from arr[:,[0],:] = values.
Examples
>>> a = np.array([[1, 1], [2, 2], [3, 3]]) >>> a array([[1, 1], [2, 2], [3, 3]]) >>> np.insert(a, 1, 5) array([1, 5, 1, 2, 2, 3, 3]) >>> np.insert(a, 1, 5, axis=1) array([[1, 5, 1], [2, 5, 2], [3, 5, 3]])
Difference between sequence and scalars:
>>> np.insert(a, [1], [[1],[2],[3]], axis=1) array([[1, 1, 1], [2, 2, 2], [3, 3, 3]]) >>> np.array_equal(np.insert(a, 1, [1, 2, 3], axis=1), ... np.insert(a, [1], [[1],[2],[3]], axis=1)) True
>>> b = a.flatten() >>> b array([1, 1, 2, 2, 3, 3]) >>> np.insert(b, [2, 2], [5, 6]) array([1, 1, 5, 6, 2, 2, 3, 3])
>>> np.insert(b, slice(2, 4), [5, 6]) array([1, 1, 5, 2, 6, 2, 3, 3])
>>> np.insert(b, [2, 2], [7.13, False]) # type casting array([1, 1, 7, 0, 2, 2, 3, 3])
>>> x = np.arange(8).reshape(2, 4) >>> idx = (1, 3) >>> np.insert(x, idx, 999, axis=1) array([[ 0, 999, 1, 2, 999, 3], [ 4, 999, 5, 6, 999, 7]])

dask.array.
isclose
(a, b, rtol=1e05, atol=1e08, equal_nan=False)¶ Returns a boolean array where two arrays are elementwise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b.
Parameters:  a, b : array_like
Input arrays to compare.
 rtol : float
The relative tolerance parameter (see Notes).
 atol : float
The absolute tolerance parameter (see Notes).
 equal_nan : bool
Whether to compare NaN’s as equal. If True, NaN’s in a will be considered equal to NaN’s in b in the output array.
Returns:  y : array_like
Returns a boolean array of where a and b are equal within the given tolerance. If both a and b are scalars, returns a single boolean value.
See also
Notes
New in version 1.7.0.
For finite values, isclose uses the following equation to test whether two floating point values are equivalent.
absolute(a  b) <= (atol + rtol * absolute(b))The above equation is not symmetric in a and b, so that isclose(a, b) might be different from isclose(b, a) in some rare cases.
Examples
>>> np.isclose([1e10,1e7], [1.00001e10,1e8]) array([True, False]) >>> np.isclose([1e10,1e8], [1.00001e10,1e9]) array([True, True]) >>> np.isclose([1e10,1e8], [1.0001e10,1e9]) array([False, True]) >>> np.isclose([1.0, np.nan], [1.0, np.nan]) array([True, False]) >>> np.isclose([1.0, np.nan], [1.0, np.nan], equal_nan=True) array([True, True])

dask.array.
iscomplex
(*args, **kwargs)¶ Returns a bool array, where True if input element is complex.
What is tested is whether the input has a nonzero imaginary part, not if the input type is complex.
Parameters:  x : array_like
Input array.
Returns:  out : ndarray of bools
Output array.
Examples
>>> np.iscomplex([1+1j, 1+0j, 4.5, 3, 2, 2j]) array([ True, False, False, False, False, True], dtype=bool)

dask.array.
isfinite
(x[, out])¶ Test elementwise for finiteness (not infinity or not Not a Number).
The result is returned as a boolean array.
Parameters:  x : array_like
Input values.
 out : ndarray, optional
Array into which the output is placed. Its type is preserved and it must be of the right shape to hold the output. See doc.ufuncs.
Returns:  y : ndarray, bool
For scalar input, the result is a new boolean with value True if the input is finite; otherwise the value is False (input is either positive infinity, negative infinity or Not a Number).
For array input, the result is a boolean array with the same dimensions as the input and the values are True if the corresponding element of the input is finite; otherwise the values are False (element is either positive infinity, negative infinity or Not a Number).
Notes
Not a Number, positive infinity and negative infinity are considered to be nonfinite.
Numpy uses the IEEE Standard for Binary FloatingPoint for Arithmetic (IEEE 754). This means that Not a Number is not equivalent to infinity. Also that positive infinity is not equivalent to negative infinity. But infinity is equivalent to positive infinity. Errors result if the second argument is also supplied when x is a scalar input, or if first and second arguments have different shapes.
Examples
>>> np.isfinite(1) True >>> np.isfinite(0) True >>> np.isfinite(np.nan) False >>> np.isfinite(np.inf) False >>> np.isfinite(np.NINF) False >>> np.isfinite([np.log(1.),1.,np.log(0)]) array([False, True, False], dtype=bool)
>>> x = np.array([np.inf, 0., np.inf]) >>> y = np.array([2, 2, 2]) >>> np.isfinite(x, y) array([0, 1, 0]) >>> y array([0, 1, 0])

dask.array.
isin
(element, test_elements, assume_unique=False, invert=False)¶

dask.array.
isinf
(x[, out])¶ Test elementwise for positive or negative infinity.
Returns a boolean array of the same shape as x, True where
x == +/inf
, otherwise False.Parameters:  x : array_like
Input values
 out : array_like, optional
An array with the same shape as x to store the result.
Returns:  y : bool (scalar) or boolean ndarray
For scalar input, the result is a new boolean with value True if the input is positive or negative infinity; otherwise the value is False.
For array input, the result is a boolean array with the same shape as the input and the values are True where the corresponding element of the input is positive or negative infinity; elsewhere the values are False. If a second argument was supplied the result is stored there. If the type of that array is a numeric type the result is represented as zeros and ones, if the type is boolean then as False and True, respectively. The return value y is then a reference to that array.
Notes
Numpy uses the IEEE Standard for Binary FloatingPoint for Arithmetic (IEEE 754).
Errors result if the second argument is supplied when the first argument is a scalar, or if the first and second arguments have different shapes.
Examples
>>> np.isinf(np.inf) True >>> np.isinf(np.nan) False >>> np.isinf(np.NINF) True >>> np.isinf([np.inf, np.inf, 1.0, np.nan]) array([ True, True, False, False], dtype=bool)
>>> x = np.array([np.inf, 0., np.inf]) >>> y = np.array([2, 2, 2]) >>> np.isinf(x, y) array([1, 0, 1]) >>> y array([1, 0, 1])

dask.array.
isnan
(x[, out])¶ Test elementwise for NaN and return result as a boolean array.
Parameters:  x : array_like
Input array.
Returns:  y : ndarray or bool
For scalar input, the result is a new boolean with value True if the input is NaN; otherwise the value is False.
For array input, the result is a boolean array of the same dimensions as the input and the values are True if the corresponding element of the input is NaN; otherwise the values are False.
Notes
Numpy uses the IEEE Standard for Binary FloatingPoint for Arithmetic (IEEE 754). This means that Not a Number is not equivalent to infinity.
Examples
>>> np.isnan(np.nan) True >>> np.isnan(np.inf) False >>> np.isnan([np.log(1.),1.,np.log(0)]) array([ True, False, False], dtype=bool)

dask.array.
isnull
(values)¶ pandas.isnull for dask arrays

dask.array.
isreal
(*args, **kwargs)¶ Returns a bool array, where True if input element is real.
If element has complex type with zero complex part, the return value for that element is True.
Parameters:  x : array_like
Input array.
Returns:  out : ndarray, bool
Boolean array of same shape as x.
Examples
>>> np.isreal([1+1j, 1+0j, 4.5, 3, 2, 2j]) array([False, True, True, True, True, False], dtype=bool)

dask.array.
ldexp
(x1, x2[, out])¶ Returns x1 * 2**x2, elementwise.
The mantissas x1 and twos exponents x2 are used to construct floating point numbers
x1 * 2**x2
.Parameters:  x1 : array_like
Array of multipliers.
 x2 : array_like, int
Array of twos exponents.
 out : ndarray, optional
Output array for the result.
Returns:  y : ndarray or scalar
The result of
x1 * 2**x2
.
See also
frexp
 Return (y1, y2) from
x = y1 * 2**y2
, inverse to ldexp.
Notes
Complex dtypes are not supported, they will raise a TypeError.
ldexp is useful as the inverse of frexp, if used by itself it is more clear to simply use the expression
x1 * 2**x2
.Examples
>>> np.ldexp(5, np.arange(4)) array([ 5., 10., 20., 40.], dtype=float32)
>>> x = np.arange(6) >>> np.ldexp(*np.frexp(x)) array([ 0., 1., 2., 3., 4., 5.])

dask.array.
linspace
(start, stop, num=50, chunks=None, dtype=None)¶ Return num evenly spaced values over the closed interval [start, stop].
TODO: implement the endpoint, restep, and dtype keyword args
Parameters:  start : scalar
The starting value of the sequence.
 stop : scalar
The last value of the sequence.
 num : int, optional
Number of samples to include in the returned dask array, including the endpoints.
 chunks : int
The number of samples on each block. Note that the last block will have fewer samples if num % blocksize != 0
Returns:  samples : dask array
See also

dask.array.
log
(x[, out])¶ Natural logarithm, elementwise.
The natural logarithm log is the inverse of the exponential function, so that log(exp(x)) = x. The natural logarithm is logarithm in base e.
Parameters:  x : array_like
Input value.
Returns:  y : ndarray
The natural logarithm of x, elementwise.
Notes
Logarithm is a multivalued function: for each x there is an infinite number of z such that exp(z) = x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log is a complex analytical function that has a branch cut [inf, 0] and is continuous from above on it. log handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. http://en.wikipedia.org/wiki/Logarithm Examples
>>> np.log([1, np.e, np.e**2, 0]) array([ 0., 1., 2., Inf])

dask.array.
log10
(x[, out])¶ Return the base 10 logarithm of the input array, elementwise.
Parameters:  x : array_like
Input values.
Returns:  y : ndarray
The logarithm to the base 10 of x, elementwise. NaNs are returned where x is negative.
See also
emath.log10
Notes
Logarithm is a multivalued function: for each x there is an infinite number of z such that 10**z = x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log10 always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log10 is a complex analytical function that has a branch cut [inf, 0] and is continuous from above on it. log10 handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. http://en.wikipedia.org/wiki/Logarithm Examples
>>> np.log10([1e15, 3.]) array([15., NaN])

dask.array.
log1p
(x[, out])¶ Return the natural logarithm of one plus the input array, elementwise.
Calculates
log(1 + x)
.Parameters:  x : array_like
Input values.
Returns:  y : ndarray
Natural logarithm of 1 + x, elementwise.
See also
expm1
exp(x)  1
, the inverse of log1p.
Notes
For realvalued input, log1p is accurate also for x so small that 1 + x == 1 in floatingpoint accuracy.
Logarithm is a multivalued function: for each x there is an infinite number of z such that exp(z) = 1 + x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log1p always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log1p is a complex analytical function that has a branch cut [inf, 1] and is continuous from above on it. log1p handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 67. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Logarithm”. http://en.wikipedia.org/wiki/Logarithm Examples
>>> np.log1p(1e99) 1e99 >>> np.log(1 + 1e99) 0.0

dask.array.
log2
(x[, out])¶ Base2 logarithm of x.
Parameters:  x : array_like
Input values.
Returns:  y : ndarray
Base2 logarithm of x.
Notes
New in version 1.3.0.
Logarithm is a multivalued function: for each x there is an infinite number of z such that 2**z = x. The convention is to return the z whose imaginary part lies in [pi, pi].
For realvalued input data types, log2 always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, log2 is a complex analytical function that has a branch cut [inf, 0] and is continuous from above on it. log2 handles the floatingpoint negative zero as an infinitesimal negative number, conforming to the C99 standard.
Examples
>>> x = np.array([0, 1, 2, 2**4]) >>> np.log2(x) array([Inf, 0., 1., 4.])
>>> xi = np.array([0+1.j, 1, 2+0.j, 4.j]) >>> np.log2(xi) array([ 0.+2.26618007j, 0.+0.j , 1.+0.j , 2.+2.26618007j])

dask.array.
logaddexp
(x1, x2[, out])¶ Logarithm of the sum of exponentiations of the inputs.
Calculates
log(exp(x1) + exp(x2))
. This function is useful in statistics where the calculated probabilities of events may be so small as to exceed the range of normal floating point numbers. In such cases the logarithm of the calculated probability is stored. This function allows adding probabilities stored in such a fashion.Parameters:  x1, x2 : array_like
Input values.
Returns:  result : ndarray
Logarithm of
exp(x1) + exp(x2)
.
See also
logaddexp2
 Logarithm of the sum of exponentiations of inputs in base 2.
Notes
New in version 1.3.0.
Examples
>>> prob1 = np.log(1e50) >>> prob2 = np.log(2.5e50) >>> prob12 = np.logaddexp(prob1, prob2) >>> prob12 113.87649168120691 >>> np.exp(prob12) 3.5000000000000057e50

dask.array.
logaddexp2
(x1, x2[, out])¶ Logarithm of the sum of exponentiations of the inputs in base2.
Calculates
log2(2**x1 + 2**x2)
. This function is useful in machine learning when the calculated probabilities of events may be so small as to exceed the range of normal floating point numbers. In such cases the base2 logarithm of the calculated probability can be used instead. This function allows adding probabilities stored in such a fashion.Parameters:  x1, x2 : array_like
Input values.
 out : ndarray, optional
Array to store results in.
Returns:  result : ndarray
Base2 logarithm of
2**x1 + 2**x2
.
See also
logaddexp
 Logarithm of the sum of exponentiations of the inputs.
Notes
New in version 1.3.0.
Examples
>>> prob1 = np.log2(1e50) >>> prob2 = np.log2(2.5e50) >>> prob12 = np.logaddexp2(prob1, prob2) >>> prob1, prob2, prob12 (166.09640474436813, 164.77447664948076, 164.28904982231052) >>> 2**prob12 3.4999999999999914e50

dask.array.
logical_and
(x1, x2[, out])¶ Compute the truth value of x1 AND x2 elementwise.
Parameters:  x1, x2 : array_like
Input arrays. x1 and x2 must be of the same shape.
Returns:  y : ndarray or bool
Boolean result with the same shape as x1 and x2 of the logical AND operation on corresponding elements of x1 and x2.
See also
Examples
>>> np.logical_and(True, False) False >>> np.logical_and([True, False], [False, False]) array([False, False], dtype=bool)
>>> x = np.arange(5) >>> np.logical_and(x>1, x<4) array([False, False, True, True, False], dtype=bool)

dask.array.
logical_not
(x[, out])¶ Compute the truth value of NOT x elementwise.
Parameters:  x : array_like
Logical NOT is applied to the elements of x.
Returns:  y : bool or ndarray of bool
Boolean result with the same shape as x of the NOT operation on elements of x.
See also
Examples
>>> np.logical_not(3) False >>> np.logical_not([True, False, 0, 1]) array([False, True, True, False], dtype=bool)
>>> x = np.arange(5) >>> np.logical_not(x<3) array([False, False, False, True, True], dtype=bool)

dask.array.
logical_or
(x1, x2[, out])¶ Compute the truth value of x1 OR x2 elementwise.
Parameters:  x1, x2 : array_like
Logical OR is applied to the elements of x1 and x2. They have to be of the same shape.
Returns:  y : ndarray or bool
Boolean result with the same shape as x1 and x2 of the logical OR operation on elements of x1 and x2.
See also
Examples
>>> np.logical_or(True, False) True >>> np.logical_or([True, False], [False, False]) array([ True, False], dtype=bool)
>>> x = np.arange(5) >>> np.logical_or(x < 1, x > 3) array([ True, False, False, False, True], dtype=bool)

dask.array.
logical_xor
(x1, x2[, out])¶ Compute the truth value of x1 XOR x2, elementwise.
Parameters:  x1, x2 : array_like
Logical XOR is applied to the elements of x1 and x2. They must be broadcastable to the same shape.
Returns:  y : bool or ndarray of bool
Boolean result of the logical XOR operation applied to the elements of x1 and x2; the shape is determined by whether or not broadcasting of one or both arrays was required.
See also
Examples
>>> np.logical_xor(True, False) True >>> np.logical_xor([True, True, False, False], [True, False, True, False]) array([False, True, True, False], dtype=bool)
>>> x = np.arange(5) >>> np.logical_xor(x < 1, x > 3) array([ True, False, False, False, True], dtype=bool)
Simple example showing support of broadcasting
>>> np.logical_xor(0, np.eye(2)) array([[ True, False], [False, True]], dtype=bool)

dask.array.
matmul
(a, b, out=None)¶ Matrix product of two arrays.
The behavior depends on the arguments in the following way.
 If both arguments are 2D they are multiplied like conventional matrices.
 If either argument is ND, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
 If the first argument is 1D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
 If the second argument is 1D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.
Multiplication by a scalar is not allowed, use
*
instead. Note that multiplying a stack of matrices with a vector will result in a stack of vectors, but matmul will not recognize it as such.matmul
differs fromdot
in two important ways. Multiplication by scalars is not allowed.
 Stacks of matrices are broadcast together as if the matrices were elements.
Warning
This function is preliminary and included in Numpy 1.10 for testing and documentation. Its semantics will not change, but the number and order of the optional arguments will.
New in version 1.10.0.
Parameters:  a : array_like
First argument.
 b : array_like
Second argument.
 out : ndarray, optional
Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be Ccontiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
Returns:  output : ndarray
Returns the dot product of a and b. If a and b are both 1D arrays then a scalar is returned; otherwise an array is returned. If out is given, then it is returned.
Raises:  ValueError
If the last dimension of a is not the same size as the secondtolast dimension of b.
If scalar value is passed.
See also
Notes
The matmul function implements the semantics of the @ operator introduced in Python 3.5 following PEP465.
Examples
For 2D arrays it is the matrix product:
>>> a = [[1, 0], [0, 1]] >>> b = [[4, 1], [2, 2]] >>> np.matmul(a, b) array([[4, 1], [2, 2]])
For 2D mixed with 1D, the result is the usual.
>>> a = [[1, 0], [0, 1]] >>> b = [1, 2] >>> np.matmul(a, b) array([1, 2]) >>> np.matmul(b, a) array([1, 2])
Broadcasting is conventional for stacks of arrays
>>> a = np.arange(2*2*4).reshape((2,2,4)) >>> b = np.arange(2*2*4).reshape((2,4,2)) >>> np.matmul(a,b).shape (2, 2, 2) >>> np.matmul(a,b)[0,1,1] 98 >>> sum(a[0,1,:] * b[0,:,1]) 98
Vector, vector returns the scalar inner product, but neither argument is complexconjugated:
>>> np.matmul([2j, 3j], [2j, 3j]) (13+0j)
Scalar multiplication raises an error.
>>> np.matmul([1,2], 3) Traceback (most recent call last): ... ValueError: Scalar operands are not allowed, use '*' instead

dask.array.
max
(a, axis=None, out=None, keepdims=False)¶ Return the maximum of an array or maximum along an axis.
Parameters:  a : array_like
Input data.
 axis : None or int or tuple of ints, optional
Axis or axes along which to operate. By default, flattened input is used.
If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. See doc.ufuncs (Section “Output arguments”) for more details.
 keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  amax : ndarray or scalar
Maximum of a. If axis is None, the result is a scalar value. If axis is given, the result is an array of dimension
a.ndim  1
.
See also
amin
 The minimum value of an array along a given axis, propagating any NaNs.
nanmax
 The maximum value of an array along a given axis, ignoring any NaNs.
maximum
 Elementwise maximum of two arrays, propagating any NaNs.
fmax
 Elementwise maximum of two arrays, ignoring any NaNs.
argmax
 Return the indices of the maximum values.
Notes
NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.
Don’t use amax for elementwise comparison of 2 arrays; when
a.shape[0]
is 2,maximum(a[0], a[1])
is faster thanamax(a, axis=0)
.Examples
>>> a = np.arange(4).reshape((2,2)) >>> a array([[0, 1], [2, 3]]) >>> np.amax(a) # Maximum of the flattened array 3 >>> np.amax(a, axis=0) # Maxima along the first axis array([2, 3]) >>> np.amax(a, axis=1) # Maxima along the second axis array([1, 3])
>>> b = np.arange(5, dtype=np.float) >>> b[2] = np.NaN >>> np.amax(b) nan >>> np.nanmax(b) 4.0

dask.array.
maximum
(x1, x2[, out])¶ Elementwise maximum of array elements.
Compare two arrays and returns a new array containing the elementwise maxima. If one of the elements being compared is a NaN, then that element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are propagated.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape, or shapes that can be broadcast to a single shape.
Returns:  y : ndarray or scalar
The maximum of x1 and x2, elementwise. Returns scalar if both x1 and x2 are scalars.
See also
Notes
The maximum is equivalent to
np.where(x1 >= x2, x1, x2)
when neither x1 nor x2 are nans, but it is faster and does proper broadcasting.Examples
>>> np.maximum([2, 3, 4], [1, 5, 2]) array([2, 5, 4])
>>> np.maximum(np.eye(2), [0.5, 2]) # broadcasting array([[ 1. , 2. ], [ 0.5, 2. ]])
>>> np.maximum([np.nan, 0, np.nan], [0, np.nan, np.nan]) array([ NaN, NaN, NaN]) >>> np.maximum(np.Inf, 1) inf

dask.array.
mean
(a, axis=None, dtype=None, out=None, keepdims=False)¶ Compute the arithmetic mean along the specified axis.
Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis. float64 intermediate and return values are used for integer inputs.
Parameters:  a : array_like
Array containing numbers whose mean is desired. If a is not an array, a conversion is attempted.
 axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.
If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before.
 dtype : datatype, optional
Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.
 out : ndarray, optional
Alternate output array in which to place the result. The default is
None
; if provided, it must have the same shape as the expected output, but the type will be cast if necessary. See doc.ufuncs for details. keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  m : ndarray, see dtype parameter above
If out=None, returns a new array containing the mean values, otherwise a reference to the output array is returned.
Notes
The arithmetic mean is the sum of the elements along the axis divided by the number of elements.
Note that for floatingpoint input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higherprecision accumulator using the dtype keyword can alleviate this issue.
Examples
>>> a = np.array([[1, 2], [3, 4]]) >>> np.mean(a) 2.5 >>> np.mean(a, axis=0) array([ 2., 3.]) >>> np.mean(a, axis=1) array([ 1.5, 3.5])
In single precision, mean can be inaccurate:
>>> a = np.zeros((2, 512*512), dtype=np.float32) >>> a[0, :] = 1.0 >>> a[1, :] = 0.1 >>> np.mean(a) 0.546875
Computing the mean in float64 is more accurate:
>>> np.mean(a, dtype=np.float64) 0.55000000074505806

dask.array.
meshgrid
(*xi, **kwargs)¶ Return coordinate matrices from coordinate vectors.
Make ND coordinate arrays for vectorized evaluations of ND scalar/vector fields over ND grids, given onedimensional coordinate arrays x1, x2,…, xn.
Changed in version 1.9: 1D and 0D cases are allowed.
Parameters:  x1, x2,…, xn : array_like
1D arrays representing the coordinates of a grid.
 indexing : {‘xy’, ‘ij’}, optional
Cartesian (‘xy’, default) or matrix (‘ij’) indexing of output. See Notes for more details.
New in version 1.7.0.
 sparse : bool, optional
If True a sparse grid is returned in order to conserve memory. Default is False.
New in version 1.7.0.
 copy : bool, optional
If False, a view into the original arrays are returned in order to conserve memory. Default is True. Please note that
sparse=False, copy=False
will likely return noncontiguous arrays. Furthermore, more than one element of a broadcast array may refer to a single memory location. If you need to write to the arrays, make copies first.New in version 1.7.0.
Returns:  X1, X2,…, XN : ndarray
For vectors x1, x2,…, ‘xn’ with lengths
Ni=len(xi)
, return(N1, N2, N3,...Nn)
shaped arrays if indexing=’ij’ or(N2, N1, N3,...Nn)
shaped arrays if indexing=’xy’ with the elements of xi repeated to fill the matrix along the first dimension for x1, the second for x2 and so on.
See also
index_tricks.mgrid
 Construct a multidimensional “meshgrid” using indexing notation.
index_tricks.ogrid
 Construct an open multidimensional “meshgrid” using indexing notation.
Notes
This function supports both indexing conventions through the indexing keyword argument. Giving the string ‘ij’ returns a meshgrid with matrix indexing, while ‘xy’ returns a meshgrid with Cartesian indexing. In the 2D case with inputs of length M and N, the outputs are of shape (N, M) for ‘xy’ indexing and (M, N) for ‘ij’ indexing. In the 3D case with inputs of length M, N and P, outputs are of shape (N, M, P) for ‘xy’ indexing and (M, N, P) for ‘ij’ indexing. The difference is illustrated by the following code snippet:
xv, yv = meshgrid(x, y, sparse=False, indexing='ij') for i in range(nx): for j in range(ny): # treat xv[i,j], yv[i,j] xv, yv = meshgrid(x, y, sparse=False, indexing='xy') for i in range(nx): for j in range(ny): # treat xv[j,i], yv[j,i]
In the 1D and 0D case, the indexing and sparse keywords have no effect.
Examples
>>> nx, ny = (3, 2) >>> x = np.linspace(0, 1, nx) >>> y = np.linspace(0, 1, ny) >>> xv, yv = meshgrid(x, y) >>> xv array([[ 0. , 0.5, 1. ], [ 0. , 0.5, 1. ]]) >>> yv array([[ 0., 0., 0.], [ 1., 1., 1.]]) >>> xv, yv = meshgrid(x, y, sparse=True) # make sparse output arrays >>> xv array([[ 0. , 0.5, 1. ]]) >>> yv array([[ 0.], [ 1.]])
meshgrid is very useful to evaluate functions on a grid.
>>> x = np.arange(5, 5, 0.1) >>> y = np.arange(5, 5, 0.1) >>> xx, yy = meshgrid(x, y, sparse=True) >>> z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2) >>> h = plt.contourf(x,y,z)

dask.array.
min
(a, axis=None, out=None, keepdims=False)¶ Return the minimum of an array or minimum along an axis.
Parameters:  a : array_like
Input data.
 axis : None or int or tuple of ints, optional
Axis or axes along which to operate. By default, flattened input is used.
If this is a tuple of ints, the minimum is selected over multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. See doc.ufuncs (Section “Output arguments”) for more details.
 keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  amin : ndarray or scalar
Minimum of a. If axis is None, the result is a scalar value. If axis is given, the result is an array of dimension
a.ndim  1
.
See also
amax
 The maximum value of an array along a given axis, propagating any NaNs.
nanmin
 The minimum value of an array along a given axis, ignoring any NaNs.
minimum
 Elementwise minimum of two arrays, propagating any NaNs.
fmin
 Elementwise minimum of two arrays, ignoring any NaNs.
argmin
 Return the indices of the minimum values.
Notes
NaN values are propagated, that is if at least one item is NaN, the corresponding min value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmin.
Don’t use amin for elementwise comparison of 2 arrays; when
a.shape[0]
is 2,minimum(a[0], a[1])
is faster thanamin(a, axis=0)
.Examples
>>> a = np.arange(4).reshape((2,2)) >>> a array([[0, 1], [2, 3]]) >>> np.amin(a) # Minimum of the flattened array 0 >>> np.amin(a, axis=0) # Minima along the first axis array([0, 1]) >>> np.amin(a, axis=1) # Minima along the second axis array([0, 2])
>>> b = np.arange(5, dtype=np.float) >>> b[2] = np.NaN >>> np.amin(b) nan >>> np.nanmin(b) 0.0

dask.array.
minimum
(x1, x2[, out])¶ Elementwise minimum of array elements.
Compare two arrays and returns a new array containing the elementwise minima. If one of the elements being compared is a NaN, then that element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are propagated.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape, or shapes that can be broadcast to a single shape.
Returns:  y : ndarray or scalar
The minimum of x1 and x2, elementwise. Returns scalar if both x1 and x2 are scalars.
See also
Notes
The minimum is equivalent to
np.where(x1 <= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.Examples
>>> np.minimum([2, 3, 4], [1, 5, 2]) array([1, 3, 2])
>>> np.minimum(np.eye(2), [0.5, 2]) # broadcasting array([[ 0.5, 0. ], [ 0. , 1. ]])
>>> np.minimum([np.nan, 0, np.nan],[0, np.nan, np.nan]) array([ NaN, NaN, NaN]) >>> np.minimum(np.Inf, 1) inf

dask.array.
modf
(x[, out1, out2])¶ Return the fractional and integral parts of an array, elementwise.
The fractional and integral parts are negative if the given number is negative.
Parameters:  x : array_like
Input array.
Returns:  y1 : ndarray
Fractional part of x.
 y2 : ndarray
Integral part of x.
Notes
For integer input the return values are floats.
Examples
>>> np.modf([0, 3.5]) (array([ 0. , 0.5]), array([ 0., 3.])) >>> np.modf(0.5) (0.5, 0)

dask.array.
moment
(a, order, axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)¶

dask.array.
nanargmax
(x, axis, **kwargs)¶

dask.array.
nanargmin
(x, axis, **kwargs)¶

dask.array.
nancumprod
(a, axis=None, dtype=None, out=None)¶ Return the cumulative product of array elements over a given axis treating Not a Numbers (NaNs) as one. The cumulative product does not change when NaNs are encountered and leading NaNs are replaced by ones.
Ones are returned for slices that are allNaN or empty.
New in version 1.12.0.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative product is computed. By default the input is flattened.
 dtype : dtype, optional
Type of the returned array, as well as of the accumulator in which the elements are multiplied. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used instead.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type of the resulting values will be cast if necessary.
Returns:  nancumprod : ndarray
A new array holding the result is returned unless out is specified, in which case it is returned.
See also
numpy.cumprod()
 Cumulative product across array propagating NaNs.
isnan
 Show which elements are NaN.
Examples
>>> np.nancumprod(1) array([1]) >>> np.nancumprod([1]) array([1]) >>> np.nancumprod([1, np.nan]) array([ 1., 1.]) >>> a = np.array([[1, 2], [3, np.nan]]) >>> np.nancumprod(a) array([ 1., 2., 6., 6.]) >>> np.nancumprod(a, axis=0) array([[ 1., 2.], [ 3., 2.]]) >>> np.nancumprod(a, axis=1) array([[ 1., 2.], [ 3., 3.]])

dask.array.
nancumsum
(a, axis=None, dtype=None, out=None)¶ Return the cumulative sum of array elements over a given axis treating Not a Numbers (NaNs) as zero. The cumulative sum does not change when NaNs are encountered and leading NaNs are replaced by zeros.
Zeros are returned for slices that are allNaN or empty.
New in version 1.12.0.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative sum is computed. The default (None) is to compute the cumsum over the flattened array.
 dtype : dtype, optional
Type of the returned array and of the accumulator in which the elements are summed. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type will be cast if necessary. See doc.ufuncs (Section “Output arguments”) for more details.
Returns:  nancumsum : ndarray.
A new array holding the result is returned unless out is specified, in which it is returned. The result has the same size as a, and the same shape as a if axis is not None or a is a 1d array.
See also
numpy.cumsum()
 Cumulative sum across array propagating NaNs.
isnan
 Show which elements are NaN.
Examples
>>> np.nancumsum(1) array([1]) >>> np.nancumsum([1]) array([1]) >>> np.nancumsum([1, np.nan]) array([ 1., 1.]) >>> a = np.array([[1, 2], [3, np.nan]]) >>> np.nancumsum(a) array([ 1., 3., 6., 6.]) >>> np.nancumsum(a, axis=0) array([[ 1., 2.], [ 4., 2.]]) >>> np.nancumsum(a, axis=1) array([[ 1., 3.], [ 3., 3.]])

dask.array.
nanmax
(a, axis=None, out=None, keepdims=False)¶ Return the maximum of an array or maximum along an axis, ignoring any NaNs. When allNaN slices are encountered a
RuntimeWarning
is raised and NaN is returned for that slice.Parameters:  a : array_like
Array containing numbers whose maximum is desired. If a is not an array, a conversion is attempted.
 axis : int, optional
Axis along which the maximum is computed. The default is to compute the maximum of the flattened array.
 out : ndarray, optional
Alternate output array in which to place the result. The default is
None
; if provided, it must have the same shape as the expected output, but the type will be cast if necessary. See doc.ufuncs for details.New in version 1.8.0.
 keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original a.
New in version 1.8.0.
Returns:  nanmax : ndarray
An array with the same shape as a, with the specified axis removed. If a is a 0d array, or if axis is None, an ndarray scalar is returned. The same dtype as a is returned.
See also
nanmin
 The minimum value of an array along a given axis, ignoring any NaNs.
amax
 The maximum value of an array along a given axis, propagating any NaNs.
fmax
 Elementwise maximum of two arrays, ignoring any NaNs.
maximum
 Elementwise maximum of two arrays, propagating any NaNs.
isnan
 Shows which elements are Not a Number (NaN).
isfinite
 Shows which elements are neither NaN nor infinity.
Notes
Numpy uses the IEEE Standard for Binary FloatingPoint for Arithmetic (IEEE 754). This means that Not a Number is not equivalent to infinity. Positive infinity is treated as a very large number and negative infinity is treated as a very small (i.e. negative) number.
If the input has a integer type the function is equivalent to np.max.
Examples
>>> a = np.array([[1, 2], [3, np.nan]]) >>> np.nanmax(a) 3.0 >>> np.nanmax(a, axis=0) array([ 3., 2.]) >>> np.nanmax(a, axis=1) array([ 2., 3.])
When positive infinity and negative infinity are present:
>>> np.nanmax([1, 2, np.nan, np.NINF]) 2.0 >>> np.nanmax([1, 2, np.nan, np.inf]) inf

dask.array.
nanmean
(a, axis=None, dtype=None, out=None, keepdims=False)¶ Compute the arithmetic mean along the specified axis, ignoring NaNs.
Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis. float64 intermediate and return values are used for integer inputs.
For allNaN slices, NaN is returned and a RuntimeWarning is raised.
New in version 1.8.0.
Parameters:  a : array_like
Array containing numbers whose mean is desired. If a is not an array, a conversion is attempted.
 axis : int, optional
Axis along which the means are computed. The default is to compute the mean of the flattened array.
 dtype : datatype, optional
Type to use in computing the mean. For integer inputs, the default is float64; for inexact inputs, it is the same as the input dtype.
 out : ndarray, optional
Alternate output array in which to place the result. The default is
None
; if provided, it must have the same shape as the expected output, but the type will be cast if necessary. See doc.ufuncs for details. keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.
Returns:  m : ndarray, see dtype parameter above
If out=None, returns a new array containing the mean values, otherwise a reference to the output array is returned. Nan is returned for slices that contain only NaNs.
Notes
The arithmetic mean is the sum of the nonNaN elements along the axis divided by the number of nonNaN elements.
Note that for floatingpoint input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for float32. Specifying a higherprecision accumulator using the dtype keyword can alleviate this issue.
Examples
>>> a = np.array([[1, np.nan], [3, 4]]) >>> np.nanmean(a) 2.6666666666666665 >>> np.nanmean(a, axis=0) array([ 2., 4.]) >>> np.nanmean(a, axis=1) array([ 1., 3.5])

dask.array.
nanmin
(a, axis=None, out=None, keepdims=False)¶ Return minimum of an array or minimum along an axis, ignoring any NaNs. When allNaN slices are encountered a
RuntimeWarning
is raised and Nan is returned for that slice.Parameters:  a : array_like
Array containing numbers whose minimum is desired. If a is not an array, a conversion is attempted.
 axis : int, optional
Axis along which the minimum is computed. The default is to compute the minimum of the flattened array.
 out : ndarray, optional
Alternate output array in which to place the result. The default is
None<