Chainer – A flexible framework of neural networksÂ¶
This is the Chainer documentation.
Install GuideÂ¶
Before installing ChainerÂ¶
We recommend these platforms.
Chainer is supported on Python 2.7.6+, 3.4.3+, 3.5.1+, 3.6.0+. Chainer uses C++ compiler such as g++. You need to install it before installing Chainer. This is typical installation method for each platform:
# Ubuntu 14.04
$ aptget install g++
# CentOS 7
$ yum install gccc++
If you use old setuptools
, upgrade it:
$ pip install U setuptools
Install ChainerÂ¶
Chainer depends on these Python packages:
CUDA support
cuDNN support
 cuDNN v2, v3, v4, v5, v5.1, v6
Caffe model support
 Protocol Buffers
 protobuf>=3.0.0 is required for Py3
All these libraries are automatically installed with pip
or setup.py
.
Image dataset is optional
HDF5 serialization is optional
 h5py 2.5.0
Install Chainer from sourceÂ¶
You can use setup.py
to install Chainer from source:
$ tar zxf chainerx.x.x.tar.gz
$ cd chainerx.x.x
$ python setup.py install
When an error occurs...Â¶
Use vvvv
option with pip
command.
That shows all logs of installation. It may helps you:
$ pip install chainer vvvv
Install Chainer with CUDAÂ¶
You need to install CUDA Toolkit before installing Chainer.
If you have CUDA in a default directory or set CUDA_PATH
correctly, Chainer installer finds CUDA automatically:
$ pip install chainer
Note
Chainer installer looks up CUDA_PATH
environment variable first.
If it is empty, the installer looks for nvcc
command from PATH
environment variable and use its parent directory as the root directory of CUDA installation.
If nvcc
command is also not found, the installer tries to use the default directory for Ubuntu /usr/local/cuda
.
If you installed CUDA into a nondefault directory, you need to specify the directory with CUDA_PATH
environment variable:
$ CUDA_PATH=/opt/nvidia/cuda pip install chainer
Warning
If you want to use sudo
to install Chainer, note that sudo
command initializes all environment variables.
Please specify CUDA_PATH
environment variable inside sudo
like this:
$ sudo CUDA_PATH=/opt/nvidia/cuda pip install chainer
Install Chainer with CUDA and cuDNNÂ¶
cuDNN is a library for Deep Neural Networks that NVIDIA provides. Chainer can use cuDNN. If you want to enable cuDNN, install cuDNN and CUDA before installing Chainer. We recommend you to install developer library of deb package of cuDNN.
If you want to install targz version, we recommend you to install it to CUDA directory.
For example if you uses Ubuntu Linux, copy .h
files to include
directory and .so
files to lib64
directory:
$ cp /path/to/cudnn.h $CUDA_PATH/include
$ cp /path/to/libcudnn.so* $CUDA_PATH/lib64
The destination directories depend on your environment.
If you want to use cuDNN installed in other directory, please use CFLAGS
, LDFLAGS
and LD_LIBRARY_PATH
environment variables before installing Chainer:
export CFLAGS=I/path/to/cudnn/include
export LDFLAGS=L/path/to/cudnn/lib
export LD_LIBRARY_PATH=/path/to/cudnn/lib:$LD_LIBRARY_PATH
Install Chainer for developersÂ¶
Chainer uses Cython (>=0.24).
Developers need to use Cython to regenerate C++ sources from pyx
files.
We recommend to use pip
with e
option for editable mode:
$ pip install U cython
$ cd /path/to/chainer/source
$ pip install e .
Users need not to install Cython as a distribution package of Chainer only contains generated sources.
Support image datasetÂ¶
Install Pillow manually to activate image dataset. This feature is optional:
$ pip install pillow
Support HDF5 serializationÂ¶
Install h5py manually to activate HDF5 serialization. This feature is optional:
$ pip install h5py
Before installing h5py, you need to install libhdf5. It depends on your environment:
# Ubuntu 14.04
$ aptget install libhdf5dev
# CentOS 7
$ yum y install epelrelease
$ yum install hdf5devel
Uninstall ChainerÂ¶
Use pip to uninstall Chainer:
$ pip uninstall chainer
Note
When you upgrade Chainer, pip
sometimes installed various version of Chainer in sitepackages
.
Please uninstall it repeatedly until pip
returns an error.
Reinstall ChainerÂ¶
If you want to reinstall Chainer, please uninstall Chainer and then install it.
We recommend to use nocachedir
option as pip
sometimes uses cache:
$ pip uninstall chainer
$ pip install chainer nocachedir
When you install Chainer without CUDA, and after that you want to use CUDA, please reinstall Chainer. You need to reinstall Chainer when you want to upgrade CUDA.
Run Chainer with DockerÂ¶
We provide the official Docker image. Use nvidiadocker command to run Chainer image with GPU. You can login to the environment with bash, and run the Python interpreter:
$ nvidiadocker run it chainer/chainer /bin/bash
Or, run the interpreter directly:
$ nvidiadocker run it chainer/chainer /usr/bin/python
What “recommend” means?Â¶
We tests Chainer automatically with Jenkins. All supported environments are tested in this environment. We cannot guarantee that Chainer works on other environments.
FAQÂ¶
The installer says “hdf5.h is not found”Â¶
You don’t have libhdf5. Please install hdf5. See Before installing Chainer.
MemoryError happensÂ¶
You maybe failed to install Cython. Please install it manually. See When an error occurs....
Examples says “cuDNN is not enabled”Â¶
You failed to build Chainer with cuDNN.
If you don’t need cuDNN, ignore this message.
Otherwise, retry to install Chainer with cuDNN.
vvvv
option helps you.
See Install Chainer with CUDA and cuDNN.
Chainer TutorialÂ¶
Introduction to ChainerÂ¶
This is the first section of the Chainer Tutorial. In this section, you will learn about the following things:
 Pros and cons of existing frameworks and why we are developing Chainer
 Simple example of forward and backward computation
 Usage of links and their gradient computation
 Construction of chains (a.k.a. “model” in most frameworks)
 Parameter optimization
 Serialization of links and optimizers
After reading this section, you will be able to:
 Compute gradients of some arithmetics
 Write a multilayer perceptron with Chainer
Core ConceptÂ¶
As mentioned on the front page, Chainer is a flexible framework for neural networks. One major goal is flexibility, so it must enable us to write complex architectures simply and intuitively.
Most existing deep learning frameworks are based on the “DefineandRun” scheme. That is, first a network is defined and fixed, and then the user periodically feeds it with minibatches. Since the network is statically defined before any forward/backward computation, all the logic must be embedded into the network architecture as data. Consequently, defining a network architecture in such systems (e.g. Caffe) follows a declarative approach. Note that one can still produce such a static network definition using imperative languages (e.g. torch.nn, Theanobased frameworks, and TensorFlow).
In contrast, Chainer adopts a “DefinebyRun” scheme, i.e., the network is defined onthefly via the actual forward computation. More precisely, Chainer stores the history of computation instead of programming logic. This strategy enables us to fully leverage the power of programming logic in Python. For example, Chainer does not need any magic to introduce conditionals and loops into the network definitions. The DefinebyRun scheme is the core concept of Chainer. We will show in this tutorial how to define networks dynamically.
This strategy also makes it easy to write multiGPU parallelization, since logic comes closer to network manipulation. We will review such amenities in later sections of this tutorial.
Note
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported:
import numpy as np
import chainer
from chainer import cuda, Function, gradient_check, report, training, utils, Variable
from chainer import datasets, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
These imports appear widely in Chainer code and examples. For simplicity, we omit these imports in this tutorial.
Forward/Backward ComputationÂ¶
As described above, Chainer uses the “DefinebyRun” scheme, so forward computation itself defines the network.
In order to start forward computation, we have to set the input array to a Variable
object.
Here we start with a simple ndarray
with only one element:
>>> x_data = np.array([5], dtype=np.float32)
>>> x = Variable(x_data)
A Variable object has basic arithmetic operators. In order to compute \(y = x^2  2x + 1\), just write:
>>> y = x**2  2 * x + 1
The resulting y
is also a Variable object, whose value can be extracted by accessing the data
attribute:
>>> y.data
array([ 16.], dtype=float32)
What y
holds is not only the result value.
It also holds the history of computation (or computational graph), which enables us to compute its differentiation.
This is done by calling its backward()
method:
>>> y.backward()
This runs error backpropagation (a.k.a. backprop or reversemode automatic differentiation).
Then, the gradient is computed and stored in the grad
attribute of the input variable x
:
>>> x.grad
array([ 8.], dtype=float32)
Also we can compute gradients of intermediate variables.
Note that Chainer, by default, releases the gradient arrays of intermediate variables for memory efficiency.
In order to preserve gradient information, pass the retain_grad
argument to the backward method:
>>> z = 2*x
>>> y = x**2  z + 1
>>> y.backward(retain_grad=True)
>>> z.grad
array([1.], dtype=float32)
All these computations are easily generalized to a multielement array input.
Note that if we want to start backward computation from a variable holding a multielement array, we must set the initial error manually.
This is done simply by setting the grad
attribute of the output variable:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x**2  2*x + 1
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward()
>>> x.grad
array([[ 0., 2., 4.],
[ 6., 8., 10.]], dtype=float32)
LinksÂ¶
In order to write neural networks, we have to combine functions with parameters and optimize the parameters. You can use links to do this. A link is an object that holds parameters (i.e. optimization targets).
The most fundamental ones are links that behave like regular functions while replacing some arguments by their parameters. We will introduce higher level links, but here think of links as simply functions with parameters.
One of the most frequently used links is the Linear
link (a.k.a. fullyconnected layer or affine transformation).
It represents a mathematical function \(f(x) = Wx + b\), where the matrix \(W\) and the vector \(b\) are parameters.
This link corresponds to its pure counterpart linear()
, which accepts \(x, W, b\) as arguments.
A linear link from threedimensional space to twodimensional space is defined by the following line:
>>> f = L.Linear(3, 2)
Note
Most functions and links only accept minibatch input, where the first dimension of the input array is considered as the batch dimension. In the above Linear link case, input must have shape of (N, 3), where N is the minibatch size.
The parameters of a link are stored as attributes.
Each parameter is an instance of Variable
.
In the case of the Linear link, two parameters, W
and b
, are stored.
By default, the matrix W
is initialized randomly, while the vector b
is initialized with zeros.
>>> f.W.data
array([[ 1.01847613, 0.23103087, 0.56507462],
[ 1.29378033, 1.07823515, 0.56423163]], dtype=float32)
>>> f.b.data
array([ 0., 0.], dtype=float32)
An instance of the Linear link acts like a usual function:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = f(x)
>>> y.data
array([[ 3.1757617 , 1.75755572],
[ 8.61950684, 7.18090773]], dtype=float32)
Gradients of parameters are computed by the backward()
method.
Note that gradients are accumulated by the method rather than overwritten.
So first you must clear gradients to renew the computation.
It can be done by calling the cleargrads()
method.
>>> f.cleargrads()
Note
cleargrads()
is introduced in v1.15 to replace zerograds()
for efficiency.
zerograds()
is left only for backward compatibility.
Now we can compute the gradients of parameters by simply calling the backward method.
>>> y.grad = np.ones((2, 2), dtype=np.float32)
>>> y.backward()
>>> f.W.grad
array([[ 5., 7., 9.],
[ 5., 7., 9.]], dtype=float32)
>>> f.b.grad
array([ 2., 2.], dtype=float32)
Write a model as a chainÂ¶
Most neural network architectures contain multiple links. For example, a multilayer perceptron consists of multiple linear layers. We can write complex procedures with parameters by combining multiple links like this:
>>> l1 = L.Linear(4, 3)
>>> l2 = L.Linear(3, 2)
>>> def my_forward(x):
... h = l1(x)
... return l2(h)
Here the L
indicates the links
module.
A procedure with parameters defined in this way is hard to reuse.
More Pythonic way is combining the links and procedures into a class:
>>> class MyProc(object):
... def __init__(self):
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def forward(self, x):
... h = self.l1(x)
... return self.l2(h)
In order to make it more reusable, we want to support parameter management, CPU/GPU migration, robust and flexible save/load features, etc.
These features are all supported by the Chain
class in Chainer.
Then, what we have to do here is just define the above class as a subclass of Chain:
>>> class MyChain(Chain):
... def __init__(self):
... super(MyChain, self).__init__(
... l1=L.Linear(4, 3),
... l2=L.Linear(3, 2),
... )
...
... def __call__(self, x):
... h = self.l1(x)
... return self.l2(h)
Note
We often define a single forward method of a link by __call__
operator.
Such links and chains are callable and behave like regular functions of Variables.
It shows how a complex chain is constructed by simpler links.
Links like l1
and l2
are called child links of MyChain.
Note that Chain itself inherits Link.
It means we can define more complex chains that hold MyChain objects as their child links.
Another way to define a chain is using the ChainList
class, which behaves like a list of links:
>>> class MyChain2(ChainList):
... def __init__(self):
... super(MyChain2, self).__init__(
... L.Linear(4, 3),
... L.Linear(3, 2),
... )
...
... def __call__(self, x):
... h = self[0](x)
... return self[1](h)
ChainList can conveniently use an arbitrary number of links, however if the number of links is fixed like in the above case, the Chain class is recommended as a base class.
OptimizerÂ¶
In order to get good values for parameters, we have to optimize them by the Optimizer
class.
It runs a numerical optimization algorithm on a given link.
Many algorithms are implemented in the optimizers
module.
Here we use the simplest one, called Stochastic Gradient Descent (SGD):
>>> model = MyChain()
>>> optimizer = optimizers.SGD()
>>> optimizer.use_cleargrads()
>>> optimizer.setup(model)
The method use_cleargrads()
is for efficiency. See use_cleargrads()
for detail.
The method setup()
prepares for the optimization given a link.
Some parameter/gradient manipulations, e.g. weight decay and gradient clipping, can be done by setting hook functions to the optimizer. Hook functions are called after the gradient computation and right before the actual update of parameters. For example, we can set weight decay regularization by running the next line beforehand:
>>> optimizer.add_hook(chainer.optimizer.WeightDecay(0.0005))
Of course, you can write your own hook functions. It should be a function or a callable object, taking the optimizer as the argument.
There are two ways to use the optimizer.
One is using it via Trainer
, which we will see in the following sections.
The other way is using it directly.
We here review the latter case.
If you are interested in getting able to use the optimizer in a simple way, skip this section and go to the next one.
There are two further ways to use the optimizer directly.
One is manually computing gradients and then calling the update()
method with no arguments.
Do not forget to clear the gradients beforehand!
>>> x = np.random.uniform(1, 1, (2, 4)).astype('f')
>>> model.cleargrads()
>>> # compute gradient here...
>>> loss = F.sum(model(chainer.Variable(x)))
>>> loss.backward()
>>> optimizer.update()
The other way is just passing a loss function to the update()
method.
In this case, cleargrads()
is automatically called by the update method, so the user does not have to call it manually.
>>> def lossfun(arg1, arg2):
... # calculate loss
... loss = F.sum(model(arg1  arg2))
... return loss
>>> arg1 = np.random.uniform(1, 1, (2, 4)).astype('f')
>>> arg2 = np.random.uniform(1, 1, (2, 4)).astype('f')
>>> optimizer.update(lossfun, chainer.Variable(arg1), chainer.Variable(arg2))
See Optimizer.update()
for the full specification.
TrainerÂ¶
When we want to train neural networks, we have to run training loops that update the parameters many times. A typical training loop consists of the following procedures:
 Iterations over training datasets
 Preprocessing of extracted minibatches
 Forward/backward computations of the neural networks
 Parameter updates
 Evaluations of the current parameters on validation datasets
 Logging and printing of the intermediate results
Chainer provides a simple yet powerful way to make it easy to write such training processes. The training loop abstraction mainly consists of two components:
 Dataset abstraction.
It implements 1 and 2 in the above list.
The core components are defined in the
dataset
module. There are also many implementations of datasets and iterators indatasets
anditerators
modules, respectively.  Trainer.
It implements 3, 4, 5, and 6 in the above list.
The whole procedure is implemented by
Trainer
. The way to update parameters (3 and 4) is defined byUpdater
, which can be freely customized. 5 and 6 are implemented by instances ofExtension
, which appends an extra procedure to the training loop. Users can freely customize the training procedure by adding extensions. Users can also implement their own extensions.
We will see how to use Trainer in the example section below.
SerializerÂ¶
Before proceeding to the first example, we introduce Serializer, which is the last core feature described in this page.
Serializer is a simple interface to serialize or deserialize an object.
Link
, Optimizer
, and Trainer
supports serialization.
Concrete serializers are defined in the serializers
module.
It supports NumPy NPZ and HDF5 formats.
For example, we can serialize a link object into NPZ file by the serializers.save_npz()
function:
>>> serializers.save_npz('my.model', model)
It saves the parameters of model
into the file 'my.model'
in NPZ format.
The saved model can be read by the serializers.load_npz()
function:
>>> serializers.load_npz('my.model', model)
Note
Note that only the parameters and the persistent values are serialized by this serialization code.
Other attributes are not saved automatically.
You can register arrays, scalars, or any serializable objects as persistent values by the Link.add_persistent()
method.
The registered values can be accessed by attributes of the name passed to the add_persistent method.
The state of an optimizer can also be saved by the same functions:
>>> serializers.save_npz('my.state', optimizer)
>>> serializers.load_npz('my.state', optimizer)
Note
Note that serialization of optimizer only saves its internal states including number of iterations, momentum vectors of MomentumSGD, etc. It does not save the parameters and persistent values of the target link. We have to explicitly save the target link with the optimizer to resume the optimization from saved states.
Support of the HDF5 format is enabled if the h5py package is installed.
Serialization and deserialization with the HDF5 format are almost identical to those with the NPZ format;
just replace save_npz()
and load_npz()
by save_hdf5()
and load_hdf5()
, respectively.
Example: Multilayer Perceptron on MNISTÂ¶
Now you can solve a multiclass classification task using a multilayer perceptron (MLP).
We use a handwritten digits dataset called MNIST, which is one of the longstanding de facto “hello world” examples used in machine learning.
This MNIST example is also found in the examples/mnist directory of the official repository.
We show how to use Trainer
to construct and run the training loop in this section.
We first have to prepare the MNIST dataset.
The MNIST dataset consists of 70,000 greyscale images of size 28x28 (i.e. 784 pixels) and corresponding digit labels.
The dataset is divided into 60,000 training images and 10,000 test images by default.
We can obtain the vectorized version (i.e., a set of 784 dimensional vectors) by datasets.get_mnist()
.
>>> train, test = datasets.get_mnist()
...
This code automatically downloads the MNIST dataset and saves the NumPy arrays to the $(HOME)/.chainer
directory.
The returned train
and test
can be seen as lists of imagelabel pairs (strictly speaking, they are instances of TupleDataset
).
We also have to define how to iterate over these datasets.
We want to shuffle the training dataset for every epoch, i.e. at the beginning of every sweep over the dataset.
In this case, we can use iterators.SerialIterator
.
>>> train_iter = iterators.SerialIterator(train, batch_size=100, shuffle=True)
On the other hand, we do not have to shuffle the test dataset.
In this case, we can pass shuffle=False
argument to disable the shuffling.
It makes the iteration faster when the underlying dataset supports fast slicing.
>>> test_iter = iterators.SerialIterator(test, batch_size=100, repeat=False, shuffle=False)
We also pass repeat=False
, which means we stop iteration when all examples are visited.
This option is usually required for the test/validation datasets; without this option, the iteration enters an infinite loop.
Next, we define the architecture. We use a simple threelayer rectifier network with 100 units per layer as an example.
>>> class MLP(Chain):
... def __init__(self, n_units, n_out):
... super(MLP, self).__init__(
... # the size of the inputs to each layer will be inferred
... l1=L.Linear(None, n_units), # n_in > n_units
... l2=L.Linear(None, n_units), # n_units > n_units
... l3=L.Linear(None, n_out), # n_units > n_out
... )
...
... def __call__(self, x):
... h1 = F.relu(self.l1(x))
... h2 = F.relu(self.l2(h1))
... y = self.l3(h2)
... return y
This link uses relu()
as an activation function.
Note that the 'l3'
link is the final linear layer whose output corresponds to scores for the ten digits.
In order to compute loss values or evaluate the accuracy of the predictions, we define a classifier chain on top of the above MLP chain:
>>> class Classifier(Chain):
... def __init__(self, predictor):
... super(Classifier, self).__init__(predictor=predictor)
...
... def __call__(self, x, t):
... y = self.predictor(x)
... loss = F.softmax_cross_entropy(y, t)
... accuracy = F.accuracy(y, t)
... report({'loss': loss, 'accuracy': accuracy}, self)
... return loss
This Classifier class computes accuracy and loss, and returns the loss value.
The pair of arguments x
and t
corresponds to each example in the datasets (a tuple of an image and a label).
softmax_cross_entropy()
computes the loss value given prediction and ground truth labels.
accuracy()
computes the prediction accuracy.
We can set an arbitrary predictor link to an instance of the classifier.
The report()
function reports the loss and accuracy values to the trainer.
For the detailed mechanism of collecting training statistics, see Reporter.
You can also collect other types of observations like activation statistics in a similar ways.
Note that a class similar to the Classifier above is defined as chainer.links.Classifier
.
So instead of using the above example, we will use this predefined Classifier chain.
>>> model = L.Classifier(MLP(100, 10)) # the input size, 784, is inferred
>>> optimizer = optimizers.SGD()
>>> optimizer.setup(model)
Now we can build a trainer object.
>>> updater = training.StandardUpdater(train_iter, optimizer)
>>> trainer = training.Trainer(updater, (20, 'epoch'), out='result')
The second argument (20, 'epoch')
represents the duration of training.
We can use either epoch
or iteration
as the unit.
In this case, we train the multilayer perceptron by iterating over the training set 20 times.
In order to invoke the training loop, we just call the run()
method.
>>> trainer.run()
This method executes the whole training sequence.
The above code just optimizes the parameters.
In most cases, we want to see how the training proceeds, where we can use extensions inserted before calling the run
method.
>>> trainer.extend(extensions.Evaluator(test_iter, model))
>>> trainer.extend(extensions.LogReport())
>>> trainer.extend(extensions.PrintReport(['epoch', 'main/accuracy', 'validation/main/accuracy']))
>>> trainer.extend(extensions.ProgressBar())
>>> trainer.run()
These extensions perform the following tasks:
Evaluator
 Evaluates the current model on the test dataset at the end of every epoch.
LogReport
 Accumulates the reported values and emits them to the log file in the output directory.
PrintReport
 Prints the selected items in the LogReport.
ProgressBar
 Shows the progress bar.
There are many extensions implemented in the chainer.training.extensions
module.
The most important one that is not included above is snapshot()
, which saves the snapshot of the training procedure (i.e., the Trainer object) to a file in the output directory.
The example code in the examples/mnist directory additionally contains GPU support, though the essential part is the same as the code in this tutorial. We will review in later sections how to use GPU(s).
How to Write a New NetworkÂ¶
Convolutional Network for Visual Recognition TasksÂ¶
In this section, you will learn how to write
 A small convolutional network with a model class that is inherited from
Chain
,  A large convolutional network that has several building block networks with
ChainList
.
After reading this section, you will be able to:
 Write your own original convolutional network in Chainer
A convolutional network (ConvNet) is mainly comprised of convolutional layers. This type of network is commonly used for various visual recognition tasks, e.g., classifying handwritten digits or natural images into given object classes, detectiong objects from an image, and labeling all pixels of an image with the object classes (semantic segmenation), and so on.
In such tasks, a typical ConvNet takes a set of images whose shape is \((N, C, H, W)\), where
 \(N\) denotes the number of images in a minibatch,
 \(C\) denotes the number of channels of those images,
 \(H\) and \(W\) denote the height and width of those images,
respectively. Then, it typically outputs a fixedsized vector as membership probabilities over the target object classes. It also can output a set of feature maps that have the corresponding size to the input image for a pixel labeling task, etc.
Note
The below example code assumes that some packages are already imported. Please see the details here: tutorial/basic.
LeNet5Â¶
Here, let’s start by defining LeNet5 [LeCun98] in Chainer. This is a ConvNet model that has 5 layers comprised of 3 convolutional layers and 2 fullyconnected layers. This was proposed to classify handwritten digit images in 1998. In Chainer, the model can be written as follows:
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__(
conv1=L.Convolution2D(
in_channels=1, out_channels=6, ksize=5, stride=1),
conv2=L.Convolution2D(
in_channels=6, out_channels=16, ksize=5, stride=1),
conv3=L.Convolution2D(
in_channels=16, out_channels=120, ksize=4, stride=1),
fc4=L.Linear(None, 84),
fc5=L.Linear(84, 10),
)
self.train = True
def __call__(self, x):
h = F.sigmoid(self.conv1(x))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv2(h))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv3(h))
h = F.sigmoid(self.fc4(h))
if self.train:
return self.fc5(h)
return F.softmax(self.fc5(h))
A typical way to write your network is creating a new class inherited from
Chain
class. When defining your model in this way, typically,
all the layers which have trainable parameters are registered to the model
by giving the objects of Link
to the superclass’s constructer
as keyword arguments (see the above __init__()
).
There is also another way to do the same thing. For example,
add_link()
of Chain
class enables to
register the trainable layers (i.e., Link
s) to the model, so
that the above __init__()
can also be written as follows:
def __init__(self
super(LeNet5, self).__init__()
self.add_link('conv1', L.Convolution2D(1, 6, 5, 1))
self.add_link('conv2', L.Convolution2D(6, 16, 5, 1))
self.add_link('conv3', L.Convolution2D(16, 120, 4, 1))
self.add_link('fc4', L.Linear(None, 84))
self.add_link('fc5', L.Linear(84, 10))
self.train = True
(Argments to Convolution2D
are given without keywords
here for simplicity.)
The model class is instantiated before the forward and backward computations.
To give input images and label vectors simply by calling the model object
like a function, __call__()
is usually defined in the model class.
This method performs the forward computation of the model. Chainer uses
the powerful autograd system for any computational graphs written with
Links
calls a corresponding Function
inside
of it), so that you don’t need to explicitly write the code for backward
computations in the model. Just prepare the data, then give it to the model.
The way this works is the resulting output Variable
from the
forward computation has a backward()
method to perform
autograd. In the above model, __call__()
has a if
statement at the
end to switch its behavior by the model’s running mode, i.e., training mode or
not. When it’s in training mode, this method returns the output value of the
last layer as is to compute the loss later on, otherwise it returns a
prediction result by calculating softmax()
.
If you don’t want to write conv1
and the other layers more than once, you
can also write the model like in this way:
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__()
net = [('conv1', L.Convolution2D(1, 6, 5, 1))]
net += [('_sigm1', F.Sigmoid())]
net += [('_mpool1', F.MaxPooling2D(2, 2))]
net += [('conv2', L.Convolution2D(6, 16, 5, 1))]
net += [('_sigm2', F.Sigmoid())]
net += [('_mpool2', F.MaxPooling2D(2, 2))]
net += [('conv3', L.Convolution2D(16, 120, 4, 1))]
net += [('_sigm3', F.Sigmoid())]
net += [('_mpool3', F.MaxPooling2D(2, 2))]
net += [('fc4', L.Linear(None, 84))]
net += [('_sigm4', F.Sigmoid())]
net += [('fc5', L.Linear(84, 10))]
net += [('_sigm5', F.Sigmoid())]
for n in net:
if not n[0].startswith('_'):
self.add_link(*n)
self.forward = net
self.train = True
def __call__(self, x):
for n, f in self.forward:
if not n.startswith('_'):
x = getattr(self, n)(x)
else:
x = f(x)
if self.train:
return x
return F.softmax(x)
This code creates a list of all Link
s and
Function
s after calling its superclass’s constructor.
Then the elements of the list are registered to this model as
trainable layers when the name of an element doesn’t start with _
character. This operation can be freely replaced with many other ways because
those names are just designed to select Link
s only from the
list net
easily. Function
doesn’t have any trainable
parameters, so that we can’t register it to the model with
add_link()
, but we want to use
Function
s for constructing a forward path. The list
net
is stored as an attribute attr:forward to refer it in
__call__()
. In __call__()
, it retrieves all layers in the network
from self.forward
sequentially regardless of what types of object (
Link
or Function
) it is, and gives the
input variable or the intermediate output from the previous layer to the
current layer. The last part of the __call__()
to switch its behavior
by the training/inference mode is the same as the former way.
Ways to calculate lossÂ¶
When you train the model with label vector t
, the loss should be calculated
using the output from the model. There also are several ways to calculate the
loss:
model = LeNet5()
# Input data and label
x = np.random.rand(32, 1, 28, 28).astype(np.float32)
t = np.random.randint(0, 10, size=(32,)).astype(np.int32)
# Forward computation
y = model(x)
# Loss calculation
loss = F.softmax_cross_entropy(y, t)
This is a primitive way to calculate a loss value from the output of the model.
On the other hand, the loss computation can be included in the model itself by
wrapping the model object (Chain
or
ChainList
object) with a class inherited from
Chain
. The outer Chain
should take the
model defined above and register it through the constructor of its superclass
or add_link()
. Chain
is actually
inherited from Link
, so that Chain
itself
can also be registedred as a trainable Link
to another
Chain
. Actually, Classifier
class to
wrap the model and add the loss computation to the model already exists.
Actually, there is already a Classifier
class that can
be used to wrap the model and include the loss computation as well.
It can be used like this:
model = L.Classifier(LeNet5())
# Foward & Loss calculation
loss = model(x, t)
This class takes a model object as an iput argument and registers it to
a predictor
property as a trained parameter. As shown above, the returned
object can then be called like a function in which we pass x
and t
as
the input arguments and the resulting loss value (which we recall is a
Variable
) is returned.
See the detailed implementation of Classifier
from
here: chainer.links.Classifier
and check the implementation by looking
at the source.
From the above examples, we can see that Chainer provides the flexibility to write our original network in many different ways. Such flexibility intends to make it intuitive for users to design new and complex models.
VGG16Â¶
Next, let’s write some larger models in Chainer. When you write a large network
consisting of several building block networks, ChainList
is
useful. First, let’s see how to write a VGG16 [Simonyan14] model.
class VGG16(chainer.ChainList):
def __init__(self):
w = chainer.initializers.HeNormal()
super(VGG16, self).__init__(
VGGBlock(64),
VGGBlock(128),
VGGBlock(256, 3),
VGGBlock(512, 3),
VGGBlock(512, 3, True))
self.train = True
def __call__(self, x):
for f in self.children():
x = f(x, self.train)
if self.train:
return x
return F.softmax(x)
class VGGBlock(chainer.Chain):
def __init__(self, n_channels, n_convs=2, fc=False):
w = chainer.initializers.HeNormal()
super(VGGBlock, self).__init__(
conv1=L.Convolution2D(None, n_channels, 3, 1, 1, initialW=w),
conv2=L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w))
if n_convs == 3:
self.add_link('conv3', L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w))
if fc:
self.add_link('fc4', L.Linear(None, 4096, initialW=w))
self.add_link('fc5', L.Linear(4096, 4096, initialW=w))
self.add_link('fc6', L.Linear(4096, 1000, initialW=w))
self.n_convs = n_convs
self.fc = fc
def __call__(self, x, train):
h = F.relu(self.conv1(x))
h = F.relu(self.conv2(h))
if self.n_convs == 3:
h = F.relu(self.conv3(h))
h = F.max_pooling_2d(h, 2, 2)
if self.fc:
h = F.dropout(F.relu(self.fc4(h)), train=train)
h = F.dropout(F.relu(self.fc5(h)), train=train)
h = self.fc6(h)
return h
That’s it. VGG16 is a model which won the 1st place in
classification + localization task at ILSVRC 2014,
and since then, has become one of the standard models for many different tasks
as a pretrained model. This has 16layers, so it’s called “VGG16”, but we can
write this model without writing all layers independently. Since this model
consists of several building blocks that have the same architecture, we can
build the whole network by reusing the building block definition. Each part
of the network is consisted of 2 or 3 convolutional layers and activation
function (relu()
) following them, and
max_pooling_2d()
operations. This block is written as
VGGBlock
in the above example code. And the whole network just calls
this block one by one in sequential manner.
ResNet152Â¶
How about ResNet? ResNet [He16] came in the following year’s ILSVRC. It is a much deeper model than VGG16, having up to 152 layers. This sounds super laborious to build, but it can be implemented in almost same manner as VGG16. In the other words, it’s easy. One possible way to write ResNet152 is:
class ResNet152(chainer.Chain):
def __init__(self, n_blocks=[3, 8, 36, 3]):
w = chainer.initializers.HeNormal()
super(ResNet152, self).__init__(
conv1=L.Convolution2D(
None, 64, 7, 2, 3, initialW=w, nobias=True),
bn1=L.BatchNormalization(64),
res2=ResBlock(n_blocks[0], 64, 64, 256, 1),
res3=ResBlock(n_blocks[1], 256, 128, 512),
res4=ResBlock(n_blocks[2], 512, 256, 1024),
res5=ResBlock(n_blocks[3], 1024, 512, 2048),
fc6=L.Linear(2048, 1000))
self.train = True
def __call__(self, x):
h = self.bn1(self.conv1(x), test=not self.train)
h = F.max_pooling_2d(F.relu(h), 2, 2)
h = self.res2(h, self.train)
h = self.res3(h, self.train)
h = self.res4(h, self.train)
h = self.res5(h, self.train)
h = F.average_pooling_2d(h, h.shape[2:], stride=1)
h = self.fc6(h)
if self.train:
return h
return F.softmax(h)
class ResBlock(chainer.ChainList):
def __init__(self, n_layers, n_in, n_mid, n_out, stride=2):
w = chainer.initializers.HeNormal()
super(ResBlock, self).__init__()
self.add_link(BottleNeck(n_in, n_mid, n_out, stride, True))
for _ in range(n_layers  1):
self.add_link(BottleNeck(n_out, n_mid, n_out))
def __call__(self, x, train):
for f in self.children():
x = f(x, train)
return x
class BottleNeck(chainer.Chain):
def __init__(self, n_in, n_mid, n_out, stride=1, proj=False):
w = chainer.initializers.HeNormal()
super(BottleNeck, self).__init__(
conv1x1a=L.Convolution2D(
n_in, n_mid, 1, stride, 0, initialW=w, nobias=True),
conv3x3b=L.Convolution2D(
n_mid, n_mid, 3, 1, 1, initialW=w, nobias=True),
conv1x1c=L.Convolution2D(
n_mid, n_out, 1, 1, 0, initialW=w, nobias=True),
bn_a=L.BatchNormalization(n_mid),
bn_b=L.BatchNormalization(n_mid),
bn_c=L.BatchNormalization(n_out))
if proj:
self.add_link('conv1x1r', L.Convolution2D(
n_in, n_out, 1, stride, 0, initialW=w, nobias=True))
self.add_link('bn_r', L.BatchNormalization(n_out))
self.proj = proj
def __call__(self, x, train):
h = F.relu(self.bn_a(self.conv1x1a(x), test=not train))
h = F.relu(self.bn_b(self.conv3x3b(h), test=not train))
h = self.bn_c(self.conv1x1c(h), test=not train)
if self.proj:
x = self.bn_r(self.conv1x1r(x), test=not train)
return F.relu(h + x)
In the BottleNeck
class, depending on the value of the proj argument
supplied to the initializer, it will conditionally compute a convolutional
layer conv1x1r
which will extend the number of channels of the input x
to be equal to the number of channels of the output of conv1x1c
, and
followed by a batch normalization layer before the final ReLU layer.
Writing the building block in this way improves the reusability of a class.
It switches not only the behavior in __class__()
by flags but also the
parameter registration. In this case, when proj
is False
, the
BottleNeck
doesn’t have conv1x1r and bn_r layers, so the memory
usage would be efficient compared to the case when it registers both anyway and
just ignore them if proj
is False
.
Using nested Chain
s and ChainList
for
sequential part enables us to write complex and very deep models easily.
Use Pretrained ModelsÂ¶
Various ways to write your models were described above. It turns out that VGG16 and ResNet are very useful as general feature extractors for many kinds of tasks, including but not limited to image classification. So, Chainer provides you with the pretrained VGG16 and ResNet50/101/152 models with a simple API. You can use these models as follows:
from chainer.links import VGG16Layers
model = VGG16Layers()
When VGG16Layers
is instantiated, the pretrained
parameters are automatically downloaded from the author’s server. So you can
immediately start to use VGG16 with pretrained weight as a good image feature
extractor. See the details of this model here:
chainer.links.VGG16Layers
.
In the case of ResNet models, there are three variations differing in the number
of layers. We have chainer.links.ResNet50
,
chainer.links.ResNet101
, and chainer.links.ResNet152
models
with easy parameter loading feature. ResNet’s pretrained parameters are not
available for direct downloading, so you need to download the weight from the
author’s web page first, and then place it into the dir
$CHAINER_DATSET_ROOT/pfnet/chainer/models
or your favorite place. Once
the preparation is finished, the usage is the same as VGG16:
from chainer.links import ResNet152Layers
model = ResNet152layers()
Please see the details of usage and how to prepare the pretrained weights for
ResNet here: chainer.links.ResNet50
ReferencesÂ¶
[LeCun98]  Yann LeCun, LÃ©on Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278â€“2324, 1998. 
[Simonyan14]  Simonyan, K. and Zisserman, A., Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv preprint arXiv:1409.1556, 2014. 
[He16]  Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770778, 2016. 
Recurrent Nets and their Computational GraphÂ¶
In this section, you will learn how to write
 recurrent nets with full backprop,
 recurrent nets with truncated backprop,
 evaluation of networks with few memory.
After reading this section, you will be able to:
 Handle input sequences of variable length
 Truncate upper stream of the network during forward computation
 Use volatile variables to prevent network construction
Recurrent NetsÂ¶
Recurrent nets are neural networks with loops. They are often used to learn from sequential input/output. Given an input stream \(x_1, x_2, \dots, x_t, \dots\) and the initial state \(h_0\), a recurrent net iteratively updates its state by \(h_t = f(x_t, h_{t1})\), and at some or every point in time \(t\), it outputs \(y_t = g(h_t)\). If we expand the procedure along the time axis, it looks like a regular feedforward network except that same parameters are repeatedly used within the network.
Here we learn how to write a simple onelayer recurrent net. The task is language modeling: given a finite sequence of words, we want to predict the next word at each position without peeking the successive words. Suppose there are 1,000 different word types, and that we use 100 dimensional real vectors to represent each word (a.k.a. word embedding).
Let’s start from defining the recurrent neural net language model (RNNLM) as a chain.
We can use the chainer.links.LSTM
link that implements a fullyconnected stateful LSTM layer.
This link looks like an ordinary fullyconnected layer.
On construction, you pass the input and output size to the constructor:
>>> l = L.LSTM(100, 50)
Then, call on this instance l(x)
executes one step of LSTM layer:
>>> l.reset_state()
>>> x = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y = l(x)
Do not forget to reset the internal state of the LSTM layer before the forward computation! Every recurrent layer holds its internal state (i.e. the output of the previous call). At the first application of the recurrent layer, you must reset the internal state. Then, the next input can be directly fed to the LSTM instance:
>>> x2 = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y2 = l(x2)
Based on this LSTM link, let’s write our recurrent network as a new chain:
class RNN(Chain):
def __init__(self):
super(RNN, self).__init__(
embed=L.EmbedID(1000, 100), # word embedding
mid=L.LSTM(100, 50), # the first LSTM layer
out=L.Linear(50, 1000), # the feedforward output layer
)
def reset_state(self):
self.mid.reset_state()
def __call__(self, cur_word):
# Given the current word ID, predict the next word.
x = self.embed(cur_word)
h = self.mid(x)
y = self.out(h)
return y
rnn = RNN()
model = L.Classifier(rnn)
optimizer = optimizers.SGD()
optimizer.setup(model)
Here EmbedID
is a link for word embedding.
It converts input integers into corresponding fixeddimensional embedding vectors.
The last linear link out
represents the feedforward output layer.
The RNN
chain implements a onestepforward computation.
It does not handle sequences by itself, but we can use it to process sequences by just feeding items in a sequence straight to the chain.
Suppose we have a list of word variables x_list
.
Then, we can compute loss values for the word sequence by simple for
loop.
def compute_loss(x_list):
loss = 0
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
return loss
Of course, the accumulated loss is a Variable object with the full history of computation.
So we can just call its backward()
method to compute gradients of the total loss according to the model parameters:
# Suppose we have a list of word variables x_list.
rnn.reset_state()
model.cleargrads()
loss = compute_loss(x_list)
loss.backward()
optimizer.update()
Or equivalently we can use the compute_loss
as a loss function:
rnn.reset_state()
optimizer.update(compute_loss, x_list)
Truncate the Graph by UnchainingÂ¶
Learning from very long sequences is also a typical use case of recurrent nets. Suppose the input and state sequence is too long to fit into memory. In such cases, we often truncate the backpropagation into a short time range. This technique is called truncated backprop. It is heuristic, and it makes the gradients biased. However, this technique works well in practice if the time range is long enough.
How to implement truncated backprop in Chainer?
Chainer has a smart mechanism to achieve truncation, called backward unchaining.
It is implemented in the Variable.unchain_backward()
method.
Backward unchaining starts from the Variable object, and it chops the computation history backwards from the variable.
The chopped variables are disposed automatically (if they are not referenced explicitly from any other user object).
As a result, they are no longer a part of computation history, and are not involved in backprop anymore.
Let’s write an example of truncated backprop. Here we use the same network as the one used in the previous subsection. Suppose we are given a very long sequence, and we want to run backprop truncated at every 30 time steps. We can write truncated backprop using the model defined above:
loss = 0
count = 0
seqlen = len(x_list[1:])
rnn.reset_state()
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
count += 1
if count % 30 == 0 or count == seqlen:
model.cleargrads()
loss.backward()
loss.unchain_backward()
optimizer.update()
State is updated at model()
, and the losses are accumulated to loss
variable.
At each 30 steps, backprop takes place at the accumulated loss.
Then, the unchain_backward()
method is called, which deletes the computation history backward from the accumulated loss.
Note that the last state of model
is not lost, since the RNN instance holds a reference to it.
The implementation of truncated backprop is simple, and since there is no complicated trick on it, we can generalize this method to different situations. For example, we can easily extend the above code to use different schedules between backprop timing and truncation length.
Network Evaluation without Storing the Computation HistoryÂ¶
On evaluation of recurrent nets, there is typically no need to store the computation history. While unchaining enables us to walk through unlimited length of sequences with limited memory, it is a bit of a workaround.
As an alternative, Chainer provides an evaluation mode of forward computation which does not store the computation history.
This is enabled by just passing volatile
flag to all input variables.
Such variables are called volatile variables.
Volatile variable is created by passing volatile='on'
at the construction:
x_list = [Variable(..., volatile='on') for _ in range(100)] # list of 100 words
loss = compute_loss(x_list)
Note that we cannot call loss.backward()
to compute the gradient here, since the volatile variable does not remember the computation history.
Volatile variables are also useful to evaluate feedforward networks to reduce the memory footprint.
Variable’s volatility can be changed directly by setting the Variable.volatile
attribute.
This enables us to combine a fixed feature extractor network and a trainable predictor network.
For example, suppose we want to train a feedforward network predictor_func
, which is located on top of another fixed pretrained network fixed_func
.
We want to train predictor_func
without storing the computation history for fixed_func
.
This is simply done by following code snippets (suppose x_data
and y_data
indicate input data and label, respectively):
x = Variable(x_data, volatile='on')
feat = fixed_func(x)
feat.volatile = 'off'
y = predictor_func(feat)
y.backward()
At first, the input variable x
is volatile, so fixed_func
is executed in volatile mode, i.e. without memorizing the computation history.
Then the intermediate variable feat
is manually set to nonvolatile, so predictor_func
is executed in nonvolatile mode, i.e., with memorizing the history of computation.
Since the history of computation is only memorized between variables feat
and y
, the backward computation stops at the feat
variable.
Warning
It is not allowed to mix volatile and nonvolatile variables as arguments to same function.
If you want to create a variable that behaves like a nonvolatile variable while can be mixed with volatile ones, use 'auto'
flag instead of 'off'
flag.
Making it with TrainerÂ¶
The above codes are written with plain Function/Variable APIs. When we write a training loop, it is better to use Trainer, since we can then easily add functionalities by extensions.
Before implementing it on Trainer, let’s clarify the training settings.
We here use Penn Tree Bank dataset as a set of sentences.
Each sentence is represented as a word sequence.
We concatenate all sentences into one long word sequence, in which each sentence is separated by a special word <eos>
, which stands for “End of Sequence”.
This dataset is easily obtained by chainer.datasets.get_ptb_words()
.
This function returns train, validation, and test dataset, each of which is represented as a long array of integers.
Each integer represents a word ID.
Our task is to learn a recurrent neural net language model from the long word sequence. We use words in different locations to form minibatches. It means we maintain \(B\) indices pointing to different locations in the sequence, read from these indices at each iteration, and increment all indices after the read. Of course, when one index reaches the end of the whole sequence, we turn the index back to 0.
In order to implement this training procedure, we have to customize the following components of Trainer:
 Iterator. Builtin iterators do not support reading from different locations and aggregating them into a minibatch.
 Update function. The default update function does not support truncated BPTT.
When we write a dataset iterator dedicated to the dataset, the dataset implementation can be arbitrary; even the interface is not fixed.
On the other hand, the iterator must support the Iterator
interface.
The important methods and attributes to implement are batch_size
, epoch
, epoch_detail
, is_new_epoch
, iteration
, __next__
, and serialize
.
Following is a code from the official example in the examples/ptb
directory.
from __future__ import division
class ParallelSequentialIterator(chainer.dataset.Iterator):
def __init__(self, dataset, batch_size, repeat=True):
self.dataset = dataset
self.batch_size = batch_size
self.epoch = 0
self.is_new_epoch = False
self.repeat = repeat
self.offsets = [i * len(dataset) // batch_size for i in range(batch_size)]
self.iteration = 0
def __next__(self):
length = len(self.dataset)
if not self.repeat and self.iteration * self.batch_size >= length:
raise StopIteration
cur_words = self.get_words()
self.iteration += 1
next_words = self.get_words()
epoch = self.iteration * self.batch_size // length
self.is_new_epoch = self.epoch < epoch
if self.is_new_epoch:
self.epoch = epoch
return list(zip(cur_words, next_words))
@property
def epoch_detail(self):
return self.iteration * self.batch_size / len(self.dataset)
def get_words(self):
return [self.dataset[(offset + self.iteration) % len(self.dataset)]
for offset in self.offsets]
def serialize(self, serializer):
self.iteration = serializer('iteration', self.iteration)
self.epoch = serializer('epoch', self.epoch)
train_iter = ParallelSequentialIterator(train, 20)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
Although the code is slightly long, the idea is simple.
First, this iterator creates offsets
pointing to positions equally spaced within the whole sequence.
The ith examples of minibatches refer the sequence with the ith offset.
The iterator returns a list of tuples of the current words and the next words.
Each minibatch is converted to a tuple of integer arrays by the concat_examples
function in the standard updater (see the previous tutorial).
Backprop Through Time is implemented as follows.
def update_bptt(updater):
loss = 0
for i in range(35):
batch = train_iter.__next__()
x, t = chainer.dataset.concat_examples(batch)
loss += model(chainer.Variable(x), chainer.Variable(t))
model.cleargrads()
loss.backward()
loss.unchain_backward() # truncate
optimizer.update()
updater = training.StandardUpdater(train_iter, optimizer, update_bptt)
In this case, we update the parameters on every 35 consecutive words.
The call of unchain_backward
cuts the history of computation accumulated to the LSTM links.
The rest of the code for setting up Trainer is almost same as one given in the previous tutorial.
In this section we have demonstrated how to write recurrent nets in Chainer and some fundamental techniques to manage the history of computation (a.k.a. computational graph).
The example in the examples/ptb
directory implements truncated backprop learning of a LSTM language model from the Penn Treebank corpus.
In the next section, we will review how to use GPU(s) in Chainer.
Using GPU(s) in ChainerÂ¶
In this section, you will learn about the following things:
 Relationship between Chainer and CuPy
 Basics of CuPy
 SingleGPU usage of Chainer
 MultiGPU usage of modelparallel computing
 MultiGPU usage of dataparallel computing
After reading this section, you will be able to:
 Use Chainer on a CUDAenabled GPU
 Write modelparallel computing in Chainer
 Write dataparallel computing in Chainer
Relationship between Chainer and CuPyÂ¶
Note
As of the release of v1.3.0, Chainer changes its GPU backend from PyCUDA to CuPy. CuPy covers all features of PyCUDA used by Chainer, though their interfaces are not compatible.
Chainer uses CuPy as its backend for GPU computation.
In particular, the cupy.ndarray
class is the GPU array implementation for Chainer.
CuPy supports a subset of features of NumPy with a compatible interface.
It enables us to write a common code for CPU and GPU.
It also supports PyCUDAlike userdefined kernel generation, which enables us to write fast implementations dedicated to GPU.
Note
The chainer.cuda
module imports many important symbols from CuPy.
For example, the cupy namespace is referred as cuda.cupy
in the Chainer code.
Note that the chainer.cuda
module can be imported even if CUDA is not installed.
Chainer uses a memory pool for GPU memory allocation.
As shown in the previous sections, Chainer constructs and destructs many arrays during learning and evaluating iterations.
It is not well suited for CUDA architecture, since memory allocation and release in CUDA (i.e. cudaMalloc
and cudaFree
functions) synchronize CPU and GPU computations, which hurts performance.
In order to avoid memory allocation and deallocation during the computation, Chainer uses CuPy’s memory pool as the standard memory allocator.
Chainer changes the default allocator of CuPy to the memory pool, so user can use functions of CuPy directly without dealing with the memory allocator.
Basics of cupy.ndarray
Â¶
Note
CuPy does not require explicit initialization, so cuda.init()
function is deprecated.
CuPy is a GPU array backend that implements a subset of NumPy interface.
The cupy.ndarray
class is in its core, which is a compatible GPU alternative of numpy.ndarray
.
CuPy implements many functions on cupy.ndarray
objects.
See the reference for the supported subset of NumPy API.
Understanding NumPy might help utilizing most features of CuPy.
See the NumPy documentation for learning it.
The main difference of cupy.ndarray
from numpy.ndarray
is that the content is allocated on the device memory.
The allocation takes place on the current device by default.
The current device can be changed by cupy.cuda.Device
object as follows:
with cupy.cuda.Device(1):
x_on_gpu1 = cupy.array([1, 2, 3, 4, 5])
Most operations of CuPy is done on the current device. Be careful that it causes an error to process an array on a noncurrent device.
Chainer provides some convenient functions to automatically switch and choose the device.
For example, the chainer.cuda.to_gpu()
function copies a numpy.ndarray
object to a specified device:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
x_gpu = cuda.to_gpu(x_cpu, device=1)
It is equivalent to the following code using CuPy:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
with cupy.cuda.Device(1):
x_gpu = cupy.array(x_cpu)
Moving a device array to the host can be done by chainer.cuda.to_cpu()
as follows:
x_cpu = cuda.to_cpu(x_gpu)
It is equivalent to the following code using CuPy:
with x_gpu.device:
x_cpu = x_gpu.get()
Note
The with statements in these codes are required to select the appropriate CUDA device.
If user uses only one device, these device switching is not needed.
chainer.cuda.to_cpu()
and chainer.cuda.to_gpu()
functions automatically switch the current device correctly.
Chainer also provides a convenient function chainer.cuda.get_device_from_id()
and chainer.cuda.get_device_from_array()
to select a device.
The former function accepts an integer or None
.
When None
is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object.
The latter function accepts CuPy array or NumPy array.
When a NumPy array is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object to the give CuPy array.
The dummy device object also supports with statements like the above example but does nothing.
Here are some other examples:
cuda.get_device_from_id(1).use()
x_gpu1 = cupy.empty((4, 3), dtype='f') # 'f' indicates float32
with cuda.get_device_from_id(1):
x_gpu1 = cuda.empty((4, 3), dtype='f')
with cuda.get_device_from_array(x_gpu1):
y_gpu1 = x_gpu + 1
Since it accepts NumPy arrays, we can write a function that accepts both NumPy and CuPy arrays with correct device switching:
def add1(x):
with cuda.get_device_from_array(x):
return x + 1
The compatibility of CuPy with NumPy enables us to write CPU/GPU generic code.
It can be made easy by the chainer.cuda.get_array_module()
function.
This function returns the numpy
or cupy
module based on arguments.
A CPU/GPU generic function is defined using it like follows:
# Stable implementation of log(1 + exp(x))
def softplus(x):
xp = cuda.get_array_module(x)
return xp.maximum(0, x) + xp.log1p(xp.exp(abs(x)))
Run Neural Networks on a Single GPUÂ¶
SingleGPU usage is very simple.
What you have to do is transferring Link
and input arrays to the GPU beforehand.
In this subsection, the code is based on our first MNIST example in this tutorial.
A Link
object can be transferred to the specified GPU using the to_gpu()
method.
This time, we make the number of input, hidden, and output units configurable.
The to_gpu()
method also accepts a device ID like model.to_gpu(0)
.
In this case, the link object is transferred to the appropriate GPU device.
The current device is used by default.
If we use chainer.training.Trainer
, what we have to do is just let the updater know the device ID to send each minibatch.
updater = training.StandardUpdater(train_iter, optimizer, device=0)
trainer = training.Trainer(updater, (20, 'epoch'), out='result')
We also have to specify the device ID for an evaluator extension as well.
trainer.extend(extensions.Evaluator(test_iter, model, device=0))
When we write down the training loop by hand, we have to transfer each minibatch to the GPU manually:
model.to_gpu()
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x = Variable(cuda.to_gpu(x_train[indexes[i : i + batchsize]]))
t = Variable(cuda.to_gpu(y_train[indexes[i : i + batchsize]]))
optimizer.update(model, x, t)
Modelparallel Computation on Multiple GPUsÂ¶
Parallelization of machine learning is roughly classified into two types called “modelparallel” and “dataparallel”. Modelparallel means parallelizations of the computations inside the model. In contrast, dataparallel means parallelizations using data sharding. In this subsection, we show how to use the modelparallel approach on multiple GPUs in Chainer.
Recall the MNIST example. Now suppose that we want to modify this example by expanding the network to 6 layers with 2000 units each using two GPUs. In order to make multiGPU computation efficient, we only make the two GPUs communicate at the third and sixth layer. The overall architecture looks like the following diagram:
(GPU0) input +> l1 > l2 > l3 +> l4 > l5 > l6 +> output
  
(GPU1) +> l1 > l2 > l3 +> l4 > l5 > l6 +
We can use the above MLP chain as following diagram:
(GPU0) input +> mlp1 +> mlp2 +> output
  
(GPU1) +> mlp1 +> mlp2 +
Let’s write a link for the whole network.
class ParallelMLP(Chain):
def __init__(self):
super(ParallelMLP, self).__init__(
# the input size, 784, is inferred
mlp1_gpu0=MLP(1000, 2000).to_gpu(0),
mlp1_gpu1=MLP(1000, 2000).to_gpu(1),
# the input size, 2000, is inferred
mlp2_gpu0=MLP(1000, 10).to_gpu(0),
mlp2_gpu1=MLP(1000, 10).to_gpu(1),
)
def __call__(self, x):
# assume x is on GPU 0
z0 = self.mlp1_gpu0(x)
z1 = self.mlp1_gpu1(F.copy(x, 1))
# sync
h0 = F.relu(z0 + F.copy(z1, 0))
h1 = F.relu(z1 + F.copy(z0, 1))
y0 = self.mlp2_gpu0(h0)
y1 = self.mlp2_gpu1(h1)
# sync
y = y0 + F.copy(y1, 0)
return y # output is on GPU0
Recall that the Link.to_gpu()
method returns the link itself.
The copy()
function copies an input variable to specified GPU device and returns a new variable on the device.
The copy supports backprop, which just reversely transfers an output gradient to the input device.
Note
Above code is not parallelized on CPU, but is parallelized on GPU. This is because all the functions in the above code run asynchronously to the host CPU.
An almost identical example code can be found at examples/mnist/train_mnist_model_parallel.py.
Dataparallel Computation on Multiple GPUs with TrainerÂ¶
Dataparallel computation is another strategy to parallelize online processing. In the context of neural networks, it means that a different device does computation on a different subset of the input data. In this subsection, we review the way to achieve dataparallel learning on two GPUs.
Suppose again our task is the MNIST example. This time we want to directly parallelize the threelayer network. The most simple form of dataparallelization is parallelizing the gradient computation for a distinct set of data. First, define a model and optimizer instances:
model = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
optimizer = optimizers.SGD()
optimizer.setup(model)
Recall that the MLP
link implements the multilayer perceptron, and the Classifier
link wraps it to provide a classifier interface.
We used StandardUpdater
in the previous example.
In order to enable dataparallel computation with multiple GPUs, we only have to replace it with ParallelUpdater
.
updater = training.ParallelUpdater(train_iter, optimizer,
devices={'main': 0, 'second': 1})
The devices
option specifies which devices to use in dataparallel learning.
The device with name 'main'
is used as the main device.
The original model is sent to this device, so the optimization runs on the main device.
In the above example, the model is also cloned and sent to GPU 1.
Half of each minibatch is fed to this cloned model.
After every backward computation, the gradient is accumulated into the main device, the parameter update runs on it, and then the updated parameters are sent to GPU 1 again.
See also the example code in examples/mnist/train_mnist_data_parallel.py.
Dataparallel Computation on Multiple GPUs without TrainerÂ¶
We here introduce a way to write dataparallel computation without the help of Trainer
.
Most users can skip this section.
If you are interested in how to write a dataparallel computation by yourself, this section should be informative.
It is also helpful to, e.g., customize the ParallelUpdater
class.
We again start from the MNIST example.
At this time, we use a suffix like _0
and _1
to distinguish objects on each device.
First, we define a model.
model_0 = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
We want to make two copies of this instance on different GPUs.
The Link.to_gpu()
method runs in place, so we cannot use it to make a copy.
In order to make a copy, we can use Link.copy()
method.
import copy
model_1 = copy.deepcopy(model_0)
model_0.to_gpu(0)
model_1.to_gpu(1)
The Link.copy()
method copies the link into another instance.
It just copies the link hierarchy, and does not copy the arrays it holds.
Then, set up an optimizer:
optimizer = optimizers.SGD()
optimizer.setup(model_0)
Here we use the first copy of the model as the master model.
Before its update, gradients of model_1
must be aggregated to those of model_0
.
Then, we can write a dataparallel learning loop as follows:
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x_batch = x_train[indexes[i : i + batchsize]]
y_batch = y_train[indexes[i : i + batchsize]]
x0 = Variable(cuda.to_gpu(x_batch[:batchsize//2], 0))
t0 = Variable(cuda.to_gpu(y_batch[:batchsize//2], 0))
x1 = Variable(cuda.to_gpu(x_batch[batchsize//2:], 1))
t1 = Variable(cuda.to_gpu(y_batch[batchsize//2:], 1))
loss_0 = model_0(x0, t0)
loss_1 = model_1(x1, t1)
model_0.cleargrads()
model_1.cleargrads()
loss_0.backward()
loss_1.backward()
model_0.addgrads(model_1)
optimizer.update()
model_1.copyparams(model_0)
Do not forget to clear the gradients of both model copies!
One half of the minibatch is forwarded to GPU 0, the other half to GPU 1.
Then the gradients are accumulated by the Link.addgrads()
method.
This method adds the gradients of a given link to those of the self.
After the gradients are prepared, we can update the optimizer in usual way.
Note that the update only modifies the parameters of model_0
.
So we must manually copy them to model_1
using Link.copyparams()
method.
Note
If the batch size used in one model remain the same, the scale of the gradient
is roughly proportional to the number of models, when we aggregate
gradients from all models by chainer.Link.addgrads()
. So you need to adjust the batch size
and/or learning rate of the optimizer accordingly.
Now you can use Chainer with GPUs.
All examples in the examples
directory support GPU computation, so please refer to them if you want to know more practices on using GPUs.
In the next section, we will show how to define a differentiable (i.e. backpropable) function on Variable objects.
We will also show there how to write a simple (elementwise) CUDA kernel using Chainer’s CUDA utilities.
Define your own functionÂ¶
In this section, you will learn about the following things:
 How to define a function on variables
 Useful tools to write a function using a GPU
 How to test the function definition
After reading this section, you will be able to:
 Write your own functions
 Define simple kernels in the function definition
Differentiable FunctionsÂ¶
Chainer provides a collection of functions in the functions
module.
It covers typical use cases in deep learning, so many existing works can be implemented with them.
On the other hand, deep learning is evolving rapidly and we cannot cover all possible functions to define unseen architectures.
So it is important to learn how to define your own functions.
First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\).
While it is possible to implement this equation using a combination of the *
and +
functions,
defining it as a single function may reduce memory consumption, so it is not only a toy example.
Here we call this function MulAdd.
Let’s start with defining MulAdd working on the CPU.
Any function must inherit the Function
class.
The skeleton of a function looks like:
class MulAdd(Function):
def forward_cpu(self, inputs):
# do forward computation on CPU
return some_tuple
def backward_cpu(self, inputs, grad_outputs):
# do backward computation on CPU
return some_tuple
We must implement forward_cpu()
and backward_cpu()
methods.
The nonself arguments of these functions are tuples of array(s), and these functions must return a tuple of array(s).
Warning
Be careful to return a tuple of arrays even if you have just one array to return.
MulAdd is simple and implemented as follows
class MulAdd(Function):
def forward_cpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_cpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
As per the warning above, the forward_cpu
method returns a tuple of single element.
Note that all arrays appearing in CPU functions are numpy.ndarray
.
The forward function is straightforward:
It unpacks the input tuple, computes the output, and packs it into a tuple.
The backward function is a bit more complicated.
Recall the rule of differentiation of multiplication.
This example just implements the rule.
Look at the return values, the function just packs the gradient of each input in same order and returns them.
By just defining the core computation of forward and backward, Function class provides a chaining logic on it (i.e. storing the history of computation, etc.).
Note
Assuming we implement a (forward) function \(y=f(x)\) which takes as input the
vector \(x \in \mathbb{R}^n\) and produces as output a vector
\(y \in \mathbb{R}^m\). Then the backward
method has to compute
where \(\gamma\) is the grad_outputs
. Note, that the
resulting vector \(\lambda\) must have the same shape as the arguments of the forward
method.
Now let’s define the corresponding GPU methods.
You can easily predict that the methods we have to write are named forward_gpu()
and backward_gpu()
:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
In GPU methods, arrays are of type cupy.ndarray
.
We use arithmetic operators defined for this class.
These operators implement the basic elementwise arithmetics.
You may find that the definitions of GPU methods are exactly same as those of CPU methods.
In that case, we can reduce them to forward()
and backward()
methods
class MulAdd(Function):
def forward(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
Since the cupy.ndarray
class implements many methods of numpy.ndarray
, we can write these unified methods in most cases.
The MulAdd function is used as follows:
x = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
w = MulAdd()(x, y, z)
It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:
def muladd(x, y, z):
return MulAdd()(x, y, z)
w = muladd(x, y, z)
Unified forward/backward methods with NumPy/CuPy functionsÂ¶
CuPy also implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backpropable function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straightforward as follows
class ExpAdd(Function):
def forward_cpu(self, inputs):
x, y = inputs
z = np.exp(x) + np.exp(y)
return z,
def backward_cpu(self, inputs, grad_outputs):
x, y = inputs
gz, = grad_outputs
gx = gz * np.exp(x)
gy = gz * np.exp(y)
return gx, gy
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y = inputs
z = cupy.exp(x) + cupy.exp(y)
return z,
def backward_gpu(self, inputs, grad_outputs):
cupy = cuda.cupy
x, y = inputs
gz, = grad_outputs
gx = gz * cupy.exp(x)
gy = gz * cupy.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note
Here we used cuda.cupy
instead of directly accessing cupy
.
This is because the cupy
module cannot be imported if the CUDA is not installed.
In order to keep the implementation valid in nonCUDA environment, we have to defer the access to the cupy
module.
Note that the chainer.cuda
module can be imported even if the CUDA is not installed.
Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDAdedicated functions, the code is still valid.
The CPU and GPU implementations are almost same, except that numpy
is replaced by cupy
in GPU methods.
We can unify these functions using the cuda.get_array_module()
function.
This function accepts arbitrary number of arrays, and returns an appropriate module for them.
See the following code
class ExpAdd(Function):
def forward(self, inputs):
xp = cuda.get_array_module(*inputs)
x, y = inputs
z = xp.exp(x) + xp.exp(y)
return z,
def backward(self, inputs, grad_outputs):
xp = cuda.get_array_module(*inputs)
x, y = inputs
gz, = grad_outputs
gx = gz * xp.exp(x)
gy = gz * xp.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note that this code works correctly even if CUDA is not installed in the environment.
If CUDA is not found, get_array_module function always returns numpy
.
We often use the name xp
for the variadic module name, which is analogous to the abbreviation np
for NumPy and cp
for CuPy.
Write an Elementwise Kernel FunctionÂ¶
Let’s turn back to the MulAdd example.
The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores. However, it invokes two kernels during each of forward and backward computations. It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth. We can reduce the number of invocations by defining our own kernel. It also reduce the memory consumption.
Most functions only require elementwise operations like MulAdd.
CuPy provides a useful tool to define elementwise kernels, the cupy.elementwise.ElementwiseKernel
class, and Chainer wraps it by cuda.elementwise()
function.
Our MulAdd implementation can be improved as follows:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'float32 x, float32 y, float32 z',
'float32 w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'float32 x, float32 y, float32 gw',
'float32 gx, float32 gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
cuda.elementwise()
function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel
object, which is callable).
In typical usage, we pass four arguments to this function as follows:
 Input argument list. This is a commaseparated string each entry of which consists of a type specification and an argument name.
 Output argument list in the same format as the input argument list.
 Body of parallel loop. We can use the input/output argument names as an element of these arrays.
 Name of the kernel function, which is shown in debuggers and profilers.
Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by cuda.elementwise()
.
The first one is binary caching:
cuda.elementwise()
function caches the compiled binary in the $(HOME)/.cupy/kernel_cache
directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value.
This caching mechanism is actually implemented in CuPy.
The second one is upload caching:
Given a compiled binary code, we have to upload it to the current GPU in order to execute it.
cuda.elementwise()
function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.
The above MulAdd code only works for float32 arrays.
The ElementwiseKernel
also supports the typevariadic kernel definition.
In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'T x, T y, T z',
'T w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'T x, T y, T gw',
'T gx, T gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
The type placeholder T
indicates an arbitrary data type that CuPy supports.
There are more functionalities on userdefined kernels in CuPy. See the CuPy documentation on userdefined kernels for more details.
Links that wrap functionsÂ¶
Some functions are meant to be combined with parameters.
In such case, it is useful to write a small link that wraps the function.
We have already seen how to define a chain that wraps other links (by inheriting Chain
class).
Here we study how to define a link that does not hold any other links.
As the first example, suppose that we want to implement elementwise product function between the input array and the parameter array. It can be defined as follows:
class EltwiseParamProduct(Link):
def __init__(self, shape):
# By passing a shape of the parameter, the initializer allocates a
# parameter variable of the shape.
super(EltwiseParamProduct, self).__init__(W=shape)
self.W.data[...] = np.random.randn(*shape)
def __call__(self, x):
return self.W * x
We can also initialize the parameter after the initialization by the Link.add_param()
method.
class EltwiseParamProduct(Link):
def __init__(self, shape):
super(EltwiseParamProduct, self).__init__()
self.add_param('W', shape)
self.W.data[...] = np.random.randn(*shape)
def __call__(self, x):
return self.W * x
Note that the initializer and the add_param()
method does not initialize elements of the parameter array.
We have to manually initialize the elements by random values, zeros, etc.
For another example, assume we want to define a simple linear layer.
It is already defined as Linear
, so this is an educational example.
The linear layer is divided into two parts: a function and its wrapper link.
First, we have to define a function on variables:
class LinearFunction(Function):
def forward(self, inputs):
x, W, b = inputs
return x.dot(W.T) + b,
def backward(self, inputs, grad_outputs):
x, W, b = inputs
gy, = grad_outputs
gx = gy.dot(W)
gW = gy.T.dot(x)
gb = gy.sum(axis=0)
return gx, gW, gb
def linear(x, W, b):
return LinearFunction()(x, W, b)
This function takes three arguments: input, weight, and bias. It can be used as a part of model definition, though is inconvenient since the user have to manage the weight and bias parameters directly. In order to make a convenient module, let’s wrap it into a link:
class Linear(Link):
def __init__(self, in_size, out_size):
super(Linear, self).__init__(W=(out_size, in_size), b=out_size)
self.W.data[...] = np.random.randn(out_size, in_size) / math.sqrt(in_size)
self.b.data.fill(0)
def __call__(self, x):
return linear(x, self.W, self.b)
This link hides the parameters of the linear layer.
Note
An advanced tip to implement functions: if you want to preserve some information between forward and backward computations (e.g. to cache some arrays), you can store it as attributes. Be careful that it might increase the memory consumption during the whole forwardbackward computation. If you want to train very large networks on a GPU with limited memory, it is not recommended to cache arrays between forward and backward. There is one exception for this: caching the output arrays does not change the memory consumption, because they are also held by the output Variable objects.
Warning
You should not assume a onetoone match of calls of forward and backward. Some users may call backward more than once after one forward call.
Testing FunctionÂ¶
In order to isolate the cause of learning failure from implementation bugs, it is important to test function implementations.
Chainer provides simple utilities to help writing unit tests.
They are defined in the gradient_check
module.
The most important test utility is the numerical_grad()
function.
This function computes the numerical gradient of given function using finite differences.
It can be used as follows
x = np.random.randn(4, 3).astype(np.float32)
gy = np.ones((4, 3), dtype=np.float32)
f = lambda: (x * x,)
gx = gradient_check.numerical_grad(f, (x,), (gy,))
f
is a closure that returns a tuple of array(s) computed from input arrays.
The second and third arguments of numerical_grad()
are tuples of input arrays and output gradient arrays, respectively.
The code above computes the numerical gradients of sum(f(x))
, where sum
indicates the summation over all elements.
The summation can be weighted by changing gy
.
numerical_grad()
function also accepts additional eps
argument, which indicates the quantization width of finite differences.
Note
numerical_grad()
function accepts both CPU and GPU arrays.
Note that we cannot mix CPU and GPU arrays.
Another utility is chainer.testing.assert_allclose()
function.
This is similar to numpy.testing.assert_allclose()
function.
The difference is that Chainer’s version accepts CPU and GPU arrays as inputs.
We can mix them in one invocation of chainer.testing.assert_allclose()
.
The default values of optional arguments are also different.
Here is a typical usage of gradient checking utilities.
This is a test example of functions.relu()
function
import unittest
from chainer import testing
class TestReLU(unittest.TestCase):
def test_backward_cpu(self):
x = Variable(np.random.randn(3, 2).astype(np.float32))
y = F.relu(x)
y.grad = np.random.randn(3, 2).astype(np.float32)
y.backward()
f = lambda: (F.relu(x).data,)
gx, = gradient_check.numerical_grad(f, (x.data,), (y.grad,))
testing.assert_allclose(gx, x.grad)
The first four lines of the test code are simple forward and backward computation of ReLU function. The next two lines compute numerical gradient using the same forward function without backward routine. And at last, we compare these two results elementwise. Note that the above test code can be easily modified to test GPU version just by replacing CPU arrays to GPU arrays.
You can find many examples of function tests under tests/chainer_tests/function_tests
directory.
Type checkÂ¶
In this section, you will learn about the following things:
 Basic usage of type check
 Detail of type information
 Internal mechanism of type check
 More complicated cases
 Call functions
 Typical type check example
After reading this section, you will be able to:
 Write a code to check types of input arguments of your own functions
Basic usage of type checkÂ¶
When you call a function with an invalid type of array, you sometimes receive no error, but get an unexpected result by broadcasting. When you use CUDA with an illegal type of array, it causes memory corruption, and you get a serious error. These bugs are hard to fix. Chainer can check preconditions of each function, and helps to prevent such problems. These conditions may help a user to understand specification of functions.
Each implementation of Function
has a method for type check, check_type_forward()
.
This function is called just before the forward()
method of the Function
class.
You can override this method to check the condition on types and shapes of arguments.
check_type_forward()
gets an argument in_types
:
def check_type_forward(self, in_types):
...
in_types
is an instance of TypeInfoTuple
, which is a subclass of tuple
.
To get type information about the first argument, use in_types[0]
.
If the function gets multiple arguments, we recommend to use new variables for readability:
x_type, y_type = in_types
In this case, x_type
represents the type of the first argument, and y_type
represents the second one.
We describe usage of in_types
with an example.
When you want to check if the number of dimension of x_type
equals to 2
, write this code:
utils.type_check.expect(x_type.ndim == 2)
When this condition is true, nothing happens. Otherwise this code throws an exception, and the user gets a message like this:
Traceback (most recent call last):
...
InvalidType: Expect: in_types[0].ndim == 2
Actual: 3 != 2
This error message means that “ndim
of the first argument expected to be 2
, but actually it is 3
”.
Detail of type informationÂ¶
You can access three information of x_type
.
.shape
is a tuple of ints. Each value is size of each dimension..ndim
isint
value representing the number of dimensions. Note thatndim == len(shape)
.dtype
isnumpy.dtype
representing data type of the value.
You can check all members. For example, the size of the first dimension must be positive, you can write like this:
utils.type_check.expect(x_type.shape[0] > 0)
You can also check data types with .dtype
:
utils.type_check.expect(x_type.dtype == np.float64)
And an error is like this:
Traceback (most recent call last):
...
InvalidType: Expect: in_types[0].dtype == <type 'numpy.float64'>
Actual: float32 != <type 'numpy.float64'>
You can also check kind
of dtype
.
This code checks if the type is floating point
utils.type_check.expect(x_type.dtype.kind == 'f')
You can compare between variables. For example, the following code checks if the first argument and the second argument have the same length:
utils.type_check.expect(x_type.shape[1] == y_type.shape[1])
Internal mechanism of type checkÂ¶
How does it show an error message like "in_types[0].ndim == 2"
?
If x_type
is an object containing ndim
member variable, we cannot show such an error message because this equation is evaluated as a boolean value by Python interpreter.
Actually x_type
is a Expr
objects, and doesn’t have a ndim
member variable itself.
Expr
represents a syntax tree.
x_type.ndim
makes a Expr
object representing (getattr, x_type, 'ndim')
.
x_type.ndim == 2
makes an object like (eq, (getattr, x_type, 'ndim'), 2)
.
type_check.expect()
gets a Expr
object and evaluates it.
When it is True
, it causes no error and shows nothing.
Otherwise, this method shows a readable error message.
If you want to evaluate a Expr
object, call eval()
method:
actual_type = x_type.eval()
actual_type
is an instance of TypeInfo
, while x_type
is an instance of Expr
.
In the same way, x_type.shape[0].eval()
returns an int value.
More powerful methodsÂ¶
Expr
class is more powerful.
It supports all mathematical operators such as +
and *
.
You can write a condition that the first dimension of x_type
is the first dimension of y_type
times four:
utils.type_check.expect(x_type.shape[0] == y_type.shape[0] * 4)
When x_type.shape[0] == 3
and y_type.shape[0] == 1
, users can get the error message below:
Traceback (most recent call last):
...
InvalidType: Expect: in_types[0].shape[0] == in_types[1].shape[0] * 4
Actual: 3 != 4
To compare a member variable of your function, wrap a value with Variable
to show readable error message:
x_type.shape[0] == utils.type_check.Variable(self.in_size, "in_size")
This code can check the equivalent condition below:
x_type.shape[0] == self.in_size
However, the latter condition doesn’t know the meaning of this value. When this condition is not satisfied, the latter code shows unreadable error message:
InvalidType: Expect: in_types[0].shape[0] == 4 # what does '4' mean?
Actual: 3 != 4
Note that the second argument of utils.type_check.Variable
is only for readability.
The former shows this message:
InvalidType: Expect: in_types[0].shape[0] == in_size # OK, `in_size` is a value that is given to the constructor
Actual: 3 != 4 # You can also check actual value here
Call functionsÂ¶
How to check summation of all values of shape?
Expr
also supports function call:
sum = utils.type_check.Variable(np.sum, 'sum')
utils.type_check.expect(sum(x_type.shape) == 10)
Why do we need to wrap the function numpy.sum
with utils.type_check.Variable
?
x_type.shape
is not a tuple but an object of Expr
as we have seen before.
Therefore, numpy.sum(x_type.shape)
fails.
We need to evaluate this function lazily.
The above example produces an error message like this:
Traceback (most recent call last):
...
InvalidType: Expect: sum(in_types[0].shape) == 10
Actual: 7 != 10
More complicated casesÂ¶
How to write a more complicated condition that can’t be written with these operators?
You can evaluate Expr
and get its result value with eval()
method.
Then check the condition and show warning message by hand:
x_shape = x_type.shape.eval() # get actual shape (int tuple)
if not more_complicated_condition(x_shape):
expect_msg = 'Shape is expected to be ...'
actual_msg = 'Shape is ...'
raise utils.type_check.InvalidType(expect_msg, actual_msg)
Please write a readable error message. This code generates the following error message:
Traceback (most recent call last):
...
InvalidType: Expect: Shape is expected to be ...
Actual: Shape is ...
Typical type check exampleÂ¶
We show a typical type check for a function.
First check the number of arguments:
utils.type_check.expect(in_types.size() == 2)
in_types.size()
returns a Expr
object representing the number of arguments.
You can check it in the same way.
And then, get each type:
x_type, y_type = in_types
Don’t get each value before checking in_types.size()
.
When the number of argument is illegal, type_check.expect
might output unuseful error messages.
For example, this code doesn’t work when the size of in_types
is 0:
utils.type_check.expect(
in_types.size() == 2,
in_types[0].ndim == 3,
)
After that, check each type:
utils.type_check.expect(
x_type.dtype == np.float32,
x_type.ndim == 3,
x_type.shape[1] == 2,
)
The above example works correctly even when x_type.ndim == 0
as all conditions are evaluated lazily.
Chainer Reference ManualÂ¶
Core functionalitiesÂ¶
VariableÂ¶

class
chainer.
Variable
(data, volatile=OFF, name=None, grad=None)[source]Â¶ Array with a structure to keep track of computation.
Every variable holds a data array of type either
numpy.ndarray
orcupy.ndarray
.A Variable object may be constructed in two ways: by the user or by some function. When a variable is created by some function as one of its outputs, the variable holds a reference to that function. This reference is used in error backpropagation (a.k.a. backprop). It is also used in backward unchaining. A variable that does not hold a reference to its creator is called a root variable. A variable is root if it is created by the user, or if the reference is deleted by
unchain_backward()
.Users can disable this chaining behavior by setting the volatile flag for the initial variables. When a function gets volatile variables as its inputs, the output variables do not hold references to the function. This acts like unchaining on every function application.
Parameters: Variables:  data – Data array of type either
numpy.ndarray
orcupy.ndarray
.  grad – Gradient array.
 creator – The function who creates this variable. It is
None
if the variable is not created by any function.  volatile – Ternary
Flag
object. If'ON'
, the variable does not keep track of any function applications. SeeFlag
for the detail of ternary flags.

__getitem__
(x, slices)Â¶ Extract elements from array with specified shape, axes and offsets.
Parameters: Returns: Variable
objectwhich contains sliced array of
x
.
Return type: Note
It only supports types that are supported by CUDA’s atomicAdd when an integer array is included in
slices
. The supported types arenumpy.float32
,numpy.int32
,numpy.uint32
,numpy.uint64
andnumpy.ulonglong
.Note
It does not support
slices
that contains multiple boolean arrays.Note
See NumPy document for details of indexing.

__len__
()[source]Â¶ Returns the number of elements of the data array.
Returns: Number of elements of the data array. Return type: int

addgrad
(var)[source]Â¶ Accumulates the gradient array from given source variable.
This method just runs
self.grad += var.grad
, except that the accumulation is even done across the host and different devices.Parameters: var (Variable) – Source variable.

backward
(retain_grad=False)[source]Â¶ Runs error backpropagation (a.k.a. backprop) from this variable.
On backprop,
Function.backward()
is called on eachFunction
object appearing in the backward graph starting from this variable. The backward graph is represented by backward references from variables to their creators, and from functions to their inputs. The backprop stops at all root variables. Some functions setNone
as gradients of some inputs, where further backprop does not take place at such input variables.This method uses
grad
as the initial error array. User can manually set a gradient array before calling this method. Ifdata
contains only one element (i.e., it is scalar) andgrad
isNone
, then this method automatically complements 1.0 as the initial error. This is useful on starting backprop from some scalar loss value.Parameters: retain_grad (bool) – If
True
, the gradient arrays of all intermediate variables are kept. Otherwise,grad
of the intermediate variables are set toNone
on appropriate timing, which may reduce the maximum memory consumption.In most cases of training some models, the purpose of backprop is to compute gradients of parameters, not of variables, so it is recommended to set this flag
False
.

copydata
(var)[source]Â¶ Copies the data array from given source variable.
This method just copies the data attribute from given variable to this variable, except that the copy is even done across the host and different devices.
Parameters: var (Variable) – Source variable.

label
Â¶ Short text that represents the variable.

reshape
(*shape)[source]Â¶ Returns a variable of a different shape and the same content.
See also
chainer.functions.reshape()
for full documentation,

set_creator
(gen_func)[source]Â¶ Notifies the variable that the given function is its creator.
Parameters: gen_func (Function) – Function object that creates this variable as one of its outputs.

to_gpu
(device=None)[source]Â¶ Copies the data and gradient arrays to specified GPU.
Parameters: device – Target device specifier. If omitted, the current device is used.

transpose
(*axes)[source]Â¶ Permute the dimensions of an input variable without copy.
See also
chainer.functions.transpose()
for full documentation.

unchain_backward
()[source]Â¶ Deletes references between variables and functions backward.
After this method completes, intermediate variables and functions that are not referenced from anywhere are deallocated by reference count GC. Also this variable itself deletes the reference to its creator function, i.e. this variable becomes root in the computation graph. It indicates that backprop after unchaining stops at this variable. This behavior is useful to implement truncated BPTT.

zerograd
()[source]Â¶ Initializes the gradient array by zeros.
Deprecated since version v1.15: Use
cleargrad()
instead.
 data – Data array of type either
FlagÂ¶

class
chainer.
Flag
[source]Â¶ Ternary flag object for variables.
It takes three values: ON, OFF, and AUTO.
ON and OFF flag can be evaluated as a boolean value. These are converted to True and False, respectively. AUTO flag cannot be converted to boolean. In this case, ValueError is raised.
Parameters: name (str, bool, or None) – Name of the flag. Following values are allowed:
'on'
,'ON'
, orTrue
for ON value'off'
,'OFF'
, orFalse
for OFF value'auto'
,'AUTO'
, orNone
for AUTO value

chainer.
ON
= ONÂ¶ Equivalent to Flag(‘on’).

chainer.
OFF
= OFFÂ¶ Equivalent to Flag(‘off’).

chainer.
AUTO
= AUTOÂ¶ Equivalent to Flag(‘auto’).

chainer.flag.
aggregate_flags
(flags)[source]Â¶ Returns an aggregated flag given a sequence of flags.
If both ON and OFF are found, this function raises an error. Otherwise, either of ON and OFF that appeared is returned. If all flags are AUTO, then it returns AUTO.
Parameters: flags (sequence of Flag) – Input flags. Returns: The result of aggregation. Return type: Flag
FunctionÂ¶

class
chainer.
Function
[source]Â¶ Function on variables with backpropagation ability.
All function implementations defined in
chainer.functions
inherit this class.The main feature of this class is keeping track of function applications as a backward graph. When a function is applied to
Variable
objects, itsforward()
method is called ondata
fields of input variables, and at the same time it chains references from output variables to the function and from the function to its inputs.Note
As of v1.5, a function instance cannot be used twice in any computational graphs. In order to reuse a function object multiple times, use
copy.copy()
before the function applications to make a copy of the instance.This restriction also means that we cannot make a stateful function anymore. For example, it is now not allowed to let a function hold parameters. Define a function as a pure (stateless) procedure, and use
Link
to combine it with parameter variables.Example
Let
x
an instance ofVariable
andf
an instance ofFunction
taking only one argument. Then a line>>> import numpy, chainer, chainer.functions as F >>> x = chainer.Variable(numpy.zeros(10)) >>> f = F.Identity() >>> y = f(x)
computes a new variable
y
and creates backward references. Actually, backward references are set as per the following diagram:x < f < y
If an application of another function
g
occurs as>>> g = F.Identity() >>> z = g(x)
then the graph grows with a branch:
 f < y x <+  g < z
Note that the branching is correctly managed on backward computation, i.e. the gradients from
f
andg
are accumulated to the gradient ofx
.Every function implementation should provide
forward_cpu()
,forward_gpu()
,backward_cpu()
andbackward_gpu()
. Alternatively, one can provideforward()
andbackward()
instead of separate methods. Backward methods have default implementations that just returnNone
, which indicates that the function is non differentiable.Variables:  inputs – A tuple or list of input variables.
 outputs – A tuple or list of output variables.
 type_check_enable – When it is
True
, the function checks types of input arguments. SetCHAINER_TYPE_CHECK
environment variable0
to disable type check, or set the variable directly in your own program.

__call__
(*inputs)[source]Â¶ Applies forward propagation with chaining backward references.
Basic behavior is expressed in documentation of
Function
class.Note
If the
data
attribute of input variables exist on GPU device, then, before it callsforward()
method, the appropriate device is selected, so in most cases implementers do not need to take care of device selection.Parameters: inputs – Tuple of input Variable
,numpy.ndarray
orcupy.ndarray
objects. The volatile flags of all input variables must agree. If the input is annumpy.ndarray
or acupy.ndarray
, it is automatically wrapped withVariable
.Returns: One Variable
object or a tuple of multipleVariable
objects.

add_hook
(hook, name=None)[source]Â¶ Registers the function hook.
Parameters:  hook (FunctionHook) – Function hook to be registered.
 name (str) – Name of the function hook.
name must be unique among function hooks
registered to the function. If
None
, default name of the function hook is used.

backward
(inputs, grad_outputs)[source]Â¶ Applies backprop to output gradient arrays.
It delegates the procedure to
backward_cpu()
orbackward_gpu()
by default. Which it selects is determined by the type of input arrays and output gradient arrays. Implementations ofFunction
must implement either CPU/GPU methods or this method, if the function is intended to be backproped.Parameters:  inputs – Tuple of input arrays.
 grad_outputs – Tuple of output gradient arrays.
Returns: Tuple of input gradient arrays. Some or all of them can be
None
, if the function is not differentiable on inputs.Return type: Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array.

backward_cpu
(inputs, grad_outputs)[source]Â¶ Applies backprop to output gradient arrays on CPU.
Parameters:  inputs – Tuple of input
numpy.ndarray
object(s).  grad_outputs – Tuple of output gradient
numpy.ndarray
object(s).
Returns: Tuple of input gradient
numpy.ndarray
object(s). Some or all of them can beNone
, if the function is not differentiable on corresponding inputs.Return type: Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array. inputs – Tuple of input

backward_gpu
(inputs, grad_outputs)[source]Â¶ Applies backprop to output gradient arrays on GPU.
Parameters:  inputs – Tuple of input
cupy.ndarray
object(s).  grad_outputs – Tuple of output gradient
cupy.ndarray
object(s).
Returns: Tuple of input gradient
cupy.ndarray
object(s). Some or all of them can beNone
, if the function is not differentiable on corresponding inputs.Return type: Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array. inputs – Tuple of input

check_type_forward
(in_types)[source]Â¶ Checks types of input data before forward propagation.
Before
forward()
is called, this function is called. You need to validate types of input data in this function using the type checking utilities.Parameters: in_types (TypeInfoTuple) – The type information of input data for forward()
.

delete_hook
(name)[source]Â¶ Unregisters the function hook.
Parameters: name (str) – the name of the function hook to be unregistered.

forward
(inputs)[source]Â¶ Applies forward propagation to input arrays.
It delegates the procedure to
forward_cpu()
orforward_gpu()
by default. Which it selects is determined by the type of input arrays. Implementations ofFunction
must implement either CPU/GPU methods or this method.Parameters: inputs – Tuple of input array(s). Returns: Tuple of output array(s). Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array.

forward_cpu
(inputs)[source]Â¶ Applies forward propagation to input arrays on CPU.
Parameters: inputs – Tuple of numpy.ndarray
object(s).Returns: Tuple of numpy.ndarray
object(s).Return type: tuple Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array.

forward_gpu
(inputs)[source]Â¶ Applies forward propagation to input arrays on GPU.
Parameters: inputs – Tuple of cupy.ndarray
object(s).Returns: Tuple of cupy.ndarray
object(s).Return type: tuple Warning
Implementations of
Function
must take care that the return value must be a tuple even if it returns only one array.

label
Â¶ Short text that represents the function.
The default implementation returns its type name. Each function should override it to give more information.

local_function_hooks
Â¶ Ordered Dictionary of registered function hooks.
Contrary to
chainer.thread_local.function_hooks
, which registers its elements to all functions, Function hooks in this property is specific to this function.

unchain
()[source]Â¶ Purges in/out variables and this function itself from the graph.
This method is called from
Variable.unchain_backward()
method.

chainer.
force_backprop_mode
(*args, **kwds)[source]Â¶ Enable backpropagation for Variable whose volatile is auto.
When you want to enable backpropagation in
no_backprop_mode()
, call this method. In this context,Variable
object whosevolatile
attribute is'auto'
behaves like a volatile variable. That means you can disableno_backprop_mode()
in this context.If you call this method outside of
no_backprop_mode()
context, it changes nothing.Variable
object withvolatile='auto'
behaves like a volatile variable by default.In this example, the volatility of
x
andy
is'auto'
. Inno_backprop_mode()
context,y
does not have a computational graph but inforce_backprop_mode()
it has a graph.>>> with chainer.no_backprop_mode(): ... # Variable with volatile='auto' behaves like volatile='on' ... with chainer.force_backprop_mode(): ... # Variable with volatile='auto' behaves like volatile='off' ... y = x + 1
See also
See
no_backprop_mode()
for details of backprop mode.

chainer.
no_backprop_mode
(*args, **kwds)[source]Â¶ Disable backpropagation for Variable whose volatile is auto.
In the default setting a
Variable
object whosevolatile
attribute is'auto'
behaves like a nonvolatile variable. That means such aVariable
object builds a computational graph, consumes memory to store the graph, and you can execute backpropagation for it. With this context such aVariable
object behaves like a volatile variable. So, you can easily switch training and evaluation.In this example, the volatility of
x
andy
is'auto'
. So,y
does not have a computational graph.>>> x = chainer.Variable(numpy.array([1,], 'f'), volatile='auto') >>> with chainer.no_backprop_mode(): ... y = x + 1
Link and ChainÂ¶

class
chainer.
Link
(**params)[source]Â¶ Building block of model definitions.
Link is a building block of neural network models that support various features like handling parameters, defining network fragments, serialization, etc.
Link is the primitive structure for the model definitions. It supports management of parameter variables and persistent values that should be incorporated to serialization. Parameters are variables registered via the
add_param()
method, or given to the initializer method. Persistent values are arrays, scalars, or any other serializable values registered via theadd_persistent()
method.Note
Whereas arbitrary serializable objects can be registered as persistent values, it is strongly recommended to just register values that should be treated as results of learning. A typical example of persistent values is ones computed during training and required for testing, e.g. running statistics for batch normalization.
Parameters and persistent values are referred by their names. They can be accessed as attributes of the links. Link class itself manages the lists of names of parameters and persistent values to distinguish parameters and persistent values from other attributes.
Link can be composed into more complex models. This composition feature is supported by child classes like
Chain
andChainList
. One can create a chain by combining one or more links. See the documents for these classes for details.As noted above, Link supports the serialization protocol of the
Serializer
class. Note that only parameters and persistent values are saved and loaded. Other attributes are considered as a part of user program (i.e. a part of network definition). In order to construct a link from saved file, other attributes must be identically reconstructed by user codes.Example
This is a simple example of custom link definition. Chainer itself also provides many links defined under the
links
module. They might serve as examples, too.Consider we want to define a simple primitive link that implements a fullyconnected layer based on the
linear()
function. Note that this function takes input units, a weight variable, and a bias variable as arguments. Then, the fullyconnected layer can be defined as follows:import chainer import chainer.functions as F import numpy as np class LinearLayer(chainer.Link): def __init__(self, n_in, n_out): # Parameters are initialized as a numpy array of given shape. super(LinearLayer, self).__init__( W=(n_out, n_in), b=(n_out,), ) self.W.data[...] = np.random.randn(n_out, n_in) self.b.data.fill(0) def __call__(self, x): return F.linear(x, self.W, self.b)
This example shows that a user can define arbitrary parameters and use them in any methods. Links typically implement the
__call__
operator.Parameters: params – Names, shapes, and optional dtypes of initial parameters. The keywords are used as the parameter names and the corresponding values consist either of the shape or a tuple of shape and a dtype (shape, dtype). If only the shape is supplied, the default dtype will be used. Variables: name (str) – Name of this link, given by the parent chain (if exists). 
add_param
(name, shape, dtype=<type 'numpy.float32'>, initializer=None)[source]Â¶ Registers a parameter to the link.
The registered parameter is saved and loaded on serialization and deserialization, and involved in the optimization. The data and gradient of the variable are initialized by NaN arrays. If
initializer
is notNone
, the data is initialized byinitializer
.If the supplied
name
argument corresponds to an uninitialized parameter (that is, one that was added with theadd_uninitialized_param()
method),name
will be removed from the set of uninitialized parameters.The parameter is set to an attribute of the link with the given name.
Parameters:  name (str) – Name of the parameter. This name is also used as the attribute name. Any uninitialized parameters with the same name will be removed.
 shape (int or tuple of ints) – Shape of the parameter array.
 dtype – Data type of the parameter array.
 initializer (chainer.initializer.Initializer) – If it is not
None
, the data is initialized with the given initializer. Note that in this casedtype
argument is ignored.

add_persistent
(name, value)[source]Â¶ Registers a persistent value to the link.
The registered value is saved and loaded on serialization and deserialization. The value is set to an attribute of the link.
Parameters:  name (str) – Name of the persistent value. This name is also used for the attribute name.
 value – Value to be registered.

add_uninitialized_param
(name)[source]Â¶ Registers an uninitialized parameter to the link.
An uninitialized parameter is defined as a parameter that has a name but that does not yet have a shape. If the shape of a parameter depends on the shape of the inputs to the
__call__
operator, it can be useful to defer initialization (that is, setting the shape) until the first forward call of the link. Such parameters are intended to be defined as uninitialized parameters in the initializer and then initialized during the first forward call.An uninitialized parameter is intended to be registered to a link by calling this method in the initializer method. Then, during the first forward call, the shape of the parameter will be determined from the size of the inputs and the parameter must be initialized by calling the
add_param()
method.Parameters: name – (str): Name of the uninitialized parameter.

addgrads
(link)[source]Â¶ Accumulates gradient values from given link.
This method adds each gradient array of the given link to corresponding gradient array of this link. The accumulation is even done across host and different devices.
Parameters: link (Link) – Source link object.

children
()[source]Â¶ Returns a generator of all child links.
Returns: A generator object that generates all child links.

cleargrads
()[source]Â¶ Clears all gradient arrays.
This method should be called before the backward computation at every iteration of the optimization.

copy
()[source]Â¶ Copies the link hierarchy to new one.
The whole hierarchy rooted by this link is copied. The copy is basically shallow, except that the parameter variables are also shallowly copied. It means that the parameter variables of copied one are different from ones of original link, while they share the data and gradient arrays.
The name of the link is reset on the copy, since the copied instance does not belong to the original parent chain (even if exists).
Returns: Copied link object. Return type: Link

copyparams
(link)[source]Â¶ Copies all parameters from given link.
This method copies data arrays of all parameters in the hierarchy. The copy is even done across the host and devices. Note that this method does not copy the gradient arrays.
Parameters: link (Link) – Source link object.

has_uninitialized_params
Â¶ Check if the link has uninitialized parameters.
Returns: True
if the link has any uninitialized parameters. Otherwise returnsFalse
.Return type: bool

links
(skipself=False)[source]Â¶ Returns a generator of all links under the hierarchy.
Parameters: skipself (bool) – If True
, then the generator skips this link and starts with the first child link.Returns: A generator object that generates all links.

namedlinks
(skipself=False)[source]Â¶ Returns a generator of all (path, link) pairs under the hierarchy.
Parameters: skipself (bool) – If True
, then the generator skips this link and starts with the first child link.Returns: A generator object that generates all (path, link) pairs.

namedparams
()[source]Â¶ Returns a generator of all (path, param) pairs under the hierarchy.
Returns: A generator object that generates all (path, parameter) pairs. The paths are relative from this link.

params
()[source]Â¶ Returns a generator of all parameters under the link hierarchy.
Returns: A generator object that generates all parameters.

serialize
(serializer)[source]Â¶ Serializes the link object.
Parameters: serializer (AbstractSerializer) – Serializer object.

to_cpu
()[source]Â¶ Copies parameter variables and persistent values to CPU.
This method does not handle nonregistered attributes. If some of such attributes must be copied to CPU, the link implementation must override this method to do so.
Returns: self

to_gpu
(device=None)[source]Â¶ Copies parameter variables and persistent values to GPU.
This method does not handle nonregistered attributes. If some of such attributes must be copied to GPU, the link implementation must override this method to do so.
Parameters: device – Target device specifier. If omitted, the current device is used. Returns: self

xp
Â¶ Array module for this link.
Depending on which of CPU/GPU this link is on, this property returns
numpy
orcupy
.

zerograds
()[source]Â¶ Initializes all gradient arrays by zero.
This method can be used for the same purpose of cleargrads, but less efficient. This method is left for backward compatibility.
Deprecated since version v1.15: Use
cleargrads()
instead.


class
chainer.
Chain
(**links)[source]Â¶ Composable link with objectlike interface.
Composability is one of the most important features of neural nets. Neural net models consist of many reusable fragments, and each model itself might be embedded into a larger learnable system. Chain enables us to write a neural net based on composition, without bothering about routine works like collecting parameters, serialization, copying the structure with parameters shared, etc.
This class actually provides a way to compose one or more links into one structure. A chain can contain one or more child links. Child link is a link registered to the chain with its own name. The child link is stored to an attribute of the chain with the name. User can write a whole model or a fragment of neural nets as a child class of Chain.
Each chain itself is also a link. Therefore, one can combine chains into higherlevel chains. In this way, links and chains construct a link hierarchy. Link hierarchy forms a tree structure, where each node is identified by the path from the root. The path is represented by a string like a file path in UNIX, consisting of names of nodes on the path, joined by slashes
/
.Example
This is a simple example of custom chain definition. Chainer itself also provides some chains defined under the
links
module. They might serve as examples, too.Consider we want to define a multilayer perceptron consisting of two hidden layers with rectifiers as activation functions. We can use the
Linear
link as a building block:import chainer import chainer.functions as F import chainer.links as L class MultiLayerPerceptron(chainer.Chain): def __init__(self, n_in, n_hidden, n_out): # Create and register three layers for this MLP super(MultiLayerPerceptron, self).__init__( layer1=L.Linear(n_in, n_hidden), layer2=L.Linear(n_hidden, n_hidden), layer3=L.Linear(n_hidden, n_out), ) def __call__(self, x): # Forward propagation h1 = F.relu(self.layer1(x)) h2 = F.relu(self.layer2(h1)) return self.layer3(h2)
Child links are registered via the initializer method. They also can be registered by the
add_link()
method. The forward propagation is often implemented as The__call__
operator as the above example, though it is not mandatory.Parameters: links – Child links. The keywords are used as their names. The names are also set to the links. 
add_link
(name, link)[source]Â¶ Registers a child link to this chain.
The registered link is saved and loaded on serialization and deserialization, and involved in the optimization. The registered link is called a child. The child link is set to an attribute of the chain with the given name.
This method also sets the
name
attribute of the registered link. If the given link already has the name attribute set, then it raises an error.Parameters:


class
chainer.
ChainList
(*links)[source]Â¶ Composable link with listlike interface.
This is another example of compositional link. Unlike
Chain
, this class can be used like a list of child links. Each child link is indexed by a nonnegative integer, and it maintains the current number of registered child links. Theadd_link()
method inserts a new link at the end of the list. It is useful to write a chain with arbitrary number of child links, e.g. an arbitrarily deep multilayer perceptron.Note that this class does not implement all methods of
list
.Parameters: links – Initial child links. 
__getitem__
(index)[source]Â¶ Returns the child at given index.
Parameters: index (int) – Index of the child in the list. Returns: The index
th child link.Return type: Link

add_link
(link)[source]Â¶ Registers a child link to this chain.
The registered link is saved and loaded on serialization and deserialization, and involved in the optimization. The registered link is called a child. The child link is accessible via
children()
generator, which returns a generator running through the children in registered order.This method also sets the
name
attribute of the registered link. If the given link already has the name attribute set, then it raises an error.Parameters: link (Link) – The link object to be registered.

OptimizerÂ¶

class
chainer.
Optimizer
[source]Â¶ Base class of all numerical optimizers.
This class provides basic features for all optimization methods. It optimizes parameters of a target link. The target link is registered via the
setup()
method, and then theupdate()
method updates its parameters based on a given loss function.Each optimizer implementation must be defined as a child class of Optimizer. It must override
update()
method. An optimizer can use internal states each of which is tied to one of the parameters. State is a dictionary of serializable values (typically arrays of size same as the corresponding parameters). In order to use state dictionaries, the optimizer must overrideinit_state()
method (or its CPU/GPU versions,init_state_cpu()
andinit_state_gpu()
).If the optimizer is based on single gradient computation (like most firstorder methods), then it should inherit
GradientMethod
, which adds some features dedicated for the first order methods.Optimizer instance also supports hook functions. Hook function is registered by the
add_hook()
method. Each hook function is called in registration order in advance of the actual parameter update.Variables:  target – Target link object. It is set by the
setup()
method.  t – Number of update steps. It must be incremented by the
update()
method.  epoch – Current epoch. It is incremented by the
new_epoch()
method.

accumulate_grads
(grads)[source]Â¶ Accumulates gradients from other source.
This method just adds given gradient arrays to gradients that this optimizer holds. It is typically used in dataparallel optimization, where gradients for different shards are computed in parallel and aggregated by this method. This method correctly treats multiple GPU devices.
Parameters: grads (Iterable) – Iterable of gradient arrays to be accumulated. Deprecated since version v1.5: Use the
chainer.Link.addgrads()
method of the target link instead.

add_hook
(hook, name=None)[source]Â¶ Registers a hook function.
Hook function is typically called right after the gradient computation, though the timing depends on the optimization method.
Parameters:

clip_grads
(maxnorm)[source]Â¶ Clips the norm of whole gradients up to the threshold.
Parameters: maxnorm (float) – Threshold of gradient L2 norm. Deprecated since version v1.5: Use the
GradientClipping
hook function instead.

compute_grads_norm
()[source]Â¶ Computes the norm of whole gradients.
Returns: L2 norm of whole gradients, i.e. square root of sum of square of all gradient elements. Return type: float Warning
This method returns a CPUcomputed value, which means that this method synchronizes between CPU and GPU if at least one of the gradients reside on the GPU.
Deprecated since version v1.5.

init_state
(param, state)[source]Â¶ Initializes the optimizer state corresponding to the parameter.
This method should add needed items to the
state
dictionary. Each optimizer implementation that uses its own states should override this method or CPU/GPU dedicated versions (init_state_cpu()
andinit_state_gpu()
).Parameters: See also

init_state_cpu
(param, state)[source]Â¶ Initializes the optimizer state on CPU.
This method is called from
init_state()
by default.Parameters:  param (Variable) – Parameter variable. Its data array is
of type
numpy.ndarray
.  state (dict) – State dictionary.
See also
 param (Variable) – Parameter variable. Its data array is
of type

init_state_gpu
(param, state)[source]Â¶ Initializes the optimizer state on GPU.
This method is called from
init_state()
by default.Parameters:  param (Variable) – Parameter variable. Its data array is
of type
cupy.ndarray
.  state (dict) – State dictionary.
See also
 param (Variable) – Parameter variable. Its data array is
of type

new_epoch
()[source]Â¶ Starts a new epoch.
This method increments the
epoch
count. Note that if the optimizer depends on the epoch count, then user should call this method appropriately at the beginning of each epoch.

prepare
()[source]Â¶ Prepares for an update.
This method initializes missing optimizer states (e.g. for newly added parameters after the set up), and copies arrays in each state dictionary to CPU or GPU according to the corresponding parameter array.

remove_hook
(name)[source]Â¶ Removes a hook function.
Parameters: name (str) – Registered name of the hook function to remove.

serialize
(serializer)[source]Â¶ Serializes or deserializes the optimizer.
It only saves or loads the following things:
 Optimizer states
 Global states (
t
andepoch
)
It does not saves nor loads the parameters of the target link. They should be separately saved or loaded.
Parameters: serializer (AbstractSerializer) – Serializer or deserializer object.

setup
(link)[source]Â¶ Sets a target link and initializes the optimizer states.
Given link is set to the
target
attribute. It also prepares the optimizer state dictionaries corresponding to all parameters in the link hierarchy. The existing states are discarded.Parameters: link (Link) – Target link object.

update
(lossfun=None, *args, **kwds)[source]Â¶ Updates the parameters and optimizer states.
This method updates the parameters of the target link and corresponding optimizer states. The behavior of this method is different for the cases either
lossfun
is given or not.If
lossfun
is given, then this method initializes the gradients by zeros, calls it with given extra arguments, and calls thebackward()
method of its output to compute the gradients. The implementation might calllossfun
more than once.If
lossfun
is not given, then this method assumes that the gradients of all parameters are already computed. An implementation that requires multiple gradient computations might raise an error on this case.In both cases, this method invokes the update procedure for all parameters.
Parameters:  lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one
Variable
object that represents the loss (or objective) value. This argument can be omitted for single gradientbased methods. In this case, this method assumes gradient arrays computed.  kwds (args,) – Arguments for the loss function.
 lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one

weight_decay
(decay)[source]Â¶ Applies weight decay to the parameter/gradient pairs.
Parameters: decay (float) – Coefficient of weight decay. Deprecated since version v1.5: Use the
WeightDecay
hook function instead.

zero_grads
()[source]Â¶ Fills all gradient arrays by zeros.
Deprecated since version v1.5: Use the
chainer.Link.cleargrads()
method for the target link instead.
 target – Target link object. It is set by the

class
chainer.
GradientMethod
[source]Â¶ Base class of all single gradientbased optimizers.
This is an extension of the
Optimizer
class. Typical gradient methods that just require the gradient at the current parameter vector on an update can be implemented as its child class.An implementation of a gradient method must override the following methods:
init_state()
or bothinit_state_cpu()
andinit_state_gpu()
update_one()
or bothupdate_one_cpu()
andupdate_one_gpu()
Note
It is recommended to call
use_cleargrads()
after creating aGradientMethod
object for efficiency.
reallocate_cleared_grads
()[source]Â¶ Reallocate gradients cleared by
cleargrad()
.This method allocates arrays for all gradients which have
None
. This method is called before and after every optimizer hook. If an inheriting optimizer does not require this allocation, the optimizer can override this method with a blank function.

update
(lossfun=None, *args, **kwds)[source]Â¶ Updates parameters based on a loss function or computed gradients.
This method runs in two ways.
 If
lossfun
is given, then use it as a loss function to compute gradients.  Otherwise, this method assumes that the gradients are already computed.
In both cases, the computed gradients are used to update parameters. The actual update routines are defined by the
update_one()
method (or its CPU/GPU versions,update_one_cpu()
andupdate_one_gpu()
). If

update_one
(param, state)[source]Â¶ Updates a parameter based on the corresponding gradient and state.
This method calls appropriate one from
update_param_cpu()
orupdate_param_gpu()
.Parameters:

use_cleargrads
(use=True)[source]Â¶ Enables or disables use of
cleargrads()
in update.Parameters: use (bool) – If True
, this function enables use of cleargrads. IfFalse
, disables use of cleargrads (zerograds is used).Note
Note that
update()
callszerograds()
by default for backward compatibility. It is recommended to call this method before first call of update because cleargrads is more efficient than zerograds.
Hook functionsÂ¶

class
chainer.optimizer.
WeightDecay
(rate)[source]Â¶ Optimizer hook function for weight decay regularization.
This hook function adds a scaled parameter to the corresponding gradient. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
Lasso
(rate)[source]Â¶ Optimizer hook function for Lasso regularization.
This hook function adds a scaled parameter to the sign of each weight. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
GradientClipping
(threshold)[source]Â¶ Optimizer hook function for gradient clipping.
This hook function scales all gradient arrays to fit to the defined L2 norm threshold.
Parameters: threshold (float) – L2 norm threshold. Variables: threshold (float) – L2 norm threshold of gradient norm.

class
chainer.optimizer.
GradientNoise
(eta, noise_func=<function exponential_decay_noise>)[source]Â¶ Optimizer hook function for adding gradient noise.
This hook function simply adds noise generated by the
noise_func
to the gradient. By default it adds timedependent annealed Gaussian noise to the gradient at every training step:\[g_t \leftarrow g_t + N(0, \sigma_t^2)\]where
\[\sigma_t^2 = \frac{\eta}{(1+t)^\gamma}\]with \(\eta\) selected from {0.01, 0.3, 1.0} and \(\gamma = 0.55\).
Parameters:  eta (float) – Parameter that defines the scale of the noise, which for the default noise function is recommended to be either 0.01, 0.3 or 1.0.
 noise_func (function) – Noise generating function which by default is given by Adding Gradient Noise Improves Learning for Very Deep Networks.
SerializerÂ¶

class
chainer.
AbstractSerializer
[source]Â¶ Abstract base class of all serializers and deserializers.

__call__
(key, value)[source]Â¶ Serializes or deserializes a value by given name.
This operator saves or loads a value by given name.
If this is a serializer, then the value is simply saved at the key. Note that some type information might be missed depending on the implementation (and the target file format).
If this is a deserializer, then the value is loaded by the key. The deserialization differently works on scalars and arrays. For scalars, the
value
argument is used just for determining the type of restored value to be converted, and the converted value is returned. For arrays, the restored elements are directly copied into thevalue
argument. String values are treated like scalars. If thevalue
argument isNone
, the type of the restored value will typically be a numpy array but can depend on the particular subclass implementation.Parameters: Returns: Serialized or deserialized value.

Dataset abstractionÂ¶
Chainer has a support of common interface of training and validation datasets. The dataset support consists of three components: datasets, iterators, and batch conversion functions.
Dataset represents a set of examples. The interface is only determined by combination with iterators you want to use on it. The builtin iterators of Chainer requires the dataset to support __getitem__
and __len__
method. In particular, the __getitem__
method should support indexing by both an integer and a slice. We can easily support slice indexing by inheriting DatasetMixin
, in which case users only have to implement get_example()
method for indexing. Some iterators also restrict the type of each example. Basically, datasets are considered as stateless objects, so that we do not need to save the dataset as a checkpoint of the training procedure.
Iterator iterates over the dataset, and at each iteration, it yields a mini batch of examples as a list. Iterators should support the Iterator
interface, which includes the standard iterator protocol of Python. Iterators manage where to read next, which means they are stateful.
Batch conversion function converts the mini batch into arrays to feed to the neural nets. They are also responsible to send each array to an appropriate device. Chainer currently provides concat_examples()
as the only example of batch conversion functions.
These components are all customizable, and designed to have a minimum interface to restrict the types of datasets and ways to handle them. In most cases, though, implementations provided by Chainer itself are enough to cover the usages.
Chainer also has a light system to download, manage, and cache concrete examples of datasets. All datasets managed through the system are saved under the dataset root directory, which is determined by the CHAINER_DATASET_ROOT
environment variable, and can also be set by the set_dataset_root()
function.
Dataset representationÂ¶
See Dataset examples for dataset implementations.

class
chainer.dataset.
DatasetMixin
[source]Â¶ Default implementation of dataset indexing.
DatasetMixin provides the
__getitem__()
operator. The default implementation usesget_example()
to extract each example, and combines the results into a list. This mixin makes it easy to implement a new dataset that does not support efficient slicing.Dataset implementation using DatasetMixin still has to provide the
__len__()
operator explicitly.
__getitem__
(index)[source]Â¶ Returns an example or a sequence of examples.
It implements the standard Python indexing and onedimensional integer array indexing. It uses the
get_example()
method by default, but it may be overridden by the implementation to, for example, improve the slicing performance.Parameters: index (int, slice, list or numpy.ndarray) – An index of an example or indexes of examples. Returns: If index is int, returns an example created by get_example. If index is either slice or onedimensional list or numpy.ndarray, returns a list of examples created by get_example. Example
>>> import numpy >>> from chainer import dataset >>> class SimpleDataset(dataset.DatasetMixin): ... def __init__(self, values): ... self.values = values ... def __len__(self): ... return len(self.values) ... def get_example(self, i): ... return self.values[i] ... >>> ds = SimpleDataset([0, 1, 2, 3, 4, 5]) >>> ds[1] # Access by int 1 >>> ds[1:3] # Access by slice [1, 2] >>> ds[[4, 0]] # Access by onedimensional integer list [4, 0] >>> index = numpy.arange(3) >>> ds[index] # Access by onedimensional integer numpy.ndarray [0, 1, 2]

get_example
(i)[source]Â¶ Returns the ith example.
Implementations should override it. It should raise
IndexError
if the index is invalid.Parameters: i (int) – The index of the example. Returns: The ith example.

Iterator interfaceÂ¶
See Iterator examples for dataset iterator implementations.

class
chainer.dataset.
Iterator
[source]Â¶ Base class of all dataset iterators.
Iterator iterates over the dataset, yielding a minibatch at each iteration. Minibatch is a list of examples. Each implementation should implement an iterator protocol (e.g., the
__next__()
method).Note that, even if the iterator supports setting the batch size, it does not guarantee that each batch always contains the same number of examples. For example, if you let the iterator to stop at the end of the sweep, the last batch may contain a fewer number of examples.
The interface between the iterator and the underlying dataset is not fixed, and up to the implementation.
Each implementation should provide the following attributes (not needed to be writable).
batch_size
: Number of examples within each minibatch.epoch
: Number of completed sweeps over the dataset.epoch_detail
: Floating point number version of the epoch. For example, if the iterator is at the middle of the dataset at the third epoch, then this value is 2.5.previous_epoch_detail
: The value ofepoch_detail
at the previous iteration. This value isNone
before the first iteration.is_new_epoch
:True
if the epoch count was incremented at the last update.
Each implementation should also support serialization to resume/suspend the iteration.

__next__
()[source]Â¶ Returns the next batch.
This is a part of the iterator protocol of Python. It may raise the
StopIteration
exception when it stops the iteration.

finalize
()[source]Â¶ Finalizes the iterator and possibly releases the resources.
This method does nothing by default. Implementation may override it to better handle the internal resources.

next
()[source]Â¶ Python2 alternative of
__next__
.It calls
__next__()
by default.
Batch conversion functionÂ¶

chainer.dataset.
concat_examples
(batch, device=None, padding=None)[source]Â¶ Concatenates a list of examples into array(s).
Dataset iterator yields a list of examples. If each example is an array, this function concatenates them along the newlyinserted first axis (called batch dimension) into one array. The basic behavior is same for examples consisting of multiple arrays, i.e., corresponding arrays of all examples are concatenated.
For instance, consider each example consists of two arrays
(x, y)
. Then, this function concatenatesx
‘s into one array, andy
‘s into another array, and returns a tuple of these two arrays. Another example: consider each example is a dictionary of two entries whose keys are'x'
and'y'
, respectively, and values are arrays. Then, this function concatenatesx
‘s into one array, andy
‘s into another array, and returns a dictionary with two entriesx
andy
whose values are the concatenated arrays.When the arrays to concatenate have different shapes, the behavior depends on the
padding
value. Ifpadding
isNone
(default), it raises an error. Otherwise, it builds an array of the minimum shape that the contents of all arrays can be substituted to. The padding value is then used to the extra elements of the resulting arrays.TODO(beam2d): Add an example.
Parameters:  batch (list) – A list of examples. This is typically given by a dataset iterator.
 device (int) – Device ID to which each array is sent. Negative value indicates the host memory (CPU). If it is omitted, all arrays are left in the original device.
 padding – Scalar value for extra elements. If this is None (default), an error is raised on shape mismatch. Otherwise, an array of minimum dimensionalities that can accommodate all arrays is created, and elements outside of the examples are padded by this value.
Returns: Array, a tuple of arrays, or a dictionary of arrays. The type depends on the type of each example in the batch.

chainer.dataset.
to_device
(device, x)[source]Â¶ Send an array to a given device.
This method send a given array to a given device. This method is used in
concat_examples()
. You can also use this method in a custom converter method used inUpdater
andExtension
such asStandardUpdater
andEvaluator
.Parameters:  device (int or None) – Device ID to which an array is sent. If it is
negative value, an array is sent to CPU. If it is positive, an
array is sent to GPU with the given ID. If it is
None
, an array is left in the original device.  x (numpy.ndarray or cupy.ndarray) – An array to send.
Returns: Converted array.
 device (int or None) – Device ID to which an array is sent. If it is
negative value, an array is sent to CPU. If it is positive, an
array is sent to GPU with the given ID. If it is
Dataset managementÂ¶

chainer.dataset.
get_dataset_root
()[source]Â¶ Gets the path to the root directory to download and cache datasets.
Returns: The path to the dataset root directory. Return type: str

chainer.dataset.
set_dataset_root
(path)[source]Â¶ Sets the root directory to download and cache datasets.
There are two ways to set the dataset root directory. One is by setting the environment variable
CHAINER_DATASET_ROOT
. The other is by using this function. If both are specified, one specified via this function is used. The default dataset root is$HOME/.chainer/dataset
.Parameters: path (str) – Path to the new dataset root directory.

chainer.dataset.
cached_download
(url)[source]Â¶ Downloads a file and caches it.
It downloads a file from the URL if there is no corresponding cache. After the download, this function stores a cache to the directory under the dataset root (see
set_dataset_root()
). If there is already a cache for the given URL, it just returns the path to the cache without downloading the same file.Parameters: url (str) – URL to download from. Returns: Path to the downloaded file. Return type: str

chainer.dataset.
cache_or_load_file
(path, creator, loader)[source]Â¶ Caches a file if it does not exist, or loads it otherwise.
This is a utility function used in dataset loading routines. The
creator
creates the file to given path, and returns the content. If the file already exists, theloader
is called instead, and it loads the file and returns the content.Note that the path passed to the creator is temporary one, and not same as the path given to this function. This function safely renames the file created by the creator to a given path, even if this function is called simultaneously by multiple threads or processes.
Parameters:  path (str) – Path to save the cached file.
 creator – Function to create the file and returns the content. It takes a path to temporary place as the argument. Before calling the creator, there is no file at the temporary path.
 loader – Function to load the cached file and returns the content.
Returns: It returns the returned values by the creator or the loader.
Training loop abstractionÂ¶
Chainer provides a standard implementation of the training loops under the chainer.training
module. It is built on top of many other core features of Chainer, including Variable and Function, Link/Chain/ChainList, Optimizer, Dataset, and Reporter/Summary. Compared to the training loop abstraction of other machine learning tool kits, Chainer’s training framework aims at maximal flexibility, while keeps the simplicity for the typical usages. Most components are pluggable, and users can overwrite the definition.
The core of the training loop abstraction is Trainer
, which implements the training loop itself. The training loop consists of two parts: one is Updater
, which actually updates the parameters to train, and the other is Extension
for arbitrary functionalities other than the parameter update.
Updater and some extensions use dataset
and Iterator
to scan the datasets and load mini batches. The trainer also uses Reporter
to collect the observed values, and some extensions use DictSummary
to accumulate them and computes the statistics.
You can find many examples for the usage of this training utilities from the official examples. You can also search the extension implementations from Trainer extensions.
TrainerÂ¶

class
chainer.training.
Trainer
(updater, stop_trigger=None, out='result')[source]Â¶ The standard training loop in Chainer.
Trainer is an implementation of a training loop. Users can invoke the training by calling the
run()
method.Each iteration of the training loop proceeds as follows.
 Update of the parameters. It includes the minibatch loading, forward and backward computations, and an execution of the update formula. These are all done by the update object held by the trainer.
 Invocation of trainer extensions in the descending order of their priorities. A trigger object is attached to each extension, and it decides at each iteration whether the extension should be executed. Trigger objects are callable objects that take the trainer object as the argument and return a boolean value indicating whether the extension should be called or not.
Extensions are callable objects that take the trainer object as the argument. There are three ways to define custom extensions: inheriting the
Extension
class, decorating functions bymake_extension()
, and defining any callable including lambda functions. SeeExtension
for more details on custom extensions and how to configure them.Users can register extensions to the trainer by calling the
extend()
method, where some configurations can be added. Trigger object, which is also explained above. In most cases,
IntervalTrigger
is used, in which case users can simply specify a tuple of the interval length and its unit, like(1000, 'iteration')
or(1, 'epoch')
.  The order of execution of extensions is determined by their priorities.
Extensions of higher priorities are invoked earlier. There are three
standard values for the priorities:
PRIORITY_WRITER
. This is the priority for extensions that write some records to theobservation
dictionary. It includes cases that the extension directly adds values to the observation dictionary, or the extension uses thechainer.report()
function to report values to the observation dictionary.PRIORITY_EDITOR
. This is the priority for extensions that edit theobservation
dictionary based on already reported values.PRIORITY_READER
. This is the priority for extensions that only read records from theobservation
dictionary. This is also suitable for extensions that do not use theobservation
dictionary at all.
 Extensions with
invoke_before_training
flag on are also invoked at the beginning of the training loop. Extensions that update the training status (e.g., changing learning rates) should have this flag to beTrue
to ensure that resume of the training loop correctly recovers the training status.
The current state of the trainer object and objects handled by the trainer can be serialized through the standard serialization protocol of Chainer. It enables us to easily suspend and resume the training loop.
Note
The serialization does not recover everything of the training loop. It only recovers the states which change over the training (e.g. parameters, optimizer states, the batch iterator state, extension states, etc.). You must initialize the objects correctly before deserializing the states.
On the other hand, it means that users can change the settings on deserialization. For example, the exit condition can be changed on the deserialization, so users can train the model for some iterations, suspend it, and then resume it with larger number of total iterations.
During the training, it also creates a
Reporter
object to store observed values on each update. For each iteration, it creates a fresh observation dictionary and stores it in theobservation
attribute.Links of the target model of each optimizer are registered to the reporter object as observers, where the name of each observer is constructed as the format
<optimizer name><link name>
. The link name is given by thechainer.Link.namedlink()
method, which represents the path to each link in the hierarchy. Other observers can be registered by accessing the reporter object via thereporter
attribute.The default trainer is plain, i.e., it does not contain any extensions.
Parameters:  updater (Updater) – Updater object. It defines how to update the models.
 stop_trigger – Trigger that determines when to stop the training loop.
If it is not callable, it is passed to
IntervalTrigger
.
Variables:  updater – The updater object for this trainer.
 stop_trigger – Trigger that determines when to stop the training loop.
The training loop stops at the iteration on which this trigger
returns
True
.  observation – Observation of values made at the last update. See the
Reporter
class for details.  out – Output directory.
 reporter – Reporter object to report observed values.

elapsed_time
Â¶ Total time used for the training.
The time is in seconds. If the training is resumed from snapshot, it includes the time of all the previous training to get the current state of the trainer.

extend
(extension, name=None, trigger=None, priority=None, invoke_before_training=None)[source]Â¶ Registers an extension to the trainer.
Extension
is a callable object which is called after each update unless the corresponding trigger object decides to skip the iteration. The order of execution is determined by priorities: extensions with higher priorities are called earlier in each iteration. Extensions with the same priority are invoked in the order of registrations.If two or more extensions with the same name are registered, suffixes are added to the names of the second to last extensions. The suffix is
_N
where N is the ordinal of the extensions.See
Extension
for the interface of extensions.Parameters:  extension – Extension to register.
 name (str) – Name of the extension. If it is omitted, the
default_name
attribute of the extension is used instead. Note that the name would be suffixed by an ordinal in case of duplicated names as explained above.  trigger (tuple or Trigger) – Trigger object that determines when to
invoke the extension. If it is
None
,extension.trigger
is used instead. If it isNone
and the extension does not have the trigger attribute, the extension is triggered at every iteration by default. If the trigger is not callable, it is passed toIntervalTrigger
to build an interval trigger.  priority (int) – Invocation priority of the extension. Extensions
are invoked in the descending order of priorities in each
iteration. If this is
None
,extension.priority
is used instead.  invoke_before_training (bool or None) – If
True
, the extension is also invoked just before entering the training loop. If this isNone
,extension.invoke_before_training
is used instead. This option is mainly used for extensions that alter the training configuration (e.g., learning rates); in such a case, resuming from snapshots require the call of extension to recover the configuration before any updates.
UpdaterÂ¶

class
chainer.training.
Updater
[source]Â¶ Interface of updater objects for trainers.
TODO(beam2d): document it.

connect_trainer
(trainer)[source]Â¶ Connects the updater to the trainer that will call it.
The typical usage of this method is to register additional links to the reporter of the trainer. This method is called at the end of the initialization of
Trainer
. The default implementation does nothing.Parameters: trainer (Trainer) – Trainer object to which the updater is registered.

finalize
()[source]Â¶ Finalizes the updater object.
This method is called at the end of training loops. It should finalize each dataset iterator used in this updater.

get_all_optimizers
()[source]Â¶ Gets a dictionary of all optimizers for this updater.
Returns: Dictionary that maps names to optimizers. Return type: dict


class
chainer.training.
StandardUpdater
(iterator, optimizer, converter=<function concat_examples>, device=None, loss_func=None)[source]Â¶ Standard implementation of Updater.
This is the standard implementation of
Updater
. It accepts one or more training datasets and one or more optimizers. The default update routine assumes that there is only one training dataset and one optimizer. Users can override this update routine by inheriting this class and overriding theupdate_core()
method. Each batch is converted to input arrays byconcat_examples()
by default, which can also be manually set byconverter
argument.Parameters:  iterator – Dataset iterator for the training dataset. It can also be a
dictionary of iterators. If this is just an iterator, then the
iterator is registered by the name
'main'
.  optimizer – Optimizer to update parameters. It can also be a dictionary
of optimizers. If this is just an optimizer, then the optimizer is
registered by the name
'main'
.  converter – Converter function to build input arrays. Each batch
extracted by the main iterator and the
device
option are passed to this function.concat_examples()
is used by default.  device – Device to which the training data is sent. Negative value indicates the host memory (CPU).
 loss_func – Loss function. The target link of the main optimizer is used by default.
Variables:  converter – Converter function.
 loss_func – Loss function. If it is
None
, the target link of the main optimizer is used instead.  device – Device to which the training data is sent.
 iteration – Current number of completed updates.
 iterator – Dataset iterator for the training dataset. It can also be a
dictionary of iterators. If this is just an iterator, then the
iterator is registered by the name

class
chainer.training.
ParallelUpdater
(iterator, optimizer, converter=<function concat_examples>, models=None, devices=None, loss_func=None)[source]Â¶ Implementation of a parallel GPU Updater.
This is an implementation of
Updater
that uses multiple GPUs. It behaves similarly toStandardUpdater
. The update routine is modified to support dataparallel computation on multiple GPUs in one machine. It is based on synchronous parallel SGD: it parallelizes the gradient computation over a minibatch, and updates the parameters only in the main device.Parameters:  iterator – Dataset iterator for the training dataset. It can also be a
dictionary of iterators. If this is just an iterator, then the
iterator is registered by the name
'main'
.  optimizer – Optimizer to update parameters. It can also be a dictionary
of optimizers. If this is just an optimizer, then the optimizer is
registered by the name
'main'
.  converter – Converter function to build input arrays. Each batch
extracted by the main iterator is split equally between the
devices and then passed with corresponding
device
option to this function.concat_examples()
is used by default.  models – Dictionary of models. The main model should be the same model
attached to the
'main'
optimizer.  devices – Dictionary of devices to which the training data is sent. The
devices should be arranged in a dictionary with the same structure
as
models
.  loss_func – Loss function. The model is used as a loss function by default.
 iterator – Dataset iterator for the training dataset. It can also be a
dictionary of iterators. If this is just an iterator, then the
iterator is registered by the name
ExtensionÂ¶

class
chainer.training.
Extension
[source]Â¶ Base class of trainer extensions.
Extension of
Trainer
is a callable object that takes the trainer object as the argument. It also provides some default configurations as its attributes, e.g. the default trigger and the default priority. This class provides a set of typical default values for these attributes.There are three ways to define users’ own extensions: inheriting this class, decorating closures by
make_extension()
, or using any callable including lambda functions as extensions. Decorator can slightly reduce the overhead and is much easier to use, while this class provides more flexibility (for example, it can have methods to configure the behavior). Using a lambda function allows oneline coding for simple purposes, but users have to specify the configurations as arguments toTrainer.extend()
. For a callable not inheriting this class, the default configurations of this class are used unless the user explicitly specifies them inTrainer.extend()
method.Variables:  trigger – Default value of trigger for this extension. It is set to
(1, 'iteration')
by default.  priority – Default priority of the extension. It is set to
PRIORITY_READER
by default.  invoke_before_training – Default flag to decide whether this extension
should be invoked before the training starts. The default value is
False
.

__call__
(trainer)[source]Â¶ Invokes the extension.
Implementations should override this operator. This method is called at iterations which the corresponding trigger accepts.
Parameters: trainer (Trainer) – Trainer object that calls this operator.

default_name
Â¶ Default name of the extension.
It is the name of the class by default. Implementation can override this property, or provide a class attribute to hide it.
 trigger – Default value of trigger for this extension. It is set to

chainer.training.
make_extension
(trigger=None, default_name=None, priority=None, invoke_before_training=False, finalizer=None)[source]Â¶ Decorator to make given functions into trainer extensions.
This decorator just adds some attributes to a given function. The value of the attributes are given by the arguments of this decorator.
See
Extension
for details of trainer extensions. Most of the default values of arguments also follow those for this class.Parameters:  trigger – Default trigger of the extension.
 default_name – Default name of the extension. The name of a given function is used by default.
 priority (int) – Default priority of the extension.
 invoke_before_training (bool) – Default flag to decide whether the extension should be invoked before any training.
 finalizer – Finalizer function of this extension. The finalizer is called at the end of the training loop.
TriggerÂ¶
Trigger is a callable object to decide when to process some specific event within the training loop. It takes a Trainer object as the argument, and returns True if some event should be fired.
It is mainly used to determine when to call an extension. It is also used to determine when to quit the training loop.

chainer.training.
get_trigger
(trigger)[source]Â¶ Gets a trigger object.
Trigger object is a callable that accepts a
Trainer
object as an argument and returns a boolean value. When it returns True, various kinds of events can occur depending on the context in which the trigger is used. For example, if the trigger is passed to theTrainer
as the stop trigger, the training loop breaks when the trigger returns True. If the trigger is passed to theextend()
method of a trainer, then the registered extension is invoked only when the trigger returns True.This function returns a trigger object based on the argument. If
trigger
is already a callable, it just returns the trigger. Iftrigger
isNone
, it returns a trigger that never fires. Otherwise, it passes the value toIntervalTrigger
.Parameters: trigger – Trigger object. It can be either an already built trigger object (i.e., a callable object that accepts a trainer object and returns a bool value), or a tuple. In latter case, the tuple is passed to IntervalTrigger
.Returns: trigger
if it is a callable, otherwise aIntervalTrigger
object made fromtrigger
.
Debug modeÂ¶
In debug mode, Chainer checks values of variables on runtime and shows more detailed error messages. It helps you to debug your programs. Instead it requires additional overhead time.
In debug mode, Chainer checks all results of forward and backward computation, and if it founds a NaN value, it raises RuntimeError
.
Some functions and links also check validity of input values.

chainer.
is_debug
()[source]Â¶ Get the debug mode.
Returns: Return True
if Chainer is in debug mode.Return type: bool

chainer.
set_debug
(debug)[source]Â¶ Set the debug mode.
Note
This method changes global state. When you use this method on multithreading environment, it may affects other threads.
Parameters: debug (bool) – New debug mode.

class
chainer.
DebugMode
(debug)[source]Â¶ Debug mode context.
This class provides a context manager for debug mode. When entering the context, it sets the debug mode to the value of debug parameter with memorizing its original value. When exiting the context, it sets the debug mode back to the original value.
Parameters: debug (bool) – Debug mode used in the context.
FunctionSet (deprecated)Â¶

class
chainer.
FunctionSet
(**links)[source]Â¶ Set of links (as “parameterized functions”).
FunctionSet is a subclass of
Chain
. Function registration is done just by adding an attribute toobject
.Deprecated since version v1.5: Use
Chain
instead.
__getitem__
(key)[source]Â¶ Returns an attribute by name.
Parameters: key (str) – Name of the attribute. Returns: Attribute. Example
>>> model = chainer.FunctionSet(l1=L.Linear(10, 10), ... l2=L.Linear(10, 10)) >>> l1 = model['l1'] # equivalent to l1 = model.l1

collect_parameters
()[source]Â¶ Returns a tuple of parameters and gradients.
Returns: Tuple (pair) of two tuples. The first element is a tuple of parameter arrays, and the second is a tuple of gradient arrays.

copy_parameters_from
(params)[source]Â¶ Copies parameters from another source without reallocation.
Parameters: params (Iterable) – Iterable of parameter arrays.

gradients
Â¶ Tuple of gradient arrays of all registered functions.
The order of gradients is consistent with
parameters()
property.

parameters
Â¶ Tuple of parameter arrays of all registered functions.
The order of parameters is consistent with
parameters()
property.

UtilitiesÂ¶
CUDA utilitiesÂ¶
Device, context and memory management on CuPy.
Chainer uses CuPy (with very thin wrapper) to exploit the speed of GPU
computation. Following modules and classes are imported to cuda
module for convenience (refer to this table when reading chainer’s source
codes).
imported name  original name 

chainer.cuda.cupy 
cupy 
chainer.cuda.ndarray 
cupy.ndarray 
chainer.cuda.cupy.cuda 
cupy.cuda 
chainer.cuda.Device 
cupy.cuda.Device 
chainer.cuda.Event 
cupy.cuda.Event 
chainer.cuda.Stream 
cupy.cuda.Stream 
Chainer replaces the default allocator of CuPy by its memory pool implementation. It enables us to reuse the device memory over multiple forward/backward computations, and temporary arrays for consecutive elementwise operations.
DevicesÂ¶

chainer.cuda.
get_device
(*args)[source]Â¶ Gets the device from a device object, an ID integer or an array object.
Note
This API is deprecated. Please use :method:`cupy.cuda.get_device_from_id` or :method:`cupy.cuda.get_device_from_array` instead.
This is a convenient utility to select a correct device if the type of
arg
is unknown (i.e., one can use this function on arrays that may be on CPU or GPU). The returned device object supports the context management protocol of Python for the with statement.Parameters: args – Values to specify a GPU device. The first device object, integer or cupy.ndarray
object is used to select a device. If it is a device object, it is returned. If it is an integer, the corresponding device is returned. If it is a CuPy array, the device on which this array reside is returned. If any arguments are neither integers nor CuPy arrays, a dummy device object representing CPU is returned.Returns: Device object specified by given args
.See also
See
cupy.cuda.Device
for the device selection not by arrays.

chainer.cuda.
get_device_from_id
(device_id)[source]Â¶ Gets the device from an ID integer.
Parameters: device_id (int or None) – The ID of the device which this function returns.

chainer.cuda.
get_device_from_array
(*arrays)[source]Â¶ Gets the device from a list of CuPy array or a single CuPy array.
The device on which the given CuPy array reside is returned.
Parameters: array ( cupy.ndarray
or list ofcupy.ndarray
) – A CuPy array which this function returns the device corresponding to. If a list of :class:`cupy.ndarray`s are given, it returns the first device object of an array in the list.
CuPy array allocation and copyÂ¶
Note
As of v1.3.0, the following array construction wrappers are marked as
deprecated. Use the corresponding functions of the cupy
module
instead. The main difference of them is that the default dtype is changed
from float32 to float64.
Deprecated functions  Recommended functions 

chainer.cuda.empty 
cupy.empty() 
chainer.cuda.empty_like 
cupy.empty_like() 
chainer.cuda.zeros 
cupy.zeros() 
chainer.cuda.zeros_like 
cupy.zeros_like() 
chainer.cuda.ones 
cupy.ones() 
chainer.cuda.ones_like 
cupy.ones_like() 
chainer.cuda.full 
cupy.full() 
chainer.cuda.full_like 
cupy.full_like() 

chainer.cuda.
copy
(array, out=None, out_device=None, stream=None)[source]Â¶ Copies a
cupy.ndarray
object using the default stream.This function can copy the device array to the destination array on another device.
Parameters:  array (cupy.ndarray) – Array to be copied.
 out (cupy.ndarray) – Destination array.
If it is not
None
, thenout_device
argument is ignored.  out_device – Destination device specifier. Actual device object is
obtained by passing this value to
get_device()
.  stream (cupy.cuda.Stream) – CUDA stream.
Returns: Copied array.
If
out
is not specified, then the array is allocated on the device specified byout_device
argument.Return type:

chainer.cuda.
to_cpu
(array, stream=None)[source]Â¶ Copies the given GPU array to host CPU.
Parameters:  array – Array to be sent to CPU.
 stream (cupy.cuda.Stream) – CUDA stream.
Returns: Array on CPU.
If given
array
is already on CPU, then this function just returnsarray
without performing any copy.Return type:

chainer.cuda.
to_gpu
(array, device=None, stream=None)[source]Â¶ Copies the given CPU array to specified device.
Parameters:  array – Array to be sent to GPU.
 device – Device specifier.
 stream (cupy.cuda.Stream) – CUDA stream. If not
None
, the copy runs asynchronously.
Returns: Array on GPU.
If
array
is already on GPU, then this function just returnsarray
without performing any copy. Note that this function does not copycupy.ndarray
into specified device.Return type:
Kernel definition utilitiesÂ¶

chainer.cuda.
memoize
(for_each_device=False)[source]Â¶ Makes a function memoizing the result for each argument and device.
This is a similar version of
cupy.memoize()
. The difference is that this function can be used in the global scope even if CUDA is not available. In such case, this function does nothing.Note
This decorator acts as a dummy if CUDA is not available. It cannot be used for general purpose memoization even if
for_each_device
is set to False.

chainer.cuda.
clear_memo
()[source]Â¶ Clears the memoized results for all functions decorated by memoize.
This function works like
cupy.clear_memo()
as a counterpart forchainer.cuda.memoize()
. It can be used even if CUDA is not available. In such a case, this function does nothing.

chainer.cuda.
elementwise
()[source]Â¶ Creates an elementwise kernel function.
This function uses
memoize()
to cache the kernel object, i.e. the resulting kernel object is cached for each argument combination and CUDA device.The arguments are the same as those for
cupy.ElementwiseKernel
, except that thename
argument is mandatory.

chainer.cuda.
reduce
()[source]Â¶ Creates a global reduction kernel function.
This function uses
memoize()
to cache the resulting kernel object, i.e. the resulting kernel object is cached for each argument combination and CUDA device.The arguments are the same as those for
cupy.ReductionKernel
, except that thename
argument is mandatory.
CPU/GPU generic code supportÂ¶

chainer.cuda.
get_array_module
(*args)[source]Â¶ Gets an appropriate one from
numpy
orcupy
.This is almost equivalent to
cupy.get_array_module()
. The differences are that this function can be used even if CUDA is not available and that it will return their data arrays’ array module forVariable
arguments.Parameters: args – Values to determine whether NumPy or CuPy should be used. Returns: cupy
ornumpy
is returned based on the types of the arguments.Return type: module
Common algorithmsÂ¶

class
chainer.utils.
WalkerAlias
(probs)[source]Â¶ Implementation of Walker’s alias method.
This method generates a random sample from given probabilities \(p_1, \dots, p_n\) in \(O(1)\) time. It is more efficient than
choice()
. This class works on both CPU and GPU.Parameters: probs (float list) – Probabilities of entries. They are normalized with sum(probs). See: Wikipedia article

sample
(shape)[source]Â¶ Generates a random sample based on given probabilities.
Parameters: shape (tuple of int) – Shape of a return value. Returns: Returns a generated array with the given shape. If a sampler is in CPU mode the return value is a numpy.ndarray
object, and if it is in GPU mode the return value is acupy.ndarray
object.

ReporterÂ¶
ReporterÂ¶

class
chainer.
Reporter
[source]Â¶ Object to which observed values are reported.
Reporter is used to collect values that users want to watch. The reporter object holds a mapping from value names to the actually observed values. We call this mapping observations.
When a value is passed to the reporter, an object called observer can be optionally attached. In this case, the name of the observer is added as the prefix of the value name. The observer name should be registered beforehand.
See the following example:
>>> from chainer import Reporter, report, report_scope >>> >>> reporter = Reporter() >>> observer = object() # it can be an arbitrary (reference) object >>> reporter.add_observer('my_observer:', observer) >>> observation = {} >>> with reporter.scope(observation): ... reporter.report({'x': 1}, observer) ... >>> observation {'my_observer:x': 1}
There are also a global API to add values:
>>> observation = {} >>> with report_scope(observation): ... report({'x': 1}, observer) ... >>> observation {'my_observer:x': 1}
The most important application of Reporter is to report observed values from each link or chain in the training and validation procedures.
Trainer
and some extensions prepare their own Reporter object with the hierarchy of the target link registered as observers. We can usereport()
function inside any links and chains to report the observed values (e.g., training loss, accuracy, activation statistics, etc.).Variables: observation – Dictionary of observed values. 
__exit__
(exc_type, exc_value, traceback)[source]Â¶ Recovers the previous reporter object to the current.

add_observer
(name, observer)[source]Â¶ Registers an observer of values.
Observer defines a scope of names for observed values. Values observed with the observer are registered with names prefixed by the observer name.
Parameters:  name (str) – Name of the observer.
 observer – The observer object. Note that the reporter distinguishes
the observers by their object ids (i.e.,
id(owner)
), rather than the object equality.

add_observers
(prefix, observers)[source]Â¶ Registers multiple observers at once.
This is a convenient method to register multiple objects at once.
Parameters:  prefix (str) – Prefix of each name of observers.
 observers – Iterator of name and observer pairs.

report
(values, observer=None)[source]Â¶ Reports observed values.
The values are written with the key, prefixed by the name of the observer object if given.
Parameters:  values (dict) – Dictionary of observed values.
 observer – Observer object. Its object ID is used to retrieve the observer name, which is used as the prefix of the registration name of the observed value.

scope
(*args, **kwds)[source]Â¶ Creates a scope to report observed values to
observation
.This is a context manager to be passed to
with
statements. In this scope, the observation dictionary is changed to the given one.It also makes this reporter object current.
Parameters: observation (dict) – Observation dictionary. All observations reported inside of the with
statement are written to this dictionary.


chainer.
report
(values, observer=None)[source]Â¶ Reports observed values with the current reporter object.
Any reporter object can be set current by the
with
statement. This function calls theReport.report()
method of the current reporter. If no reporter object is current, this function does nothing.Example
The most typical example is a use within links and chains. Suppose that a link is registered to the current reporter as an observer (for example, the target link of the optimizer is automatically registered to the reporter of the
Trainer
). We can report some values from the link as follows:class MyRegressor(chainer.Chain): def __init__(self, predictor): super(MyRegressor, self).__init__(predictor=predictor) def __call__(self, x, y): # This chain just computes the mean absolute and squared # errors between the prediction and y. pred = self.predictor(x) abs_error = F.sum(F.abs(pred  y)) / len(x.data) loss = F.mean_squared_error(pred, y) # Report the mean absolute and squared errors. report({'abs_error': abs_error, 'squared_error': loss}, self) return loss
If the link is named
'main'
in the hierarchy (which is the default name of the target link in theStandardUpdater
), these reported values are named'main/abs_error'
and'main/squared_error'
. If these values are reported inside theEvaluator
extension,'validation/'
is added at the head of the link name, thus the item names are changed to'validation/main/abs_error'
and'validation/main/squared_error'
('validation'
is the default name of the Evaluator extension).Parameters:  values (dict) – Dictionary of observed values.
 observer – Observer object. Its object ID is used to retrieve the observer name, which is used as the prefix of the registration name of the observed value.
Summary and DictSummaryÂ¶

class
chainer.
Summary
[source]Â¶ Online summarization of a sequence of scalars.
Summary computes the statistics of given scalars online.

class
chainer.
DictSummary
[source]Â¶ Online summarization of a sequence of dictionaries.
DictSummary
computes the statistics of a given set of scalars online. It only computes the statistics for scalar values and variables of scalar values in the dictionaries.
add
(d)[source]Â¶ Adds a dictionary of scalars.
Parameters: d (dict) – Dictionary of scalars to accumulate. Only elements of scalars, zerodimensional arrays, and variables of zerodimensional arrays are accumulated.

compute_mean
()[source]Â¶ Creates a dictionary of mean values.
It returns a single dictionary that holds a mean value for each entry added to the summary.
Returns: Dictionary of mean values. Return type: dict

make_statistics
()[source]Â¶ Creates a dictionary of statistics.
It returns a single dictionary that holds mean and standard deviation values for every entry added to the summary. For an entry of name
'key'
, these values are added to the dictionary by names'key'
and'key.std'
, respectively.Returns: Dictionary of statistics of all entries. Return type: dict

Experimental feature annotationÂ¶

chainer.utils.
experimental
(api_name)[source]Â¶ Declares that user is using an experimental feature.
The developer of an API can mark it as experimental by calling this function. When users call experimental APIs,
FutureWarning
is issued. The presentation ofFutureWarning
is disabled by settingchainer.disable_experimental_warning
toTrue
, which isFalse
by default.The basic usage is to call it in the function or method we want to mark as experimental along with the API name.
from chainer import utils def f(x): utils.experimental('chainer.foo.bar.f') # concrete implementation of f follows f(1)
... FutureWarning: chainer.foo.bar.f is experimental. The interface can change in the future. ...
We can also make a whole class experimental. In that case, we should call this function in its
__init__
method.class C(): def __init__(self): utils.experimental('chainer.foo.C') C()
... FutureWarning: chainer.foo.C is experimental. The interface can change in the future. ...
If we want to mark
__init__
method only, rather than class itself, it is recommended that we explicitly feed its API name.class D(): def __init__(self): utils.experimental('D.__init__') D()
... FutureWarning: D.__init__ is experimental. The interface can change in the future. ...
Currently, we do not have any sophisticated way to mark some usage of nonexperimental function as experimental. But we can support such usage by explicitly branching it.
def g(x, experimental_arg=None): if experimental_arg is not None: utils.experimental('experimental_arg of chainer.foo.g')
Parameters: api_name (str) – The name of an API marked as experimental.
Assertion and TestingÂ¶
Chainer provides some facilities to make debugging easy.
Type checking utilitiesÂ¶
Function
uses a systematic type checking of the chainer.utils.type_check
module.
It enables users to easily find bugs of forward and backward implementations.
You can find examples of type checking in some function implementations.

class
chainer.utils.type_check.
Expr
(priority)[source]Â¶ Abstract syntax tree of an expression.
It represents an abstract syntax tree, and isn’t a value. You can get its actual value with
eval()
function, and get syntax representation with the__str__()
method. Each comparison operator (e.g.==
) generates a newExpr
object which represents the result of comparison between two expressions.Example
Let
x
andy
be instances ofExpr
, then>>> x = Variable(1, 'x') >>> y = Variable(1, 'y') >>> c = (x == y)
is also an instance of
Expr
. To evaluate and get its value, calleval()
method:>>> c.eval() True
Call
str
function to get a representation of the original equation:>>> str(c) 'x == y'
You can actually compare an expression with a value:
>>> (x == 1).eval() True
Note that you can’t use boolean operators such as
and
, as they try to cast expressions to boolean values:>>> z = Variable(1, 'z') >>> x == y and y == z # raises an error Traceback (most recent call last): RuntimeError: Don't convert Expr to bool. Please call Expr.eval method to evaluate expression.

chainer.utils.type_check.
expect
(*bool_exprs)[source]Â¶ Evaluates and tests all given expressions.
This function evaluates given boolean expressions in order. When at least one expression is evaluated as
False
, that means the given condition is not satisfied. You can check conditions with this function.Parameters: bool_exprs (tuple of Bool expressions) – Bool expressions you want to evaluate.

class
chainer.utils.type_check.
TypeInfo
(shape, dtype)[source]Â¶ Type information of an input/gradient array.
It contains type information of an array, such as the shape of array and the number of dimensions. This information is independent of CPU or GPU array.
Gradient checking utilitiesÂ¶
Most function implementations are numerically tested by gradient checking.
This method computes numerical gradients of forward routines and compares their results with the corresponding backward routines.
It enables us to make the source of issues clear when we hit an error of gradient computations.
The chainer.gradient_check
module makes it easy to implement the gradient checking.

chainer.gradient_check.
check_backward
(func, x_data, y_grad, params=(), eps=0.001, atol=1e05, rtol=0.0001, no_grads=None, dtype=None)[source]Â¶ Test backward procedure of a given function.
This function automatically check backwardprocess of given function. For example, when you have a
Function
classMyFunc
, that gets two arguments and returns one value, you can make its test like this:>> def test_my_func(self): >> func = MyFunc() >> x1_data = xp.array(...) >> x2_data = xp.array(...) >> gy_data = xp.array(...) >> check_backward(func, (x1_data, x2_data), gy_data)
This method creates
Variable
objects withx_data
and callsfunc
with theVariable
s to get its result asVariable
. Then, it setsy_grad
array tograd
attribute of the result and callsbackward
method to get gradients of the inputs. To check correctness of the gradients, the function callsnumerical_grad()
to calculate numerically the gradients and compares the types of gradients withchainer.testing.assert_allclose()
. If input objects (x1_data
or/andx2_data
in this example) represent integer variables, their gradients are ignored.You can simplify a test when
MyFunc
gets only one argument:>> check_backward(func, x1_data, gy_data)
If
MyFunc
is a loss function which returns a zerodimensional array, passNone
togy_data
. In this case, it sets1
tograd
attribute of the result:>> check_backward(my_loss_func, (x1_data, x2_data), None)
If
MyFunc
returns multiple outputs, pass all gradients for outputs as a tuple:>> gy1_data = xp.array(...) >> gy2_data = xp.array(...) >> check_backward(func, x1_data, (gy1_data, gy2_data))
You can also test a
Link
. To check gradients of parameters of the link, set a tuple of the parameters toparams
arguments:>> check_backward(my_link, (x1_data, x2_data), gy_data, >> (my_link.W, my_link.b))
Note that
params
are notndarray
s, butVariables
s.Function objects are acceptable as
func
argument:>> check_backward(lambda x1, x2: f(x1, x2), >> (x1_data, x2_data), gy_data)
Note
func
is called many times to get numerical gradients for all inputs. This function doesn’t work correctly whenfunc
behaves randomly as it gets different gradients.Parameters:  func (callable) – A function which gets
Variable
s and returnsVariable
s.func
must returns a tuple ofVariable
s or oneVariable
. You can useFunction
object,Link
object or a function satisfying the condition.  x_data (ndarray or tuple of ndarrays) – A set of
ndarray
s to be passed tofunc
. Ifx_data
is onendarray
object, it is treated as(x_data,)
.  y_grad (ndarray or tuple of ndarrays or None) – A set of
ndarray
s representing gradients of returnvalues offunc
. Ify_grad
is onendarray
object, it is treated as(y_grad,)
. Iffunc
is a lossfunction,y_grad
should be set toNone
.  params (Variable or tuple of ~chainder.Variable) – A set of
Variable
s whose gradients are checked. Whenfunc
is aLink
object, set its parameters asparams
. Ifparams
is oneVariable
object, it is treated as(params,)
.  eps (float) – Epsilon value to be passed to
numerical_grad()
.  atol (float) – Absolute tolerance to be passed to
chainer.testing.assert_allclose()
.  rtol (float) – Relative tolerance to be passed to
chainer.testing.assert_allclose()
.  no_grads (list of bool) – Flag to skip variable for gradient assertion.
It should be same length as
x_data
.  dtype (dtype) –
x_data
andy_grad
are casted to this dtype when calculating numerical gradients. Only float types andNone
are allowed.
 See:
numerical_grad()
 func (callable) – A function which gets

chainer.gradient_check.
numerical_grad
(f, inputs, grad_outputs, eps=0.001)[source]Â¶ Computes numerical gradient by finite differences.
This function is used to implement gradient check. For usage example, see unit tests of
chainer.functions
.Parameters:  f (function) – Python function with no arguments that runs forward computation and returns the result.
 inputs (tuple of arrays) – Tuple of arrays that should be treated as inputs. Each element of them is slightly modified to realize numerical gradient by finite differences.
 grad_outputs (tuple of arrays) – Tuple of arrays that are treated as output gradients.
 eps (float) – Epsilon value of finite differences.
Returns: Numerical gradient arrays corresponding to
inputs
.Return type:
Standard AssertionsÂ¶
The assertions have same names as NumPy’s ones.
The difference from NumPy is that they can accept both numpy.ndarray
and cupy.ndarray
.
Function testing utilitiesÂ¶
Chainer provides some utilities for testing its functions.

chainer.testing.
unary_math_function_unittest
(func, func_expected=None, label_expected=None, make_data=None)[source]Â¶ Decorator for testing unary mathematical Chainer functions.
This decorator makes test classes test unary mathematical Chainer functions. Tested are forward and backward computations on CPU and GPU across parameterized
shape
anddtype
.Parameters:  func (Function) – Chainer function to be tested by the decorated test class.
 func_expected – Function used to provide expected values for
testing forward computation. If not given, a corresponsing numpy
function for
func
is implicitly picked up by its class name.  label_expected (string) – String used to test labels of Chainer
functions. If not given, the class name of
func
lowered is implicitly used.  make_data – Function to customize input and gradient data used
in the tests. It takes
shape
anddtype
as its arguments, and returns a tuple of input and gradient data. By default, uniform destribution ranged[1, 1]
is used for both.
The decorated test class tests forward and backward computations on CPU and GPU across the following
parameterize()
ed parameters: shape: rank of zero, and rank of more than zero
 dtype:
numpy.float16
,numpy.float32
andnumpy.float64
Additionally, it tests the label of the Chainer function.
Chainer functions tested by the test class decorated with the decorator should have the following properties:
 Unary, taking one parameter and returning one value
dtype
of input and output are the same Elementwise operation for the supplied ndarray
Example
The following code defines a test class that tests
sin()
Chainer function, which takes a parameter withdtype
of float and returns a value with the samedtype
.>>> import unittest >>> from chainer import testing >>> from chainer import functions as F >>> >>> @testing.unary_math_function_unittest(F.Sin()) ... class TestSin(unittest.TestCase): ... pass
Because the test methods are implicitly injected to
TestSin
class by the decorator, it is enough to placepass
in the class definition.Now the test is run with
nose
module.>>> import nose >>> nose.run( ... defaultTest=__name__, argv=['', 'a', '!gpu'], exit=False) True
To customize test data,
make_data
optional parameter can be used. The following is an example of testingsqrt
Chainer function, which is tested in positive value domain here instead of the default input.>>> import numpy >>> >>> def make_data(shape, dtype): ... x = numpy.random.uniform(0.1, 1, shape).astype(dtype) ... gy = numpy.random.uniform(1, 1, shape).astype(dtype) ... return x, gy ... >>> @testing.unary_math_function_unittest(F.Sqrt(), ... make_data=make_data) ... class TestSqrt(unittest.TestCase): ... pass ... >>> nose.run( ... defaultTest=__name__, argv=['', 'a', '!gpu'], exit=False) True
make_data
function which returns input and gradient data generated in proper value domains with givenshape
anddtype
parameters is defined, then passed to the decorator’smake_data
parameter.
Standard Function implementationsÂ¶
Chainer provides basic Function
implementations in the
chainer.functions
package. Most of them are wrapped by plain Python
functions, which users should use.
Note
As of v1.5, the concept of parameterized functions are gone, and they are
replaced by corresponding Link
implementations. They are
still put in the functions
namespace for backward
compatibility, though it is strongly recommended to use them via the
chainer.links
package.
Activation functionsÂ¶
clipped_reluÂ¶

chainer.functions.
clipped_relu
(x, z=20.0)[source]Â¶ Clipped Rectifier Unit function.
For a clipping value \(z(>0)\), it computes
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_n)\)shaped float array.  z (float) – Clipping value. (default = 20.0)
Returns: Output variable. A \((s_1, s_2, ..., s_n)\)shaped float array.
Return type: Example
>>> x = np.random.uniform(100, 100, (10, 20)).astype('f') >>> z = 10.0 >>> np.any(x < 0) True >>> np.any(x > z) True >>> y = F.clipped_relu(x, z=z) >>> np.any(y.data < 0) False >>> np.any(y.data > z) False
 x (
creluÂ¶

chainer.functions.
crelu
(x, axis=1)[source]Â¶ Concatenated Rectified Linear Unit function.
This function is expressed as follows
\[f(x) = (\max(0, x), \max(0, x)).\]Here, two output values are concatenated along an axis.
See: https://arxiv.org/abs/1603.05201
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  axis (int) – Axis that the output values are concatenated along. Default is 1.
Returns: Output variable of concatenated array. If the axis is 1, A \((s_1, s_2 \times 2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.array([[1, 0], [2, 3]], 'f') >>> x array([[1., 0.], [ 2., 3.]], dtype=float32) >>> y = F.crelu(x, axis=1) >>> y.data array([[ 0., 0., 1., 0.], [ 2., 0., 0., 3.]], dtype=float32)
 x (
eluÂ¶

chainer.functions.
elu
(x, alpha=1.0)[source]Â¶ Exponential Linear Unit function.
For a parameter \(\alpha\), it is expressed as
\[\begin{split}f(x) = \left \{ \begin{array}{ll} x & {\rm if}~ x \ge 0 \\ \alpha (\exp(x)  1) & {\rm if}~ x < 0, \end{array} \right.\end{split}\]See: https://arxiv.org/abs/1511.07289
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  alpha (float) – Parameter \(\alpha\). Default is 1.0.
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.array([[1, 0], [2, 3]], 'f') >>> x array([[1., 0.], [ 2., 3.]], dtype=float32) >>> y = F.elu(x, alpha=1.) >>> y.data array([[0.63212055, 0. ], [ 2. , 0.95021296]], dtype=float32)
 x (
hard_sigmoidÂ¶

chainer.functions.
hard_sigmoid
(x)[source]Â¶ Elementwise hardsigmoid function.
This function is defined as
\[\begin{split}f(x) = \left \{ \begin{array}{ll} 0 & {\rm if}~ x < 2.5 \\ 0.2 x + 0.5 & {\rm if}~ 2.5 < x < 2.5 \\ 1 & {\rm if}~ 2.5 < x. \end{array} \right.\end{split}\]Parameters: x ( Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array. Return type: Variable Example
It maps the input values into the range of \([0, 1]\).
>>> x = np.array([2.6, 1, 0, 1, 2.6]) >>> x array([2.6, 1. , 0. , 1. , 2.6]) >>> F.hard_sigmoid(x).data array([ 0. , 0.3, 0.5, 0.7, 1. ])
leaky_reluÂ¶

chainer.functions.
leaky_relu
(x, slope=0.2)[source]Â¶ Leaky Rectified Linear Unit function.
This function is expressed as
\[f(x)=\max(x, ax),\]where \(a\) is a configurable slope value.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  slope (float) – Slope value \(a\).
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.array([[1, 0], [2, 3], [2, 1]], 'f') >>> x array([[1., 0.], [ 2., 3.], [2., 1.]], dtype=float32) >>> F.leaky_relu(x, slope=0.2).data array([[0.2 , 0. ], [ 2. , 0.60000002], [0.40000001, 1. ]], dtype=float32)
 x (
log_softmaxÂ¶

chainer.functions.
log_softmax
(x, use_cudnn=True)[source]Â¶ Channelwise logsoftmax function.
This function computes its logarithm of softmax along the second axis. Let \(c = (c_1, c_2, \dots, c_D)\) be the slice of
x
along with the second axis. For each slice \(c\), it computes the logarithm of the function \(f(c)\) defined as\[f(c) = {\exp(c) \over \sum_{d} \exp(c_d)}.\]This method is theoretically equivalent to
log(softmax(x))
but is more stable.Note
log(softmax(x))
may cause underflow whenx
is too small, becausesoftmax(x)
may returns0
.log_softmax
method is more stable.Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \(n\)dimensional (\(n \geq 2\)) float array.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable. A \(n\)dimensional (\(n \geq 2\)) float array, which is the same shape with x.
Return type: See also
Example
>>> x = np.array([[0, 1, 2], [0, 2, 4]], 'f') >>> x array([[ 0., 1., 2.], [ 0., 2., 4.]], dtype=float32) >>> F.log_softmax(x).data array([[2.40760589, 1.40760589, 0.40760589], [4.14293146, 2.14293146, 0.14293146]], dtype=float32) >>> np.allclose(F.log_softmax(x).data, F.log(F.softmax(x)).data) True
 x (
lstmÂ¶

chainer.functions.
lstm
(c_prev, x)[source]Â¶ Long ShortTerm Memory units as an activation function.
This function implements LSTM units with forget gates. Let the previous cell state
c_prev
and the input arrayx
.First, the input array
x
is split into four arrays \(a, i, f, o\) of the same shapes along the second axis. It means thatx
‘s second axis must have 4 times thec_prev
‘s second axis.The split input arrays are corresponding to:
 \(a\) : sources of cell input
 \(i\) : sources of input gate
 \(f\) : sources of forget gate
 \(o\) : sources of output gate
Second, it computes the updated cell state
c
and the outgoing signalh
as:\[\begin{split}c &= \tanh(a) \sigma(i) + c_{\text{prev}} \sigma(f), \\ h &= \tanh(c) \sigma(o),\end{split}\]where \(\sigma\) is the elementwise sigmoid function. These are returned as a tuple of two variables.
This function supports variable length inputs. The minibatch size of the current input must be equal to or smaller than that of the previous one. When minibatch size of
x
is smaller than that ofc
, this function only updatesc[0:len(x)]
and doesn’t change the rest ofc
,c[len(x):]
. So, please sort input sequences in descending order of lengths before applying the function.Parameters:  c_prev (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the previous cell state. The cell state should be a zero array or the output of the previous call of LSTM.  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the sources of cell input, input gate, forget gate and output gate. It must have the second dimension whose size is four times of that of the cell state.
Returns: Two
Variable
objectsc
andh
.c
is the updated cell state.h
indicates the outgoing signal.Return type: See the original paper proposing LSTM with forget gates: Long ShortTerm Memory in Recurrent Neural Networks.
See also
Example
Assuming
y
is the current incoming signal,c
is the previous cell state, andh
is the previous outgoing signal from anlstm
function. Each ofy
,c
andh
hasn_units
channels. Most typical preparation ofx
is:>>> n_units = 100 >>> y = chainer.Variable(np.zeros((1, n_units), 'f')) >>> h = chainer.Variable(np.zeros((1, n_units), 'f')) >>> c = chainer.Variable(np.zeros((1, n_units), 'f')) >>> model = chainer.Chain(w=L.Linear(n_units, 4 * n_units), ... v=L.Linear(n_units, 4 * n_units),) >>> x = model.w(y) + model.v(h) >>> c, h = F.lstm(c, x)
It corresponds to calculate the input array
x
, or the input sources \(a, i, f, o\), from the current incoming signaly
and the previous outgoing signalh
. Different parameters are used for different kind of input sources.Note
We use the naming rule below.
 incoming signal
 The formal input of the formulation of LSTM (e.g. in NLP, word
vector or output of lower RNN layer). The input of
chainer.links.LSTM
is the incoming signal.
 input array
 The array which is linear transformed from incoming signal and
the previous outgoing signal. The input array contains four
sources, the sources of cell input, input gate, forget gate and
output gate. The input of
chainer.functions.LSTM
is the input array.
maxoutÂ¶

chainer.functions.
maxout
(x, pool_size, axis=1)[source]Â¶ Maxout activation function.
It accepts an input tensor
x
, reshapes theaxis
dimension (say the size beingM * pool_size
) into two dimensions(M, pool_size)
, and takes maximum along theaxis
dimension.Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \(n\)dimensional (\(n \ge\)axis
) float array. In general, its first dimension is assumed to be the minibatch dimension. The other dimensions are treated as one concatenated dimension.  pool_size (int) – The size used for downsampling of pooling layer.
 axis (int) – The
axis
dimension to be reshaped. The size ofaxis
dimension should beM * pool_size
.
Returns: Output variable. The shape of the output is same as
x
except thataxis
dimension is transformed fromM * pool_size
toM
.Return type: See also
Example
Typically,
x
is the output of a linear layer or a convolution layer. The following is the example where we usemaxout()
in combination with a Linear link.>>> in_size, out_size, pool_size = 10, 10, 10 >>> bias = np.arange(out_size * pool_size).astype('f') >>> l = L.Linear(in_size, out_size * pool_size, initial_bias=bias) >>> x = np.zeros((1, in_size), 'f') # prepare data >>> x = l(x) >>> y = F.maxout(x, pool_size) >>> x.shape (1, 100) >>> y.shape (1, 10) >>> x.reshape((out_size, pool_size)).data array([[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], [ 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.], [ 20., 21., 22., 23., 24., 25., 26., 27., 28., 29.], [ 30., 31., 32., 33., 34., 35., 36., 37., 38., 39.], [ 40., 41., 42., 43., 44., 45., 46., 47., 48., 49.], [ 50., 51., 52., 53., 54., 55., 56., 57., 58., 59.], [ 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.], [ 70., 71., 72., 73., 74., 75., 76., 77., 78., 79.], [ 80., 81., 82., 83., 84., 85., 86., 87., 88., 89.], [ 90., 91., 92., 93., 94., 95., 96., 97., 98., 99.]], dtype=float32) >>> y.data array([[ 9., 19., 29., 39., 49., 59., 69., 79., 89., 99.]], dtype=float32)
 x (
preluÂ¶

chainer.functions.
prelu
(x, W)[source]Â¶ Parametric ReLU function.
It accepts two arguments: an input
x
and a weight arrayW
and computes the output as \(PReLU(x) = \max(x, W*x)\), where \(*\) is an elementwise multiplication for each sample in the batch.When the PReLU function is combined with twodimensional convolution, the elements of parameter \(a\) are typically shared across the same filter of different pixels. In order to support such usage, this function supports the shape of parameter array that indicates leading dimensions of input arrays except the batch dimension.
For example \(W\) has the shape of \((2, 3, 4)\), \(x\) must have the shape of \((B, 2, 3, 4, S1, ..., SN)\) where B is batch size and the number of trailing S’s is arbitrary nonnegative integer.
Parameters: Returns: Output variable
Return type: See also
reluÂ¶

chainer.functions.
relu
(x, use_cudnn=True)[source]Â¶ Rectified Linear Unit function.
\[f(x)=\max(0, x).\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.array([[1, 0], [2, 3], [2, 1]], 'f') >>> np.any(x < 0) True >>> y = F.relu(x) >>> np.any(y.data < 0) False >>> y.shape (3, 2)
 x (
sigmoidÂ¶

chainer.functions.
sigmoid
(x, use_cudnn=True)[source]Â¶ Elementwise sigmoid logistic function.
\[f(x)=(1 + \exp(x))^{1}.\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
It maps the input values into the range of \([0, 1]\).
>>> x = np.arange(2, 3, 2).astype('f') >>> x array([2., 0., 2.], dtype=float32) >>> F.sigmoid(x).data array([ 0.11920291, 0.5 , 0.88079709], dtype=float32)
 x (
slstmÂ¶

chainer.functions.
slstm
(c_prev1, c_prev2, x1, x2)[source]Â¶ SLSTM units as an activation function.
This function implements SLSTM unit. It is an extension of LSTM unit applied to tree structures. The function is applied to binary trees. Each node has two child nodes. It gets four arguments, previous cell states
c_prev1
andc_prev2
, and input arraysx1
andx2
.First both input arrays
x1
andx2
are split into eight arrays \(a_1, i_1, f_1, o_1\), and \(a_2, i_2, f_2, o_2\). They have the same shape along the second axis. It means thatx1
andx2
‘s second axis must have 4 times the length ofc_prev1
andc_prev2
.The split input arrays are corresponding to:
 \(a_i\) : sources of cell input
 \(i_i\) : sources of input gate
 \(f_i\) : sources of forget gate
 \(o_i\) : sources of output gate
It computes the updated cell state
c
and the outgoing signalh
as:\[\begin{split}c &= \tanh(a_1 + a_2) \sigma(i_1 + i_2) + c_{\text{prev}1} \sigma(f_1) + c_{\text{prev}2} \sigma(f_2), \\ h &= \tanh(c) \sigma(o_1 + o_2),\end{split}\]where \(\sigma\) is the elementwise sigmoid function. The function returns
c
andh
as a tuple.Parameters:  c_prev1 (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the previous cell state of the first child node. The cell state should be a zero array or the output of the previous call of LSTM.  c_prev2 (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the previous cell state of the second child node.  x1 (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the sources of cell input, input gate, forget gate and output gate from the first child node. It must have the second dimension whose size is four times of that of the cell state.  x2 (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable that holds the input sources from the second child node.
Returns: Two
Variable
objectsc
andh
.c
is the cell state.h
indicates the outgoing signal.Return type: See detail in paper: Long ShortTerm Memory Over Tree Structures.
Example
Assuming
c1
,c2
is the previous cell state of children, andh1
,h2
is the previous outgoing signal from children. Each ofc1
,c2
,h1
andh2
hasn_units
channels. Most typical preparation ofx1
,x2
is:>>> n_units = 100 >>> h1 = chainer.Variable(np.zeros((1, n_units), 'f')) >>> h2 = chainer.Variable(np.zeros((1, n_units), 'f')) >>> c1 = chainer.Variable(np.zeros((1, n_units), 'f')) >>> c2 = chainer.Variable(np.zeros((1, n_units), 'f')) >>> model1 = chainer.Chain(w=L.Linear(n_units, 4 * n_units), ... v=L.Linear(n_units, 4 * n_units)) >>> model2 = chainer.Chain(w=L.Linear(n_units, 4 * n_units), ... v=L.Linear(n_units, 4 * n_units)) >>> x1 = model1.w(c1) + model1.v(h1) >>> x2 = model2.w(c2) + model2.v(h2) >>> c, h = F.slstm(c1, c2, x1, x2)
It corresponds to calculate the input array
x1
, or the input sources \(a_1, i_1, f_1, o_1\) from the previous cell state of first child nodec1
, and the previous outgoing signal from first child nodeh1
. Different parameters are used for different kind of input sources.
softmaxÂ¶

chainer.functions.
softmax
(x, use_cudnn=True, axis=1)[source]Â¶ Softmax function.
This function computes its softmax along an axis. Let \(x = (x_1, x_2, \dots, x_d)^{\top}\) be the d dimensional index array and \(f(x)\) be the d dimensional input array. For each index \(x\) of the input array \(f(x)\), it computes the probability \(p(x)\) defined as \(p(x) = {\exp(f(x)) \over \sum_{x_2} \exp(f(x))}\).
Parameters: Returns: Output variable.
Return type:
softplusÂ¶

chainer.functions.
softplus
(x, beta=1.0)[source]Â¶ Elementwise softplus function.
The softplus function is the smooth approximation of ReLU.
\[f(x)=\frac{1}{\beta}\log(1 + \exp(\beta x)),\]where \(\beta\) is a parameter. The function becomes curved and akin to ReLU as the \(\beta\) is increasing.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  beta (float) – Parameter \(\beta\).
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.arange(2, 3, 2).astype('f') >>> x array([2., 0., 2.], dtype=float32) >>> F.softplus(x, beta=1.0).data array([ 0.126928 , 0.69314718, 2.12692809], dtype=float32)
 x (
tanhÂ¶

chainer.functions.
tanh
(x, use_cudnn=True)[source]Â¶ Elementwise hyperbolic tangent function.
\[f(x)=\tanh(x).\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable. A \((s_1, s_2, ..., s_N)\)shaped float array.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable. A \((s_1, s_2, ..., s_N)\)shaped float array.
Return type: Example
>>> x = np.arange(1, 4, 2).astype('f') >>> x array([1., 1., 3.], dtype=float32) >>> F.tanh(x).data array([0.76159418, 0.76159418, 0.99505478], dtype=float32)
 x (
Array manipulationsÂ¶
broadcastÂ¶

chainer.functions.
broadcast
(*args)[source]Â¶ Broadcast given variables.
Parameters: args ( Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be broadcasted. Each dimension of the shapes of the input variables must have the same size.Returns: Variable
or tuple ofVariable
objects which are broadcasted from given arguments.Return type: Variable Example
>>> x = np.random.uniform(0, 1, (3, 2)).astype('f') >>> y = F.broadcast(x) >>> np.all(x == y.data) True >>> z = np.random.uniform(0, 1, (3, 2)).astype('f') >>> y, w = F.broadcast(x, z) >>> np.all(x == y.data) & np.all(z == w.data) True
broadcast_toÂ¶

chainer.functions.
broadcast_to
(x, shape)[source]Â¶ Broadcast a given variable to a given shape.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable be broadcasted. A \((s_1, s_2, ..., s_N)\)shaped float array.  shape (tuple) – Tuple of
int
of the shape of the output variable.
Returns: Output variable broadcasted to the given shape.
Return type: Example
>>> x = np.arange(0, 3) >>> x array([0, 1, 2]) >>> y = F.broadcast_to(x, (3, 3)) >>> y.data array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
 x (
castÂ¶

chainer.functions.
cast
(x, typ)[source]Â¶ Cast an input variable to a given type.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable to be casted. A \((s_1, s_2, ..., s_N)\)shaped float array.  typ (
str
of dtype ornumpy.dtype
) – Typecode or data type to cast.
Returns: Variable holding a casted array.
Return type: Example
>>> x = np.arange(0, 3, dtype=np.float64) >>> x.dtype dtype('float64') >>> y = F.cast(x, np.float32) >>> y.dtype dtype('float32') >>> y = F.cast(x, 'float16') >>> y.dtype dtype('float16')
 x (
concatÂ¶

chainer.functions.
concat
(xs, axis=1)[source]Â¶ Concatenates given variables along an axis.
Parameters:  xs (tuple of
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be concatenated. The variables must have the same shape, except in the dimension corresponding to axis.  axis (int) – The axis along which the arrays will be joined. Default is 1.
Returns: The concatenated variable.
Return type: Example
>>> x = np.arange(0, 12).reshape(3, 4) >>> x array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> y = np.arange(0, 3).reshape(3, 1) >>> y array([[0], [1], [2]]) >>> z = F.concat((x, y), axis=1) >>> z.data array([[ 0, 1, 2, 3, 0], [ 4, 5, 6, 7, 1], [ 8, 9, 10, 11, 2]])
 xs (tuple of
copyÂ¶

chainer.functions.
copy
(x, dst)[source]Â¶ Copies the input variable onto the specified device.
This function copies the array of input variable onto the device specified by
dst
. Whendst == 1
, it copies the array onto the host memory. This function supports copies from host to host, from host to device, from device to device and from device to host.Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable to be copied.  dst (int) – Target device specifier.
Returns: Output variable.
Return type: Example
>>> import chainer.cuda as cuda >>> x = np.random.uniform(1, 1, (5, 10)) >>> cuda.get_device_from_array(x).id 1 >>> y = F.copy(x, 0) # from host to device0 >>> cuda.get_device_from_array(y.data).id 0 >>> z = F.copy(y, 1) # from device0 to host >>> cuda.get_device_from_array(z.data).id 1
 x (
depth2spaceÂ¶

chainer.functions.
depth2space
(X, r)[source]Â¶ Computes the depth2space transformation for subpixel calculations.
Parameters:  X (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable holding a 4d array of shape(batch, channel * r * r, dim1, dim2)
.  r (int) – the upscaling factor.
Returns: A variable holding the upscaled array from interspersed depth layers. The shape is
(batch, channel, dim1 * r, dim2 * r)
.Return type: Note
This can be used to compute superresolution transformations. See https://arxiv.org/abs/1609.05158 for details.
See also
Example
>>> X = np.arange(24).reshape(1, 4, 2, 3).astype('f') >>> X.shape (1, 4, 2, 3) >>> X array([[[[ 0., 1., 2.], [ 3., 4., 5.]], [[ 6., 7., 8.], [ 9., 10., 11.]], [[ 12., 13., 14.], [ 15., 16., 17.]], [[ 18., 19., 20.], [ 21., 22., 23.]]]], dtype=float32) >>> y = F.depth2space(X, 2) >>> y.shape (1, 1, 4, 6) >>> y.data array([[[[ 0., 6., 1., 7., 2., 8.], [ 12., 18., 13., 19., 14., 20.], [ 3., 9., 4., 10., 5., 11.], [ 15., 21., 16., 22., 17., 23.]]]], dtype=float32)
 X (
dstackÂ¶

chainer.functions.
dstack
(xs)[source]Â¶ Concatenate variables along third axis (depth wise).
Parameters: xs (list of Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be concatenated. The variables must have the samendim
. When the variables have the third axis (i.e. \(ndim \geq 3\)), the variables must have the same shape along all but the third axis. When the variables do not have the third axis(i.e. \(ndim < 3\)), the variables must have the same shape.Returns: Output variable. When the input variables have the third axis (i.e. \(ndim \geq 3\)), the shapes of inputs and output are the same along all but the third axis. The length of third axis is the sum of the lengths of inputs’ third axis. When the shape of variables are (N1, N2)
(i.e. \(ndim = 2\)), the shape of output is(N1, N2, 2)
. When the shape of variables are(N1,)
(i.e. \(ndim = 1\)), the shape of output is(1, N1, 2)
. When the shape of variables are()
(i.e. \(ndim = 0\)), the shape of output is(1, 1, 2)
.Return type: Variable Example
>>> x1 = np.array((1, 2, 3)) >>> x1.shape (3,) >>> x2 = np.array((2, 3, 4)) >>> x2.shape (3,) >>> y = F.dstack((x1, x2)) >>> y.shape (1, 3, 2) >>> y.data array([[[1, 2], [2, 3], [3, 4]]])
>>> x1 = np.arange(0, 6).reshape(3, 2) >>> x1.shape (3, 2) >>> x1 array([[0, 1], [2, 3], [4, 5]]) >>> x2 = np.arange(6, 12).reshape(3, 2) >>> x2.shape (3, 2) >>> x2 array([[ 6, 7], [ 8, 9], [10, 11]]) >>> y = F.dstack([x1, x2]) >>> y.shape (3, 2, 2) >>> y.data array([[[ 0, 6], [ 1, 7]], [[ 2, 8], [ 3, 9]], [[ 4, 10], [ 5, 11]]])
>>> x1 = np.arange(0, 12).reshape(3, 2, 2) >>> x2 = np.arange(12, 18).reshape(3, 2, 1) >>> y = F.dstack([x1, x2]) >>> y.shape (3, 2, 3) >>> y.data array([[[ 0, 1, 12], [ 2, 3, 13]], [[ 4, 5, 14], [ 6, 7, 15]], [[ 8, 9, 16], [10, 11, 17]]])
expand_dimsÂ¶

chainer.functions.
expand_dims
(x, axis)[source]Â¶ Expands dimensions of an input variable without copy.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable.  axis (int) – Position where new axis is to be inserted. The
axis
parameter is acceptable when \(ndim  1 \leq axis \leq ndim\). (ndim
is the dimension of input variables). When \(axis < 0\), the result is the same with \(ndim + 1  axis\).
Returns: Variable that holds a expanded input. The
ndim
of output is one grater than that ofx
.Return type: Example
>>> x = np.array([1, 2, 3]) >>> x.shape (3,) >>> y = F.expand_dims(x, axis=0) >>> y.shape (1, 3) >>> y.data array([[1, 2, 3]]) >>> y = F.expand_dims(x, axis=1) >>> y.shape (3, 1) >>> y.data array([[1], [2], [3]]) >>> y = F.expand_dims(x, axis=2) >>> y.shape (1, 3) >>> y.data array([[1, 2, 3]])
 x (
flattenÂ¶
fliplrÂ¶
flipudÂ¶
get_itemÂ¶

chainer.functions.
get_item
(x, slices)[source]Â¶ Extract elements from array with specified shape, axes and offsets.
Parameters: Returns: Variable
objectwhich contains sliced array of
x
.
Return type: Note
It only supports types that are supported by CUDA’s atomicAdd when an integer array is included in
slices
. The supported types arenumpy.float32
,numpy.int32
,numpy.uint32
,numpy.uint64
andnumpy.ulonglong
.Note
It does not support
slices
that contains multiple boolean arrays.Note
See NumPy document for details of indexing.
hstackÂ¶

chainer.functions.
hstack
(xs)[source]Â¶ Concatenate variables horizontally (column wise).
Parameters: xs (list of Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be concatenated. The variables must have the samendim
. When the variables have the second axis (i.e. \(ndim \geq 2\)), the variables must have the same shape along all but the second axis. When the variables do not have the second axis(i.e. \(ndim < 2\)), the variables need not to have the same shape.Returns: Output variable. When the input variables have the second axis (i.e. \(ndim \geq 2\)), the shapes of inputs and output are the same along all but the second axis. The length of second axis is the sum of the lengths of inputs’ second axis. When the variables do not have the second axis (i.e. \(ndim < 2\)), the shape of output is (N, )
(N
is the sum of the input variables’ size).Return type: Variable Example
>>> x1 = np.array((1, 2, 3)) >>> x1.shape (3,) >>> x2 = np.array((2, 3, 4)) >>> x2.shape (3,) >>> y = F.hstack((x1, x2)) >>> y.shape (6,) >>> y.data array([1, 2, 3, 2, 3, 4]) >>> x1 = np.arange(0, 12).reshape(3, 4) >>> x1.shape (3, 4) >>> x1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> x2 = np.arange(12, 18).reshape(3, 2) >>> x2.shape (3, 2) >>> x2 array([[12, 13], [14, 15], [16, 17]]) >>> y = F.hstack([x1, x2]) >>> y.shape (3, 6) >>> y.data array([[ 0, 1, 2, 3, 12, 13], [ 4, 5, 6, 7, 14, 15], [ 8, 9, 10, 11, 16, 17]])
im2colÂ¶

chainer.functions.
im2col
(x, ksize, stride=1, pad=0, cover_all=False, dilate=1)[source]Â¶ Extract patches from an image based on the filter.
This function rearranges patches of an image and put them in the channel dimension of the output.
Patches are extracted at positions shifted by multiples of
stride
from the first positionpad
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Notation: here is a notation.
 \(n\) is the batch size.
 \(c\) is the number of the input channels.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(k_H\) and \(k_W\) are the height and width of the filters, respectively.
 \(s_Y\) and \(s_X\) are the strides of the filter.
 \(p_H\) and \(p_W\) are the spatial padding sizes.
 \(d_Y\) and \(d_X\) are the dilation factors of filter application.
The output size \((h_O, w_O)\) is determined by the following equations when
cover_all = False
:\[\begin{split}h_O &= (h + 2p_H  k_H  (k_H  1) * (d_Y  1)) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W  (k_W  1) * (d_X  1)) / s_X + 1.\end{split}\]When
cover_all = True
, the output size is determined by the following equations:\[\begin{split}h_O &= (h + 2p_H  k_H  (k_H  1) * (d_Y  1) + s_Y  1) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W  (k_W  1) * (d_X  1) + s_X  1) / s_X + 1.\end{split}\]Parameters:  x (Variable) – Input variable of shape \((n, c, h, w)\).
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  cover_all (bool) – If
True
, all spatial locations are rearranged into some output pixels. It may make the output size larger.  dilate (int or pair of ints) – Dilation factor of filter applications.
dilate=d
anddilate=(d, d)
are equivalent.
Returns: Output variable whose shape is \((n, c \cdot k_H \cdot k_W, h_O, w_O)\)
Return type:
padÂ¶

chainer.functions.
pad
(x, pad_width, mode, **keywords)[source]Â¶ Pad an input variable.
Parameters:  x (chainer.Variable or :class:
numpy.ndarray
or cupy.ndarray) – Input data.  pad_width (int or arraylike) – Number of values padded to the edges of each axis.
 mode (str) –
Specifies how the function fills the periphery of the array. constant
Pads with a constant values.  constant_values (int or arraylike) – The values are padded for each axis.
Returns: Output variable.
Return type:  x (chainer.Variable or :class:
permutateÂ¶

chainer.functions.
permutate
(x, indices, axis=0, inv=False)[source]Â¶ Permutates a given variable along an axis.
This function permutate
x
with givenindices
. That meansy[i] = x[indices[i]]
for alli
. Note that this result is same asy = x.take(indices)
.indices
must be a permutation of[0, 1, ..., len(x)  1]
.When
inv
isTrue
,indices
is treated as its inverse. That meansy[indices[i]] = x[i]
.Parameters: Returns: Output variable.
Return type:
reshapeÂ¶
resize_imagesÂ¶

chainer.functions.
resize_images
(x, output_shape)[source]Â¶ Resize images to the given shape.
This function resizes 2D data to
output_shape
. Currently, only bilinear interpolation is supported as the sampling method.Notatition: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) is the number of the input channels.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(h_O\) and \(w_O\) are the height and width of the output image.
Parameters: Returns: Resized image whose shape is \((n, c_I, h_O, w_O)\).
Return type:
rollaxisÂ¶
select_itemÂ¶
separateÂ¶

chainer.functions.
separate
(x, axis=0)[source]Â¶ Separates an array along a given axis.
This function separates an array along a given axis. For example, shape of an array is
(2, 3, 4)
. When it separates the array withaxis=1
, it returns three(2, 4)
arrays.This function is an inverse of
chainer.functions.stack()
.Parameters:  x (chainer.Variable) – Variable to be separated.
 axis (int) – Axis along which variables are separated.
Returns: Output variables.
Return type: tuple of chainer.Variable
See also
space2depthÂ¶

chainer.functions.
space2depth
(X, r)[source]Â¶ Computes the space2depth transformation for subpixel calculations.
Parameters:  X (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Variable holding a 4d array of shape(batch, channel, dim1 * r, dim2 * r)
.  r (int) – the downscaling factor.
Returns: A variable holding the downscaled layer array from subpixel array sampling. The shape is
(batch, channel * r * r, dim1, dim2)
.Return type: Note
This can be used to compute inverse superresolution transformations. See https://arxiv.org/abs/1609.05158 for details.
See also
Example
>>> X = np.arange(24).reshape(1, 1, 4, 6).astype('f') >>> X.shape (1, 1, 4, 6) >>> X array([[[[ 0., 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10., 11.], [ 12., 13., 14., 15., 16., 17.], [ 18., 19., 20., 21., 22., 23.]]]], dtype=float32) >>> y = F.space2depth(X, 2) >>> y.shape (1, 4, 2, 3) >>> y.data array([[[[ 0., 2., 4.], [ 12., 14., 16.]], [[ 1., 3., 5.], [ 13., 15., 17.]], [[ 6., 8., 10.], [ 18., 20., 22.]], [[ 7., 9., 11.], [ 19., 21., 23.]]]], dtype=float32)
 X (
spatial_transformer_gridÂ¶

chainer.functions.
spatial_transformer_grid
(theta, output_shape, use_cudnn=True)[source]Â¶ 2D Spatial Transformer grid.
This function generates coordinates of the points sampled from an image to perform warping described in Spatial Transformer Networks.
Given a coordinate in the warped image \((x_i^t, y_i^t)\), the point sampled from the source image \((x_i^s, y_i^s)\) are calculated by the following equation.
\[\begin{split}\left(\begin{matrix} x_i^s \\ y_i^s \end{matrix}\right) = \left(\begin{matrix} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{matrix}\right) \left(\begin{matrix} x_i^t \\ y_i^t \\ 1 \end{matrix}\right)\end{split}\]Notatition: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(h_O\) and \(w_O\) are the height and the width of the output image.
Parameters:  theta (Variable) – An array of shape \((n, 2, 3)\). This is a batch of \(2 \times 3\) matrix used for the warping described above.
 output_shape (tuple) – A tuple of 2 elements: \(h_O, w_O\).
 use_cudnn (bool) – If
True
, then this function uses cuDNN if available. Note that, cuDNN supports SpatialTransformerGrid from version 5.0.0.
Returns: A variable of shape \((n, 2, h_O, w_O)\). In the 2nd dimension, the first element is the coordinate along the x axis, and the second element is the coordinate along the y axis. All the coordinates in the image are scaled to fit range \([1, 1]\). This means that the coordinate \((1, 1)\) corresponds to the upperleft corner of the input image.
Return type:
spatial_transformer_samplerÂ¶

chainer.functions.
spatial_transformer_sampler
(x, grid, use_cudnn=True)[source]Â¶ 2D Spatial Transformer sampler.
This is a differentiable image sampler. With a set of sampling points
grid
and an input feature mapx
, this produces a sampled output feature map.This function currently only supports bilinear interpolation as a sampling kernel.
When coordinates in
grid
is outside range \([1, 1]\), values are sampled from a zero padded input image.Notatition: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) is the number of the input channels.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(h_O\) and \(w_O\) are the height and width of the output image.
See detail in the following paper: Spatial Transformer Networks.
Parameters:  x (Variable) – Input variable of shape \((n, c_I, h, w)\).
 grid (Variable) –
Coordinate variable of shape \((n, 2, h_O, w_O)\). Each coordinate defines the spatial location in the input where a sampling kernel is applied to get the value at a particular pixel in the output.
grid[idx, :, i, j]
corresponds to the coordinate that is used to sample the values for an output pixel at location \((i, j)\).In the second dimension, the first coordinate corresponds to the location along the horizontal axis, and the second coordinate corresponds to the location along the vertical axis.
The coordinate \((1, 1)\) corresponds to the upperleft corner of the input image.
 use_cudnn (bool) – If
True
, then this function uses cuDNN if available. Note that, cuDNN supports SpatialTransformerSampler from version 5.0.0.
Returns: Output feature map of shape \((n, c_I, h_O, w_O)\).
Return type:
split_axisÂ¶

chainer.functions.
split_axis
(x, indices_or_sections, axis, force_tuple=False)[source]Â¶ Splits given variables along an axis.
Parameters:  x (tuple of Variables) – Variables to be split.
 indices_or_sections (int or 1D array) – If this argument is an integer, N, the array will be divided into N equal arrays along axis. If it is a 1D array of sorted integers, it indicates the positions where the array is split.
 axis (int) – Axis that the input array is split along.
 force_tuple (bool) – If
True
, this method returns a tuple even when the number of outputs is one.
Returns: Return type: Note
This function raises
ValueError
if at least one of the outputs is split to zerosize (i.e.axis
th value of its shape is zero).
squeezeÂ¶

chainer.functions.
squeeze
(x, axis=None)[source]Â¶ Remove demensions of size one from the shape of a ndarray.
Parameters:  x (chainer.Variable or :class:
numpy.ndarray
or cupy.ndarray) – Input data.  axis (None or int or tuple of ints) – A subset of the singledimensional
entries in the shape to remove. If
None
is supplied, all of them are removed. The dimension index starts at zero. If an axis with dimension greater than one is selected, an error is raised.
Returns: Variable whose dimensions of size 1 are removed.
Return type:  x (chainer.Variable or :class:
stackÂ¶

chainer.functions.
stack
(xs, axis=0)[source]Â¶ Concatenate variables along a new axis.
Parameters:  xs (list of
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be concatenated. The variables must have the same shape.  axis (int) – The axis along which the arrays will be stacked. The
axis
parameter is acceptable when \(ndim  1 \leq axis \leq ndim\). (ndim
is the dimension of input variables). When \(axis < 0\), the result is the same with \(ndim + 1  axis\).
Returns: Output variable. Let
x_1, x_2, ..., x_n
andy
be the input variables and the output variable,y[:, ..., 0, ..., :]
isx_1
,y[:, ..., 1, ..., :]
isx_2
andy[:, ..., n1, ..., :]
isx_n
(The indexed axis indicates theaxis
).Return type: Example
>>> x1 = np.arange(0, 12).reshape(3, 4) >>> x1.shape (3, 4) >>> x1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> x2 = np.arange(12, 24).reshape(3, 4) >>> x2.shape (3, 4) >>> x2 array([[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) >>> y = F.stack([x1, x2], axis=0) >>> y.shape (2, 3, 4) >>> y.data array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) >>> y = F.stack([x1, x2], axis=1) >>> y.shape (3, 2, 4) >>> y.data array([[[ 0, 1, 2, 3], [12, 13, 14, 15]], [[ 4, 5, 6, 7], [16, 17, 18, 19]], [[ 8, 9, 10, 11], [20, 21, 22, 23]]]) >>> y = F.stack([x1, x2], axis=2) >>> y.shape (3, 4, 2) >>> y.data array([[[ 0, 12], [ 1, 13], [ 2, 14], [ 3, 15]], [[ 4, 16], [ 5, 17], [ 6, 18], [ 7, 19]], [[ 8, 20], [ 9, 21], [10, 22], [11, 23]]]) >>> y = F.stack([x1, x2], axis=1) >>> y.shape (3, 4, 2)
 xs (list of
swapaxesÂ¶
tileÂ¶

chainer.functions.
tile
(x, reps)[source]Â¶ Construct an array by tiling a given array.
Parameters:  x (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Input data.  reps (int or tuple of ints) – The number of times for each axis with which x is replicated.
Returns: Variable tiled the given array.
Return type:  x (chainer.Variable or
transposeÂ¶

chainer.functions.
transpose
(x, axes=None)[source]Â¶ Permute the dimensions of an input variable without copy.
Parameters:  x (Variable) – Input variable.
 axes (tuple of ints) – By default, reverse the dimensions, otherwise permute the axes according to the values given.
Returns: Variable whose axes are permuted.
Return type:
transpose_sequenceÂ¶

chainer.functions.
transpose_sequence
(xs)[source]Â¶ Transpose a list of Variables.
This function transposes a list of
Variable
s and returns a list ofVariable
s. For example a user gives[(0, 1, 2, 3), (4, 5), (6)]
, the function returns[(0, 4, 6), (1, 5), (2), (3)]
. Note that a given list needs to be sorted by each length ofVariable
.Parameters: xs (list of ~chainer.Variable) – Variables to transpose. Returns: Transposed list. Return type: tuple or Variable
vstackÂ¶

chainer.functions.
vstack
(xs)[source]Â¶ Concatenate variables vertically (row wise).
Parameters: xs (list of Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variables to be concatenated. The variables must have the samendim
. When the variables have the second axis (i.e. \(ndim \geq 2\)), the variables must have the same shape along all but the first axis. When the variables do not have the second axis(i.e. \(ndim < 2\)), the variables must have the same shape.Returns: Output variable. When the input variables have the second axis (i.e. \(ndim \geq 2\)), the shapes of inputs and output are the same along all but the first axis. The length of first axis is the sum of the lengths of inputs’ first axis. When the variables do not have the second axis (i.e. \(ndim < 2\)), the shape of output is (2, N)
(N
is the size of the input variable).Return type: Variable Example
>>> x1 = np.array((1, 2, 3)) >>> x1.shape (3,) >>> x2 = np.array((2, 3, 4)) >>> x2.shape (3,) >>> y = F.vstack((x1, x2)) >>> y.shape (2, 3) >>> y.data array([[1, 2, 3], [2, 3, 4]]) >>> x1 = np.arange(0, 12).reshape(3, 4) >>> x1.shape (3, 4) >>> x1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> x2 = np.arange(12, 20).reshape(2, 4) >>> x2.shape (2, 4) >>> x2 array([[12, 13, 14, 15], [16, 17, 18, 19]]) >>> y = F.vstack([x1, x2]) >>> y.shape (5, 4) >>> y.data array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]])
Neural network connectionsÂ¶
bilinearÂ¶

chainer.functions.
bilinear
(e1, e2, W, V1=None, V2=None, b=None)[source]Â¶ Applies a bilinear function based on given parameters.
This is a building block of Neural Tensor Network (see the reference paper below). It takes two input variables and one or four parameters, and outputs one variable.
To be precise, denote six input arrays mathematically by \(e^1\in \mathbb{R}^{I\cdot J}\), \(e^2\in \mathbb{R}^{I\cdot K}\), \(W\in \mathbb{R}^{J \cdot K \cdot L}\), \(V^1\in \mathbb{R}^{J \cdot L}\), \(V^2\in \mathbb{R}^{K \cdot L}\), and \(b\in \mathbb{R}^{L}\), where \(I\) is minibatch size. In this document, we call \(V^1\), \(V^2\), and \(b\) linear parameters.
The output of forward propagation is calculated as
\[y_{il} = \sum_{jk} e^1_{ij} e^2_{ik} W_{jkl} + \ \sum_{j} e^1_{ij} V^1_{jl} + \sum_{k} e^2_{ik} V^2_{kl} + b_{l}.\]Note that V1, V2, b are optional. If these are not given, then this function omits the last three terms in the above equation.
Note
This function accepts an input variable
e1
ore2
of a nonmatrix array. In this case, the leading dimension is treated as the batch dimension, and the other dimensions are reduced to one dimension.Note
In the original paper, \(J\) and \(K\) must be equal and the author denotes \([V^1 V^2]\) (concatenation of matrices) by \(V\).
Parameters: Returns: Output variable.
Return type:  See:
 Reasoning With Neural Tensor Networks for Knowledge Base Completion [Socher+, NIPS2013].
convolution_2dÂ¶

chainer.functions.
convolution_2d
(x, W, b=None, stride=1, pad=0, use_cudnn=True, cover_all=False, deterministic=False)[source]Â¶ Twodimensional convolution function.
This is an implementation of twodimensional convolution in ConvNets. It takes three variables: the input image
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(h_I\) and \(w_I\) are the height and width of the input image, respectively.
 \(h_K\) and \(w_K\) are the height and width of the filters, respectively.
 \(h_P\) and \(w_P\) are the height and width of the spatial padding size, respectively.
Then the
Convolution2D
function computes correlations between filters and patches of size \((h_K, w_K)\) inx
. Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at positions shifted by multiples ofstride
from the first position(h_P, w_P)
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Let \((s_Y, s_X)\) be the stride of filter application. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= (h_I + 2h_P  h_K) / s_Y + 1,\\ w_O &= (w_I + 2w_P  w_K) / s_X + 1.\end{split}\]If
cover_all
option isTrue
, the filter will cover the all spatial locations. So, if the last stride of filter does not cover the end of spatial locations, an addtional stride will be applied to the end part of spatial locations. In this case, the output size \((h_O, w_O)\) is determined by the following equations:\[\begin{split}h_O &= (h_I + 2h_P  h_K + s_Y  1) / s_Y + 1,\\ w_O &= (w_I + 2w_P  w_K + s_X  1) / s_X + 1.\end{split}\]If the bias vector is given, then it is added to all spatial locations of the output of convolution.
The twodimensional convolution function is defined as follows.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable of shape \((n, c_I, h_I, w_I)\).  W (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Weight variable of shape \((c_O, c_I, h_K, w_K)\).  b (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Bias variable of length \(c_O\) (optional).  stride (
int
or pair ofint
s) – Stride of filter applications.stride=s
andstride=(s, s)
are equivalent.  pad (
int
or pair ofint
s) – Spatial padding width for input arrays.pad=p
andpad=(p, p)
are equivalent.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels.  deterministic (bool) – The output of this function can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v3.
Returns: Output variable of shape \((n, c_O, h_O, w_O)\).
Return type: See also
Example
>>> n = 10 >>> c_i, c_o = 3, 1 >>> h_i, w_i = 30, 40 >>> h_k, w_k = 10, 10 >>> h_p, w_p = 5, 5 >>> x = np.random.uniform(0, 1, (n, c_i, h_i, w_i)).astype('f') >>> x.shape (10, 3, 30, 40) >>> W = np.random.uniform(0, 1, (c_o, c_i, h_k, w_k)).astype('f') >>> W.shape (1, 3, 10, 10) >>> b = np.random.uniform(0, 1, (c_o,)).astype('f') >>> b.shape (1,) >>> s_y, s_x = 5, 7 >>> y = F.convolution_2d(x, W, b, stride=(s_y, s_x), pad=(h_p, w_p)) >>> y.shape (10, 1, 7, 6) >>> h_o = int((h_i + 2 * h_p  h_k) / s_y + 1) >>> w_o = int((w_i + 2 * w_p  w_k) / s_x + 1) >>> y.shape == (n, c_o, h_o, w_o) True >>> y = F.convolution_2d(x, W, b, stride=(s_y, s_x), pad=(h_p, w_p), cover_all=True) >>> y.shape == (n, c_o, h_o, w_o + 1) True
convolution_ndÂ¶

chainer.functions.
convolution_nd
(x, W, b=None, stride=1, pad=0, use_cudnn=True, cover_all=False)[source]Â¶ Ndimensional convolution function.
This is an implementation of Ndimensional convolution which is generalized twodimensional convolution in ConvNets. It takes three variables: the input
x
, the filter weightW
and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(N\) is the number of spatial dimensions.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(d_1, d_2, ..., d_N\) are the size of each axis of the input’s spatial dimensions, respectively.
 \(k_1, k_2, ..., k_N\) are the size of each axis of the filters, respectively.
 \(l_1, l_2, ..., l_N\) are the size of each axis of the output’s spatial dimensions, respectively.
 \(p_1, p_2, ..., p_N\) are the size of each axis of the spatial padding size, respectively.
Then the
convolution_nd
function computes correlations between filters and patches of size \((k_1, k_2, ..., k_N)\) inx
. Note that correlation here is equivalent to the inner product between expanded tensors. Patches are extracted at positions shifted by multiples ofstride
from the first position(p_1, p_2, ..., p_N)
for each spatial axis.Let \((s_1, s_2, ..., s_N)\) be the stride of filter application. Then, the output size \((l_1, l_2, ..., l_N)\) is determined by the following equations:
\[l_n = (d_n + 2p_n  k_n) / s_n + 1 \ \ (n = 1, ..., N)\]If
cover_all
option isTrue
, the filter will cover the all spatial locations. So, if the last stride of filter does not cover the end of spatial locations, an addtional stride will be applied to the end part of spatial locations. In this case, the output size is determined by the following equations:\[l_n = (d_n + 2p_n  k_n + s_n  1) / s_n + 1 \ \ (n = 1, ..., N)\]The Ndimensional convolution function is defined as follows.
Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable of shape \((n, c_I, d_1, d_2, ..., d_N)\).  W (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Weight variable of shape \((c_O, c_I, k_1, k_2, ..., k_N)\).  b (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Onedimensional bias variable with length \(c_O\) (optional).  stride (
int
ortuple
ofint
s) – Stride of filter applications \((s_1, s_2, ..., s_N)\).stride=s
is equivalent to(s, s, ..., s)
.  pad (
int
ortuple
ofint
s) – Spatial padding width for input arrays \((p_1, p_2, ..., p_N)\).pad=p
is equivalent to(p, p, ..., p)
.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available. See below for the excact conditions.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger. cover_all needs to beFalse
if you want to use cuDNN.
Returns: Output variable of shape \((n, c_O, l_1, l_2, ..., l_N)\).
Return type: Note
This function uses cuDNN implementation for its forward and backward computation if ALL of the following conditions are satisfied:
cuda.cudnn_enabled
isTrue
use_cudnn
isTrue
 The number of spatial dimensions is more than one.
cover_all
isFalse
 The input’s
dtype
is equal to the filter weight’s.  The
dtype
is FP16, FP32 or FP64. (FP16 is only available when cuDNN version \(\geq\) v3.)
See also
Example
>>> n = 10 >>> c_i, c_o = 3, 1 >>> d1, d2, d3 = 30, 40, 50 >>> k1, k2, k3 = 10, 10, 10 >>> p1, p2, p3 = 5, 5, 5 >>> x = np.random.uniform(0, 1, (n, c_i, d1, d2, d3)).astype('f') >>> x.shape (10, 3, 30, 40, 50) >>> W = np.random.uniform(0, 1, (c_o, c_i, k1, k2, k3)).astype('f') >>> W.shape (1, 3, 10, 10, 10) >>> b = np.random.uniform(0, 1, (c_o)).astype('f') >>> b.shape (1,) >>> s1, s2, s3 = 2, 4, 6 >>> y = F.convolution_nd(x, W, b, stride=(s1, s2, s3), pad=(p1, p2, p3)) >>> y.shape (10, 1, 16, 11, 9) >>> l1 = int((d1 + 2 * p1  k1) / s1 + 1) >>> l2 = int((d2 + 2 * p2  k2) / s2 + 1) >>> l3 = int((d3 + 2 * p3  k3) / s3 + 1) >>> y.shape == (n, c_o, l1, l2, l3) True >>> y = F.convolution_nd(x, W, b, stride=(s1, s2, s3), pad=(p1, p2, p3), cover_all=True) >>> y.shape == (n, c_o, l1, l2, l3 + 1) True
deconvolution_2dÂ¶

chainer.functions.
deconvolution_2d
(x, W, b=None, stride=1, pad=0, outsize=None, use_cudnn=True, deterministic=False)[source]Â¶ Two dimensional deconvolution function.
This is an implementation of twodimensional deconvolution. In most of deep learning frameworks and papers, this function is called transposed convolution. But because of historical reasons (e.g. paper by Ziller Deconvolutional Networks) and backward compatibility, this function is called deconvolution in Chainer.
It takes three variables: input image
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(h_I\) and \(w_I\) are the height and width of the input image, respectively.
 \(h_K\) and \(w_K\) are the height and width of the filters, respectively.
 \(h_P\) and \(w_P\) are the height and width of the spatial padding size, respectively.
Let \((s_Y, s_X)\) be the stride of filter application. Then, the output size \((h_O, w_O)\) is estimated by the following equations:
\[\begin{split}h_O &= s_Y (h_I  1) + h_K  2h_P,\\ w_O &= s_X (w_I  1) + w_K  2w_P.\end{split}\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable of shape \((n, c_I, h_I, w_I)\).  W (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Weight variable of shape \((c_I, c_O, h_K, w_K)\).  b (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Bias variable of length \(c_O\) (optional).  stride (
int
or pair ofint
s) – Stride of filter applications.stride=s
andstride=(s, s)
are equivalent.  pad (
int
or pair ofint
s) – Spatial padding width for input arrays.pad=p
andpad=(p, p)
are equivalent.  outsize (
tuple
ofint
) – Expected output size of deconvolutional operation. It should be pair of height and width \((h_O, w_O)\). Default value isNone
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  deterministic (bool) – The output of this function can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v3.
Returns: Output variable of shape \((n, c_O, h_O, w_O)\).
Return type: Example
>>> n = 10 >>> c_i, c_o = 1, 3 >>> h_i, w_i = 5, 10 >>> h_k, w_k = 10, 10 >>> h_p, w_p = 5, 5 >>> x = np.random.uniform(0, 1, (n, c_i, h_i, w_i)).astype('f') >>> x.shape (10, 1, 5, 10) >>> W = np.random.uniform(0, 1, (c_i, c_o, h_k, w_k)).astype('f') >>> W.shape (1, 3, 10, 10) >>> b = np.random.uniform(0, 1, c_o).astype('f') >>> b.shape (3,) >>> s_y, s_x = 5, 5 >>> y = F.deconvolution_2d(x, W, b, stride=(s_y, s_x), pad=(h_p, w_p)) >>> y.shape (10, 3, 20, 45) >>> h_o = s_y * (h_i  1) + h_k  2 * h_p >>> w_o = s_x * (w_i  1) + w_k  2 * w_p >>> y.shape == (n, c_o, h_o, w_o) True
deconvolution_ndÂ¶

chainer.functions.
deconvolution_nd
(x, W, b=None, stride=1, pad=0, outsize=None, use_cudnn=True)[source]Â¶ Ndimensional deconvolution function.
This is an implementation of Ndimensional deconvolution which generalizes twodimensional one. In most of deep learning frameworks and papers, this function is called transposed convolution. But because of historical reasons (e.g. paper by Ziller Deconvolutional Networks) and backward compatibility, this function is called deconvolution in Chainer.
It takes three variables: the input
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(N\) is the number of spatial dimensions.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(d_1, d_2, ..., d_N\) are the size of each axis of the input’s spatial dimensions, respectively.
 \(k_1, k_2, ..., k_N\) are the size of each axis of the filters, respectively.
 \(p_1, p_2, ..., p_N\) are the size of each axis of the spatial padding size, respectively.
 \(s_1, s_2, ..., s_N\) are the stride of each axis of filter application, respectively.
If
outsize
option isNone
, the output size \((l_1, l_2, ..., l_N)\) is determined by the following equations with the items in the above list:\[l_n = s_n (d_n  1) + k_n  2 p_n \ \ (n = 1, ..., N)\]If
outsize
option is given, the output size is determined byoutsize
. In this case, theoutsize
\((l_1, l_2, ..., l_N)\) must satisfy the following equations:\[d_n = \lfloor (l_n + 2p_n  k_n) / s_n \rfloor + 1 \ \ (n = 1, ..., N)\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable of shape \((n, c_I, d_1, d_2, ..., d_N)\).  W (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Weight variable of shape \((c_I, c_O, k_1, k_2, ..., k_N)\).  b (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Onedimensional bias variable with length \(c_O\) (optional).  stride (
int
ortuple
ofint
s) – Stride of filter applications \((s_1, s_2, ..., s_N)\).stride=s
is equivalent to(s, s, ..., s)
.  pad (
int
ortuple
ofint
s) – Spatial padding width for input arrays \((p_1, p_2, ..., p_N)\).pad=p
is equivalent to(p, p, ..., p)
.  outsize (
tuple
ofint
s) – Expected output size of deconvolutional operation. It should be a tuple of ints \((l_1, l_2, ..., l_N)\). Default value isNone
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available. Note that cuDNN supports more than onedimensional deconvolution operations only.
Returns: Output variable of shape \((n, c_O, l_1, l_2, ..., l_N)\).
Return type: See also
links.DeconvolutionND
,deconvolution_2d()
Example
Example1: the case when
outsize
is not given.>>> n = 10 >>> c_i, c_o = 3, 1 >>> d1, d2, d3 = 5, 10, 15 >>> k1, k2, k3 = 10, 10, 10 >>> p1, p2, p3 = 5, 5, 5 >>> x = np.random.uniform(0, 1, (n, c_i, d1, d2, d3)).astype('f') >>> x.shape (10, 3, 5, 10, 15) >>> W = np.random.uniform(0, 1, (c_i, c_o, k1, k2, k3)).astype('f') >>> W.shape (3, 1, 10, 10, 10) >>> b = np.random.uniform(0, 1, (c_o)).astype('f') >>> b.shape (1,) >>> s1, s2, s3 = 2, 4, 6 >>> y = F.deconvolution_nd(x, W, b, stride=(s1, s2, s3), pad=(p1, p2, p3)) >>> y.shape (10, 1, 8, 36, 84) >>> l1 = s1 * (d1  1) + k1  2 * p1 >>> l2 = s2 * (d2  1) + k2  2 * p2 >>> l3 = s3 * (d3  1) + k3  2 * p3 >>> y.shape == (n, c_o, l1, l2, l3) True
Example2: the case when
outsize
is given.>>> n = 10 >>> c_i, c_o = 3, 1 >>> d1, d2, d3 = 5, 10, 15 >>> k1, k2, k3 = 10, 10, 10 >>> p1, p2, p3 = 5, 5, 5 >>> x = np.random.uniform(0, 1, (n, c_i, d1, d2, d3)).astype('f') >>> x.shape (10, 3, 5, 10, 15) >>> W = np.random.uniform(0, 1, (c_i, c_o, k1, k2, k3)).astype('f') >>> W.shape (3, 1, 10, 10, 10) >>> b = np.random.uniform(0, 1, (c_o)).astype('f') >>> b.shape (1,) >>> s1, s2, s3 = 2, 4, 6 >>> l1, l2, l3 = 9, 38, 87 >>> d1 == int((l1 + 2 * p1  k1) / s1) + 1 True >>> d2 == int((l2 + 2 * p2  k2) / s2) + 1 True >>> d3 == int((l3 + 2 * p3  k3) / s3) + 1 True >>> y = F.deconvolution_nd(x, W, b, stride=(s1, s2, s3), pad=(p1, p2, p3), outsize=(l1, l2, l3)) >>> y.shape (10, 1, 9, 38, 87) >>> y.shape == (n, c_o, l1, l2, l3) True
depthwise_convolution_2dÂ¶

chainer.functions.
depthwise_convolution_2d
(x, W, b=None, stride=1, pad=0)[source]Â¶ Twodimensional depthwise convolution function.
This is an implementation of twodimensional depthwise convolution. It takes two or three variables: the input image
x
, the filter weightW
, and optionally, the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) is the number of the input.
 \(c_M\) is the channel multiplier.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(h_O\) and \(w_O\) are the height and width of the output image, respectively.
 \(k_H\) and \(k_W\) are the height and width of the filters, respectively.
Parameters:  x (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Input variable of shape \((n, c_I, h, w)\).  W (Variable) – Weight variable of shape \((c_M, c_I, k_H, k_W)\).
 b (Variable) – Bias variable of length \(c_M * c_I\) (optional).
 stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.
Returns: Output variable. Its shape is \((n, c_I * c_M, h_O, w_O)\).
Return type: Like
Convolution2D
,DepthwiseConvolution2D
function computes correlations between filters and patches of size \((k_H, k_W)\) inx
. But unlikeConvolution2D
,DepthwiseConvolution2D
does not add up input channels of filters but concatenates them. For that reason, the shape of outputs of depthwise convolution are \((n, c_I * c_M, h_O, w_O)\), \(c_M\) is called channel_multiplier.\((h_O, w_O)\) is determined by the equivalent equation of
Convolution2D
.If the bias vector is given, then it is added to all spatial locations of the output of convolution.
See: L. Sifre. Rigidmotion scattering for image classification
See also
Example
>>> x = np.random.uniform(0, 1, (2, 3, 4, 7)) >>> W = np.random.uniform(0, 1, (2, 3, 3, 3)) >>> b = np.random.uniform(0, 1, (6,)) >>> y = F.depthwise_convolution_2d(x, W, b) >>> y.shape (2, 6, 2, 5)
dilated_convolution_2dÂ¶

chainer.functions.
dilated_convolution_2d
(x, W, b=None, stride=1, pad=0, dilate=1, use_cudnn=True, cover_all=False)[source]Â¶ Twodimensional dilated convolution function.
This is an implementation of twodimensional dilated convolution in ConvNets. It takes three variables: the input image
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output, respectively.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(k_H\) and \(k_W\) are the height and width of the filters, respectively.
Parameters:  x (Variable) – Input variable of shape \((n, c_I, h, w)\).
 W (Variable) – Weight variable of shape \((c_O, c_I, k_H, k_W)\).
 b (Variable) – Bias variable of length \(c_O\) (optional).
 stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  dilate (int or pair of ints) – Dilation factor of filter applications.
dilate=d
anddilate=(d, d)
are equivalent.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger.
Returns: Output variable.
Return type: The twodimensional dilated convolution function is defined as follows. Then the
DilatedConvolution2D
function computes correlations between filters and patches of size \((k_H, k_W)\) inx
. Patches here are extracted at intervals of the dilation factor. Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at intervals of the dilation factor and at positions shifted by multiples ofstride
from the first positionpad
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Let \((s_Y, s_X)\) be the stride of filter application, \((p_H, p_W)\) the spatial padding size, and \((d_Y, d_X)\) the dilation factor of filter application. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= (h + 2p_H  k_H  (k_H  1) * (d_Y  1)) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W  (k_W  1) * (d_X  1)) / s_X + 1.\end{split}\]If the bias vector is given, then it is added to all spatial locations of the output of convolution.
See also
DilatedConvolution2D
embed_idÂ¶

chainer.functions.
embed_id
(x, W, ignore_label=None)[source]Â¶ Efficient linear function for onehot input.
This function implements so called word embedding. It takes two arguments: a set of IDs (words)
x
in \(B\) dimensional integer vector, and a set of all ID (word) embeddingsW
in \(V \times d\) float32 matrix. It outputs \(B \times d\) matrix whosei
th column is thex[i]
th column ofW
.This function is only differentiable on the input
W
.Parameters: Returns: Output variable.
Return type: See also
linearÂ¶

chainer.functions.
linear
(x, W, b=None)[source]Â¶ Linear function, or affine transformation.
It accepts two or three arguments: an input minibatch
x
, a weight matrixW
, and optionally a bias vectorb
. It computes\[Y = xW^\top + b.\]Parameters:  x (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Input variable, which is a \((s_B, s_1, s_2, ..., s_n)\)shaped float array. Its first dimension \((s_B)\) is assumed to be the minibatch dimension. The other dimensions are treated as concatenated one dimension whose size must be \((s_1 * ... * s_n = N)\).  W (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Weight variable of shape \((M, N)\), where \((N = s_1 * ... * s_n)\).  b (
Variable
ornumpy.ndarray
orcupy.ndarray
) – Bias variable (optional) of shape \((M,)\).
Returns: Output variable. A float array with shape of \((s_B, M)\).
Return type: See also
Example
>>> x = np.random.uniform(0, 1, (3, 4)).astype('f') >>> W = np.random.uniform(0, 1, (5, 4)).astype('f') >>> b = np.random.uniform(0, 1, (5,)).astype('f') >>> y = F.linear(x, W, b) >>> y.shape (3, 5)
 x (
n_step_bigruÂ¶

chainer.functions.
n_step_bigru
(n_layers, dropout_ratio, hx, ws, bs, xs, train=True, use_cudnn=True)[source]Â¶ Stacked Bidirectional Gated Recurrent Unit function.
This function calculates stacked Bidirectional GRU with sequences. This function gets an initial hidden state \(h_0\), an input sequence \(x\), weight matrices \(W\), and bias vectors \(b\). This function calculates hidden states \(h_t\) for each time \(t\) from input \(x_t\).
\[\begin{split}r^{f}_t &= \sigma(W^{f}_0 x_t + W^{f}_3 h_{t1} + b^{f}_0 + b^{f}_3) \\ z^{f}_t &= \sigma(W^{f}_1 x_t + W^{f}_4 h_{t1} + b^{f}_1 + b^{f}_4) \\ h^{f'}_t &= \tanh(W^{f}_2 x_t + b^{f}_2 + r^{f}_t \cdot (W^{f}_5 h_{t1} + b^{f}_5)) \\ h^{f}_t &= (1  z^{f}_t) \cdot h^{f'}_t + z^{f}_t \cdot h_{t1} \\ r^{b}_t &= \sigma(W^{b}_0 x_t + W^{b}_3 h_{t1} + b^{b}_0 + b^{b}_3) \\ z^{b}_t &= \sigma(W^{b}_1 x_t + W^{b}_4 h_{t1} + b^{b}_1 + b^{b}_4) \\ h^{b'}_t &= \tanh(W^{b}_2 x_t + b^{b}_2 + r^{b}_t \cdot (W^{b}_5 h_{t1} + b^{b}_5)) \\ h^{b}_t &= (1  z^{b}_t) \cdot h^{b'}_t + z^{b}_t \cdot h_{t1} \\ h_t &= [h^{f}_t; h^{f}_t] \\\end{split}\]where \(W^{f}\) is weight matrices for forwardGRU, \(W^{b}\) is weight matrices for backwardGRU.
As the function accepts a sequence, it calculates \(h_t\) for all \(t\) with one call. Six weight matrices and six bias vectors are required for each layers. So, when \(S\) layers exists, you need to prepare \(6S\) weigth matrices and \(6S\) bias vectors.
If the number of layers
n_layers
is greather than \(1\), input ofk
th layer is hidden stateh_t
ofk1
th layer. Note that all input variables except first layer may have different shape from the first layer.Parameters:  n_layers (int) – Number of layers.
 dropout_ratio (float) – Dropout ratio.
 hx (chainer.Variable) – Variable holding stacked hidden states.
Its shape is
(S, B, N)
whereS
is number of layers and is equal ton_layers
,B
is minibatch size, andN
is dimention of hidden units.  ws (list of list of chainer.Variable) – Weight matrices.
ws[i]
represents weights for ith layer. Eachws[i]
is a list containing six matrices.ws[i][j]
is corresponding withW_j
in the equation. Onlyws[0][j]
where0 <= j < 3
is(I, N)
shape as they are multiplied with input variables. All other matrices has(N, N)
shape.  bs (list of list of chainer.Variable) – Bias vectors.
bs[i]
represnents biases for ith layer. Eachbs[i]
is a list containing six vectors.bs[i][j]
is corresponding withb_j
in the equation. Shape of each matrix is(N,)
whereN
is dimention of hidden units.  xs (list of chainer.Variable) – A list of
Variable
holding input values. Each elementxs[t]
holds input value for timet
. Its shape is(B_t, I)
, whereB_t
is minibatch size for timet
, andI
is size of input units. Note that this functions supports variable length sequences. When sequneces has different lengths, sort sequences in descending order by length, and transpose the sorted sequence.transpose_sequence()
transpose a list ofVariable()
holding sequence. Soxs
needs to satisfyxs[t].shape[0] >= xs[t + 1].shape[0]
.  train (bool) – If
True
, this function executes dropout.  use_cudnn (bool) – If
True
, this function uses cuDNN if available.  use_bi_direction (bool) – If
True
, this function uses Bidirection GRU.
Returns:  This functions returns a tuple concaining three elements,
hy
andys
. hy
is an updated hidden states whose shape is same ashx
. ys
is a list ofVariable
. Each elementys[t]
holds hidden states of the last layer corresponding to an inputxs[t]
. Its shape is(B_t, N)
whereB_t
is minibatch size for timet
, andN
is size of hidden units. Note thatB_t
is the same value asxs[t]
.
Return type:
n_step_bilstmÂ¶

chainer.functions.
n_step_bilstm
(n_layers, dropout_ratio, hx, cx, ws, bs, xs, train=True, use_cudnn=True)[source]Â¶ Stacked Bidirectional Long ShortTerm Memory function.
This function calculates stacked Bidirectional LSTM with sequences. This function gets an initial hidden state \(h_0\), an initial cell state \(c_0\), an input sequence \(x\), weight matrices \(W\), and bias vectors \(b\). This function calculates hidden states \(h_t\) and \(c_t\) for each time \(t\) from input \(x_t\).
\[\begin{split}i^{f}_t &=& \sigma(W^{f}_0 x_t + W^{f}_4 h_{t1} + b^{f}_0 + b^{f}_4), \\ f^{f}_t &=& \sigma(W^{f}_1 x_t + W^{f}_5 h_{t1} + b^{f}_1 + b^{f}_5), \\ o^{f}_t &=& \sigma(W^{f}_2 x_t + W^{f}_6 h_{t1} + b^{f}_2 + b^{f}_6), \\ a^{f}_t &=& \tanh(W^{f}_3 x_t + W^{f}_7 h_{t1} + b^{f}_3 + b^{f}_7), \\ c^{f}_t &=& f^{f}_t \cdot c^{f}_{t1} + i^{f}_t \cdot a^{f}_t, \\ h^{f}_t &=& o^{f}_t \cdot \tanh(c^{f}_t), \\ i^{b}_t &=& \sigma(W^{b}_0 x_t + W^{b}_4 h_{t1} + b^{b}_0 + b^{b}_4), \\ f^{b}_t &=& \sigma(W^{b}_1 x_t + W^{b}_5 h_{t1} + b^{b}_1 + b^{b}_5), \\ o^{b}_t &=& \sigma(W^{b}_2 x_t + W^{b}_6 h_{t1} + b^{b}_2 + b^{b}_6), \\ a^{b}_t &=& \tanh(W^{b}_3 x_t + W^{b}_7 h_{t1} + b^{b}_3 + b^{b}_7), \\ c^{b}_t &=& f^{b}_t \cdot c^{b}_{t1} + i^{b}_t \cdot a^{b}_t, \\ h^{b}_t &=& o^{b}_t \cdot \tanh(c^{b}_t), \\ h_t &=& [h^{f}; h^{b}]\end{split}\]where \(W^{f}\) is weight matrices for forwardLSTM, \(W^{b}\) is weight matrices for backwardLSTM.
As the function accepts a sequence, it calculates \(h_t\) for all \(t\) with one call. Eight weight matrices and eight bias vectors are required for each layers. So, when \(S\) layers exists, you need to prepare \(8S\) weigth matrices and \(8S\) bias vectors.
If the number of layers
n_layers
is greather than \(1\), input ofk
th layer is hidden stateh_t
ofk1
th layer. Note that all input variables except first layer may have different shape from the first layer.Parameters:  n_layers (int) – Number of layers.
 dropout_ratio (float) – Dropout ratio.
 hx (chainer.Variable) – Variable holding stacked hidden states.
Its shape is
(S, B, N)
whereS
is number of layers and is equal ton_layers
,B
is minibatch size, andN
is dimention of hidden units.  cx (chainer.Variable) – Variable holding stacked cell states.
It has the same shape as
hx
.  ws (list of list of chainer.Variable) – Weight matrices.
ws[i]
represents weights for ith layer. Eachws[i]
is a list containing eight matrices.ws[i][j]
is corresponding withW_j
in the equation. Onlyws[0][j]
where0 <= j < 4
is(I, N)
shape as they are multiplied with input variables. All other matrices has(N, N)
shape.  bs (list of list of chainer.Variable) – Bias vectors.
bs[i]
represnents biases for ith layer. Eachbs[i]
is a list containing eight vectors.bs[i][j]
is corresponding withb_j
in the equation. Shape of each matrix is(N,)
whereN
is dimention of hidden units.  xs (list of chainer.Variable) – A list of
Variable
holding input values. Each elementxs[t]
holds input value for timet
. Its shape is(B_t, I)
, whereB_t
is minibatch size for timet
, andI
is size of input units. Note that this functions supports variable length sequences. When sequneces has different lengths, sort sequences in descending order by length, and transpose the sorted sequence.transpose_sequence()
transpose a list ofVariable()
holding sequence. Soxs
needs to satisfyxs[t].shape[0] >= xs[t + 1].shape[0]
.  train (bool) – If
True
, this function executes dropout.  use_cudnn (bool) – If
True
, this function uses cuDNN if available.
Returns:  This functions returns a tuple concaining three elements,
hy
,cy
andys
. hy
is an updated hidden states whose shape is same ashx
. cy
is an updated cell states whose shape is same ascx
. ys
is a list ofVariable
. Each elementys[t]
holds hidden states of the last layer corresponding to an inputxs[t]
. Its shape is(B_t, N)
whereB_t
is minibatch size for timet
, andN
is size of hidden units. Note thatB_t
is the same value asxs[t]
.
Return type:
n_step_birnnÂ¶

chainer.functions.
n_step_birnn
(n_layers, dropout_ratio, hx, ws, bs, xs, train=True, use_cudnn=True, activation='tanh')