Chainer – A flexible framework of neural networks

Chainer is a powerful, flexible and intuitive deep learning framework.

  • Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.

  • Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.

  • Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.

Note

As announced, Chainer is under the maintenance phase and further development will be limited to bug-fixes and maintenance only.

Chainer at a Glance

Welcome to Chainer!

Chainer is a rapidly growing neural network platform. The strengths of Chainer are:

  • Python-based – Chainer is developed in Python, allowing for inspection and customization of all code in python and understandable python messages at run time

  • Define by Run – neural networks definitions are defined on-the-fly at run time, allowing for dynamic network changes

  • NumPy based syntax for working with arrays, thanks to CuPy implementation

  • Fully customizable – since Chainer is pure python, all classes and methods can be adapted to allow for the latest cutting edge or specialized approaches

  • Broad and deep support – Chainer is actively used for most of the current approaches for neural nets (CNN, RNN, RL, etc.), aggressively adds new approaches as they’re developed, and provides support for many kinds of hardware as well as parallelization for multiple GPUs

Mushrooms – tasty or deadly?

Let’s take a look at a basic program of Chainer to see how it works. For a dataset, we’ll work with Kaggle’s edible vs. poisonous mushroom dataset, which has over 8,000 examples of mushrooms, labelled by 22 categories including odor, cap color, habitat, etc., in a mushrooms.csv file.

How will Chainer learn which mushrooms are edible and which mushrooms will kill you? Let’s see!

The code below is from the glance example in the examples/glance directory.

Code Breakdown

Initialization

Let’s start the program. Here are the typical imports for a Chainer program. chainer.links contain trainable parameters and chainer.functions do not.

 6import chainer as ch
 7from chainer import datasets
 8import chainer.functions as F
 9import chainer.links as L
10from chainer import training
11from chainer.training import extensions
12
13import numpy as np

We’ll use Matplotlib for the graphs to show training progress.

15import matplotlib
16matplotlib.use('Agg')

Trainer Structure

A trainer is used to set up our neural network and data for training. The components of the trainer are generally hierarchical, and are organized as follows:

_images/trainer1.png

Each of the components is fed information from the components within it. Setting up the trainer starts at the inner components, and moves outward, with the exception of extensions, which are added after the trainer is defined.

Dataset

_images/trainer-dataset.png

Our first step is to format the dataset. From the raw mushrooms.csv, we format the data into a Chainer TupleDataset.

18mushroomsfile = 'mushrooms.csv'
19data_array = np.genfromtxt(
20    mushroomsfile, delimiter=',', dtype=str, skip_header=1)
21for col in range(data_array.shape[1]):
22    data_array[:, col] = np.unique(data_array[:, col], return_inverse=True)[1]
23
24X = data_array[:, 1:].astype(np.float32)
25Y = data_array[:, 0].astype(np.int32)[:, None]
26train, test = datasets.split_dataset_random(
27    datasets.TupleDataset(X, Y), int(data_array.shape[0] * .7))

Iterator

_images/trainer-iterator.png

Configure iterators to step through batches of the data for training and for testing validation. In this case, we’ll use a batch size of 100. For the training iterator, repeating and shuffling are implicitly enabled, while they are explicitly disabled for the testing iterator.

29train_iter = ch.iterators.SerialIterator(train, 100)
30test_iter = ch.iterators.SerialIterator(
31    test, 100, repeat=False, shuffle=False)

Model

_images/trainer-model.png

Next, we need to define the neural network for inclusion in our model. For our mushrooms, we’ll chain together two fully-connected, Linear, hidden layers between the input and output layers.

As an activation function, we’ll use standard Rectified Linear Units (relu()).

Using Sequential allows us to define the neural network model in a compact format.

34# Network definition
35def MLP(n_units, n_out):
36    layer = ch.Sequential(L.Linear(n_units), F.relu)
37    model = layer.repeat(2)
38    model.append(L.Linear(n_out))
39
40    return model

Since mushrooms are either edible or poisonous (no information on psychedelic effects!) in the dataset, we’ll use a Link Classifier for the output, with 44 units (double the features of the data) in the hidden layers and a single edible/poisonous category for classification.

43model = L.Classifier(
44    MLP(44, 1), lossfun=F.sigmoid_cross_entropy, accfun=F.binary_accuracy)

Note that in the two code snippets above we have not specified the size of the input layer. Once we start feeding the neural network with samples, Chainer will recognize the dimensionality of the input automatically and initialize the matrix for each layer with the appropriate shape. In the example above, that is 44×22 for the first hidden layer, 44×44 for the second hidden layer, and 1×44 for the output layer.

Optimizer

_images/trainer-optimizer.png

Pick an optimizer, and set up the model to use it.

46# Setup an optimizer
47optimizer = ch.optimizers.SGD().setup(model)

Updater

_images/trainer-updater.png

Now that we have the training iterator and optimizer set up, we link them both together into the updater. The updater uses the minibatches from the iterator, does the forward and backward processing of the model, and updates the parameters of the model according to the optimizer. Setting the device=-1 sets the device as the CPU. To use a GPU, set device equal to the number of the GPU, usually device=0.

49# Create the updater, using the optimizer
50updater = training.StandardUpdater(train_iter, optimizer, device=-1)

Finally we create a Trainer object. The trainer processes minibatches using the updater defined above until a certain stop condition is met and allows the use of extensions during the training. We set it to run for 50 epochs and store all files created by the extensions (see below) in the result directory.

52# Set up a trainer
53trainer = training.Trainer(updater, (50, 'epoch'), out='result')

Extensions

_images/trainer-extensions.png

Extensions can be used to execute code at certain events during the training, such as every epoch or every 1000 iterations. This mechanism is used in Chainer to evaluate models during training, print progress messages, or dump intermediate model files.

First, use the testing iterator defined above for an Evaluator extension to the trainer to provide test scores. If using a GPU instead of the CPU, set device to the ID of the GPU, usually 0.

54# Evaluate the model with the test dataset for each epoch
55trainer.extend(extensions.Evaluator(test_iter, model, device=-1))

Save a computational graph from loss variable at the first iteration. main refers to the target link of the main optimizer. The graph is saved in the Graphviz’s dot format. The output location (directory) to save the graph is set by the out argument of trainer.

57# Dump a computational graph from 'loss' variable at the first iteration
58# The "main" refers to the target link of the "main" optimizer.
59trainer.extend(extensions.DumpGraph('main/loss'))

Take a snapshot of the trainer object every 20 epochs.

61trainer.extend(extensions.snapshot(), trigger=(20, 'epoch'))

Write a log of evaluation statistics for each epoch.

63# Write a log of evaluation statistics for each epoch
64trainer.extend(extensions.LogReport())

Save two plot images to the result directory.

66# Save two plot images to the result dir
67trainer.extend(
68    extensions.PlotReport(['main/loss', 'validation/main/loss'],
69                          'epoch', file_name='loss.png'))
70trainer.extend(
71    extensions.PlotReport(
72        ['main/accuracy', 'validation/main/accuracy'],
73        'epoch', file_name='accuracy.png'))

Print selected entries of the log to standard output.

75# Print selected entries of the log to stdout
76trainer.extend(extensions.PrintReport(
77    ['epoch', 'main/loss', 'validation/main/loss',
78     'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))

Main Loop

Finally, with the trainer and all the extensions set up, we can add the line that actually starts the main loop:

80#  Run the training
81trainer.run()

Inference

Once the training is complete, only the model is necessary to make predictions. Let’s check that a random line from the test data set and see if the inference is correct:

83x, t = test[np.random.randint(len(test))]
84
85predict = model.predictor(x[None]).array
86predict = predict[0][0]
87
88if predict >= 0:
89    print('Predicted Poisonous, Actual ' + ['Edible', 'Poisonous'][t[0]])
90else:
91    print('Predicted Edible, Actual ' + ['Edible', 'Poisonous'][t[0]])

Output

Output for this instance will look like:

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.550724    0.502818              0.733509       0.752821                  0.215426
2           0.454206    0.446234              0.805439       0.786926                  0.902108
3           0.402783    0.395893              0.838421       0.835979                  1.50414
4           0.362979    0.359988              0.862807       0.852632                  2.24171
5           0.32713     0.329881              0.88           0.874232                  2.83247
6           0.303469    0.31104               0.892456       0.887284                  3.45173
7           0.284755    0.288553              0.901754       0.903284                  3.9877
8           0.26801     0.272033              0.9125         0.907137                  4.54794
9           0.25669     0.261355              0.920175       0.917937                  5.21672
10          0.241789    0.251821              0.927193       0.917937                  5.79541
11          0.232291    0.238022              0.93           0.925389                  6.3055
12          0.222805    0.22895               0.934035       0.923389                  6.87083
13          0.21276     0.219291              0.93614        0.928189                  7.54113
14          0.204822    0.220736              0.938596       0.922589                  8.12495
15          0.197671    0.207017              0.938393       0.936042                  8.69219
16          0.190285    0.199129              0.941053       0.934842                  9.24302
17          0.182827    0.193303              0.944386       0.942695                  9.80991
18          0.176776    0.194284              0.94614        0.934042                  10.3603
19          0.16964     0.177684              0.945789       0.945242                  10.8531
20          0.164831    0.171988              0.949825       0.947347                  11.3876
21          0.158394    0.167459              0.952982       0.949747                  11.9866
22          0.153353    0.161774              0.956964       0.949347                  12.6433
23          0.148209    0.156644              0.957368       0.951747                  13.3825
24          0.144814    0.15322               0.957018       0.955495                  13.962
25          0.138782    0.148277              0.958947       0.954147                  14.6
26          0.135333    0.145225              0.961228       0.956695                  15.2284
27          0.129593    0.141141              0.964561       0.958295                  15.7413
28          0.128265    0.136866              0.962632       0.960547                  16.2711
29          0.123848    0.133444              0.966071       0.961347                  16.7772
30          0.119687    0.129579              0.967193       0.964547                  17.3311
31          0.115857    0.126606              0.968596       0.966547                  17.8252
32          0.113911    0.124272              0.968772       0.962547                  18.3121
33          0.111502    0.122548              0.968596       0.965095                  18.8973
34          0.107427    0.116724              0.970526       0.969747                  19.4723
35          0.104536    0.114517              0.970877       0.969095                  20.0804
36          0.099408    0.112128              0.971786       0.970547                  20.6509
37          0.0972982   0.107618              0.973158       0.970947                  21.2467
38          0.0927064   0.104918              0.973158       0.969347                  21.7978
39          0.0904702   0.101141              0.973333       0.969747                  22.3328
40          0.0860733   0.0984015             0.975263       0.971747                  22.8447
41          0.0829282   0.0942095             0.977544       0.974947                  23.5113
42          0.082219    0.0947418             0.975965       0.969347                  24.0427
43          0.0773362   0.0906804             0.977857       0.977747                  24.5252
44          0.0751769   0.0886449             0.977895       0.972147                  25.1722
45          0.072056    0.0916797             0.978246       0.977495                  26.0778
46          0.0708111   0.0811359             0.98           0.979347                  26.6648
47          0.0671919   0.0783265             0.982456       0.978947                  27.2929
48          0.0658817   0.0772342             0.981754       0.977747                  27.8119
49          0.0634615   0.0762576             0.983333       0.974947                  28.3876
50          0.0622394   0.0710278             0.982321       0.981747                  28.9067
Predicted Edible Actual Edible

Our prediction was correct. Success!

The loss function:

_images/loss.png

And the accuracy

_images/accuracy.png

Concepts Walkthrough

Define-by-Run

As mentioned on the top page, Chainer is a flexible framework for neural networks. One major goal is flexibility, so it must enable us to write complex architectures simply and intuitively.

Most existing deep learning frameworks are based on the “Define-and-Run” scheme. That is, first a network is defined and fixed, and then the user periodically feeds it with mini-batches of training data. Since the network is statically defined before any forward/backward computation, all the logic must be embedded into the network architecture as data. Consequently, defining a network architecture in such systems (e.g. Caffe) follows a declarative approach. Note that one can still produce such a static network definition using imperative languages (e.g. torch.nn, Theano-based frameworks, and TensorFlow).

In contrast, Chainer adopts a “Define-by-Run” scheme, i.e., the network is defined dynamically via the actual forward computation. More precisely, Chainer stores the history of computation instead of programming logic. This strategy enables us to fully leverage the power of programming logic in Python. For example, Chainer does not need any magic to introduce conditionals and loops into the network definitions. The Define-by-Run scheme is the core concept of Chainer. We will show in this tutorial how to define networks dynamically.

This strategy also makes it easy to write multi-GPU parallelization, since logic comes closer to network manipulation. We will review such amenities in later sections of this tutorial.

Variables and Derivatives

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

As described previously, Chainer uses the “Define-by-Run” scheme, so forward computation itself defines the network. In order to start forward computation, we have to set the input array to a chainer.Variable object. Here we start with a simple ndarray with only one element:

>>> x_data = np.array([5], dtype=np.float32)
>>> x = Variable(x_data)

A Variable object supports basic arithmetic operators. In order to compute \(y = x^2 - 2x + 1\), just write:

>>> y = x**2 - 2 * x + 1

The resulting y is also a Variable object, whose value can be extracted by accessing the array attribute:

>>> y.array
array([16.], dtype=float32)

Note

Variable has two attributes to represent the underlying array: array and data. There is no difference between the two; both refer to exactly the same object. However it is not recommended that you use .data because it might be confused with numpy.ndarray.data attribute.

What y holds is not only the result value. It also holds the history of computation (or computational graph), which enables us to compute its derivative. This is done by calling its backward() method:

>>> y.backward()

This runs error backpropagation (a.k.a. backprop or reverse-mode automatic differentiation). Then, the gradient is computed and stored in the grad attribute of the input variable x:

>>> x.grad
array([8.], dtype=float32)

Also we can compute gradients of intermediate variables. Note that Chainer, by default, releases the gradient arrays of intermediate variables for memory efficiency. In order to preserve gradient information, pass the retain_grad argument to the backward method:

>>> z = 2*x
>>> y = x**2 - z + 1
>>> y.backward(retain_grad=True)
>>> z.grad
array([-1.], dtype=float32)

All these computations can be generalized to a multi-element array input. While single-element arrays are automatically initialized to [1], to start backward computation from a variable holding a multi-element array, we must set the initial error manually. This is done simply by setting the grad attribute of the output variable:

>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x**2 - 2*x + 1
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward()
>>> x.grad
array([[ 0.,  2.,  4.],
       [ 6.,  8., 10.]], dtype=float32)

Note

Many functions taking Variable object(s) are defined in the chainer.functions module. You can combine them to realize complicated functions with automatic backward computation.

Note

Instead of using backward(), you can also calculate gradients of any variables in a computational graph w.r.t. any other variables in the graph using the chainer.grad() function.

Higher-Order Derivatives

Variable also supports higher-order derivatives (a.k.a. double backpropagation).

Let’s see a simple example. First calculate the first-order derivative. Note that enable_double_backprop=True is passed to y.backward().

>>> x = chainer.Variable(np.array([[0, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x ** 3
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward(enable_double_backprop=True)
>>> x.grad_var
variable([[  0.,  12.,  27.],
          [ 48.,  75., 108.]])
>>> assert x.grad_var.array is x.grad
>>> assert (x.grad == (3 * x**2).array).all()

chainer.Variable.grad_var is a Variable for chainer.Variable.grad (which is an ndarray). By passing enable_double_backprop=True to backward(), a computational graph for the backward calculation is recorded. So, you can start backpropagation from x.grad_var to calculate the second-order derivative.

>>> gx = x.grad_var
>>> x.cleargrad()
>>> gx.grad = np.ones((2, 3), dtype=np.float32)
>>> gx.backward()
>>> x.grad
array([[ 0., 12., 18.],
       [24., 30., 36.]], dtype=float32)
>>> assert (x.grad == (6 * x).array).all()

Define your own function

In this section, you will learn about the following things:

  • How to define a function on variables

  • Useful tools to write a function using a GPU

  • How to test the function definition

After reading this section, you will be able to:

  • Write your own functions

  • Define simple kernels in the function definition

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

Differentiable Functions

Chainer provides a collection of functions in the chainer.functions module. It covers typical use cases in deep learning, so many existing works can be implemented with them. On the other hand, deep learning is evolving rapidly and we cannot cover all possible functions to define unseen architectures. So it is important to learn how to define your own functions.

New-Style v.s. Old-Style Functions

In Chainer, you can define a function in two ways: new-style and old-style.

  • New-style functions inherit from chainer.FunctionNode class (introduced in Chainer v3). Forward computation can be implemented using NumPy/CuPy. Backward computation needs to be implemented by using (possibly a composition of) other new-style functions.

  • Old-style functions inherit from chainer.Function class. Forward and backward computation can be implemented using NumPy/CuPy.

The primary advantage of using new-style functions is that they support computation of higher-order gradients (a.k.a. higher-order derivative or double backpropagation). Higher-order gradients are used in some models e.g., recently-proposed GAN architectures. New-style functions are also better in terms of performance of backward, as the interface allows an implementation to skip the computation of unneeded input gradients.

Currently, most of built-in functions are implemented in new-style (with a few exceptions listed in #4449). Basically, we recommend you use new-style when implementing new functions. However, you can still continue to use existing old-style functions for the foreseeable future.

In the following sections, we describe steps to implenent user-defiend functions in new-style. You can also refer to Implementing Old-Style Functions and Migrating From Old-Style Functions To New-Style Functions if you have interest.

Implementing New-Style Functions

First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\). While it is possible to implement this equation using a combination of the * and + functions, defining it as a single function may reduce memory consumption, so it is not only a toy example. Here we call this function MulAdd.

Let’s start with defining MulAdd working on the CPU. New-style functions must inherit the chainer.FunctionNode class. The skeleton of a function looks like:

class MulAdd(FunctionNode):
    def forward_cpu(self, inputs):
        # do forward computation on CPU
        return some_tuple

    def backward(self, target_input_indexes, grad_outputs):
        # do backward computation
        return some_tuple

We must implement forward_cpu() and backward() methods.

  • In forward_cpu() function, inputs is a tuple of array(s). You need to return a tuple of array(s), which is a result of forward computation.

  • In backward() function, grad_outputs is a tuple of Variable(s) which are gradients with regard to each output(s), i.e., the length of grad_outputs tuple equals to the number of outputs returned by forward_cpu). You need to return a tuple of Variable(s) which are gradients with regard to each input(s), i.e., the length of returned tuple equals to the number of inputs to forward_cpu. You can optionally use target_input_indexes (a tuple of indices required to compute gradients) to omit computing unnecessary gradients. We will show you the usage of target_input_indexes later.

Warning

Be careful to return a tuple even if you have just one array or Variable to return.

Note

Unlike old-style functions, inputs and outputs of backward method in new-style functions are Variables. In other words, the backward method is device agnostic; there are no backward_cpu or backward_gpu in FunctionNode.

MulAdd is simple and can be implemented as follows:

class MulAdd(FunctionNode):
    def forward_cpu(self, inputs):
        # Unpack input arrays (``numpy.ndarray``).
        x, y, z = inputs

        # Mark inputs (``x`` and ``y``) as retained so that it can be
        # accessed during the backward process.
        self.retain_inputs((0, 1))

        # Compute results.
        w = x * y + z

        # Return the result as a tuple.
        return w,

    def backward(self, target_input_indexes, grad_outputs):
        # Unpack inputs retained in the forward process (``Variable``).
        x, y = self.get_retained_inputs()

        # Get gradients w.r.t. the output (Variable).
        gw, = grad_outputs

        # Compute gradients w.r.t the inputs.
        gx = y * gw
        gy = x * gw
        gz = gw

        # Return the result as a tuple.
        return gx, gy, gz

As per the warning above, the forward_cpu() method returns a tuple of single element. Note that all arrays appearing in forward_cpu are numpy.ndarray. The forward function is straightforward; it unpacks the input tuple, computes the output, and packs it into a tuple. The backward function is a bit more complicated. Recall the rule of differentiation of multiplication. This example just implements the rule. Look at the return values, the function just packs the gradient of each input in the same order and returns them.

By just defining the core computation of forward and backward, FunctionNode class provides a chaining logic on it (i.e., storing the history of computation, etc.).

Note

Assuming we implement a (forward) function \(y=f(x)\) which takes as input the vector \(x \in \mathbb{R}^n\) and produces as output a vector \(y \in \mathbb{R}^m\). Then the backward method has to compute

\[\lambda_i = \sum_{j=1}^m \frac{\partial y_j}{\partial x_i} \, \gamma_j \,\, \text{for}\, i = 1 \dots n\]

where \(\gamma\) is the grad_outputs. Note, that the resulting vector \(\lambda\) must have the same shape as the arguments of the forward method.

Now let’s define the corresponding GPU method. You can easily predict that the method we have to write is named forward_gpu():

class MulAdd(FunctionNode):
    def forward_cpu(self, inputs):
        ...

    def forward_gpu(self, inputs):
        # Unpack input arrays (``cupy.ndarray``).
        x, y, z = inputs

        # Mark inputs (``x`` and ``y``) as retained so that it can be
        # accessed during the backward process.
        self.retain_inputs((0, 1))

        # Compute results.
        w = x * y + z

        # Return the result as a tuple.
        return w,

    def backward(self, target_input_indexes, grad_outputs):
        ...

In forward_gpu method, arrays are of type cupy.ndarray. We use arithmetic operators defined for this class. These operators implement the basic elementwise arithmetics.

You may find that the definitions of forward_gpu is exactly same as forward_cpu. In that case, we can reduce them io forward().

class MulAdd(FunctionNode):
    def forward(self, inputs):
        # Unpack input arrays (``numpy.ndarray`` or ``cupy.ndarray``).
        x, y, z = inputs

        # Mark inputs (``x`` and ``y``) as retained so that it can be
        # accessed during the backward process.
        self.retain_inputs((0, 1))

        # Compute results.
        w = x * y + z

        # Return the result as a tuple.
        return w,

    def backward(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx = y * gw
        gy = x * gw
        gz = gw
        return gx, gy, gz

Since the cupy.ndarray class implements many methods of numpy.ndarray, we can write these unified methods in most cases.

The MulAdd function can be used as follows:

x = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
w, = MulAdd().apply((x, y, z))

It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:

def muladd(x, y, z):
    return MulAdd().apply((x, y, z))

w = muladd(x, y, z)

All functions under chainer.functions are implemented as wrapper functions like this.

Unified forward/backward methods with NumPy/CuPy functions

CuPy implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backprop-able function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straight-forward as follows:

from chainer.backends import cuda

class ExpAdd(FunctionNode):
    def forward_cpu(self, inputs):
        self.retain_inputs((0, 1))
        x, y = inputs
        z = np.exp(x) + np.exp(y)
        return z,

    def forward_gpu(self, inputs):
        self.retain_inputs((0, 1))
        cupy = cuda.cupy
        x, y = inputs
        z = cupy.exp(x) + cupy.exp(y)
        return z,

    def backward(self, target_input_indexes, grad_outputs):
        x, y = self.get_retained_inputs()
        gz, = grad_outputs

        gx = gz * F.exp(x)
        gy = gz * F.exp(y)
        return gx, gy

def expadd(x, y):
    z, = ExpAdd().apply((x, y))
    return z

Note

Here we used chainer.backends.cuda.cupy instead of directly accessing cupy. This is because the cupy module cannot be imported if the CUDA is not installed. In order to keep the implementation valid in non-CUDA environment, we have to defer the access to the cupy module. Note that the chainer.backends.cuda module can be imported even if the CUDA is not installed. Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDA-dedicated functions, the code is still valid.

The CPU and GPU implementations are almost same, except that numpy is replaced by cupy in forward_gpu. We can unify these functions using the chainer.backend.get_array_module() function. This function accepts arbitrary number of arrays, and returns an appropriate module for them. See the following code:

class ExpAdd(FunctionNode):
    def forward(self, inputs):
        self.retain_inputs((0, 1))
        xp = backend.get_array_module(*inputs)
        x, y = inputs
        z = xp.exp(x) + xp.exp(y)
        return z,

    def backward(self, target_input_indexes, grad_outputs):
        x, y = self.get_retained_inputs()
        gz, = grad_outputs

        gx = gz * F.exp(x)
        gy = gz * F.exp(y)
        return gx, gy

def expadd(x, y):
    z, = ExpAdd().apply((x, y))
    return z

Note that this code works correctly even if CUDA is not installed in the environment. If CUDA is not found, get_array_module() function always returns numpy. We often use the name xp for the variadic module name, which is analogous to the abbreviation np for NumPy and cp for CuPy.

Write an Elementwise Kernel Function

Let’s turn back to the MulAdd example.

The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores. However, it invokes two kernels during each of forward (w = x * y + z) and backward (gx = y * gw and gy = x * gw) computations. It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth. We can reduce the number of invocations by defining our own kernel. It also reduce the memory consumption.

CuPy provides a useful tool to define elementwise kernels, the cupy.ElementwiseKernel class, and Chainer wraps it by chainer.backends.cuda.elementwise() function. Our MulAdd implementation can be improved as follows:

class MulAdd(FunctionNode):
    def forward_cpu(self, inputs):
        self.retain_inputs((0, 1))
        x, y, z = inputs
        w = x * y + z
        return w,

    def forward_gpu(self, inputs):
        self.retain_inputs((0, 1))
        x, y, z = inputs
        w = cuda.cupy.elementwise(
            'float32 x, float32 y, float32 z',
            'float32 w',
            'w = x * y + z',
            'muladd_fwd')(x, y, z)
        return w,

    def backward(self, target_input_indexes, grad_outputs):
        x, y, z = self.get_retained_inputs()
        gw, = grad_outputs
        return MulAddGrad().apply((x, y, z, gw))

class MulAddGrad(FunctionNode):
    def forward_cpu(self, inputs):
        x, y, z, gw = inputs
        gx = y * gw
        gy = x * gw
        gz = gw
        return gx, gy, gz

    def forward_gpu(self, inputs):
        x, y, z, gw = inputs
        gx, gy = cuda.elementwise(
            'float32 x, float32 y, float32 gw',
            'float32 gx, float32 gy',
            '''
               gx = y * gw;
               gy = x * gw;
            ''',
            'muladd_bwd')(x, y, gw)

        gz = gw
        return gx, gy, gz

    def backward(self, target_input_indexes, grad_outputs):
        # You can leave this unimplemented unless you need to compute
        # higher-order derivative using this function.
        raise NotImplementedError()

chainer.backends.cuda.elementwise() function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel object, which is callable). In typical usage, we pass four arguments to this function as follows:

  1. Input argument list. This is a comma-separated string each entry of which consists of a type specification and an argument name.

  2. Output argument list in the same format as the input argument list.

  3. Body of parallel loop. We can use the input/output argument names as an element of these arrays.

  4. Name of the kernel function, which is shown in debuggers and profilers.

Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by chainer.backends.cuda.elementwise().

The first one is binary caching: chainer.backends.cuda.elementwise() function caches the compiled binary in the $(HOME)/.cupy/kernel_cache directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value. This caching mechanism is actually implemented in CuPy.

The second one is upload caching: Given a compiled binary code, we have to upload it to the current GPU in order to execute it. chainer.backends.cuda.elementwise() function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.

The above MulAdd code only works for float32 arrays. The ElementwiseKernel also supports the type-variadic kernel definition. In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:

class MulAdd(Function):
    def forward_cpu(self, inputs):
        ...

    def backward_cpu(self, inputs, grad_outputs):
        ...

    def forward_gpu(self, inputs):
        cupy = cuda.cupy
        x, y, z = inputs
        w = cuda.elementwise(
            'T x, T y, T z',
            'T w',
            'w = x * y + z',
            'muladd_fwd')(x, y, z)
        return w,

    def backward_gpu(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx, gy = cuda.elementwise(
            'T x, T y, T gw',
            'T gx, T gy',
            '''
               gx = y * gw;
               gy = x * gw;
            ''',
            'muladd_bwd')(x, y, gw)

        gz = gw
        return gx, gy, gz

The type placeholder T indicates an arbitrary data type that CuPy supports.

There are more functionalities on user-defined kernels in CuPy. See the CuPy documentation on user-defined kernels for more details.

Advanced Topics

Write a function with training/test mode

We sometimes want to make a function behave differently in training and test modes. The training/test mode in Chainer is configured by chainer.config. This is a thread-local configuration object, and users can substitute True or False to its train attribute. You can refer to Configuring Chainer to see how to configure this flag as well as other configuration items.

Here, we just show how to use this flag to make a function support training/test mode. You will need to check the value of the boolean flag chainer.config.train and branch appropriately.

For example, consider the following simple dropout function:

def dropout(x):
    xp = backend.get_array_module(x.array)
    mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
    return x * mask

This function applies dropout to each element and doubles survived elements to preserve the scale. The above implementation applies dropout even in test mode, but it is not a desired behavior. We can fix it as follows:

def dropout(x):
    if not chainer.config.train:
        return x

    xp = backend.get_array_module(x.array)
    mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
    return x * mask

The function now supports test mode. Note that you usually do not have to implement your own dropout function because dropout() is officially provided.

Testing Functions

In order to isolate the cause of learning failure from implementation bugs, it is important to test function implementations. Chainer provides simple utilities to help writing unit tests. They are defined in the gradient_check module.

The most important test utility is the numerical_grad() function. This function computes the numerical gradient of given function using finite differences. It can be used as follows:

x  = np.random.randn(4, 3).astype(np.float32)
gy = np.ones((4, 3), dtype=np.float32)
f  = lambda: (x * x,)
gx = gradient_check.numerical_grad(f, (x,), (gy,))

f is a closure that returns a tuple of array(s) computed from input arrays. The second and third arguments of numerical_grad() are tuples of input arrays and output gradient arrays, respectively. The code above computes the numerical gradients of sum(f(x)), where sum indicates the summation over all elements. The summation can be weighted by changing gy. numerical_grad() function also accepts additional eps argument, which indicates the quantization width of finite differences.

Note

numerical_grad() function accepts both CPU and GPU arrays. Note that we cannot mix CPU and GPU arrays.

Another utility is chainer.testing.assert_allclose() function. This is similar to numpy.testing.assert_allclose() function. The difference is that Chainer’s version accepts CPU and GPU arrays as inputs. We can mix them in one invocation of chainer.testing.assert_allclose(). The default values of optional arguments are also different.

Here is a typical usage of gradient checking utilities. This is a test example of functions.relu() function:

import unittest

from chainer import testing

class TestReLU(unittest.TestCase):
    def test_backward_cpu(self):
        x = Variable(np.random.randn(3, 2).astype(np.float32))
        y = F.relu(x)
        y.grad = np.random.randn(3, 2).astype(np.float32)
        y.backward(retain_grad=True)

        def f():
            return F.relu(x).array,

        gx, = gradient_check.numerical_grad(f, (x.array,), (y.grad,))
        testing.assert_allclose(gx, x.grad)

The first four lines of the test code are simple forward and backward computation of ReLU function. The next two lines compute numerical gradient using the same forward function without backward routine. And at last, we compare these two results elementwise. Note that the above test code can be easily modified to test GPU version just by replacing CPU arrays to GPU arrays.

In most cases, we do not write the code like the above explicitly because Chainer offers a utility function chainer.gradient_check.check_backward() that follows this procedure.

import unittest

from chainer import gradient_check

class TestReLU(unittest.TestCase):
    def test_backward_cpu(self):

        def f(x):
            return F.relu(x)

        x = np.random.randn(3, 2).astype(np.float32)
        y_grad = np.random.randn(3, 2).astype(np.float32)

        gradient_check.check_backward(f, x, y_grad, atol=1e-4, rtol=1e-4)

You can find many examples of function tests under tests/chainer_tests/functions_tests directory.

You can use chainer.gradient_check.check_double_backward() to run gradient check for the second order gradient computed by new-style functions. This function runs two backwpropagations; first to compute the gradient gx of y w.r.t. x, and second to compute the gradient of gx w.r.t. x. It can be used like check_backward(), but check_double_backward() expects an additional argument x_grad_grad, which is an array or a tuple of arrays used for initializing the gradient array of each gradient w.r.t. an input. In other words, this argument is used to initialize gx.grad for the second backprop.

Migrating From Old-Style Functions To New-Style Functions

Here are the key differences between Function and FunctionNode.

  • Implementing forward computation (difference between chainer.Function.forward() and chainer.FunctionNode.forward())

    • There are no difference between Function and FunctionNode except that the input arrays are NOT retained by default.

      If you want the inputs to be retained to use them in backward, call retain_inputs() explicitly. In other words, self.retain_inputs(()) has no effect in FunctionNode.

  • Implementing backward computation (difference between chainer.Function.backward() and chainer.FunctionNode.backward())

    • Arguments to the method has been changed.

      • inputs argument is no longer passed.

        You can use get_retained_inputs() and get_retained_outputs() to retrieve the inputs/outputs retained in the forward method. Note that grad_outputs and these retained inputs/outputs are all given as Variable objects, and backward method must return a tuple of Variable objects.

      • target_input_indexes argument has been added.

        It contains a sorted indices of the input variables w.r.t. which the gradients are required. You can use it to skip calculation of unneeded gradients. The use of target_input_indexes is optional; it is acceptable to calculate and return all gradients.

    • All inputs (grad_outputs) and retained values are given in Variable in FunctionNode, whereas ndarray in Function.

  • Invoking forward computation

    • Function is a callable, whereas FunctionNode is not.

      You need to use f.apply((x,)) instead of f(x). Note that apply() always returns outputs as tuple even if the function generates only one output value.

When migrating from old-style to new-style, typically you will need to write a new function class that implements the first-order gradient of the original function. Here is an example of rewriting old-style MyOldFunc unary function to new-style MyFunc function.

class MyOldFunc(chainer.Function):

    def forward(self, inputs):
        x, = inputs
        ...  # forward computation code
        return y,

    def backward(self, inputs, grad_outputs):
        x, = inputs
        gy, = grad_outputs
        ...  # backward computation code
        return gx,
class MyFunc(chainer.FunctionNode):

    def forward(self, inputs):
        self.retain_inputs((0,))
        x, = inputs
        ...  # forward computation code in MyOldFunc
        return y,

    def backward(self, target_input_indexes, grad_outputs):
        x, = self.get_retained_inputs()
        gy, = grad_outputs
        gx, = MyFuncGrad().apply((x, gy))
        return gx,

class MyFuncGrad(chainer.FunctionNode):

    def forward(self, inputs):
        x, gy = inputs
        ...  # backward computation code in MyOldFunc
        return gx,

    def backward(self, target_input_indexes, grad_outputs):
        # You can leave this unimplemented unless you need to compute
        # higher-order derivative using this function.
        raise NotImplementedError()

Implementing Old-Style Functions

Note

As noted in the New-Style v.s. Old-Style Functions, we recommend that you use new-style for newly implemented functions. This section uses the same example as in Implementing New-Style Functions but using old-style.

First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\). While it is possible to implement this equation using a combination of the * and + functions, defining it as a single function may reduce memory consumption, so it is not only a toy example. Here we call this function MulAdd.

Let’s start with defining MulAdd working on the CPU. Old-style functions must inherit the Function class. The skeleton of a function looks like:

class MulAdd(Function):
    def forward_cpu(self, inputs):
        # do forward computation on CPU
        return some_tuple

    def backward_cpu(self, inputs, grad_outputs):
        # do backward computation on CPU
        return some_tuple

We must implement forward_cpu() and backward_cpu() methods. The non-self arguments of these functions are tuples of array(s), and these functions must return a tuple of array(s).

Warning

Be careful to return a tuple of arrays even if you have just one array to return.

MulAdd is simple and implemented as follows:

class MulAdd(Function):
    def forward_cpu(self, inputs):
        x, y, z = inputs
        w = x * y + z
        return w,

    def backward_cpu(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx = y * gw
        gy = x * gw
        gz = gw
        return gx, gy, gz

As per the warning above, the forward_cpu method returns a tuple of single element. Note that all arrays appearing in CPU functions are numpy.ndarray. The forward function is straightforward; it unpacks the input tuple, computes the output, and packs it into a tuple. The backward function is a bit more complicated. Recall the rule of differentiation of multiplication. This example just implements the rule. Look at the return values, the function just packs the gradient of each input in the same order and returns them.

By just defining the core computation of forward and backward, Function class provides a chaining logic on it (i.e., storing the history of computation, etc.).

Note

Assuming we implement a (forward) function \(y=f(x)\) which takes as input the vector \(x \in \mathbb{R}^n\) and produces as output a vector \(y \in \mathbb{R}^m\). Then the backward method has to compute

\[\lambda_i = \sum_{j=1}^m \frac{\partial y_j}{\partial x_i} \, \gamma_j \,\, \text{for}\, i = 1 \dots n\]

where \(\gamma\) is the grad_outputs. Note, that the resulting vector \(\lambda\) must have the same shape as the arguments of the forward method.

Now let’s define the corresponding GPU methods. You can easily predict that the methods we have to write are named forward_gpu() and backward_gpu():

class MulAdd(Function):
    def forward_cpu(self, inputs):
        ...

    def backward_cpu(self, inputs, grad_outputs):
        ...

    def forward_gpu(self, inputs):
        x, y, z = inputs
        w = x * y + z
        return w,

    def backward_gpu(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx = y * gw
        gy = x * gw
        gz = gw
        return gx, gy, gz

In GPU methods, arrays are of type cupy.ndarray. We use arithmetic operators defined for this class. These operators implement the basic elementwise arithmetics.

You may find that the definitions of GPU methods are exactly same as those of CPU methods. In that case, we can reduce them to forward() and backward() methods.

class MulAdd(Function):
    def forward(self, inputs):
        x, y, z = inputs
        w = x * y + z
        return w,

    def backward(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx = y * gw
        gy = x * gw
        gz = gw
        return gx, gy, gz

Since the cupy.ndarray class implements many methods of numpy.ndarray, we can write these unified methods in most cases.

The MulAdd function can be used as follows:

x = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
w = MulAdd()(x, y, z)

It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:

def muladd(x, y, z):
    return MulAdd()(x, y, z)

w = muladd(x, y, z)

All functions under chainer.functions are implemented as wrapper functions like this.

Unified forward/backward methods with NumPy/CuPy functions

CuPy implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backprop-able function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straight-forward as follows:

from chainer.backends import cuda

class ExpAdd(Function):
    def forward_cpu(self, inputs):
        x, y = inputs
        z = np.exp(x) + np.exp(y)
        return z,

    def backward_cpu(self, inputs, grad_outputs):
        x, y = inputs
        gz, = grad_outputs

        gx = gz * np.exp(x)
        gy = gz * np.exp(y)
        return gx, gy

    def forward_gpu(self, inputs):
        cupy = cuda.cupy
        x, y = inputs
        z = cupy.exp(x) + cupy.exp(y)
        return z,

    def backward_gpu(self, inputs, grad_outputs):
        cupy = cuda.cupy
        x, y = inputs
        gz, = grad_outputs

        gx = gz * cupy.exp(x)
        gy = gz * cupy.exp(y)
        return gx, gy

def expadd(x, y):
    return ExpAdd()(x, y)

Note

Here we used chainer.backends.cuda.cupy instead of directly accessing cupy. This is because the cupy module cannot be imported if the CUDA is not installed. In order to keep the implementation valid in non-CUDA environment, we have to defer the access to the cupy module. Note that the chainer.backends.cuda module can be imported even if the CUDA is not installed. Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDA-dedicated functions, the code is still valid.

The CPU and GPU implementations are almost same, except that numpy is replaced by cupy in GPU methods. We can unify these functions using the chainer.backend.get_array_module() function. This function accepts arbitrary number of arrays, and returns an appropriate module for them. See the following code:

class ExpAdd(Function):
    def forward(self, inputs):
        xp = backend.get_array_module(*inputs)
        x, y = inputs
        z = xp.exp(x) + xp.exp(y)
        return z,

    def backward(self, inputs, grad_outputs):
        xp = backend.get_array_module(*inputs)
        x, y = inputs
        gz, = grad_outputs

        gx = gz * xp.exp(x)
        gy = gz * xp.exp(y)
        return gx, gy

def expadd(x, y):
    return ExpAdd()(x, y)

Note that this code works correctly even if CUDA is not installed in the environment. If CUDA is not found, get_array_module() function always returns numpy. We often use the name xp for the variadic module name, which is analogous to the abbreviation np for NumPy and cp for CuPy.

Write an Elementwise Kernel Function

Let’s turn back to the MulAdd example.

The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores. However, it invokes two kernels during each of forward (w = x * y + z) and backward (gx = y * gw and gy = x * gw) computations. It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth. We can reduce the number of invocations by defining our own kernel. It also reduce the memory consumption.

Most functions only require elementwise operations like MulAdd. CuPy provides a useful tool to define elementwise kernels, the cupy.ElementwiseKernel class, and Chainer wraps it by chainer.backends.cuda.elementwise() function. Our MulAdd implementation can be improved as follows:

class MulAdd(Function):
    def forward_cpu(self, inputs):
        ...

    def backward_cpu(self, inputs, grad_outputs):
        ...

    def forward_gpu(self, inputs):
        cupy = cuda.cupy
        x, y, z = inputs
        w = cuda.elementwise(
            'float32 x, float32 y, float32 z',
            'float32 w',
            'w = x * y + z',
            'muladd_fwd')(x, y, z)
        return w,

    def backward_gpu(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx, gy = cuda.elementwise(
            'float32 x, float32 y, float32 gw',
            'float32 gx, float32 gy',
            '''
               gx = y * gw;
               gy = x * gw;
            ''',
            'muladd_bwd')(x, y, gw)

        gz = gw
        return gx, gy, gz

chainer.backends.cuda.elementwise() function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel object, which is callable). In typical usage, we pass four arguments to this function as follows:

  1. Input argument list. This is a comma-separated string each entry of which consists of a type specification and an argument name.

  2. Output argument list in the same format as the input argument list.

  3. Body of parallel loop. We can use the input/output argument names as an element of these arrays.

  4. Name of the kernel function, which is shown in debuggers and profilers.

Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by chainer.backends.cuda.elementwise().

The first one is binary caching: chainer.backends.cuda.elementwise() function caches the compiled binary in the $(HOME)/.cupy/kernel_cache directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value. This caching mechanism is actually implemented in CuPy.

The second one is upload caching: Given a compiled binary code, we have to upload it to the current GPU in order to execute it. chainer.backends.cuda.elementwise() function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.

The above MulAdd code only works for float32 arrays. The ElementwiseKernel also supports the type-variadic kernel definition. In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:

class MulAdd(Function):
    def forward_cpu(self, inputs):
        ...

    def backward_cpu(self, inputs, grad_outputs):
        ...

    def forward_gpu(self, inputs):
        cupy = cuda.cupy
        x, y, z = inputs
        w = cuda.elementwise(
            'T x, T y, T z',
            'T w',
            'w = x * y + z',
            'muladd_fwd')(x, y, z)
        return w,

    def backward_gpu(self, inputs, grad_outputs):
        x, y, z = inputs
        gw, = grad_outputs

        gx, gy = cuda.elementwise(
            'T x, T y, T gw',
            'T gx, T gy',
            '''
               gx = y * gw;
               gy = x * gw;
            ''',
            'muladd_bwd')(x, y, gw)

        gz = gw
        return gx, gy, gz

The type placeholder T indicates an arbitrary data type that CuPy supports.

There are more functionalities on user-defined kernels in CuPy. See the CuPy documentation on user-defined kernels for more details.

Creating Models

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

Most neural network architectures contain multiple links. For example, a multi-layer perceptron consists of multiple linear layers. We can write complex procedures with parameters by combining multiple links like this:

>>> l1 = L.Linear(4, 3)
>>> l2 = L.Linear(3, 2)

>>> def my_forward(x):
...     h = l1(x)
...     return l2(h)

Here the L indicates the links module. A procedure with parameters defined in this way is hard to reuse. More Pythonic way is combining the links and procedures into a class:

>>> class MyProc(object):
...     def __init__(self):
...         self.l1 = L.Linear(4, 3)
...         self.l2 = L.Linear(3, 2)
...
...     def forward(self, x):
...         h = self.l1(x)
...         return self.l2(h)

In order to make it more reusable, we want to support parameter management, CPU/GPU migration, robust and flexible save/load features, etc. These features are all supported by the Chain class in Chainer. Then, what we have to do here is just define the above class as a subclass of Chain:

>>> class MyChain(Chain):
...     def __init__(self):
...         super(MyChain, self).__init__()
...         with self.init_scope():
...             self.l1 = L.Linear(4, 3)
...             self.l2 = L.Linear(3, 2)
...
...     def forward(self, x):
...         h = self.l1(x)
...         return self.l2(h)

It shows how a complex chain is constructed by simpler links. Links like l1 and l2 are called child links of MyChain. Note that Chain itself inherits Link. It means we can define more complex chains that hold MyChain objects as their child links.

Note

We often define a single forward method of a link by the forward operator. Such links and chains are callable and behave like regular functions of Variables.

Another way to define a chain is using the ChainList class, which behaves like a list of links:

>>> class MyChain2(ChainList):
...     def __init__(self):
...         super(MyChain2, self).__init__(
...             L.Linear(4, 3),
...             L.Linear(3, 2),
...         )
...
...     def forward(self, x):
...         h = self[0](x)
...         return self[1](h)

ChainList can conveniently use an arbitrary number of links, however if the number of links is fixed like in the above case, the Chain class is recommended as a base class.

Optimizer

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

From the previous guide on Creating Models, let’s use the MyChain class:

>>> class MyChain(Chain):
...     def __init__(self):
...         super(MyChain, self).__init__()
...         with self.init_scope():
...             self.l1 = L.Linear(4, 3)
...             self.l2 = L.Linear(3, 2)
...
...     def forward(self, x):
...         h = self.l1(x)
...         return self.l2(h)

To tune parameters values to minimize loss, etc., we have to optimize them by the Optimizer class. It runs a numerical optimization algorithm on a given link. Many algorithms are implemented in the optimizers module. Here we use the simplest one, called Stochastic Gradient Descent (SGD):

>>> model = MyChain()
>>> optimizer = optimizers.SGD().setup(model)

The method setup() prepares for the optimization given a link.

Some parameter/gradient manipulations, e.g. weight decay and gradient clipping, can be done by setting hook functions to the optimizer. Hook functions are called after the gradient computation and right before the actual update of parameters. For example, we can set weight decay regularization by running the next line beforehand:

>>> optimizer.add_hook(chainer.optimizer_hooks.WeightDecay(0.0005))

Of course, you can write your own hook functions. It should be a function or a callable object.

There are two ways to use the optimizer. One is using it via Trainer, which we will see in the following sections. The other way is using it directly. We here review the latter case. To use the optimizer in an automated fashion, see the Trainer guide.

There are two further ways to use the optimizer directly. One is manually computing gradients and then calling the update() method with no arguments. Do not forget to clear the gradients beforehand!

>>> x = np.random.uniform(-1, 1, (2, 4)).astype(np.float32)
>>> model.cleargrads()
>>> # compute gradient here...
>>> loss = F.sum(model(chainer.Variable(x)))
>>> loss.backward()
>>> optimizer.update()

The other way is just passing a loss function to the update() method. In this case, cleargrads() is automatically called by the update method, so the user does not have to call it manually.

>>> def lossfun(arg1, arg2):
...     # calculate loss
...     loss = F.sum(model(arg1 - arg2))
...     return loss

>>> arg1 = np.random.uniform(-1, 1, (2, 4)).astype(np.float32)
>>> arg2 = np.random.uniform(-1, 1, (2, 4)).astype(np.float32)
>>> optimizer.update(lossfun, chainer.Variable(arg1), chainer.Variable(arg2))

See chainer.Optimizer.update() for the full specification.

Trainer

When we want to train neural networks, we have to run training loops that update the parameters many times. A typical training loop consists of the following procedures:

  1. Iterations over training datasets

  2. Preprocessing of extracted mini-batches

  3. Forward/backward computations of the neural networks

  4. Parameter updates

  5. Evaluations of the current parameters on validation datasets

  6. Logging and printing of the intermediate results

Chainer provides a simple yet powerful way to make it easy to write such training processes. The training loop abstraction mainly consists of two components:

  • Dataset abstraction. It implements 1 and 2 in the above list. The core components are defined in the dataset module. There are also many implementations of datasets and iterators in datasets and iterators modules, respectively.

  • Trainer. It implements 3, 4, 5, and 6 in the above list. The whole procedure is implemented by Trainer. The way to update parameters (3 and 4) is defined by Updater, which can be freely customized. 5 and 6 are implemented by instances of Extension, which appends an extra procedure to the training loop. Users can freely customize the training procedure by adding extensions. Users can also implement their own extensions.

Trainer Extensions

In this section, you will learn about the following topics:

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

What is trainer Extension?

Extension is a callable object that takes a Trainer object as an argument. By adding an Extension to a Trainer using the extend() method, the Extension will be called according to the schedule specified by using a trigger object (See the details in 1. trigger)

The Trainer object contains all information used in a training loop, e.g., models, optimizers, updaters, iterators, and datasets, etc. This makes it possible to change settings such as the learning rate of an optimizer.

Write a simple function

You can make a new Extension by writing a simple function which takes a Trainer object as its argument. For example, when you want to reduce the learning rate periodically during training, an lr_drop extension can be written as follows:

def lr_drop(trainer):
    trainer.updater.get_optimizer('main').lr *= 0.1

Then you can add this function to a Trainer object via extend() method.

trainer.extend(lr_drop, trigger=(10, 'epoch'))

It lowers the learning rate every 10 epochs by multiplying 0.1 with the current learning rate.

Write a function decorated with @make_extension

make_extension() is a decorator that adds some attributes to a given function. For example, the simple extension we created above can be written in this form:

@training.make_extension(trigger=(10, 'epoch'))
def lr_drop(trainer):
    trainer.updater.get_optimizer('main').lr *= 0.1

The difference between the above example and this is whether it has a default trigger or not. In the latter case, lr_drop() has its default trigger so that unless another trigger is specified via extend() method, the trigger specified in make_extension() is used by default. The code below acts the same as the former example, i.e., it reduces the learning rate every 10 epochs.

trainer.extend(lr_drop)

There are several attributes you can add using the make_extension() decorator.

1. trigger

trigger is an object that takes a Trainer object as an argument and returns a boolean value. If a tuple in the form (period, unit) is given as a trigger, it will be considered as an IntervalTrigger that invokes the extension every period unit. For example, when the given tuple is (10, 'epoch'), the extension will run every 10 epochs.

trigger can also be given to the extend() method that adds an extension to a Trainer object. The priority of triggers is as follows:

See the details in the documentation of get_trigger() for more information.

2. default_name

An Extension is kept in a dictionary which is a property in a Trainer. This argument gives the name of the Extension. Users will see this name in the keys of the snapshot which is a dictionary generated by serialization.

3. priority

As a Trainer object can be assigned multiple Extension objects, the execution order is defined according to the following three values:

  • PRIORITY_WRITER: The priority for extensions that write some records to the observation dictionary. It includes cases that the extension directly adds values to the observation dictionary, or the extension uses the chainer.report() function to report values to the observation dictionary. Extensions which write something to reporter should go first because other Extensions which read those values may be added.

  • PRIORITY_EDITOR: The priority for extensions that edit the observation dictionary based on already reported values. Extensions which edit some values of reported ones should go after the extensions which write values to reporter but before extensions which read the final values.

  • PRIORITY_READER: The priority for extensions that only read records from the observation dictionary. This is also suitable for extensions that do not use the observation dictionary at all. Extensions which read the reported values should be fired after all the extensions which have other priorities, e.g, PRIORITY_WRITER and PRIORITY_EDITOR because it should read the final values.

See the details in the documentation of Trainer for more information.

4. finalizer

You can specify a function to finalize the extension. It is called once at the end of the training loop, i.e., when run() has finished.

5. initializer

You can specify a function which takes a Trainer object as an argument to initialize the extension. It is called once before the training loop begins.

Write a class inherited from the Extension class

This is the way to define your own extension with the maximum degree of freedom. You can keep any values inside of the extension and serialize them.

As an example, let’s make an extension that drops the learning rate polynomially. It calculates the learning rate by this equation:

\[\eta = \eta_{\rm init} \left( 1 - \frac{t}{t_{\rm max}} \right)^{\rm power}\]

The learning rate will be dropped according to the curve below with \({\rm power} = 0.5\):

_images/polynomial.png
class PolynomialShift(training.Extension):

    def __init__(self, attr, power, stop_trigger, batchsize=None,
                 len_dataset=None):
        self._attr = attr
        self._power = power
        self._init = None
        self._t = 0
        self._last_value = 0

        if stop_trigger[1] == 'iteration':
            self._maxiter = stop_trigger[0]
        elif stop_trigger[1] == 'epoch':
            if batchsize is None or len_dataset is None:
                raise ValueError(
                    'When the unit of \'stop_trigger\' is \'epoch\', '
                    '\'batchsize\' and \'len_dataset\' should be '
                    'specified to calculate the maximum iteration.')
            n_iter_per_epoch = len_dataset / float(batchsize)
            self._maxiter = float(stop_trigger[0] * n_iter_per_epoch)

    def initialize(self, trainer):
        optimizer = trainer.updater.get_optimizer('main')
        # ensure that _init is set
        if self._init is None:
            self._init = getattr(optimizer, self._attr)

    def __call__(self, trainer):
        self._t += 1

        optimizer = trainer.updater.get_optimizer('main')
        value = self._init * ((1 - (self._t / self._maxiter)) ** self._power)
        setattr(optimizer, self._attr, value)
        self._last_value = value

    def serialize(self, serializer):
        self._t = serializer('_t', self._t)
        self._last_value = serializer('_last_value', self._last_value)
        if isinstance(self._last_value, np.ndarray):
            self._last_value = self._last_value.item()
stop_trigger = (10000, 'iteration')
trainer.extend(PolynomialShift('lr', 0.5, stop_trigger))

This extension PolynomialShift takes five arguments.

  • attr: The name of the optimizer property you want to update using this extension.

  • power: The power of the above equation to calculate the learning rate.

  • stop_trigger: The trigger given to the Trainer object to specify when to stop the training loop.

  • batchsize: The training mini-batchsize.

  • len_dataset: The length of the dataset, i.e., the number of data in the training dataset.

This extension calculates the number of iterations which will be performed during training by using stop_trigger, batchsize, and len_dataset, then stores it as a property _maxiter. This property will be used in the __call__() method to update the learning rate. The initialize() method obtains the initial learning rate from the optimizer given to the Trainer object. The serialize() method stores or recovers the properties, _t (number of iterations) and _last_value (the latest learning rate), belonging to this extension.

Using GPU(s) in Chainer

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

In this section, you will learn about the following topics:

  • Relationship between Chainer and CuPy

  • Basics of CuPy

  • Single-GPU usage of Chainer

  • Multi-GPU usage of model-parallel computing

  • Multi-GPU usage of data-parallel computing

After reading this section, you will be able to:

  • Use Chainer on a CUDA-enabled GPU

  • Write model-parallel computing in Chainer

  • Write data-parallel computing in Chainer

Relationship between Chainer and CuPy

Note

Even if you have CUDA installed in your environment, you have to install CuPy separately to use GPUs. See Working with Custom CUDA Installation for the way to set up CUDA support.

Chainer uses CuPy as its backend for GPU computation. In particular, the cupy.ndarray class is the GPU array implementation for Chainer. CuPy supports a subset of features of NumPy with a compatible interface. It enables us to write a common code for CPU and GPU. It also supports PyCUDA-like user-defined kernel generation, which enables us to write fast implementations dedicated to GPU.

Note

The chainer.backends.cuda module imports many important symbols from CuPy. For example, the cupy namespace is referred as cuda.cupy in the Chainer code. Note that the chainer.backends.cuda module can be imported even if CUDA is not installed.

Chainer uses a memory pool for GPU memory allocation. As shown in the previous sections, Chainer constructs and destructs many arrays during learning and evaluating iterations. It is not well suited for CUDA architecture, since memory allocation and release in CUDA (i.e. cudaMalloc and cudaFree functions) synchronize CPU and GPU computations, which hurts performance. In order to avoid memory allocation and deallocation during the computation, Chainer uses CuPy’s memory pool as the standard memory allocator. Chainer changes the default allocator of CuPy to the memory pool, so user can use functions of CuPy directly without dealing with the memory allocator.

Basics of cupy.ndarray

See the documentation of CuPy for the basic usage of cupy.ndarray

CuPy is a GPU array backend that implements a subset of NumPy interface. The cupy.ndarray class is in its core, which is a compatible GPU alternative of numpy.ndarray. CuPy implements many functions on cupy.ndarray objects. See the reference for the supported subset of NumPy API. Understanding NumPy might help utilizing most features of CuPy. See the NumPy documentation for learning it.

The main difference of cupy.ndarray from numpy.ndarray is that the content is allocated on the device memory. The allocation takes place on the current device by default. The current device can be changed by cupy.cuda.Device object as follows:

with cupy.cuda.Device(1):
    x_on_gpu1 = cupy.array([1, 2, 3, 4, 5])

Most operations of CuPy is done on the current device. Be careful that it causes an error to process an array on a non-current device.

Chainer provides some convenient functions to automatically switch and choose the device. For example, the chainer.backends.cuda.to_gpu() function copies a numpy.ndarray object to a specified device:

x_cpu = np.ones((5, 4, 3), dtype=np.float32)
x_gpu = cuda.to_gpu(x_cpu, device=1)

It is equivalent to the following code using CuPy:

x_cpu = np.ones((5, 4, 3), dtype=np.float32)
with cupy.cuda.Device(1):
    x_gpu = cupy.array(x_cpu)

Moving a device array to the host can be done by chainer.backends.cuda.to_cpu() as follows:

x_cpu = cuda.to_cpu(x_gpu)

It is equivalent to the following code using CuPy:

with x_gpu.device:
    x_cpu = x_gpu.get()

Note

The with statements in these codes are required to select the appropriate CUDA device. If user uses only one device, these device switching is not needed. chainer.backends.cuda.to_cpu() and chainer.backends.cuda.to_gpu() functions automatically switch the current device correctly.

Chainer also provides a convenient function chainer.backends.cuda.get_device_from_id() and chainer.backends.cuda.get_device_from_array() to select a device. The former function accepts an integer or None. When None is given, it returns a dummy device object. Otherwise, it returns a corresponding device object. The latter function accepts CuPy array or NumPy array. When a NumPy array is given, it returns a dummy device object. Otherwise, it returns a corresponding device object to the give CuPy array. The dummy device object also supports with statements like the above example but does nothing. Here are some other examples:

cuda.get_device_from_id(1).use()
x_gpu1 = cupy.empty((4, 3), dtype=cupy.float32)

with cuda.get_device_from_id(1):
    x_gpu1 = cupy.empty((4, 3), dtype=cupy.float32)

with cuda.get_device_from_array(x_gpu1):
    y_gpu1 = x_gpu + 1

Since it accepts NumPy arrays, we can write a function that accepts both NumPy and CuPy arrays with correct device switching:

def add1(x):
    with cuda.get_device_from_array(x):
        return x + 1

The compatibility of CuPy with NumPy enables us to write CPU/GPU generic code. It can be made easy by the chainer.backend.get_array_module() function. This function returns the numpy or cupy module based on arguments. A CPU/GPU generic function is defined using it like follows:

# Stable implementation of log(1 + exp(x))
def softplus(x):
    xp = backend.get_array_module(x)
    return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))

Run Neural Networks on a Single GPU

Single-GPU usage is very simple. What you have to do is transferring Link and input arrays to the GPU beforehand. In this subsection, the code is based on our first MNIST example in this tutorial.

A Link object can be transferred to the specified GPU using the to_gpu() method.

This time, we make the number of input, hidden, and output units configurable. The to_gpu() method also accepts a device ID like model.to_gpu(0). In this case, the link object is transferred to the appropriate GPU device. The current device is used by default.

If we use chainer.training.Trainer, what we have to do is just let the updater know the device ID to send each mini-batch.

updater = training.updaters.StandardUpdater(train_iter, optimizer, device=0)
trainer = training.Trainer(updater, (20, 'epoch'), out='result')

We also have to specify the device ID for an evaluator extension as well.

trainer.extend(extensions.Evaluator(test_iter, model, device=0))

When we write down the training loop by hand, we have to transfer each mini-batch to the GPU manually:

model.to_gpu()
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
    print('epoch %d' % epoch)
    indexes = np.random.permutation(datasize)
    for i in range(0, datasize, batchsize):
        x = Variable(cuda.to_gpu(x_train[indexes[i : i + batchsize]]))
        t = Variable(cuda.to_gpu(y_train[indexes[i : i + batchsize]]))
        optimizer.update(model, x, t)

Model-parallel Computation on Multiple GPUs

Parallelization of machine learning is roughly classified into two types called “model-parallel” and “data-parallel”. Model-parallel means parallelizations of the computations inside the model. In contrast, data-parallel means parallelizations using data sharding. In this subsection, we show how to use the model-parallel approach on multiple GPUs in Chainer.

Recall the MNIST example. Now suppose that we want to modify this example by expanding the network to 6 layers with 2000 units each using two GPUs. In order to make multi-GPU computation efficient, we only make the two GPUs communicate at the third and sixth layer. The overall architecture looks like the following diagram:

(GPU0) input --+--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+--> output
               |                       |                       |
(GPU1)         +--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+

We can use the above MLP chain as following diagram:

(GPU0) input --+--> mlp1 --+--> mlp2 --+--> output
               |           |           |
(GPU1)         +--> mlp1 --+--> mlp2 --+

Let’s write a link for the whole network.

class ParallelMLP(Chain):
    def __init__(self):
        super(ParallelMLP, self).__init__()
        with self.init_scope():
            # the input size, 784, is inferred
            self.mlp1_gpu0 = MLP(1000, 2000).to_gpu(0)
            self.mlp1_gpu1 = MLP(1000, 2000).to_gpu(1)

            # the input size, 2000, is inferred
            self.mlp2_gpu0 = MLP(1000, 10).to_gpu(0)
            self.mlp2_gpu1 = MLP(1000, 10).to_gpu(1)

    def forward(self, x):
        # assume x is on GPU 0
        z0 = self.mlp1_gpu0(x)
        z1 = self.mlp1_gpu1(F.copy(x, 1))

        # sync
        h0 = F.relu(z0 + F.copy(z1, 0))
        h1 = F.relu(z1 + F.copy(z0, 1))

        y0 = self.mlp2_gpu0(h0)
        y1 = self.mlp2_gpu1(h1)

        # sync
        y = y0 + F.copy(y1, 0)
        return y  # output is on GPU0

Recall that the Link.to_gpu() method returns the link itself. The copy() function copies an input variable to specified GPU device and returns a new variable on the device. The copy supports backprop, which just reversely transfers an output gradient to the input device.

Note

Above code is not parallelized on CPU, but is parallelized on GPU. This is because all the functions in the above code run asynchronously to the host CPU.

An almost identical example code can be found at examples/mnist/train_mnist_model_parallel.py.

Data-parallel Computation on Multiple GPUs with Trainer

Data-parallel computation is another strategy to parallelize online processing. In the context of neural networks, it means that a different device does computation on a different subset of the input data. In this subsection, we review the way to achieve data-parallel learning on two GPUs.

Suppose again our task is the MNIST example. This time we want to directly parallelize the three-layer network. The most simple form of data-parallelization is parallelizing the gradient computation for a distinct set of data. First, define a model and optimizer instances:

model = L.Classifier(MLP(1000, 10))  # the input size, 784, is inferred
optimizer = optimizers.SGD()
optimizer.setup(model)

Recall that the MLP link implements the multi-layer perceptron, and the Classifier link wraps it to provide a classifier interface. We used StandardUpdater in the previous example. In order to enable data-parallel computation with multiple GPUs, we only have to replace it with ParallelUpdater.

updater = training.updaters.ParallelUpdater(train_iter, optimizer,
                                   devices={'main': 0, 'second': 1})

The devices option specifies which devices to use in data-parallel learning. The device with name 'main' is used as the main device. The original model is sent to this device, so the optimization runs on the main device. In the above example, the model is also cloned and sent to GPU 1. Half of each mini-batch is fed to this cloned model. After every backward computation, the gradient is accumulated into the main device, the parameter update runs on it, and then the updated parameters are sent to GPU 1 again.

See also the example code in examples/mnist/train_mnist_data_parallel.py.

Data-parallel Computation on Multiple GPUs without Trainer

We here introduce a way to write data-parallel computation without the help of Trainer. Most users can skip this section. If you are interested in how to write a data-parallel computation by yourself, this section should be informative. It is also helpful to, e.g., customize the ParallelUpdater class.

We again start from the MNIST example. At this time, we use a suffix like _0 and _1 to distinguish objects on each device. First, we define a model.

model_0 = L.Classifier(MLP(1000, 10))  # the input size, 784, is inferred

We want to make two copies of this instance on different GPUs. The Link.to_gpu() method runs in place, so we cannot use it to make a copy. In order to make a copy, we can use Link.copy() method.

model_1 = model_0.copy()
model_0.to_gpu(0)
model_1.to_gpu(1)

The Link.copy() method copies the link into another instance. It just copies the link hierarchy, and does not copy the arrays it holds.

Then, set up an optimizer:

optimizer = optimizers.SGD()
optimizer.setup(model_0)

Here we use the first copy of the model as the master model. Before its update, gradients of model_1 must be aggregated to those of model_0.

Then, we can write a data-parallel learning loop as follows:

batchsize = 100
datasize = len(x_train)
for epoch in range(20):
    print('epoch %d' % epoch)
    indexes = np.random.permutation(datasize)
    for i in range(0, datasize, batchsize):
        x_batch = x_train[indexes[i : i + batchsize]]
        y_batch = y_train[indexes[i : i + batchsize]]

        x0 = Variable(cuda.to_gpu(x_batch[:batchsize//2], 0))
        t0 = Variable(cuda.to_gpu(y_batch[:batchsize//2], 0))
        x1 = Variable(cuda.to_gpu(x_batch[batchsize//2:], 1))
        t1 = Variable(cuda.to_gpu(y_batch[batchsize//2:], 1))

        loss_0 = model_0(x0, t0)
        loss_1 = model_1(x1, t1)

        model_0.cleargrads()
        model_1.cleargrads()

        loss_0.backward()
        loss_1.backward()

        model_0.addgrads(model_1)
        optimizer.update()

        model_1.copyparams(model_0)

Do not forget to clear the gradients of both model copies! One half of the mini-batch is forwarded to GPU 0, the other half to GPU 1. Then the gradients are accumulated by the Link.addgrads() method. This method adds the gradients of a given link to those of the self. After the gradients are prepared, we can update the optimizer in usual way. Note that the update only modifies the parameters of model_0. So we must manually copy them to model_1 using Link.copyparams() method.

Note

If the batch size used in one model remain the same, the scale of the gradient is roughly proportional to the number of models, when we aggregate gradients from all models by chainer.Link.addgrads(). So you need to adjust the batch size and/or learning rate of the optimizer accordingly.


Now you can use Chainer with GPUs. All examples in the examples directory support GPU computation, so please refer to them if you want to know more practices on using GPUs. In the next section, we will show how to define a differentiable (i.e. backpropable) function on Variable objects. We will also show there how to write a simple (elementwise) CUDA kernel using Chainer’s CUDA utilities.

Type Checks

In this section, you will learn about the following things:

  • Basic usage of type check

  • Detail of type information

  • Internal mechanism of type check

  • More complicated cases

  • Call functions

  • Typical type check example

After reading this section, you will be able to:

  • Write a code to check types of input arguments of your own functions

Basic usage of type check

When you call a function with an invalid type of array, you sometimes receive no error, but get an unexpected result by broadcasting. When you use CUDA with an illegal type of array, it causes memory corruption, and you get a serious error. These bugs are hard to fix. Chainer can check preconditions of each function, and helps to prevent such problems. These conditions may help a user to understand specification of functions.

Each implementation of Function has a method for type check, check_type_forward(). This function is called just before the forward() method of the Function class. You can override this method to check the condition on types and shapes of arguments.

check_type_forward() gets an argument in_types:

def check_type_forward(self, in_types):
  ...

in_types is an instance of TypeInfoTuple, which is a sub-class of tuple. To get type information about the first argument, use in_types[0]. If the function gets multiple arguments, we recommend to use new variables for readability:

x_type, y_type = in_types

In this case, x_type represents the type of the first argument, and y_type represents the second one.

We describe usage of in_types with an example. When you want to check if the number of dimension of x_type equals to 2, write this code:

utils.type_check.expect(x_type.ndim == 2)

When this condition is true, nothing happens. Otherwise this code throws an exception, and the user gets a message like this:

Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].ndim == 2
Actual: 3 != 2

This error message means that “ndim of the first argument expected to be 2, but actually it is 3”.

Detail of type information

You can access three information of x_type.

  • .shape is a tuple of ints. Each value is size of each dimension.

  • .ndim is int value representing the number of dimensions. Note that ndim == len(shape)

  • .dtype is numpy.dtype representing data type of the value.

You can check all members. For example, the size of the first dimension must be positive, you can write like this:

utils.type_check.expect(x_type.shape[0] > 0)

You can also check data types with .dtype:

utils.type_check.expect(x_type.dtype == np.float64)

And an error is like this:

Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].dtype == <class 'numpy.float64'>
Actual: float32 != <class 'numpy.float64'>

You can also check kind of dtype. This code checks if the type is floating point

utils.type_check.expect(x_type.dtype.kind == 'f')

You can compare between variables. For example, the following code checks if the first argument and the second argument have the same length:

utils.type_check.expect(x_type.shape[1] == y_type.shape[1])

Internal mechanism of type check

How does it show an error message like "in_types[0].ndim == 2"? If x_type is an object containing ndim member variable, we cannot show such an error message because this equation is evaluated as a boolean value by Python interpreter.

Actually x_type is a Expr objects, and doesn’t have a ndim member variable itself. Expr represents a syntax tree. x_type.ndim makes a Expr object representing (getattr, x_type, 'ndim'). x_type.ndim == 2 makes an object like (eq, (getattr, x_type, 'ndim'), 2). expect() gets a Expr object and evaluates it. When it is True, it causes no error and shows nothing. Otherwise, this method shows a readable error message.

If you want to evaluate a Expr object, call eval() method:

actual_type = x_type.eval()

actual_type is an instance of TypeInfo, while x_type is an instance of Expr. In the same way, x_type.shape[0].eval() returns an int value.

More powerful methods

Expr class is more powerful. It supports all mathematical operators such as + and *. You can write a condition that the first dimension of x_type is the first dimension of y_type times four:

utils.type_check.expect(x_type.shape[0] == y_type.shape[0] * 4)

When x_type.shape[0] == 3 and y_type.shape[0] == 1, users can get the error message below:

Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_types[1].shape[0] * 4
Actual: 3 != 4

To compare a member variable of your function, wrap a value with Variable to show readable error message:

x_type.shape[0] == utils.type_check.Variable(self.in_size, "in_size")

This code can check the equivalent condition below:

x_type.shape[0] == self.in_size

However, the latter condition doesn’t know the meaning of this value. When this condition is not satisfied, the latter code shows unreadable error message:

chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == 4  # what does '4' mean?
Actual: 3 != 4

Note that the second argument of utils.type_check.Variable is only for readability.

The former shows this message:

chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_size  # OK, `in_size` is a value that is given to the constructor
Actual: 3 != 4  # You can also check actual value here

Call functions

How to check summation of all values of shape? Expr also supports function call:

sum = utils.type_check.Variable(np.sum, 'sum')
utils.type_check.expect(sum(x_type.shape) == 10)

Why do we need to wrap the function numpy.sum with utils.type_check.Variable? x_type.shape is not a tuple but an object of Expr as we have seen before. Therefore, numpy.sum(x_type.shape) fails. We need to evaluate this function lazily.

The above example produces an error message like this:

Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: sum(in_types[0].shape) == 10
Actual: 7 != 10

More complicated cases

How to write a more complicated condition that can’t be written with these operators? You can evaluate Expr and get its result value with eval() method. Then check the condition and show warning message by hand:

x_shape = x_type.shape.eval()  # get actual shape (int tuple)
if not more_complicated_condition(x_shape):
    expect_msg = 'Shape is expected to be ...'
    actual_msg = 'Shape is ...'
    raise utils.type_check.InvalidType(expect_msg, actual_msg)

Please write a readable error message. This code generates the following error message:

Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: Shape is expected to be ...
Actual: Shape is ...

Typical type check example

We show a typical type check for a function.

First check the number of arguments:

utils.type_check.expect(in_types.size() == 2)

in_types.size() returns a Expr object representing the number of arguments. You can check it in the same way.

And then, get each type:

x_type, y_type = in_types

Don’t get each value before checking in_types.size(). When the number of argument is illegal, type_check.expect might output unuseful error messages. For example, this code doesn’t work when the size of in_types is 0:

utils.type_check.expect(
  in_types.size() == 2,
  in_types[0].ndim == 3,
)

After that, check each type:

utils.type_check.expect(
  x_type.dtype == np.float32,
  x_type.ndim == 3,
  x_type.shape[1] == 2,
)

The above example works correctly even when x_type.ndim == 0 as all conditions are evaluated lazily.

Serializers – saving and loading

Serializer is a simple interface to serialize or deserialize an object. Link, Optimizer, and Trainer support serialization.

Concrete serializers are defined in the serializers module. It supports NumPy NPZ and HDF5 formats.

For example, we can serialize a link object into NPZ file by the save_npz() function:

Assuming we have defined a model:

>>> from chainer import serializers
>>> serializers.save_npz('my.model', model)

This saves the parameters of model into the file 'my.model' in NPZ format. The saved model can be read back from my.model back into model by the load_npz() function:

>>> serializers.load_npz('my.model', model)

Note

Note that only the parameters and the persistent values are serialized by this serialization code. Other attributes are not saved automatically. You can register arrays, scalars, or any serializable objects as persistent values by the add_persistent() method. The registered values can be accessed by attributes of the name passed to the add_persistent method.

The state of an optimizer can also be saved by the same functions:

>>> serializers.save_npz('my.state', optimizer)
>>> serializers.load_npz('my.state', optimizer)

Note

Note that serialization of optimizer only saves its internal states including number of iterations, momentum vectors of MomentumSGD, etc. It does not save the parameters and persistent values of the target link. We have to explicitly save the target link with the optimizer to resume the optimization from saved states. This can be done by saving the entire Trainer object, like this:

>>> serializers.save_npz('my.state', trainer)

Support of the HDF5 format is enabled if the h5py package is installed. Serialization and deserialization with the HDF5 format are almost identical to those with the NPZ format; just replace save_npz() and load_npz() by save_hdf5() and load_hdf5(), respectively.

Customize your own logging

In this section, you will learn about the following things:

After reading this section, you will be able to:

  • Write your own report.

What is Reporter?

chainer.Reporter is used to collect values that users want to watch. The reporter object manipulates a dictionary from value names to the actually observed values. We call this dictionary as observation.

See the following example:

>>> from chainer import Reporter, report, report_scope
>>>
>>> reporter = Reporter()
>>> observer = object()  # it can be an arbitrary (reference) object
>>> reporter.add_observer('my_observer:', observer)
>>> observation = {}
>>> with reporter.scope(observation):
...     reporter.report({'x': 1}, observer)
...
>>> observation
{'my_observer:/x': 1}

When a value is passed to the reporter, an object called observer can be optionally attached. In this case, the name of the observer is added as the prefix of the value name. The observer name should be registered beforehand. Using reporter.scope, you can select which observation to save the observed values.

There are also a global API chainer.report(), which reports observed values with the current reporter object. In this case, current means which with statement scope the current code line is in. This function calls the Reporter.report() method of the current reporter.

>>> observation = {}
>>> with reporter.scope(observation):
...     report({'x': 1}, observer)
...
>>> observation
{'my_observer:/x': 1}

Naming rule for the reported values

So, you know almost everything about Reporter. However, there is one more thing. It is what is the naming rule for the reported values, especially when the values are reported from a link that is not the root of the link hierarchy.

As we explained in the previous section, the root of links is named as 'main' by the the StandardUpdater and the names of reported values in the root have the prefix 'main/'. When the values are reported from a link that is not the root of the link hierarchy, the prefix of the names are determined by the link hierarchy, or namedlinks().

See the following example:

>>> class MLP(Chain):
...     def __init__(self, n_units, n_out):
...         super(MLP, self).__init__()
...         with self.init_scope():
...             # the size of the inputs to each layer will be inferred
...             self.l1 = L.Linear(None, n_units)  # n_in -> n_units
...             self.l2 = L.Linear(None, n_units)  # n_units -> n_units
...             self.l3 = L.Linear(None, n_out)    # n_units -> n_out
...
...     def forward(self, x):
...         h1 = F.relu(self.l1(x))
...         h2 = F.relu(self.l2(h1))
...         y = self.l3(h2)
...         report({'sum_y': F.sum(y)}, self)
...         return y
...
>>> model = Classifier(MLP(100, 10))
>>> for name, observer in model.namedlinks(skipself=True):
...     print(name)  
/predictor
/predictor/l1
/predictor/l2
/predictor/l3

You can get the parameters of the link hierarchy by namedlinks(). In this example, we report 'loss' and 'accuracy' in the root of links, and 'sum_y' in the link of '/predictor'. So, you can access the reported values by 'main/accuracy', 'main/accuracy', and 'main/predictor/sum_y'.

See what we explained is correct:

>>> train, test = datasets.get_mnist()
>>> train_iter = iterators.SerialIterator(train, batch_size=100, shuffle=True)
>>> test_iter = iterators.SerialIterator(test, batch_size=100, repeat=False, shuffle=False)
>>> optimizer = optimizers.SGD()
>>> optimizer.setup(model)
>>> updater = training.StandardUpdater(train_iter, optimizer)
>>> trainer = training.Trainer(updater, (1, 'epoch'), out='result')
>>> trainer.extend(extensions.Evaluator(test_iter, model))
>>> trainer.extend(extensions.LogReport())
>>> trainer.extend(extensions.PrintReport(
...     ['epoch', 'main/accuracy', 'main/loss', 'main/predictor/sum_y', 'validation/main/accuracy']))
>>> trainer.run()
epoch       main/accuracy  main/loss   main/predictor/sum_y  validation/main/accuracy
1           0.662317       1.38345     47.9927               0.8498

Neural Net Examples

MNIST using Trainer

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

By using Trainer, you don’t need to write the training loop explicitly any more. Furthermore, Chainer provides many useful extensions that can be used with Trainer to visualize your results, evaluate your model, store and manage log files more easily.

This example will show how to use the Trainer to train a fully-connected feed-forward neural network on the MNIST dataset.

Note

If you would like to know how to write a training loop without using the Trainer, please check MNIST with a Manual Training Loop instead of this tutorial.

1. Prepare the dataset

Load the MNIST dataset, which contains a training set of images and class labels as well as a corresponding test set.

from chainer.datasets import mnist

train, test = mnist.get_mnist()

Note

You can use a Python list as a dataset. That’s because Iterator can take any object as a dataset whose elements can be accessed via [] accessor and whose length can be obtained with len() function. For example,

train = [(x1, t1), (x2, t2), ...]

a list of tuples like this can be used as a dataset.

There are many utility dataset classes defined in datasets. It is recommended that you utilize them in the actual applications.

For example, if your dataset consists of a number of image files, it would take a large amount of memory to load those data into a list like above. In that case, you can use ImageDataset, which just keeps the paths to image files. The actual image data will be loaded from the disk when the corresponding element is requested via [] accessor. Until then, no images are loaded to the memory to reduce memory use.

2. Prepare the dataset iterations

Iterator creates a mini-batch from the given dataset.

batchsize = 128

train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize, False, False)

3. Prepare the model

Here, we are going to use the same model as the one defined in MNIST with a Manual Training Loop.

class MLP(Chain):

    def __init__(self, n_mid_units=100, n_out=10):
        super(MLP, self).__init__()
        with self.init_scope():
            self.l1 = L.Linear(None, n_mid_units)
            self.l2 = L.Linear(None, n_mid_units)
            self.l3 = L.Linear(None, n_out)

    def forward(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

gpu_id = 0  # Set to -1 if you use CPU

model = MLP()
if gpu_id >= 0:
    model.to_gpu(gpu_id)

4. Prepare the Updater

Trainer is a class that holds all of the necessary components needed for training. The main components are shown below.

_images/trainer.png

Basically, all you need to pass to Trainer is an Updater. However, Updater contains an Iterator and Optimizer. Since Iterator can access the dataset and Optimizer has references to the model, Updater can access to the model to update its parameters.

So, Updater can perform the training procedure as shown below:

  1. Retrieve the data from dataset and construct a mini-batch (Iterator)

  2. Pass the mini-batch to the model and calculate the loss

  3. Update the parameters of the model (Optimizer)

Now let’s create the Updater object !

max_epoch = 10

# Wrap your model by Classifier and include the process of loss calculation within your model.
# Since we do not specify a loss function here, the default 'softmax_cross_entropy' is used.
model = L.Classifier(model)

# selection of your optimizing method
optimizer = optimizers.MomentumSGD()

# Give the optimizer a reference to the model
optimizer.setup(model)

# Get an updater that uses the Iterator and Optimizer
updater = training.updaters.StandardUpdater(train_iter, optimizer, device=gpu_id)

Note

Here, the model defined above is passed to Classifier and changed to a new Chain. Classifier, which in fact inherits from the Chain class, keeps the given Chain model in its predictor attribute. Once you give the input data and the corresponding class labels to the model by the () operator,

  1. forward() of the model is invoked. The data is then given to predictor to obtain the output y.

  2. Next, together with the given labels, the output y is passed to the loss function which is determined by lossfun argument in the constructor of Classifier.

  3. The loss is returned as a Variable.

In Classifier, the lossfun is set to softmax_cross_entropy() as default.

StandardUpdater is the simplest class among several updaters. There are also the ParallelUpdater and the MultiprocessParallelUpdater to utilize multiple GPUs. The MultiprocessParallelUpdater uses the NVIDIA NCCL library, so you need to install NCCL and re-install CuPy before using it.

5. Setup Trainer

Lastly, we will setup Trainer. The only requirement for creating a Trainer is to pass the Updater object that we previously created above. You can also pass a stop_trigger to the second trainer argument as a tuple like (length, unit) to tell the trainer when to stop the training. The length is given as an integer and the unit is given as a string which should be either epoch or iteration. Without setting stop_trigger, the training will never be stopped.

# Setup a Trainer
trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='mnist_result')

The out argument specifies an output directory used to save the log files, the image files of plots to show the time progress of loss, accuracy, etc. when you use PlotReport extension. Next, we will explain how to display or save those information by using trainer Extension.

6. Add Extensions to the Trainer object

The Trainer extensions provide the following capabilities:

  • Save log files automatically (LogReport)

  • Display the training information to the terminal periodically (PrintReport)

  • Visualize the loss progress by plotting a graph periodically and save it as an image file (PlotReport)

  • Automatically serialize the state periodically (snapshot() / snapshot_object())

  • Display a progress bar to the terminal to show the progress of training (ProgressBar)

  • Save the model architecture as a Graphviz’s dot file (DumpGraph())

To use these wide variety of tools for your training task, pass Extension objects to the extend() method of your Trainer object.

from chainer.training import extensions

trainer.extend(extensions.LogReport())
trainer.extend(extensions.snapshot(filename='snapshot_epoch-{.updater.epoch}'))
trainer.extend(extensions.snapshot_object(model.predictor, filename='model_epoch-{.updater.epoch}'))
trainer.extend(extensions.Evaluator(test_iter, model, device=gpu_id))
trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
trainer.extend(extensions.DumpGraph('main/loss'))
LogReport

Collect loss and accuracy automatically every epoch or iteration and store the information under the log file in the directory specified by the out argument when you create a Trainer object.

snapshot()

The snapshot() method saves the Trainer object at the designated timing (default: every epoch) in the directory specified by out. The Trainer object, as mentioned before, has an Updater which contains an Optimizer and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.

snapshot_object()

However, when you keep the whole Trainer object, in some cases, it is very tedious to retrieve only the inside of the model. By using snapshot_object(), you can save the particular object (in this case, the model wrapped by Classifier) as a separate snapshot. Classifier is a Chain object which keeps the model that is also a Chain object as its predictor property, and all the parameters are under the predictor, so taking the snapshot of predictor is enough to keep all the trained parameters.

This is a list of commonly used trainer extensions:

LogReport

This extension collects the loss and accuracy values every epoch or iteration and stores in a log file. The log file will be located under the output directory (specified by out argument of the Trainer object).

snapshot()

This extension saves the Trainer object at the designated timing (defaut: every epoch) in the output directory. The Trainer object, as mentioned before, has an Updater which contains an Optimizer and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.

snapshot_object()

snapshot() extension above saves the whole Trainer object. However, in some cases, it is tedious to retrieve only the inside of the model. By using snapshot_object(), you can save the particular object (in the example above, the model wrapped by Classifier) as a separeted snapshot. Taking the snapshot of predictor is enough to keep all the trained parameters, because Classifier (which is a subclass of Chain) keeps the model as its predictor property, and all the parameters are under this property.

DumpGraph()

This extension saves the structure of the computational graph of the model. The graph is saved in Graphviz dot format under the output directory of the Trainer.

Evaluator

Iterators that use the evaluation dataset and the model object are required to use Evaluator extension. It evaluates the model using the given dataset (typically it’s a validation dataset) at the specified timing interval.

PrintReport

This extension outputs the spcified values to the standard output.

PlotReport

This extension plots the values specified by its arguments and saves it as a image file.

This is not an exhaustive list of built-in extensions. Please take a look at Extensions for more of them.

7. Start Training

Just call run() method from Trainer object to start training.

trainer.run()
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           1.53241     0.638409       0.74935               0.835839                  4.93409
2           0.578334    0.858059       0.444722              0.882812                  7.72883
3           0.418569    0.886844       0.364943              0.899229                  10.4229
4           0.362342    0.899089       0.327569              0.905558                  13.148
5           0.331067    0.906517       0.304399              0.911788                  15.846
6           0.309019    0.911964       0.288295              0.917722                  18.5395
7           0.292312    0.916128       0.272073              0.921776                  21.2173
8           0.278291    0.92059        0.261351              0.923457                  23.9211
9           0.266266    0.923541       0.253195              0.927314                  26.6612
10          0.255489    0.926739       0.242415              0.929094                  29.466

Let’s see the plot of loss progress saved in the mnist_result directory.

_images/mnist_loss.png

How about the accuracy?

_images/mnist_accuracy.png

Furthermore, let’s visualize the computational graph saved with DumpGraph() using Graphviz.

% dot -Tpng mnist_result/cg.dot -o mnist_result/cg.png
_images/mnist_graph.png

From the top to the bottom, you can see the data flow in the computational graph. It basically shows how data and parameters are passed to the Functions.

8. Evaluate a pre-trained model

Evaluation using the snapshot of a model is as easy as what explained in the MNIST with a Manual Training Loop.

import matplotlib.pyplot as plt

model = MLP()
serializers.load_npz('mnist_result/model_epoch-10', model)

# Show the output
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)

y = model(x[None, ...])

print('predicted_label:', y.array.argmax(axis=1)[0])
_images/mnist_output.png
label: 7
predicted_label: 7

The prediction looks correct. Success!

MNIST with a Manual Training Loop

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

In this tutorial section, we will learn how to train a deep neural network to classify images of hand-written digits in the popular MNIST dataset. This dataset contains 50,000 training examples and 10,000 test examples. Each example is a set of a 28 x 28 greyscale image and a corresponding class label. Since the digits from 0 to 9 are used, there are 10 classes for the labels.

Chainer provides a feature called Trainer that can simplify the training procedure of your model. However, it is also good to know how the training works in Chainer before starting to use the useful Trainer class that hides the actual processes. Writing your own training loop can be useful for learning how Trainer works or for implementing features not included in the standard trainer.

The complete training procedure consists of the following steps:

  1. Prepare a dataset

  2. Create a dataset iterator

  3. Define a network

  4. Select an optimization algorithm

  5. Write a training loop

    1. Retrieve a set of examples (mini-batch) from the training dataset.

    2. Feed the mini-batch to your network.

    3. Run a forward pass of the network and compute the loss.

    4. Just call the backward() method from the loss Variable to compute the gradients for all trainable parameters.

    5. Run the optimizer to update those parameters.

  6. Save the trained model

  7. Perform classification by the saved model and check the network performance on validation/test sets.

1. Prepare a dataset

Chainer contains some built-in functions to use some popular datasets like MNIST, CIFAR10/100, etc. Those can automatically download the data from servers and provide dataset objects which are easy to use.

The code below shows how to retrieve the MNIST dataset from the server and save an image from its training split to make sure the images are correctly obtained.

from __future__ import print_function
import matplotlib.pyplot as plt
from chainer.datasets import mnist

# Download the MNIST data if you haven't downloaded it yet
train, test = mnist.get_mnist(withlabel=True, ndim=1)

# Display an example from the MNIST dataset.
# `x` contains the input image array and `t` contains that target class
# label as an integer.
x, t = train[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('5.png')
print('label:', t)
label: 5

The saved image 5.png will look like:

_images/5.png

2. Create a dataset iterator

Although this is an optional step, we’d like to introduce the Iterator class that retrieves a set of data and labels from the given dataset to easily make a mini-batch. There are some subclasses that can perform the same thing in different ways, e.g., using multi-processing to parallelize the data loading part, etc.

Here, we use SerialIterator, which is also a subclass of Iterator in the example code below. The SerialIterator can provide mini-batches with or without shuffling the order of data in the given dataset.

All Iterators produce a new mini-batch by calling its next() method. All Iterators also have properties to know how many times we have taken all the data from the given dataset (epoch) and whether the next mini-batch will be the start of a new epoch (is_new_epoch), and so on.

The code below shows how to create a SerialIterator object from a dataset object.

from chainer import iterators

# Choose the minibatch size.
batchsize = 128

train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize,
                                     repeat=False, shuffle=False)

Note

Iterators can take a built-in Python list as a given dataset. It means that the example code below is able to work,

train = [(x1, t1), (x2, t2), ...]  # A list of tuples
train_iter = iterators.SerialIterator(train, batchsize)

where x1, x2, ... denote the input data and t1, t2, ... denote the corresponding labels.

Details of SerialIterator
  • SerialIterator is a built-in subclass of Iterator that can retrieve a mini-batch from a given dataset in either sequential or shuffled order.

  • The Iterator's constructor takes two arguments: a dataset object and a mini-batch size.

  • If you want to use the same dataset repeatedly during the training process, set the repeat argument to True (default). Otherwise, the dataset will be used only one time. The latter case is actually for the evaluation.

  • If you want to shuffle the training dataset every epoch, set the shuffle argument to True. Otherwise, the order of each data retrieved from the dataset will be always the same at each epoch.

In the example code shown above, we set batchsize = 128 in both train_iter and test_iter. So, these iterators will provide 128 images and corresponding labels at a time.

3. Define a network

Now let’s define a neural network that we will train to classify the MNIST images. For simplicity, we use a three-layer perceptron here. We set each hidden layer to have 100 units and set the output layer to have 10 units, which is corresponding to the number of class labels of the MNIST.

Create your network as a subclass of Chain

You can create your network by writing a new subclass of Chain. The main steps are twofold:

  1. Register the network components which have trainable parameters to the subclass. Each of them must be instantiated and assigned to a property in the scope specified by init_scope():

  2. Define a forward() method that represents the actual forward computation of your network. This method takes one or more Variable, numpy.ndarray, or cupy.ndarray as its inputs and calculates the forward pass using them.

    class MyNetwork(Chain):
    
        def __init__(self, n_mid_units=100, n_out=10):
            super(MyNetwork, self).__init__()
            with self.init_scope():
                self.l1 = L.Linear(None, n_mid_units)
                self.l2 = L.Linear(n_mid_units, n_mid_units)
                self.l3 = L.Linear(n_mid_units, n_out)
    
        def forward(self, x):
            h = F.relu(self.l1(x))
            h = F.relu(self.l2(h))
            return self.l3(h)
    
    model = MyNetwork()
    
    gpu_id = 0  # Set to -1 if you use CPU
    if gpu_id >= 0:
        model.to_gpu(gpu_id)
    

Link, Chain, ChainList, and those subclass objects which contain trainable parameters should be registered to the model by assigning it as a property inside the init_scope(). For example, a FunctionNode does not contain any trainable parameters, so there is no need to keep the object as a property of your network. When you want to use relu() in your network, using it as a function in forward() works correctly.

In Chainer, the Python code that implements the forward computation itself represents the network. In other words, we can conceptually think of the computation graph for our network being constructed dynamically as this forward computation code executes. This allows Chainer to describe networks in which different computations can be performed in each iteration, such as branched networks, intuitively and with a high degree of flexibility. This is the key feature of Chainer that we call Define-by-Run.

4. Select an optimization algorithm

Chainer provides a wide variety of optimization algorithms that can be used to optimize the network parameters during training. They are located in optimizers module.

Here, we are going to use the stochastic gradient descent (SGD) method with momentum, which is implemented by MomentumSGD. To use the optimizer, we give the network object (typically it’s a Chain or ChainList) to the setup() method of the optimizer object to register it. In this way, the Optimizer can automatically find the model parameters and update them during training.

You can easily try out other optimizers as well. Please test and observe the results of various optimizers. For example, you could try to change MomentumSGD to Adam, RMSprop, etc.

from chainer import optimizers

# Choose an optimizer algorithm
optimizer = optimizers.MomentumSGD(lr=0.01, momentum=0.9)

# Give the optimizer a reference to the model so that it
# can locate the model's parameters.
optimizer.setup(model)

Note

In the above example, we set lr to 0.01 in the constructor. This value is known as the “learning rate”, one of the most important hyperparameters that need to be adjusted in order to obtain the best performance. The various optimizers may each have different hyperparameters and so be sure to check the documentation for the details.

5. Write a training loop

We now show how to write the training loop. Since we are working on a digit classification problem, we will use softmax_cross_entropy() as the loss function for the optimizer to minimize. For other types of problems, such as regression models, other loss functions might be more appropriate. See the Chainer documentation for detailed information on the various loss functions for more details.

Our training loop will be structured as follows.

  1. We will first get a mini-batch of examples from the training dataset.

  2. We will then feed the batch into our network by calling it (a Chain object) like a function. This will execute the forward-pass code that are written in the forward() method.

  3. This will return the network output that represents class label predictions. We supply it to the loss function along with the true (that is, target) values. The loss function will output the loss as a Variable object.

  4. We then clear any previous gradients in the network and perform the backward pass by calling the backward() method on the loss variable which computes the parameter gradients. We need to clear the gradients first because the backward() method accumulates gradients instead of overwriting the previous values.

  5. Since the optimizer already has a reference to the network, it has access to the parameters and the computed gradients so that we can now call the update() method of the optimizer which will update the model parameters.

In addition to the above steps, you might want to check the performance of the network with a validation dataset. This allows you to observe how well it is generalized to new data so far, namely, you can check whether it is overfitting to the training data. The code below checks the performance on the test set at the end of each epoch. The code has the same structure as the training code except that no backpropagation is performed and we also compute the accuracy on the test data using the accuracy() function.

The training loop code is as follows:

import numpy as np
from chainer.dataset import concat_examples
from chainer.backends.cuda import to_cpu

max_epoch = 10

while train_iter.epoch < max_epoch:

    # ---------- One iteration of the training loop ----------
    train_batch = train_iter.next()
    image_train, target_train = concat_examples(train_batch, gpu_id)

    # Calculate the prediction of the network
    prediction_train = model(image_train)

    # Calculate the loss with softmax_cross_entropy
    loss = F.softmax_cross_entropy(prediction_train, target_train)

    # Calculate the gradients in the network
    model.cleargrads()
    loss.backward()

    # Update all the trainable parameters
    optimizer.update()
    # --------------------- until here ---------------------

    # Check the validation accuracy of prediction after every epoch
    if train_iter.is_new_epoch:  # If this iteration is the final iteration of the current epoch

        # Display the training loss
        print('epoch:{:02d} train_loss:{:.04f} '.format(
            train_iter.epoch, float(to_cpu(loss.array))), end='')

        test_losses = []
        test_accuracies = []
        for test_batch in test_iter:
            image_test, target_test = concat_examples(test_batch, gpu_id)

            # Forward the test data
            prediction_test = model(image_test)

            # Calculate the loss
            loss_test = F.softmax_cross_entropy(prediction_test, target_test)
            test_losses.append(to_cpu(loss_test.array))

            # Calculate the accuracy
            accuracy = F.accuracy(prediction_test, target_test)
            accuracy.to_cpu()
            test_accuracies.append(accuracy.array)

        test_iter.reset()

        print('val_loss:{:.04f} val_accuracy:{:.04f}'.format(
            np.mean(test_losses), np.mean(test_accuracies)))
Output
epoch:01 train_loss:0.8072 val_loss:0.7592 val_accuracy:0.8289
epoch:02 train_loss:0.5021 val_loss:0.4467 val_accuracy:0.8841
epoch:03 train_loss:0.3539 val_loss:0.3673 val_accuracy:0.9007
epoch:04 train_loss:0.2524 val_loss:0.3307 val_accuracy:0.9067
epoch:05 train_loss:0.4232 val_loss:0.3076 val_accuracy:0.9136
epoch:06 train_loss:0.3033 val_loss:0.2910 val_accuracy:0.9167
epoch:07 train_loss:0.2004 val_loss:0.2773 val_accuracy:0.9222
epoch:08 train_loss:0.2885 val_loss:0.2679 val_accuracy:0.9239
epoch:09 train_loss:0.2818 val_loss:0.2579 val_accuracy:0.9266
epoch:10 train_loss:0.2403 val_loss:0.2484 val_accuracy:0.9307

6. Save the trained model

Chainer provides two types of serializers that can be used to save and restore model state. One supports the HDF5 format and the other supports the NumPy NPZ format. For this example, we are going to use the NPZ format to save our model since it is easy to use with NumPy and doesn’t need to install any additional dependencies or libraries.

serializers.save_npz('my_mnist.model', model)

7. Perform classification by the saved model

Let’s use the saved model to classify a new image. In order to load the trained model parameters, we need to perform the following two steps:

  1. Instantiate the same network as what you trained.

  2. Overwrite all parameters in the model instance with the saved weights using the load_npz() function.

Once the model is restored, it can be used to predict image labels on new input data.

from chainer import serializers

# Create an instance of the network you trained
model = MyNetwork()

# Load the saved parameters into the instance
serializers.load_npz('my_mnist.model', model)

# Get a test image and label
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('7.png')
print('label:', t)
label: 7

The saved test image looks like:

_images/7.png
# Change the shape of the minibatch.
# In this example, the size of minibatch is 1.
# Inference using any mini-batch size can be performed.

print(x.shape, end=' -> ')
x = x[None, ...]
print(x.shape)

# Forward calculation of the model by sending X
y = model(x)

# The result is given as Variable, then we can take a look at the contents by the attribute, .array.
y = y.array

# Look up the most probable digit number using argmax
pred_label = y.argmax(axis=1)

print('predicted label:', pred_label[0])
(784,) -> (1, 784)
predicted label: 7

The prediction result looks correct. Yay!

Convolutional Network for Visual Recognition Tasks

In this section, you will learn how to write

  • A small convolutional network with a model class that is inherited from Chain,

  • A large convolutional network that has several building block networks with ChainList.

After reading this section, you will be able to:

  • Write your own original convolutional network in Chainer

A convolutional network (ConvNet) is mainly comprised of convolutional layers. This type of network is commonly used for various visual recognition tasks, e.g., classifying hand-written digits or natural images into given object classes, detecting objects from an image, and labeling all pixels of an image with the object classes (semantic segmentation), and so on.

In such tasks, a typical ConvNet takes a set of images whose shape is \((N, C, H, W)\), where

  • \(N\) denotes the number of images in a mini-batch,

  • \(C\) denotes the number of channels of those images,

  • \(H\) and \(W\) denote the height and width of those images,

respectively. Then, it typically outputs a fixed-sized vector as membership probabilities over the target object classes. It also can output a set of feature maps that have the corresponding size to the input image for a pixel labeling task, etc.

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

LeNet5

Here, let’s start by defining LeNet5 [LeCun98] in Chainer. In this example, we show a simplified version of LeNet5 introduced in Deep Learning Tutorials. This is a ConvNet model that has 5 layers comprised of 3 convolutional layers and 2 fully-connected layers. This was proposed to classify hand-written digit images in 1998. In Chainer, the model can be written as follows:

class LeNet5(Chain):
    def __init__(self):
        super(LeNet5, self).__init__()
        with self.init_scope():
            self.conv1 = L.Convolution2D(
                in_channels=1, out_channels=6, ksize=5, stride=1)
            self.conv2 = L.Convolution2D(
                in_channels=6, out_channels=16, ksize=5, stride=1)
            self.conv3 = L.Convolution2D(
                in_channels=16, out_channels=120, ksize=4, stride=1)
            self.fc4 = L.Linear(None, 84)
            self.fc5 = L.Linear(84, 10)

    def forward(self, x):
        h = F.sigmoid(self.conv1(x))
        h = F.max_pooling_2d(h, 2, 2)
        h = F.sigmoid(self.conv2(h))
        h = F.max_pooling_2d(h, 2, 2)
        h = F.sigmoid(self.conv3(h))
        h = F.sigmoid(self.fc4(h))
        if chainer.config.train:
            return self.fc5(h)
        return F.softmax(self.fc5(h))

A typical way to write your network is creating a new class inherited from Chain class. When defining your model in this way, typically, all the layers which have trainable parameters are registered to the model by assigning the objects of Link as an attribute.

The model class is instantiated before the forward and backward computations. To give input images and label vectors simply by calling the model object like a function, forward() is usually defined in the model class. This method performs the forward computation of the model. Chainer uses the powerful autograd system for any computational graphs written with FunctionNodes and Links (actually a Link calls a corresponding FunctionNode inside of it), so that you don’t need to explicitly write the code for backward computations in the model. Just prepare the data, then give it to the model. The way this works is the resulting output Variable from the forward computation has a backward() method to perform autograd. In the above model, forward() has a if statement at the end to switch its behavior by the Chainer’s running mode, i.e., training mode or not. Chainer presents the running mode as a global variable chainer.config.train. When it’s in training mode, forward() returns the output value of the last layer as is to compute the loss later on, otherwise it returns a prediction result by calculating softmax().

It is recommended that you use the global configuration chainer.config.train to switch the running mode.

If you don’t want to write conv1 and the other layers more than once, you can also write the same model like in this way:

from functools import partial

class LeNet5(Chain):
    def __init__(self):
        super(LeNet5, self).__init__()
        net = [('conv1', L.Convolution2D(1, 6, 5, 1))]
        net += [('_sigm1', F.sigmoid)]
        net += [('_mpool1', partial(F.max_pooling_2d, ksize=2, stride=2))]
        net += [('conv2', L.Convolution2D(6, 16, 5, 1))]
        net += [('_sigm2', F.sigmoid)]
        net += [('_mpool2', partial(F.max_pooling_2d, ksize=2, stride=2))]
        net += [('conv3', L.Convolution2D(16, 120, 4, 1))]
        net += [('_sigm3', F.sigmoid)]
        net += [('_mpool3', partial(F.max_pooling_2d, ksize=2, stride=2))]
        net += [('fc4', L.Linear(None, 84))]
        net += [('_sigm4', F.sigmoid)]
        net += [('fc5', L.Linear(84, 10))]
        net += [('_sigm5', F.sigmoid)]
        with self.init_scope():
            for n in net:
                if not n[0].startswith('_'):
                    setattr(self, n[0], n[1])
        self.layers = net

    def forward(self, x):
        for n, f in self.layers:
            if not n.startswith('_'):
                x = getattr(self, n)(x)
            else:
                x = f(x)
        if chainer.config.train:
            return x
        return F.softmax(x)

Note

You can also use Sequential to write the above model more simply. Please note that Sequential is an experimental feature introduced in Chainer v4 and its interface may be changed in the future versions.

This code creates a list of pairs of component name (e.g., conv1, _sigm1, etc.) and all Links and functions (e.g., F.sigmoid, which internally invokes FunctionNode) after calling its superclass’s constructor. In this case, components whose name start with _ are functions (FunctionNode), which doesn’t have any trainable parameters, so that we don’t register (setattr) it to the model. Others (conv1, fc4, etc.) are Links, which are trainable layers that hold parameters. This operation can be freely replaced with many other ways because those component names are just designed to select Links only from the list net easily. The list net is stored as an attribute layers to refer it in forward(). In forward(), it retrieves all layers in the network from self.forward sequentially and gives the input variable or the intermediate output from the previous layer to the current layer. The last part of the forward() to switch its behavior by the training/inference mode is the same as the former way.

Ways to calculate loss

When you train the model with label vector t, the loss should be calculated using the output from the model. There also are several ways to calculate the loss:

model = LeNet5()

# Input data and label
x = np.random.rand(32, 1, 28, 28).astype(np.float32)
t = np.random.randint(0, 10, size=(32,)).astype(np.int32)

# Forward computation
y = model(x)

# Loss calculation
loss = F.softmax_cross_entropy(y, t)

This is a primitive way to calculate a loss value from the output of the model. On the other hand, the loss computation can be included in the model itself by wrapping the model object (Chain or ChainList object) with a class inherited from Chain. The outer Chain should take the model defined above and register it with init_scope(). Chain is actually inherited from Link, so that Chain itself can also be registered as a trainable Link to another Chain. Actually, Classifier class to wrap the model and add the loss computation to the model already exists. Actually, there is already a Classifier class that can be used to wrap the model and include the loss computation as well. It can be used like this:

model = L.Classifier(LeNet5())

# Foward & Loss calculation
loss = model(x, t)

This class takes a model object as an input argument and registers it to a predictor property as a trained parameter. As shown above, the returned object can then be called like a function in which we pass x and t as the input arguments and the resulting loss value (which we recall is a Variable) is returned.

See the detailed implementation of Classifier from here: chainer.links.Classifier and check the implementation by looking at the source.

From the above examples, we can see that Chainer provides the flexibility to write our original network in many different ways. Such flexibility intends to make it intuitive for users to design new and complex models.

VGG16

Next, let’s write some larger models in Chainer. When you write a large network consisting of several building block networks, ChainList is useful. First, let’s see how to write a VGG16 [Simonyan14] model.

class VGG16(chainer.ChainList):
    def __init__(self):
        super(VGG16, self).__init__(
            VGGBlock(64),
            VGGBlock(128),
            VGGBlock(256, 3),
            VGGBlock(512, 3),
            VGGBlock(512, 3, True))

    def forward(self, x):
        for f in self.children():
            x = f(x)
        if chainer.config.train:
            return x
        return F.softmax(x)


class VGGBlock(chainer.Chain):
    def __init__(self, n_channels, n_convs=2, fc=False):
        w = chainer.initializers.HeNormal()
        super(VGGBlock, self).__init__()
        with self.init_scope():
            self.conv1 = L.Convolution2D(None, n_channels, 3, 1, 1, initialW=w)
            self.conv2 = L.Convolution2D(
                n_channels, n_channels, 3, 1, 1, initialW=w)
            if n_convs == 3:
                self.conv3 = L.Convolution2D(
                    n_channels, n_channels, 3, 1, 1, initialW=w)
            if fc:
                self.fc4 = L.Linear(None, 4096, initialW=w)
                self.fc5 = L.Linear(4096, 4096, initialW=w)
                self.fc6 = L.Linear(4096, 1000, initialW=w)

        self.n_convs = n_convs
        self.fc = fc

    def forward(self, x):
        h = F.relu(self.conv1(x))
        h = F.relu(self.conv2(h))
        if self.n_convs == 3:
            h = F.relu(self.conv3(h))
        h = F.max_pooling_2d(h, 2, 2)
        if self.fc:
            h = F.dropout(F.relu(self.fc4(h)))
            h = F.dropout(F.relu(self.fc5(h)))
            h = self.fc6(h)
        return h

That’s it. VGG16 is a model which won the 1st place in classification + localization task at ILSVRC 2014, and since then, has become one of the standard models for many different tasks as a pre-trained model. This has 16-layers, so it’s called “VGG-16”, but we can write this model without writing all layers independently. Since this model consists of several building blocks that have the same architecture, we can build the whole network by re-using the building block definition. Each part of the network is consisted of 2 or 3 convolutional layers and activation function (relu()) following them, and max_pooling_2d() operations. This block is written as VGGBlock in the above example code. And the whole network just calls this block one by one in sequential manner.

ResNet152

How about ResNet? ResNet [He16] came in the following year’s ILSVRC. It is a much deeper model than VGG16, having up to 152 layers. This sounds super laborious to build, but it can be implemented in almost same manner as VGG16. In the other words, it’s easy. One possible way to write ResNet-152 is:

class ResNet152(chainer.Chain):
    def __init__(self, n_blocks=[3, 8, 36, 3]):
        w = chainer.initializers.HeNormal()
        super(ResNet152, self).__init__()
        with self.init_scope():
            self.conv1 = L.Convolution2D(None, 64, 7, 2, 3, initialW=w, nobias=True)
            self.bn1 = L.BatchNormalization(64)
            self.res2 = ResBlock(n_blocks[0], 64, 64, 256, 1)
            self.res3 = ResBlock(n_blocks[1], 256, 128, 512)
            self.res4 = ResBlock(n_blocks[2], 512, 256, 1024)
            self.res5 = ResBlock(n_blocks[3], 1024, 512, 2048)
            self.fc6 = L.Linear(2048, 1000)

    def forward(self, x):
        h = self.bn1(self.conv1(x))
        h = F.max_pooling_2d(F.relu(h), 2, 2)
        h = self.res2(h)
        h = self.res3(h)
        h = self.res4(h)
        h = self.res5(h)
        h = F.average_pooling_2d(h, h.shape[2:], stride=1)
        h = self.fc6(h)
        if chainer.config.train:
            return h
        return F.softmax(h)


class ResBlock(chainer.ChainList):
    def __init__(self, n_layers, n_in, n_mid, n_out, stride=2):
        super(ResBlock, self).__init__()
        self.add_link(BottleNeck(n_in, n_mid, n_out, stride, True))
        for _ in range(n_layers - 1):
            self.add_link(BottleNeck(n_out, n_mid, n_out))

    def forward(self, x):
        for f in self.children():
            x = f(x)
        return x


class BottleNeck(chainer.Chain):
    def __init__(self, n_in, n_mid, n_out, stride=1, proj=False):
        w = chainer.initializers.HeNormal()
        super(BottleNeck, self).__init__()
        with self.init_scope():
            self.conv1x1a = L.Convolution2D(
                n_in, n_mid, 1, stride, 0, initialW=w, nobias=True)
            self.conv3x3b = L.Convolution2D(
                n_mid, n_mid, 3, 1, 1, initialW=w, nobias=True)
            self.conv1x1c = L.Convolution2D(
                n_mid, n_out, 1, 1, 0, initialW=w, nobias=True)
            self.bn_a = L.BatchNormalization(n_mid)
            self.bn_b = L.BatchNormalization(n_mid)
            self.bn_c = L.BatchNormalization(n_out)
            if proj:
                self.conv1x1r = L.Convolution2D(
                    n_in, n_out, 1, stride, 0, initialW=w, nobias=True)
                self.bn_r = L.BatchNormalization(n_out)
        self.proj = proj

    def forward(self, x):
        h = F.relu(self.bn_a(self.conv1x1a(x)))
        h = F.relu(self.bn_b(self.conv3x3b(h)))
        h = self.bn_c(self.conv1x1c(h))
        if self.proj:
            x = self.bn_r(self.conv1x1r(x))
        return F.relu(h + x)

In the BottleNeck class, depending on the value of the proj argument supplied to the initializer, it will conditionally compute a convolutional layer conv1x1r which will extend the number of channels of the input x to be equal to the number of channels of the output of conv1x1c, and followed by a batch normalization layer before the final ReLU layer. Writing the building block in this way improves the re-usability of a class. It switches not only the behavior in __class__() by flags but also the parameter registration. In this case, when proj is False, the BottleNeck doesn’t have conv1x1r and bn_r layers, so the memory usage would be efficient compared to the case when it registers both anyway and just ignore them if proj is False.

Using nested Chains and ChainList for sequential part enables us to write complex and very deep models easily.

Use Pre-trained Models

Various ways to write your models were described above. It turns out that VGG16 and ResNet are very useful as general feature extractors for many kinds of tasks, including but not limited to image classification. So, Chainer provides you with the pre-trained VGG16 and ResNet-50/101/152 models with a simple API. You can use these models as follows:

from chainer.links import VGG16Layers

model = VGG16Layers()

When VGG16Layers is instantiated, the pre-trained parameters are automatically downloaded from the author’s server. So you can immediately start to use VGG16 with pre-trained weight as a good image feature extractor. See the details of this model here: chainer.links.VGG16Layers.

In the case of ResNet models, there are three variations differing in the number of layers. We have chainer.links.ResNet50Layers, chainer.links.ResNet101Layers, and chainer.links.ResNet152Layers models with easy parameter loading feature. ResNet’s pre-trained parameters are not available for direct downloading, so you need to download the weight from the author’s web page first, and then place it into the dir $CHAINER_DATSET_ROOT/pfnet/chainer/models or your favorite place. Once the preparation is finished, the usage is the same as VGG16:

from chainer.links import ResNet152Layers

model = ResNet152Layers()
Traceback (most recent call last):
OSError: The pre-trained caffemodel does not exist. Please download it from 'https://github.com/KaimingHe/deep-residual-networks', and place it on ...

Please see the details of usage and how to prepare the pre-trained weights for ResNet here: chainer.links.ResNet50Layers

References
LeCun98

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324, 1998.

Simonyan14

Simonyan, K. and Zisserman, A., Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.

He16

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.

DCGAN: Generate images with Deep Convolutional GAN

0. Introduction

In this tutorial, we generate images with generative adversarial networks (GAN). GAN are kinds of deep neural network for generative modeling that are often applied to image generation. GAN-based models are also used in PaintsChainer, an automatic colorization service.

_images/generated-images.gif

In this tutorial, you will learn the following things:

  1. Generative Adversarial Networks (GAN)

  2. Implementation of DCGAN in Chainer

1. Generarive Adversarial Networks (GAN)

1.1 What are GAN?

As explained in GAN tutorial in NIPS 2016 [1], generative models can be classified into the categories as shown in the following figure:

_images/class-generative-model.png

cited from [1]

Besides GAN, other famous generative models include Fully visible belief networks (FVBNs) and Variational autoencoder (VAE). Unlike FVBNs and VAE, GAN do not explicitly model the probability distribution \(p({\bf s})\) that generates training data. Instead, we model a generator \(G: {\bf z} \mapsto {\bf s}\). The generator \(G\) samples \({\bf s} \sim p({\bf s})\) from the latent variable \({\bf z}\). Apart from the generator \(G\), we create a discriminator \(D({\bf x})\) which discriminates between samples from the generator G and examples from training data. While training the discriminator \(D\), the generator \(G\) tries to maximize the probability of the discriminator \(D\) making a mistake. So, the generator \(G\) tries to create samples that seem to be drawn from the same distribution as the training data.

The advantages of GAN are low sampling cost and its state-of-the-art performance in image generation. The disadvantage is that we cannot calculate the likelihood \(p_{\mathrm {model}}({\bf s})\) because we do not model any probability distribution, and we cannot infer the latent variable \({\bf z}\) from a sample.

1.2 How GAN work?

As explained above, GAN use the two models, the generator and the discriminator. When training the networks, we should match the data distribution \(p({\bf s})\) with the distribution of the samples \({\bf s} = G ({\bf z})\) generated from the generator.

_images/gan-overview.png

The generator \(G\) learns the target distribution, and ideally eventually reaches a Nash equilibrium [2] of game theory. In detail, while training the discriminator \(D\), the generator \(G\) is also trained, so that the discriminator \(D\) makes a mistake.

As an intuitive example, the relationship between counterfeiters of banknotes and the police is frequently used. The counterfeiters try to make counterfeit notes that look like real banknotes. The police try to distinguish real bank notes from counterfeit notes. It is supposed that the ability of the police gradually rises, so that real banknotes and counterfeit notes can be recognized well. Then, the counterfeiters will not be able to use counterfeit banknotes, so they will create counterfeit banknotes that appear more realistic. As the police improve their skill further, they can distinguish real and counterfeit notes… Eventually, the counterfeiter will be able to produce counterfeit banknotes look as real as genuine ones.

The training process is explained by the following mathematical expressions. First, since the discriminator \(D({\bf s})\) is the probability that a sample \({\bf s}\) is generated from the data distribution at, it can be expressed as follows:

\[D({\bf s}) = \frac{p({\bf s})}{p({\bf s}) + p_{\mathrm{model}}({\bf s})}\]

Then, when we match the data distribution \({\bf s} \sim p({\bf s})\) and the distribution of generated samples by \(G\), it means that we should minimize the dissimilarity between the two distributions. It is common to use Jensen-Shannon Divergence \(D_{\mathrm{JS}}\) to measure the dissimilarity between distributions[3].

The \(D_{\mathrm{JS}}\) of \(p_{\mathrm{model}}({\bf s})\) and \(p({\bf s})\) can be written as follows by using \(D({\bf s})\):

\[\begin{split}2 D_{\mathrm{JS}} &=& D_{\mathrm{KL}}(p({\bf s})||\bar{p}({\bf s})) + D_{\mathrm{KL}}(p_{\mathrm{model}}({\bf s})||\bar{p}({\bf s})) \\ &=& \mathbb{E}_{p({\bf s})} \left[ \log \frac{2p({\bf s})}{p({\bf s}) + p_{\mathrm{model}}({\bf s})} \right] + \mathbb{E}_{p_{\mathrm{model}}} \left[ \log \frac{2p_{\mathrm{model}}({\bf s})}{p({\bf s}) + p_{\mathrm{model}}({\bf s})} \right] \\ &=& \mathbb{E}_{p({\bf s})} \log D({\bf s}) + \mathbb{E}_{p_{\mathrm{model}}} \log (1-D({\bf s})) + \log 4 \\ &=& \mathbb{E}_{p({\bf s})} \log D({\bf s}) + \mathbb{E}_{p_{\bf z}} \log (1-D(G({\bf z}))) + \log 4\end{split}\]

where \(\bar{p}({\bf s}) = \frac{p({\bf s}) + p_{\rm model}({\bf s})}{2}\). The \(D_{\mathrm{JS}}\) will be maximized by the discriminator \(D\) and minimized by the generator \(G\), namely, \(p_{\mathrm{model}}\). And the distribution \(p_{\mathrm model}({\bf s})\) generated by \(G({\bf {\bf s}})\) can match the data distribution \(p({\bf s})\).

\[\min_{G} \max_{D} \mathbb{E}_{p({\bf s})} \log D({\bf s}) + \mathbb{E}_{p_{\bf z}} \log (1-D(G({\bf z})))\]

When we actually train the model, the above min-max problem is solved by alternately updating the discriminator \(D({\bf s})\) and the generator \(G({\bf z})\) [4]. The actual training procedures are described as follows:

_images/update-gan.png

cited from [4]

1.3 What are DCGAN?

In this section, we will introduce the model called DCGAN(Deep Convolutional GAN) proposed by Radford et al.[5]. As shown below, it is a model using CNN(Convolutional Neural Network) as its name suggests.

_images/dcgan.png

cited from [5]

In addition, although GAN are known for its difficulty in training, this paper introduces various techniques for successful training:

  1. Convert max-pooling layers to convolution layers with larger or fractional strides

  2. Convert fully connected layers to global average pooling layers in the discriminator

  3. Use batch normalization layers in the generator and the discriminator

  4. Use leaky ReLU activation functions in the discriminator

2. Implementation of DCGAN in Chainer

There is an example of DCGAN in the official repository of Chainer, so we will explain how to implement DCGAN based on this: chainer/examples/dcgan

2.1 Define the generator model

First, let’s define a network for the generator.

train_dcgan.py
class Generator(chainer.Chain):

    def __init__(self, n_hidden, bottom_width=4, ch=512, wscale=0.02):
        super(Generator, self).__init__()
        self.n_hidden = n_hidden
        self.ch = ch
        self.bottom_width = bottom_width

        with self.init_scope():
            w = chainer.initializers.Normal(wscale)
            self.l0 = L.Linear(self.n_hidden, bottom_width * bottom_width * ch,
                               initialW=w)
            self.dc1 = L.Deconvolution2D(ch, ch // 2, 4, 2, 1, initialW=w)
            self.dc2 = L.Deconvolution2D(ch // 2, ch // 4, 4, 2, 1, initialW=w)
            self.dc3 = L.Deconvolution2D(ch // 4, ch // 8, 4, 2, 1, initialW=w)
            self.dc4 = L.Deconvolution2D(ch // 8, 3, 3, 1, 1, initialW=w)
            self.bn0 = L.BatchNormalization(bottom_width * bottom_width * ch)
            self.bn1 = L.BatchNormalization(ch // 2)
            self.bn2 = L.BatchNormalization(ch // 4)
            self.bn3 = L.BatchNormalization(ch // 8)

    def make_hidden(self, batchsize):
        dtype = chainer.get_dtype()
        return numpy.random.uniform(-1, 1, (batchsize, self.n_hidden, 1, 1))\
            .astype(dtype)

    def forward(self, z):
        h = F.reshape(F.relu(self.bn0(self.l0(z))),
                      (len(z), self.ch, self.bottom_width, self.bottom_width))
        h = F.relu(self.bn1(self.dc1(h)))
        h = F.relu(self.bn2(self.dc2(h)))
        h = F.relu(self.bn3(self.dc3(h)))
        x = F.sigmoid(self.dc4(h))
        return x

When we make a network in Chainer, there are some conventions:

  1. Define a network class which inherits Chain.

  2. Make chainer.links‘s instances in the init_scope(): of the initializer __init__.

  3. Define network connections in the __call__ operator by using the chainer.links‘s instances and chainer.functions.

If you are not familiar with constructing a new network, please refer to this tutorial.

As we can see from the initializer __init__, the Generator uses deconvolution layers Deconvolution2D and batch normalization layers BatchNormalization. In __call__, each layer is called and followed by relu except the last layer.

Because the first argument of L.Deconvolution is the channel size of input and the second is the channel size of output, we can find that each layer halves the channel size. When we construct Generator with ch=1024, the network is same as the above image.

Note

Be careful when passing the output of a fully connected layer to a convolution layer, because the convolutional layer needs additional dimensions for inputs. As we can see the 1st line of __call__, the output of the fully connected layer is reshaped by reshape to add the dimensions of the channel, the width and the height of images.

2.2 Define the discriminator model

In addition, let’s define the network for the discriminator.

train_dcgan.py
class Discriminator(chainer.Chain):

    def __init__(self, bottom_width=4, ch=512, wscale=0.02):
        w = chainer.initializers.Normal(wscale)
        super(Discriminator, self).__init__()
        with self.init_scope():
            self.c0_0 = L.Convolution2D(3, ch // 8, 3, 1, 1, initialW=w)
            self.c0_1 = L.Convolution2D(ch // 8, ch // 4, 4, 2, 1, initialW=w)
            self.c1_0 = L.Convolution2D(ch // 4, ch // 4, 3, 1, 1, initialW=w)
            self.c1_1 = L.Convolution2D(ch // 4, ch // 2, 4, 2, 1, initialW=w)
            self.c2_0 = L.Convolution2D(ch // 2, ch // 2, 3, 1, 1, initialW=w)
            self.c2_1 = L.Convolution2D(ch // 2, ch // 1, 4, 2, 1, initialW=w)
            self.c3_0 = L.Convolution2D(ch // 1, ch // 1, 3, 1, 1, initialW=w)
            self.l4 = L.Linear(bottom_width * bottom_width * ch, 1, initialW=w)
            self.bn0_1 = L.BatchNormalization(ch // 4, use_gamma=False)
            self.bn1_0 = L.BatchNormalization(ch // 4, use_gamma=False)
            self.bn1_1 = L.BatchNormalization(ch // 2, use_gamma=False)
            self.bn2_0 = L.BatchNormalization(ch // 2, use_gamma=False)
            self.bn2_1 = L.BatchNormalization(ch // 1, use_gamma=False)
            self.bn3_0 = L.BatchNormalization(ch // 1, use_gamma=False)

    def forward(self, x):
        device = self.device
        h = add_noise(device, x)
        h = F.leaky_relu(add_noise(device, self.c0_0(h)))
        h = F.leaky_relu(add_noise(device, self.bn0_1(self.c0_1(h))))
        h = F.leaky_relu(add_noise(device, self.bn1_0(self.c1_0(h))))
        h = F.leaky_relu(add_noise(device, self.bn1_1(self.c1_1(h))))
        h = F.leaky_relu(add_noise(device, self.bn2_0(self.c2_0(h))))
        h = F.leaky_relu(add_noise(device, self.bn2_1(self.c2_1(h))))
        h = F.leaky_relu(add_noise(device, self.bn3_0(self.c3_0(h))))
        return self.l4(h)

The Discriminator network is almost mirrors of the Generator network. However, there are minor different points:

  1. Use leaky_relu as activation functions

  2. Deeper than Generator

  3. Add some noise to every intermediate outputs before giving them to the next layers

train_dcgan.py
def add_noise(device, h, sigma=0.2):
    if chainer.config.train:
        xp = device.xp
        # TODO(niboshi): Support random.randn in ChainerX
        if device.xp is chainerx:
            fallback_device = device.fallback_device
            with chainer.using_device(fallback_device):
                randn = device.send(fallback_device.xp.random.randn(*h.shape))
        else:
            randn = xp.random.randn(*h.shape)
        return h + sigma * randn
    else:
        return h
2.3 Prepare dataset and iterator

Let’s retrieve the CIFAR-10 dataset by using Chainer’s dataset utility function get_cifar10. CIFAR-10 is a set of small natural images. Each example is an RGB color image of size 32x32. In the original images, each of R, G, B of pixels is represented by one-byte unsigned integer (i.e. from 0 to 255). This function changes the scale of pixel values into [0, scale] float values.

    train, _ = chainer.datasets.get_cifar10(withlabel=False, scale=255.)
train_dcgan.py
train_iter = chainer.iterators.SerialIterator(train, args.batchsize)

2.4 Prepare model and optimizer

Let’s make the instances of the generator and the discriminator.

train_dcgan.py
gen = Generator(n_hidden=args.n_hidden)
dis = Discriminator()

gen.to_device(device)  # Copy the model to the device
dis.to_device(device)

# Setup an optimizer
def make_optimizer(model, alpha=0.0002, beta1=0.5):
    optimizer = chainer.optimizers.Adam(alpha=alpha, beta1=beta1)
    optimizer.setup(model)
    optimizer.add_hook(
        chainer.optimizer_hooks.WeightDecay(0.0001), 'hook_dec')
    return optimizer

opt_gen = make_optimizer(gen)
opt_dis = make_optimizer(dis)

Next, let’s make optimizers for the models created above.

train_dcgan.py
def make_optimizer(model, alpha=0.0002, beta1=0.5):
    optimizer = chainer.optimizers.Adam(alpha=alpha, beta1=beta1)
    optimizer.setup(model)
    optimizer.add_hook(
        chainer.optimizer_hooks.WeightDecay(0.0001), 'hook_dec')
    return optimizer

opt_gen = make_optimizer(gen)
opt_dis = make_optimizer(dis)

2.5 Prepare updater

GAN need the two models: the generator and the discriminator. Usually, the default updaters pre-defined in Chainer take only one model. So, we need to define a custom updater for GAN training.

The definition of DCGANUpdater is a little complicated. However, it just minimizes the loss of the discriminator and that of the generator alternately.

As you can see in the class definition, DCGANUpdater inherits StandardUpdater. In this case, almost all necessary functions are defined in StandardUpdater, we just override the functions of __init__ and update_core.

Note

We do not need to define loss_dis and loss_gen because the functions are called only in update_core. It aims at improving readability.

train_dcgan.py
class DCGANUpdater(chainer.training.updaters.StandardUpdater):

    def __init__(self, *args, **kwargs):
        self.gen, self.dis = kwargs.pop('models')
        super(DCGANUpdater, self).__init__(*args, **kwargs)

    def loss_dis(self, dis, y_fake, y_real):
        batchsize = len(y_fake)
        L1 = F.sum(F.softplus(-y_real)) / batchsize
        L2 = F.sum(F.softplus(y_fake)) / batchsize
        loss = L1 + L2
        chainer.report({'loss': loss}, dis)
        return loss

    def loss_gen(self, gen, y_fake):
        batchsize = len(y_fake)
        loss = F.sum(F.softplus(-y_fake)) / batchsize
        chainer.report({'loss': loss}, gen)
        return loss

    def update_core(self):
        gen_optimizer = self.get_optimizer('gen')
        dis_optimizer = self.get_optimizer('dis')

        batch = self.get_iterator('main').next()
        device = self.device
        x_real = Variable(self.converter(batch, device)) / 255.

        gen, dis = self.gen, self.dis
        batchsize = len(batch)

        y_real = dis(x_real)

        z = Variable(device.xp.asarray(gen.make_hidden(batchsize)))
        x_fake = gen(z)
        y_fake = dis(x_fake)

        dis_optimizer.update(self.loss_dis, dis, y_fake, y_real)
        gen_optimizer.update(self.loss_gen, gen, y_fake)

In the initializer __init__, an additional keyword argument models is required as you can see the code below. Also, we use keyword arguments iterator, optimizer and device. It should be noted that the optimizer augment takes a dictionary. The two different models require two different optimizers. To specify the different optimizers for the models, we give a dictionary, {'gen': opt_gen, 'dis': opt_dis}, to the optimizer argument. we should input optimizer as a dictionary {'gen': opt_gen, 'dis': opt_dis}. In the DCGANUpdater, you can access the iterator with self.get_iterator('main'). Also, you can access the optimizers with self.get_optimizer('gen') and self.get_optimizer('dis').

In update_core, the two loss functions loss_dis and loss_gen are minimized by the optimizers. At first two lines, we access the optimizers. Then, we create next minibatch of training data by self.get_iterator('main').next(), copy batch to the device by self.converter, and make it a Variable object. After that, we minimize the loss functions with the optimizers.

Note

When defining update_core, we may want to manipulate the underlying array of a Variable with numpy or cupy library. Note that the type of arrays on CPU is numpy.ndarray, while the type of arrays on GPU is cupy.ndarray. However, users do not need to write if condition explicitly, because the appropriate array module can be obtained by xp = chainer.backend.get_array_module(variable.array). If variable is on GPU, cupy is assigned to xp, otherwise numpy is assigned to xp.

train_dcgan.py
updater = DCGANUpdater(
    models=(gen, dis),
    iterator=train_iter,
    optimizer={
        'gen': opt_gen, 'dis': opt_dis},
    device=device)

2.6 Prepare trainer and run
train_dcgan.py
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

snapshot_interval = (args.snapshot_interval, 'iteration')
display_interval = (args.display_interval, 'iteration')
trainer.extend(
    extensions.snapshot(filename='snapshot_iter_{.updater.iteration}.npz'),
    trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
    gen, 'gen_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
    dis, 'dis_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.LogReport(trigger=display_interval))
trainer.extend(extensions.PrintReport([
    'epoch', 'iteration', 'gen/loss', 'dis/loss',
]), trigger=display_interval)
trainer.extend(extensions.ProgressBar(update_interval=10))
trainer.extend(
    out_generated_image(
        gen, dis,
        10, 10, args.seed, args.out),
    trigger=snapshot_interval)

train_dcgan.py
trainer.run()


2.7 Start training

We can run the example as follows.

$ pwd
/root2chainer/chainer/examples/dcgan
$ python train_dcgan.py --gpu 0
GPU: 0
# Minibatch-size: 50
# n_hidden: 100
# epoch: 1000

epoch       iteration   gen/loss    dis/loss  ................]  0.01%
0           100         1.2292      1.76914
     total [..................................................]  0.02%
this epoch [#########.........................................] 19.00%
       190 iter, 0 epoch / 1000 epochs
    10.121 iters/sec. Estimated time to finish: 1 day, 3:26:26.372445.

The results will be saved in the directory /root2chainer/chainer/examples/dcgan/result/. The image is generated by the generator trained for 1000 epochs, and the GIF image on the top of this page shows generated images after every 10 epochs.

_images/generated-image-epoch1000.png

3. Reference

Recurrent Nets and their Computational Graph

In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.

import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions

In this section, you will learn how to write

  • recurrent nets with full backprop,

  • recurrent nets with truncated backprop,

  • evaluation of networks with few memory.

After reading this section, you will be able to:

  • Handle input sequences of variable length

  • Truncate upper stream of the network during forward computation

  • Use no-backprop mode to prevent network construction

Recurrent Nets

Recurrent nets are neural networks with loops. They are often used to learn from sequential input/output. Given an input stream \(x_1, x_2, \dots, x_t, \dots\) and the initial state \(h_0\), a recurrent net iteratively updates its state by \(h_t = f(x_t, h_{t-1})\), and at some or every point in time \(t\), it outputs \(y_t = g(h_t)\). If we expand the procedure along the time axis, it looks like a regular feed-forward network except that same parameters are repeatedly used within the network.

Here we learn how to write a simple one-layer recurrent net. The task is language modeling: given a finite sequence of words, we want to predict the next word at each position without peeking the successive words. Suppose there are 1,000 different word types, and that we use 100 dimensional real vectors to represent each word (a.k.a. word embedding).

Let’s start from defining the recurrent neural net language model (RNNLM) as a chain. We can use the chainer.links.LSTM link that implements a fully-connected stateful LSTM layer. This link looks like an ordinary fully-connected layer. On construction, you pass the input and output size to the constructor:

>>> l = L.LSTM(100, 50)

Then, call on this instance l(x) executes one step of LSTM layer:

>>> l.reset_state()
>>> x = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y = l(x)

Do not forget to reset the internal state of the LSTM layer before the forward computation! Every recurrent layer holds its internal state (i.e. the output of the previous call). At the first application of the recurrent layer, you must reset the internal state. Then, the next input can be directly fed to the LSTM instance:

>>> x2 = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y2 = l(x2)

Based on this LSTM link, let’s write our recurrent network as a new chain:

class RNN(Chain):
    def __init__(self):
        super(RNN, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(1000, 100)  # word embedding
            self.mid = L.LSTM(100, 50)  # the first LSTM layer
            self.out = L.Linear(50, 1000)  # the feed-forward output layer

    def reset_state(self):
        self.mid.reset_state()

    def forward(self, cur_word):
        # Given the current word ID, predict the next word.
        x = self.embed(cur_word)
        h = self.mid(x)
        y = self.out(h)
        return y

rnn = RNN()
model = L.Classifier(rnn)
optimizer = optimizers.SGD()
optimizer.setup(model)

Here EmbedID is a link for word embedding. It converts input integers into corresponding fixed-dimensional embedding vectors. The last linear link out represents the feed-forward output layer.

The RNN chain implements a one-step-forward computation. It does not handle sequences by itself, but we can use it to process sequences by just feeding items in a sequence straight to the chain.

Suppose we have a list of word variables x_list. Then, we can compute loss values for the word sequence by simple for loop.

def compute_loss(x_list):
    loss = 0
    for cur_word, next_word in zip(x_list, x_list[1:]):
        loss += model(cur_word, next_word)
    return loss

Of course, the accumulated loss is a Variable object with the full history of computation. So we can just call its backward() method to compute gradients of the total loss according to the model parameters:

# Suppose we have a list of word variables x_list.
rnn.reset_state()
model.cleargrads()
loss = compute_loss(x_list)
loss.backward()
optimizer.update()

Or equivalently we can use the compute_loss as a loss function:

rnn.reset_state()
optimizer.update(compute_loss, x_list)

Truncate the Graph by Unchaining

Learning from very long sequences is also a typical use case of recurrent nets. Suppose the input and state sequence is too long to fit into memory. In such cases, we often truncate the backpropagation into a short time range. This technique is called truncated backprop. It is heuristic, and it makes the gradients biased. However, this technique works well in practice if the time range is long enough.

How to implement truncated backprop in Chainer? Chainer has a smart mechanism to achieve truncation, called backward unchaining. It is implemented in the Variable.unchain_backward() method. Backward unchaining starts from the Variable object, and it chops the computation history backwards from the variable. The chopped variables are disposed automatically (if they are not referenced explicitly from any other user object). As a result, they are no longer a part of computation history, and are not involved in backprop anymore.

Let’s write an example of truncated backprop. Here we use the same network as the one used in the previous subsection. Suppose we are given a very long sequence, and we want to run backprop truncated at every 30 time steps. We can write truncated backprop using the model defined above:

loss = 0
count = 0
seqlen = len(x_list[1:])

rnn.reset_state()
for cur_word, next_word in zip(x_list, x_list[1:]):
    loss += model(cur_word, next_word)
    count += 1
    if count % 30 == 0 or count == seqlen:
        model.cleargrads()
        loss.backward()
        loss.unchain_backward()
        optimizer.update()

State is updated at model(), and the losses are accumulated to loss variable. At each 30 steps, backprop takes place at the accumulated loss. Then, the unchain_backward() method is called, which deletes the computation history backward from the accumulated loss. Note that the last state of model is not lost, since the RNN instance holds a reference to it.

The implementation of truncated backprop is simple, and since there is no complicated trick on it, we can generalize this method to different situations. For example, we can easily extend the above code to use different schedules between backprop timing and truncation length.

Network Evaluation without Storing the Computation History

On evaluation of recurrent nets, there is typically no need to store the computation history. While unchaining enables us to walk through unlimited length of sequences with limited memory, it is a bit of a work-around.

As an alternative, Chainer provides an evaluation mode of forward computation which does not store the computation history. This is enabled by just calling no_backprop_mode() context:

with chainer.no_backprop_mode():
    x_list = [Variable(...) for _ in range(100)]  # list of 100 words
    loss = compute_loss(x_list)

Note that we cannot call loss.backward() to compute the gradient here, since the variable created in the no-backprop context does not remember the computation history.

No-backprop context is also useful to evaluate feed-forward networks to reduce the memory footprint.

We can combine a fixed feature extractor network and a trainable predictor network using no_backprop_mode(). For example, suppose we want to train a feed-forward network predictor_func, which is located on top of another fixed pre-trained network fixed_func. We want to train predictor_func without storing the computation history for fixed_func. This is simply done by following code snippets (suppose x_data and y_data indicate input data and label, respectively):

with chainer.no_backprop_mode():
    x = Variable(x_data)
    feat = fixed_func(x)
y = predictor_func(feat)
y.backward()

At first, the input variable x is in no-backprop mode, so fixed_func does not memorize the computation history. Then predictor_func is executed in backprop mode, i.e., with memorizing the history of computation. Since the history of computation is only memorized between variables feat and y, the backward computation stops at the feat variable.

Making it with Trainer

The above codes are written with plain Function/Variable APIs. When we write a training loop, it is better to use Trainer, since we can then easily add functionalities by extensions.

Before implementing it on Trainer, let’s clarify the training settings. We here use Penn Tree Bank dataset as a set of sentences. Each sentence is represented as a word sequence. We concatenate all sentences into one long word sequence, in which each sentence is separated by a special word <eos>, which stands for “End of Sequence”. This dataset is easily obtained by chainer.datasets.get_ptb_words(). This function returns train, validation, and test dataset, each of which is represented as a long array of integers. Each integer represents a word ID.

Our task is to learn a recurrent neural net language model from the long word sequence. We use words in different locations to form mini-batches. It means we maintain \(B\) indices pointing to different locations in the sequence, read from these indices at each iteration, and increment all indices after the read. Of course, when one index reaches the end of the whole sequence, we turn the index back to 0.

In order to implement this training procedure, we have to customize the following components of Trainer:

  • Iterator. Built-in iterators do not support reading from different locations and aggregating them into a mini-batch.

  • Update function. The default update function does not support truncated BPTT.

When we write a dataset iterator dedicated to the dataset, the dataset implementation can be arbitrary; even the interface is not fixed. On the other hand, the iterator must support the Iterator interface. The important methods and attributes to implement are batch_size, epoch, epoch_detail, is_new_epoch, iteration, __next__, and serialize. Following is a code from the official example in the examples/ptb directory.

from __future__ import division

class ParallelSequentialIterator(chainer.dataset.Iterator):
    def __init__(self, dataset, batch_size, repeat=True):
        self.dataset = dataset
        self.batch_size = batch_size
        self.epoch = 0
        self.is_new_epoch = False
        self.repeat = repeat
        self.offsets = [i * len(dataset) // batch_size for i in range(batch_size)]
        self.iteration = 0

    def __next__(self):
        length = len(self.dataset)
        if not self.repeat and self.iteration * self.batch_size >= length:
            raise StopIteration
        cur_words = self.get_words()
        self.iteration += 1
        next_words = self.get_words()

        epoch = self.iteration * self.batch_size // length
        self.is_new_epoch = self.epoch < epoch
        if self.is_new_epoch:
            self.epoch = epoch

        return list(zip(cur_words, next_words))

    @property
    def epoch_detail(self):
        return self.iteration * self.batch_size / len(self.dataset)

    def get_words(self):
        return [self.dataset[(offset + self.iteration) % len(self.dataset)]
                for offset in self.offsets]

    def serialize(self, serializer):
        self.iteration = serializer('iteration', self.iteration)
        self.epoch = serializer('epoch', self.epoch)

train_iter = ParallelSequentialIterator(train, 20)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)

Although the code is slightly long, the idea is simple. First, this iterator creates offsets pointing to positions equally spaced within the whole sequence. The i-th examples of mini-batches refer the sequence with the i-th offset. The iterator returns a list of tuples of the current words and the next words. Each mini-batch is converted to a tuple of integer arrays by the concat_examples function in the standard updater (see the previous tutorial).

Backprop Through Time is implemented as follows.

class BPTTUpdater(training.updaters.StandardUpdater):

    def __init__(self, train_iter, optimizer, bprop_len):
        super(BPTTUpdater, self).__init__(train_iter, optimizer)
        self.bprop_len = bprop_len

    # The core part of the update routine can be customized by overriding.
    def update_core(self):
        loss = 0
        # When we pass one iterator and optimizer to StandardUpdater.__init__,
        # they are automatically named 'main'.
        train_iter = self.get_iterator('main')
        optimizer = self.get_optimizer('main')

        # Progress the dataset iterator for bprop_len words at each iteration.
        for i in range(self.bprop_len):
            # Get the next batch (a list of tuples of two word IDs)
            batch = train_iter.__next__()

            # Concatenate the word IDs to matrices and send them to the device
            # self.converter does this job
            # (it is chainer.dataset.concat_examples by default)
            x, t = self.converter(batch)

            # Compute the loss at this time step and accumulate it
            loss += optimizer.target(chainer.Variable(x), chainer.Variable(t))

        optimizer.target.cleargrads()  # Clear the parameter gradients
        loss.backward()  # Backprop
        loss.unchain_backward()  # Truncate the graph
        optimizer.update()  # Update the parameters

updater = BPTTUpdater(train_iter, optimizer, bprop_len)  # instantiation

In this case, we update the parameters on every bprop_len consecutive words. The call of unchain_backward cuts the history of computation accumulated to the LSTM links. The rest of the code for setting up Trainer is almost same as one given in the previous tutorial.


In this section we have demonstrated how to write recurrent nets in Chainer and some fundamental techniques to manage the history of computation (a.k.a. computational graph). The example in the examples/ptb directory implements truncated backprop learning of a LSTM language model from the Penn Treebank corpus. In the next section, we will review how to use GPU(s) in Chainer.

RNN Language Models

0. Introduction

The language model is modeling the probability of generating natural language sentences or documents. You can use the language model to estimate how natural a sentence or a document is. Also, with the language model, you can generate new sentences or documents.

Let’s start with modeling the probability of generating sentences. We represent a sentence as \({\bf X} = ({\bf x}_0, {\bf x}_1, ..., {\bf x}_T)\), in which \({\bf x}_t\) is a one-hot vector. Generally, \({\bf x}_0\) is the one-hot vector of BOS (beginning of sentence), and \({\bf x}_T\) is that of EOS (end of sentence).

A language model models the probability of a word occurrence under the condition of its previous words in a sentence. Let \({\bf X}_{[i, j]}\) be \(({\bf x}_i, {\bf x}_{i+1}, ..., {\bf x}_j)\), the occurrence probability of sentence \(\bf X\) can be represented as follows:

\[P({\bf X}) = P({\bf x}_0) \prod_{t=1}^T P({\bf x}_t|{\bf X}_{[0, t-1]})\]

So, the language model \(P({\bf X})\) can be decomposed into word probabilities conditioned with its previous words. In this tutorial, we model \(P({\bf x}_t|{\bf X}_{[0, t-1]})\) with a recurrent neural network to obtain a language model \(P({\bf X})\).

1. Basic Idea of Recurrent Neural Net Language Model

1.1 Recurrent Neural Net Language Model

Recurrent Neural Net Language Model (RNNLM) is a type of neural net language models which contains the RNNs in the network. Since an RNN can deal with the variable length inputs, it is suitable for modeling the sequential data such as sentences in natural language.

We show one layer of an RNNLM with these parameters.

Symbol

Definition

\({\bf x}_t\)

the one-hot vector of \(t\)-th word

\({\bf y}_t\)

the \(t\)-th output

\({\bf h}_t^{(i)}\)

the \(t\)-th hidden layer of \(i\)-th layer

\({\bf p}_t\)

the next word’s probability of \(t\)-th word

\({\bf E}\)

Embedding matrix

\({\bf W}_h\)

Hidden layer matrix

\({\bf W}_o\)

Output layer matrix

_images/rnnlm.png
The process to get a next word prediction from \(i\)-th input word \({\bf x}_t\)
  1. Get the embedding vector: \({\bf h}_t^{(0)} = {\bf E} {\bf x}_t\)

  2. Calculate the hidden layer: \({\bf h}_t^{(1)} = {\rm tanh} \left( {\bf W}_h \left[ \begin{array}{cc} {\bf h}_t^{(0)} \\ {\bf h}_{t-1}^{(1)} \end{array} \right] \right)\)

  3. Calculate the output layer: \({\bf y}_t = {\bf W}_o {\bf h}_t^{(1)}\)

  4. Transform to probability: \({\bf p}_t = {\rm softmax}({\bf y}_t)\)

Note

  • Note that \(\rm tanh\) in the above equation is applied to the input vector in element-wise manner.

  • Note that \(\left[ \begin{array}{cc} {\bf a} \\ {\bf b} \end{array} \right]\) denotes a concatenated vector of \({\bf a}\) and \({\bf b}\).

  • Note that \({\rm softmax}\) in the above equation converts an arbitrary real vector to a probability vector which the summation over all elements is \(1\).

1.2 Perplexity (Evaluation of the language model)

Perplexity is the common evaluation metric for a language model. Generally, it measures how well the proposed probability model \(P_{\rm model}({\bf X})\) represents the target data \(P^*({\bf X})\). Let a validation dataset be \(D = \{{\bf X}^{(n)}\}_{n=1}^{|D|}\), which is a set of sentences, where the \(n\)-th sentence length is \(T^{(n)}\), and the vocabulary size of this dataset is \(|\mathcal{V}|\), the perplexity is represented as follows:

\[b^z \ \ s.t. \ \ z = - \frac{1}{|\mathcal{V}|} \sum_{n=1}^{|D|} \sum_{t=1}^{T^{(n)}} \log_b P_{\rm model}({\bf x}_t^{(n)}, {\bf X}_{[a, t-1]}^{(n)})\]

We usually use \(b = 2\) or \(b = e\). The perplexity shows how much varied the predicted distribution for the next word is. When a language model represents the dataset well, it should show a high probability only for the correct next word, so that the entropy should be high. In the above equation, the sign is reversed, so that smaller perplexity means better model.

During training, we minimize the below cross entropy:

\[\mathcal{H}(\hat{P}, P_{\rm model}) = - \hat{P}({\bf X}) \log P_{\rm model}({\bf X})\]

where \(\hat P\) is the empirical distribution of a sequence in the training dataset.

2. Implementation of Recurrent Neural Net Language Model

There is an example of RNN language model in the official repository, so we will explain how to implement a RNNLM in Chainer based on that: examples/ptb

2.1 Model Overview
_images/rnnlm_example.png

The RNNLM used in this notebook is depicted in the above figure. The symbols appeared in the figure are defined as follows:

Symbol

Definition

\({\bf x}_t\)

the one-hot vector of \(t\)-th word

\({\bf y}_t\)

the \(t\)-th output

\({\bf h}_t^{(i)}\)

the \(t\)-th hidden layer of \(i\)-th layer

\({\bf p}_t\)

the next word’s probability of \(t\)-th word

\({\bf E}\)

Embedding matrix

\({\bf W}_h\)

Hidden layer matrix

\({\bf W}_o\)

Output layer matrix

LSTMs (long short-term memory) are used for the connection of hidden layers. A LSTM is one of major recurrent neural net modules. It is designed for remembering the long-term memory, so that it should be able to consider relationships of distant words, such that a word at beginning of sentence and it at the end. We also use Dropout before both LSTMs and linear transformations. Dropout is one of regularization techniques for preventing overfitting on training dataset.

2.2 Step-by-step Implementation
2.2.1 Import Package

First, let’s import necessary packages.

train_ptb.py
"""
from __future__ import division
import argparse
import sys

import numpy as np

2.2.2 Define Training Settings

Define all training settings here.

train_ptb.py
parser.add_argument('--batchsize', '-b', type=int, default=20,
                    help='Number of examples in each mini-batch')
parser.add_argument('--bproplen', '-l', type=int, default=35,
                    help='Number of words in each mini-batch '
                         '(= length of truncated BPTT)')
parser.add_argument('--epoch', '-e', type=int, default=39,
                    help='Number of sweeps over the dataset to train')
parser.add_argument('--device', '-d', type=str, default='-1',
                    help='Device specifier. Either ChainerX device '
                    'specifier or an integer. If non-negative integer, '
                    'CuPy arrays with specified device id are used. If '
                    'negative integer, NumPy arrays are used')
parser.add_argument('--gradclip', '-c', type=float, default=5,
                    help='Gradient norm threshold to clip')
parser.add_argument('--out', '-o', default='result',
                    help='Directory to output the result')
parser.add_argument('--resume', '-r', type=str,
                    help='Resume the training from snapshot')
parser.add_argument('--test', action='store_true',
                    help='Use tiny datasets for quick tests')
parser.set_defaults(test=False)
parser.add_argument('--unit', '-u', type=int, default=650,
                    help='Number of LSTM units in each layer')
parser.add_argument('--model', '-m', default='model.npz',
                    help='Model file name to serialize')
2.2.3 Define Network Structure

An RNNLM written in Chainer is shown below. It implements the model depicted in the above figure.

train_ptb.py
class RNNForLM(chainer.Chain):

    def __init__(self, n_vocab, n_units):
        super(RNNForLM, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l1 = L.LSTM(n_units, n_units)
            self.l2 = L.LSTM(n_units, n_units)
            self.l3 = L.Linear(n_units, n_vocab)

        for param in self.params():
            param.array[...] = np.random.uniform(-0.1, 0.1, param.shape)

    def reset_state(self):
        self.l1.reset_state()
        self.l2.reset_state()

    def forward(self, x):
        h0 = self.embed(x)
        h1 = self.l1(F.dropout(h0))
        h2 = self.l2(F.dropout(h1))
        y = self.l3(F.dropout(h2))
        return y
  • When we instantiate this class for making a model, we give the vocabulary size to n_vocab and the size of hidden vectors to n_units.

  • This network uses chainer.links.LSTM, chainer.links.Linear, and chainer.functions.dropout as its building blocks. All the layers are registered and initialized in the context with self.init_scope().

  • You can access all the parameters in those layers by calling self.params().

  • In the constructor, it initializes all parameters with values sampled from a uniform distribution \(U(-1, 1)\).

  • The forward method takes an word ID x, and calculates the word probability vector for the next word by forwarding it through the network, and returns the output.

  • Note that the word ID x is automatically converted to a \(|\mathcal{V}|\)-dimensional one-hot vector and then multiplied with the input embedding matrix in self.embed(x) to obtain an embed vector h0 at the first line of forward.

2.2.4 Load the Penn Tree Bank Long Word Sequence Dataset

In this notebook, we use Penn Tree Bank dataset that contains number of sentences. Chainer provides an utility function to obtain this dataset from server and convert it to a long single sequence of word IDs. chainer.datasets.get_ptb_words() actually returns three separated datasets which are for train, validation, and test.

Let’s download and make dataset objects using it:

train_ptb.py

# Load the Penn Tree Bank long word sequence dataset
train, val, test = chainer.datasets.get_ptb_words()
2.2.5 Define Iterator for Making a Mini-batch from the Dataset

Dataset iterator creates a mini-batch of couple of words at different positions, namely, pairs of current word and its next word. Each example is a part of sentences starting from different offsets equally spaced within the whole sequence.

train_ptb.py
class ParallelSequentialIterator(chainer.dataset.Iterator):

    def __init__(self, dataset, batch_size, repeat=True):
        super(ParallelSequentialIterator, self).__init__()
        self.dataset = dataset
        self.batch_size = batch_size  # batch size
        self.repeat = repeat
        length = len(dataset)
        # Offsets maintain the position of each sequence in the mini-batch.
        self.offsets = [i * length // batch_size for i in range(batch_size)]
        self.reset()

    def reset(self):
        # Number of completed sweeps over the dataset. In this case, it is
        # incremented if every word is visited at least once after the last
        # increment.
        self.epoch = 0
        # True if the epoch is incremented at the last iteration.
        self.is_new_epoch = False
        # NOTE: this is not a count of parameter updates. It is just a count of
        # calls of ``__next__``.
        self.iteration = 0
        # use -1 instead of None internally
        self._previous_epoch_detail = -1.

    def __next__(self):
        # This iterator returns a list representing a mini-batch. Each item
        # indicates a different position in the original sequence. Each item is
        # represented by a pair of two word IDs. The first word is at the
        # "current" position, while the second word at the next position.
        # At each iteration, the iteration count is incremented, which pushes
        # forward the "current" position.
        length = len(self.dataset)
        if not self.repeat and self.iteration * self.batch_size >= length:
            # If not self.repeat, this iterator stops at the end of the first
            # epoch (i.e., when all words are visited once).
            raise StopIteration
        cur_words = self.get_words()
        self._previous_epoch_detail = self.epoch_detail
        self.iteration += 1
        next_words = self.get_words()

        epoch = self.iteration * self.batch_size // length
        self.is_new_epoch = self.epoch < epoch
        if self.is_new_epoch:
            self.epoch = epoch

        return list(zip(cur_words, next_words))

    @property
    def epoch_detail(self):
        # Floating point version of epoch.
        return self.iteration * self.batch_size / len(self.dataset)

    @property
    def previous_epoch_detail(self):
        if self._previous_epoch_detail < 0:
            return None
        return self._previous_epoch_detail

    def get_words(self):
        # It returns a list of current words.
        return [self.dataset[(offset + self.iteration) % len(self.dataset)]
                for offset in self.offsets]

    def serialize(self, serializer):
        # It is important to serialize the state to be recovered on resume.
        self.iteration = serializer('iteration', self.iteration)
        self.epoch = serializer('epoch', self.epoch)
        try:
            self._previous_epoch_detail = serializer(
                'previous_epoch_detail', self._previous_epoch_detail)
        except KeyError:
            # guess previous_epoch_detail for older version
            self._previous_epoch_detail = self.epoch + \
                (self.current_position - self.batch_size) / len(self.dataset)
            if self.epoch_detail > 0:
                self._previous_epoch_detail = max(
                    self._previous_epoch_detail, 0.)
            else:
                self._previous_epoch_detail = -1.
2.2.6 Define Updater

We use Backpropagation through time (BPTT) for optimize the RNNLM. BPTT can be implemented by overriding update_core() method of StandardUpdater. First, in the constructor of the BPTTUpdater, it takes bprop_len as an argument in addition to other arguments StandardUpdater needs. bprop_len defines the length of sequence \(T\) to calculate the loss:

\[\mathcal{L} = - \sum_{t=0}^T \sum_{n=1}^{|\mathcal{V}|} \hat{P}({\bf x}_{t+1}^{(n)}) \log P_{\rm model}({\bf x}_{t+1}^{(n)} \mid {\bf x}_t^{(n)})\]

where \(\hat{P}({\bf x}_t^n)\) is a probability for \(n\)-th word in the vocabulary at the position \(t\) in the training data sequence.

train_ptb.py
class BPTTUpdater(training.updaters.StandardUpdater):

    def __init__(self, train_iter, optimizer, bprop_len, device):
        super(BPTTUpdater, self).__init__(
            train_iter, optimizer, device=device)
        self.bprop_len = bprop_len

    # The core part of the update routine can be customized by overriding.
    def update_core(self):
        loss = 0
        # When we pass one iterator and optimizer to StandardUpdater.__init__,
        # they are automatically named 'main'.
        train_iter = self.get_iterator('main')
        optimizer = self.get_optimizer('main')

        # Progress the dataset iterator for bprop_len words at each iteration.
        for i in range(self.bprop_len):
            # Get the next batch (a list of tuples of two word IDs)
            batch = train_iter.__next__()

            # Concatenate the word IDs to matrices and send them to the device
            # self.converter does this job
            # (it is chainer.dataset.concat_examples by default)
            x, t = self.converter(batch, self.device)

            # Compute the loss at this time step and accumulate it
            loss += optimizer.target(x, t)

        optimizer.target.cleargrads()  # Clear the parameter gradients
        loss.backward()  # Backprop
        loss.unchain_backward()  # Truncate the graph
        optimizer.update()  # Update the parameters
2.2.7 Define Evaluation Function (Perplexity)

Define a function to calculate the perplexity from the loss value. If we take \(e\) as \(b\) in the above definition of perplexity, calculating the perplexity is just to give the loss value to the power of \(e\):

train_ptb.py
def compute_perplexity(result):
    result['perplexity'] = np.exp(result['main/loss'])
    if 'validation/main/loss' in result:
        result['val_perplexity'] = np.exp(result['validation/main/loss'])
2.2.8 Create Iterator

Here, the code below just creates iterator objects from dataset splits (train/val/test).

train_ptb.py

train_iter = ParallelSequentialIterator(train, args.batchsize)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
test_iter = ParallelSequentialIterator(test, 1, repeat=False)

2.2.9 Create RNN and Classification Model

Instantiate RNNLM model and wrap it with chainer.links.Classifier because it calculates softmax cross entropy as the loss.

train_ptb.py
rnn = RNNForLM(n_vocab, args.unit)
model = L.Classifier(rnn)
model.compute_accuracy = False  # we only want the perplexity

Note that Classifier computes not only the loss but also accuracy based on a given input/label pair. To learn the RNN language model, we only need the loss (cross entropy) in the Classifier because we calculate the perplexity instead of classification accuracy to check the performance of the model. So, we turn off computing the accuracy by giving False to model.compute_accuracy attribute.

2.2.10 Setup Optimizer

Prepare an optimizer. Here, we use GradientClipping to prevent gradient explosion. It automatically clips the gradient to be used to update the parameters in the model with given constant gradclip.

train_ptb.py
optimizer = chainer.optimizers.SGD(lr=1.0)
optimizer.setup(model)
optimizer.add_hook(chainer.optimizer_hooks.GradientClipping(args.gradclip))

2.2.11 Setup and Run Trainer

Let’s make a trainer object and start the training! Note that we add an eval_hook to the Evaluator extension to reset the internal states before starting evaluation process. It can prevent to use training data during evaluating the model.

train_ptb.py
updater = BPTTUpdater(train_iter, optimizer, args.bproplen, device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

eval_model = model.copy()  # Model with shared params and distinct states
eval_rnn = eval_model.predictor
trainer.extend(extensions.Evaluator(
    val_iter, eval_model, device=device,
    # Reset the RNN state at the beginning of each evaluation
    eval_hook=lambda _: eval_rnn.reset_state()))

interval = 10 if args.test else 500
trainer.extend(extensions.LogReport(postprocess=compute_perplexity,
                                    trigger=(interval, 'iteration')))
trainer.extend(extensions.PrintReport(
    ['epoch', 'iteration', 'perplexity', 'val_perplexity']
), trigger=(interval, 'iteration'))
trainer.extend(extensions.ProgressBar(
    update_interval=1 if args.test else 10))
trainer.extend(extensions.snapshot())
trainer.extend(extensions.snapshot_object(
    model, 'model_iter_{.updater.iteration}'))
if args.resume is not None:
    chainer.serializers.load_npz(args.resume, trainer)

trainer.run()

2.2.12 Evaluate the trained model on test dataset

Let’s see the perplexity on the test split. Trainer’s extension can be used as just a normal function outside of Trainer.

train_ptb.py
print('test')
eval_rnn.reset_state()
evaluator = extensions.Evaluator(test_iter, eval_model, device=device)
result = evaluator()
print('test perplexity: {}'.format(np.exp(float(result['main/loss']))))

2.3 Run Example
2.3.1 Training the model

You can train the model with the script: examples/ptb/train_ptb.py

$ pwd
/root2chainer/chainer/examples/ptb
$ python train_ptb.py --test  # run by test mode. If you want to use all data, remove "--test".
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt...
#vocab = 10000
test
test perplexity: 29889.9857364
2.3.2 Generating sentences

You can generate the sentence which starts with a word in the vocabulary. In this example, we generate a sentence which starts with the word apple. We use the script in the PTB example of the official repository: examples/ptb/gentxt.py

$ pwd
/root2chainer/chainer/examples/ptb
$ python gentxt.py -m model.npz -p apple
apple a new u.s. economist with <unk> <unk> fixed more than to N the company said who is looking back to

Word2Vec: Obtain word embeddings

0. Introduction

Word2vec is the tool for generating the distributed representation of words, which is proposed by Mikolov et al[1]. When the tool assigns a real-valued vector to each word, the closer the meanings of the words, the greater similarity the vectors will indicate.

Distributed representation means assigning a real-valued vector for each word and representing the word by the vector. When representing a word by distributed representation, we call the word embeddings. In this tutorial, we aim at explaining how to get the word embeddings from Penn Tree Bank dataset.

Let’s think about what the meaning of word is. Since we are human, we can understand that the words “animal” and “dog” are deeply related each other. But what information will Word2vec use to learn the vectors for words? The words “animal” and “dog” should have similar vectors, but the words “food” and “dog” should be far from each other. How to know the features of those words automatically?

1. Basic Idea

Word2vec learns the similarity of word meanings from simple information. It learns the representation of words from sentences. The core idea is based on the assumption that the meaning of a word is affected by the words around it. This idea follows distributional hypothesis[2].

The word we focus on to learn its representation is called center word, and the words around it are called context words. The window size \(C\) determines the number of context words which is considered.

Here, let’s see the algorithm by using an example sentence: “The cute cat jumps over the lazy dog.”.

  • All of the following figures consider “cat” as the center word.

  • According to the window size \(C\), you can see that the number of context words is changed.

_images/center_context_word.png

2. Main Algorithm

Word2vec, the tool for creating the word embeddings, is actually built with two models, which are called Skip-gram and CBoW.

To explain the models with the figures below, we will use the following symbols.

Symbol

Definition

\(|\mathcal{V}|\)

The size of vocabulary

\(D\)

The size of embedding vector

\({\bf v}_t\)

A one-hot center word vector

\(V_{t \pm C}\)

A set of \(2C\) context vectors around \({\bf v}_t\), namely, \(\{{\bf v}_{t+c}\}_{c=-C}^C \backslash {\bf v}_t\)

\({\bf l}_H\)

An embedding vector of an input word vector

\({\bf l}_O\)

An output vector of the network

\({\bf W}_H\)

The embedding matrix for inputs

\({\bf W}_O\)

The embedding matrix for outputs

Note

Using negative sampling or hierarchical softmax for the loss function is very common, however, in this tutorial, we will use the softmax over all words and skip the other variants for the sake of simplicity.

2.1 Skip-gram

This model learns to predict context words \(V_{t \pm C}\) when a center word \({\bf v}_t\) is given. In the model, each row of the embedding matrix for input \({\bf W}_H\) becomes a word embedding of each word.

When you input a center word \({\bf v}_t\) into the network, you can predict one of context words \(\hat {\bf v}_{t+c} \in V_{t \pm C}\) as follows:

  1. Calculate an embedding vector of the input center word vector: \({\bf l}_H = {\bf W}_H {\bf v}_t\)

  2. Calculate an output vector of the embedding vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)

  3. Calculate a probability vector of a context word: \(\hat {\bf v}_{t+c} = \text{softmax}({\bf l}_O)\)

Each element of the \(|\mathcal{V}|\)-dimensional vector \(\hat {\bf v}_{t+c}\) is a probability that a word in the vocabulary turns out to be a context word at position \(c\). So, the probability \(p({\bf v}_{t+c}|{\bf v}_t)\) can be estimated by a dot product of the one-hot vector \({\bf v}_{t+c}\) which represents the actual word at the position \(c\) and the output vector \(\hat {\bf v}_{t+c}\).

\[p({\bf v}_{t+c}|{\bf v}_t) = {\bf v}_{t+c}^T \hat {\bf v}_{t+c}\]

The loss function to predict all the context words \(V_{t \pm C}\) given a center word \({\bf v}_t\) is defined as follows:

\[\begin{split}L(V_{t \pm C} | {\bf v}_t; {\bf W}_H, {\bf W}_O) &= \sum_{V_{t \pm C}} -\log\left(p({\bf v}_{t+c} \mid {\bf v}_t)\right) \\ &= \sum_{V_{t \pm C}} -\log({\bf v}_{t+c}^T \hat{\bf v}_{t+c})\end{split}\]
2.2 Continuous Bag of Words (CBoW)

This model learns to predict center word \({\bf v}_t\) when context words \(V_{t \pm C}\) is given. When you give a set of context words \(V_{t \pm C}\) to the network, you can estimate the probability of the center word \(\hat {\bf v}_t\) as follows:

  1. Calculate a mean embedding vector over all context words: \({\bf l}_H = \frac{1}{2C} \sum_{V_{t \pm C}} {\bf W}_H {\bf v}_{t+c}\)

  2. Calculate an output vector of the embedding vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)

  3. Calculate a probability vector of a center word: \(\hat {\bf v}_t = \text{softmax}({\bf l}_O)\)

Each element of the \(|\mathcal{V}|\)-dimensional vector \(\hat {\bf v}_t\) is a probability that a word in the vocabulary turns out to be a center word. So, the probability \(p({\bf v}_t|V_{t \pm C})\) can be estimated by a dot product of the one-hot vector \({\bf v}_t\) which represents the actual center word and the output vector \(\hat {\bf v}_t\).

\[p({\bf v}_t|V_{t \pm C}) = {\bf v}_t^T \hat {\bf v}_t\]

The loss function to predict the center word \({\bf v}_t\) given context words \(V_{t \pm C}\) is defined as follows:

\[\begin{split}L({\bf v}_t | V_{t \pm C}; {\bf W}_H, {\bf W}_O) &= -\log\left(p({\bf v}_t \mid V_{t \pm C})\right) \\ &= -\log({\bf v}_t^T \hat {\bf v}_t)\end{split}\]

3. Details of Skip-gram

In this tutorial, we mainly explain Skip-gram model because

  1. It is easier to understand the algorithm than CBoW.

  2. Even if the number of words increases, the accuracy is largely maintained. So, it is more scalable.

So, let’s think about a concrete example of calculating Skip-gram under this setup:

  • The size of vocabulary \(|\mathcal{V}|\) is 10.

  • The size of embedding vector \(D\) is 2.

  • Center word is “dog”.

  • Context word is “animal”.

Since there should be more than one context word, repeat the following process for each context word.

  1. The one-hot vector of “dog” is [0 0 1 0 0 0 0 0 0 0] and you input it as the center word.

  2. The third row of embedding matrix \({\bf W}_H\) is used for the word embedding of “dog” \({\bf l}_H\).

  3. Then, multiply \({\bf W}_O\) with \({\bf l}_H\) to obtain the output vector \({\bf l}_O\).

  4. Give \({\bf l}_O\) to the softmax function to make it a predicted probability vector \(\hat {\bf v}_{t+c}\) for a context word at the position \(c\).

  5. Calculate the error between \(\hat {\bf v}_{t+c}\) and the one-hot vector of “animal”; [1 0 0 0 0 0 0 0 0 0 0].

  6. Propagate the error back to the network to update the parameters.

_images/skipgram_detail.png

4. Implementation of Skip-gram in Chainer

There is an example of Word2vec in the official repository of Chainer, so we will explain how to implement Skip-gram based on this: examples/word2vec

4.1 Preparation

First, let’s import necessary packages:

train_word2vec.py
import argparse
import collections
import os
import six
import warnings

import numpy as np

import chainer
from chainer.backends import cuda
import chainer.functions as F
import chainer.initializers as I
import chainer.links as L
import chainer.optimizers as O
from chainer import reporter
4.2 Define a Skip-gram model

Next, let’s define a network for Skip-gram.

train_word2vec.py
class SkipGram(chainer.Chain):
    """Definition of Skip-gram Model"""

    def __init__(self, n_vocab, n_units, loss_func):
        super(SkipGram, self).__init__()

        with self.init_scope():
            self.embed = L.EmbedID(
                n_vocab, n_units, initialW=I.Uniform(1. / n_units))
            self.loss_func = loss_func

    def forward(self, x, contexts):
        e = self.embed(contexts)
        batch_size, n_context, n_units = e.shape
        x = F.broadcast_to(x[:, None], (batch_size, n_context))
        e = F.reshape(e, (batch_size * n_context, n_units))
        x = F.reshape(x, (batch_size * n_context,))
        loss = self.loss_func(e, x)
        reporter.report({'loss': loss}, self)
        return loss
train_word2vec.py
class SoftmaxCrossEntropyLoss(chainer.Chain):
    """Softmax cross entropy loss function preceded by linear transformation.

    """

    def __init__(self, n_in, n_out):
        super(SoftmaxCrossEntropyLoss, self).__init__()
        with self.init_scope():
            self.out = L.Linear(n_in, n_out, initialW=0)

    def forward(self, x, t):
        return F.softmax_cross_entropy(self.out(x), t)

Note

  • The weight matrix self.embed.W is the embedding matrix for input vector x.

  • The function call forward takes the word ID of a center word x and word IDs of context words contexts as inputs, and outputs the error calculated by the loss function loss_func s.t. SoftmaxCrossEntropyLoss.

  • Note that the initial shape of x and contexts are (batch_size,) and (batch_size, n_context), respectively.

  • The batch_size means the size of mini-batch, and n_context means the number of context words.

First, we obtain the embedding vectors of contexts by e = self.embed(contexts). Then F.broadcast_to(x[:, None], (batch_size, n_context)) performs broadcasting of x (its shape is (batch_size,)) to (batch_size, n_context) by copying the same value n_context time to fill the second axis, and then the broadcasted x is reshaped into 1-D vector (batchsize * n_context,) while e is reshaped to (batch_size * n_context, n_units). In Skip-gram model, predicting a context word from the center word is the same as predicting the center word from a context word because the center word is always a context word when considering the context word as a center word. So, we create batch_size * n_context center word predictions by applying self.out linear layer to the embedding vectors of context words. Then, calculate softmax cross entropy between the broadcasted center word ID x and the predictions.

4.3 Prepare dataset and iterator

Let’s retrieve the Penn Tree Bank (PTB) dataset by using Chainer’s dataset utility get_ptb_words() method.

train, val, _ = chainer.datasets.get_ptb_words()
counts = collections.Counter(train)

Then define an iterator to make mini-batches that contain a set of center words with their context words. train and val means training data and validation data. Each data contains the list of Document IDs:

>>> train
array([ 0,  1,  2, ..., 39, 26, 24], dtype=int32)
>>> val
array([2211,  396, 1129, ...,  108,   27,   24], dtype=int32)
train_word2vec.py
class WindowIterator(chainer.dataset.Iterator):
    """Dataset iterator to create a batch of sequences at different positions.

    This iterator returns a pair of the current words and the context words.
    """

    def __init__(self, dataset, window, batch_size, repeat=True):
        self.dataset = np.array(dataset, np.int32)
        self.window = window  # size of context window
        self.batch_size = batch_size
        self._repeat = repeat
        # order is the array which is shuffled ``[window, window + 1, ...,
        # len(dataset) - window - 1]``
        self.order = np.random.permutation(
            len(dataset) - window * 2).astype(np.int32)
        self.order += window
        self.current_position = 0
        # Number of completed sweeps over the dataset. In this case, it is
        # incremented if every word is visited at least once after the last
        # increment.
        self.epoch = 0
        # True if the epoch is incremented at the last iteration.
        self.is_new_epoch = False

    def __next__(self):
        """This iterator returns a list representing a mini-batch.

        Each item indicates a different position in the original sequence.
        """
        if not self._repeat and self.epoch > 0:
            raise StopIteration

        i = self.current_position
        i_end = i + self.batch_size
        position = self.order[i:i_end]
        w = np.random.randint(self.window - 1) + 1
        offset = np.concatenate([np.arange(-w, 0), np.arange(1, w + 1)])
        pos = position[:, None] + offset[None, :]
        contexts = self.dataset.take(pos)
        center = self.dataset.take(position)

        if i_end >= len(self.order):
            np.random.shuffle(self.order)
            self.epoch += 1
            self.is_new_epoch = True
            self.current_position = 0
        else:
            self.is_new_epoch = False
            self.current_position = i_end

        return center, contexts

    @property
    def epoch_detail(self):
        return self.epoch + float(self.current_position) / len(self.order)

    def serialize(self, serializer):
        self.current_position = serializer('current_position',
                                           self.current_position)
        self.epoch = serializer('epoch', self.epoch)
        self.is_new_epoch = serializer('is_new_epoch', self.is_new_epoch)
        if self.order is not None:
            serializer('order', self.order)
  • In the constructor, we create an array self.order which denotes shuffled indices of [window, window + 1, ..., len(dataset) - window - 1] in order to choose a center word randomly from dataset in a mini-batch.

  • The iterator definition __next__ returns batch_size sets of center word and context words.

  • The code self.order[i:i_end] returns the indices for a set of center words from the random-ordered array self.order. The center word IDs center at the random indices are retrieved by self.dataset.take.

  • np.concatenate([np.arange(-w, 0), np.arange(1, w + 1)]) creates a set of offsets to retrieve context words from the dataset.

  • The code position[:, None] + offset[None, :] generates the indices of context words for each center word index in position. The context word IDs context are retrieved by self.dataset.take.

4.4 Prepare model, optimizer, and updater
train_word2vec.py
    model = SkipGram(n_vocab, args.unit, loss_func)
train_word2vec.py
optimizer = O.Adam()
optimizer.setup(model)

train_word2vec.py
train_iter = WindowIterator(train, args.window, args.batchsize)
val_iter = WindowIterator(val, args.window, args.batchsize, repeat=False)

# Set up an updater
updater = training.updaters.StandardUpdater(
    train_iter, optimizer, converter=convert, device=device)

train_word2vec.py
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

trainer.extend(extensions.Evaluator(
    val_iter, model, converter=convert, device=device))
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(
    ['epoch', 'main/loss', 'validation/main/loss']))
trainer.extend(extensions.ProgressBar())

trainer.extend(
    extensions.snapshot(filename='snapshot_epoch_{.updater.epoch}'),
    trigger=(args.snapshot_interval, 'epoch'))

if args.resume is not None:
    chainer.serializers.load_npz(args.resume, trainer)
trainer.run()

4.5 Start training
$ pwd
/root2chainer/chainer/examples/word2vec
$ python train_word2vec.py --test  # run by test mode. If you want to use all data, remove "--test".
GPU: -1
# unit: 100
Window: 5
Minibatch-size: 1000
# epoch: 20
Training model: skipgram
Output type: hsm

n_vocab: 10000
data length: 100
epoch       main/loss   validation/main/loss
1           4233.75     2495.33
2           1411.14     4990.66
3           4233.11     1247.66
4           2821.66     4990.65
5           4231.94     1247.66
6           5642.04     2495.3
7           5640.82     4990.64
8           5639.31     2495.28
9           2817.89     4990.62
10          1408.03     3742.94
11          5633.11     1247.62
12          4221.71     2495.21
13          4219.3      4990.56
14          4216.57     2495.16
15          4213.52     2495.12
16          5616.03     1247.55
17          5611.34     3742.78
18          2800.31     3742.74
19          1397.79     2494.95
20          2794.1      3742.66
4.5 Search the similar words
$ pwd
/root2chainer/chainer/examples/word2vec
$ python search.py
>> apple
query: apple
compaq: 0.6169619560241699
chip: 0.49579331278800964
retailer: 0.4904134273529053
maker: 0.4684058427810669
computer: 0.4652436673641205
>> animal
query: animal
beauty: 0.5680124759674072
human: 0.5404794216156006
insulin: 0.5365156531333923
cell: 0.5186758041381836
photographs: 0.5077002048492432

5. Reference

Write a Sequence to Sequence (seq2seq) Model

0. Introduction

The sequence to sequence (seq2seq) model[1][2] is a learning model that converts an input sequence into an output sequence. In this context, the sequence is a list of symbols, corresponding to the words in a sentence. The seq2seq model has achieved great success in fields such as machine translation, dialogue systems, question answering, and text summarization. All of these tasks can be regarded as the task to learn a model that converts an input sequence into an output sequence.

1. Basic Idea of Seq2seq Model

1.1 Overview of Seq2seq Model
The Notations of Sequence

The seq2seq model converts an input sequence into an output sequence. Let the input sequence and the output sequence be \(\bf X\) and \(\bf Y\). The \(i\)-th element of the input sequence is represented as \({\bf x}_i\), and the \(j\)-th element of the output sequence is also represented as \({\bf y}_j\). Generally, each of the \({\bf x}_i\) and the \({\bf y}_j\) is the one-hot vector of the symbols. For example, in natural language processing(NLP), the one-hot vector represents the word and its size becomes the vocabulary size.

Let’s think about the seq2seq model in the context of NLP. Let the vocabulary of the inputs and the outputs be \({\mathcal V}^{(s)}\) and \({\mathcal V}^{(t)}\), all the elements \({\bf x}_i\) and \({\bf y}_j\) satisfy \({\bf x}_i \in \mathbb{R}^{|{\mathcal V}^{(s)}|}\) and \({\bf y}_i \in \mathbb{R}^{|{\mathcal V}^{(t)}|}\). The input sequence \(\bf X\) and the output sequence \(\bf Y\) are represented as the following equations:

\[\begin{split}{\bf X} &= ({\bf x}_1, ..., {\bf x}_I) = ({\bf x}_i)_{i=1}^I \\ {\bf Y} &= ({\bf y}_1, ..., {\bf y}_J) = ({\bf y}_j)_{j=1}^J\end{split}\]

\(I\) and \(J\) are the length of the input sequence and the output sequence. Using the typical NLP notation, \({\bf y}_0\) is the one-hot vector of BOS, which is the virtual word representing the beginning of the sentence, and \({\bf y}_{J+1}\) is that of EOS, which is the virtual word representing the end of the sentence.

The Notations of Conditional Probability \(P({\bf Y}|{\bf X})\)

Next, let’s think about the conditional probability \(P({\bf Y}|{\bf X})\) generating the output sequence \(\bf Y\) when the input sequence \(\bf X\) is given. The purpose of seq2seq model is modeling the probability \(P({\bf Y}|{\bf X})\). However, the seq2seq model does not model the probability \(P({\bf Y}|{\bf X})\) directly. Actually, it models the probability \(P({\bf y}_j|{\bf Y}_{<j}, {\bf X})\), which is the probability of generating the \(j\)-th element of the output sequence \({\bf y}_j\) given the \({\bf Y}_{<j}\) and \({\bf X}\). \({\bf Y}_{<j}\) means the output sequence from \(1\) to \(j-1\), or \(({\bf y}_j)_{j=1}^{j-1}\). In this notation, you can write the model \(P_{\theta}({\bf Y}|{\bf X})\) with the product of \(P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})\):

\[P_{\theta}({\bf Y}|{\bf X}) = \prod_{j=1}^{J+1} P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})\]
Processing Steps in Seq2seq Model

Now, let’s think about the processing steps in seq2seq model. The feature of seq2seq model is that it consists of the two processes:

  1. The process that generates the fixed size vector \(\bf z\) from the input sequence \(\bf X\)

  2. The process that generates the output sequence \(\bf Y\) from \(\bf z\)

In other words, the information of \(\bf X\) is conveyed by \(\bf z\), and \(P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})\) is actually calculated by \(P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf z})\).

First, we represent the process which generating \(\bf z\) from \(\bf X\) by the function \(\Lambda\):

\[{\bf z} = \Lambda({\bf X})\]

The function \(\Lambda\) may be the recurrent neural net such as LSTMs.

Second, we represent the process which generating \(\bf Y\) from \(\bf z\) by the following formula:

\[\begin{split}P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X}) = \Upsilon({\bf h}_j^{(t)}, {\bf y}_j) \\ {\bf h}_j^{(t)} = \Psi({\bf h}_{j-1}^{(t)}, {\bf y}_{j-1})\end{split}\]

\(\Psi\) is the function to generate the hidden vectors \({\bf h}_j^{(t)}\), and \(\Upsilon\) is the function to calculate the generative probability of the one-hot vector \({\bf y}_j\). When \(j=1\), \({\bf h}_{j-1}^{(t)}\) or \({\bf h}_0^{(t)}\) is \(\bf z\) generated by \(\Lambda({\bf X})\), and \({\bf y}_{j-1}\) or \({\bf y}_0\) is the one-hot vector of BOS.

1.2 Model Architecture of Seq2seq Model

In this section, we describe the architecture of seq2seq model. To simplify the explanation, we use the most basic architecture. The architecture of seq2seq model can be separated to the five major roles.

  1. Encoder Embedding Layer

  2. Encoder Recurrent Layer

  3. Decoder Embedding Layer

  4. Decoder Recurrent Layer

  5. Decoder Output Layer

_images/seq2seq.png

The encoder consists of two layers: the embedding layer and the recurrent layer, and the decoder consists of three layers: the embedding layer, the recurrent layer, and the output layer.

In the explanation, we use the following symbols:

Symbol

Definition

\(H\)

the size of the hidden vector

\(D\)

the size of the embedding vector

\({\bf x}_i\)

the one-hot vector of \(i\)-th word in the input sentence

\({\bf \bar x}_i\)

the embedding vector of \(i\)-th word in the input sentence

\({\bf E}^{(s)}\)

Embedding matrix of the encoder

\({\bf h}_i^{(s)}\)

the \(i\)-th hidden vector of the encoder

\({\bf y}_j\)

the one-hot vector of \(j\)-th word in the output sentence

\({\bf \bar y}_j\)

the embedding vector of \(j\)-th word in the output sentence

\({\bf E}^{(t)}\)

Embedding matrix of the decoder

\({\bf h}_j^{(t)}\)

the \(j\)-th hidden vector of the decoder

1.2.1 Encoder Embedding Layer

The first layer, or the encoder embedding layer converts the each word in the input sentence to the embedding vector. When processing the \(i\)-th word in the input sentence, the input and the output of the layer are the following:

  • The input is \({\bf x}_i\) : the one-hot vector which represents \(i\)-th word

  • The output is \({\bf \bar x}_i\) : the embedding vector which represents \(i\)-th word

Each embedding vector is calculated by the following equation:

\[{\bf \bar x}_i = {\bf E}^{(s)} {\bf x}_i\]

\({\bf E}^{(s)} \in {\mathbb R}^{D \times |{\mathcal V}^{(s)}|}\) is the embedding matrix of the encoder.

1.2.2 Encoder Recurrent Layer

The encoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the \(i\)-th embedding vector, the input and the output of the layer are the following:

  • The input is \({\bf \bar x}_i\) : the embedding vector which represents the \(i\)-th word

  • The output is \({\bf h}_i^{(s)}\) : the hidden vector of the \(i\)-th position

For example, when using the uni-directional RNN of one layer, the process can be represented as the following function \(\Psi^{(s)}\):

\[\begin{split}{\bf h}_i^{(s)} &= \Psi^{(s)}({\bf \bar x}_i, {\bf h}_{i-1}^{(s)}) \\ &= {\rm tanh} \left( {\bf W}^{(s)} \left[ \begin{array}{cc} {\bf h}_{i-1}^{(s)} \\ {\bf \bar x}_{i} \end{array} \right] + {\bf b}^{(s)} \right)\end{split}\]

In this case, we use the \({\rm tanh}\) as the activation function.

1.2.3 Decoder Embedding Layer

The decoder embedding layer converts the each word in the output sentence to the embedding vector. When processing the \(j\)-th word in the output sentence, the input and the output of the layer are the following:

  • The input is \({\bf y}_{j-1}\) : the one-hot vector which represents the \((j-1)\)-th word generated by the decoder output layer

  • The output is \({\bf \bar y}_j\) : the embedding vector which represents the \((j-1)\)-th word

Each embedding vector is calculated by the following equation:

\[{\bf \bar y}_j = {\bf E}^{(t)} {\bf y}_{j-1}\]

\({\bf E}^{(t)} \in {\mathbb R}^{D \times |{\mathcal V}^{(t)}|}\) is the embedding matrix of the encoder.

1.2.4 Decoder Recurrent Layer

The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the \(j\)-th embedding vector, the input and the output of the layer are the following:

  • The input is \({\bf \bar y}_j\) : the embedding vector

  • The output is \({\bf h}_j^{(t)}\) : the hidden vector of \(j\)-th position

For example, when using the uni-directional RNN of one layer, the process can be represented as the following function \(\Psi^{(t)}\):

\[\begin{split}{\bf h}_j^{(t)} &= \Psi^{(t)}({\bf \bar y}_j, {\bf h}_{j-1}^{(t)}) \\ &= {\rm tanh} \left( {\bf W}^{(t)} \left[ \begin{array}{cc} {\bf h}_{j-1}^{(t)} \\ {\bf \bar y}_{j} \end{array} \right] + {\bf b}^{(t)} \right)\end{split}\]

In this case, we use the \({\rm tanh}\) as the activation function. And we must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of first position as following:

\[{\bf h}_0^{(t)} = {\bf z} = {\bf h}_I^{(s)}\]
1.2.5 Decoder Output Layer

The decoder output layer generates the probability of the \(j\)-th word of the output sentence from the hidden vector. When processing the \(j\)-th embedding vector, the input and the output of the layer are the following:

  • The input is \({\bf h}_j^{(t)}\) : the hidden vector of \(j\)-th position

  • The output is \(p_j\) : the probability of generating the one-hot vector \({\bf y}_j\) of the \(j\)-th word

\[\begin{split}p_j &= P_{\theta}({\bf y}_j|{\bf Y}_{<j}) = {\rm softmax}({\bf o}_j) \cdot {\bf y}_j \\ &= {\rm softmax}({\bf W}^{(o)}{\bf h}_j^{(t)} + {\bf b}^{(o)}) \cdot {\bf y}_j\end{split}\]

Note

There are a lot of varieties of seq2seq models. We can use the different RNN models in terms of: (1) directionality (unidirectional or bidirectional), (2) depth (single-layer or multi-layer), (3) type (a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU)), and (4) additional functionality (s.t. Attention Mechanism).

2. Implementation of Seq2seq Model

The official Chainer repository includes a neural machine translation example using the seq2seq model. We will now provide an overview of the example and explain its implementation in detail. chainer/examples/seq2seq

2.1 Model Overview

In this simple example, an input sequence is processed by a stacked LSTM-RNN (long short-term memory recurrent neural networks) and it is encoded as a fixed-size vector. The output sequence is also processed by another stacked LSTM-RNN. At decoding time, an output sequence is generated using argmax.

_images/lstm-rnn.png
2.2 Step-by-step Implementation
2.2.1 Import Package

First, let’s import necessary packages.

seq2seq.py
import io

from nltk.translate import bleu_score
import numpy
import progressbar
import six

import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training
2.2.2 Define Training Settings

Define all training settings here.

seq2seq.py
parser.add_argument('SOURCE', help='source sentence list')
parser.add_argument('TARGET', help='target sentence list')
parser.add_argument('SOURCE_VOCAB', help='source vocabulary file')
parser.add_argument('TARGET_VOCAB', help='target vocabulary file')
parser.add_argument('--validation-source',
                    help='source sentence list for validation')
parser.add_argument('--validation-target',
                    help='target sentence list for validation')
parser.add_argument('--batchsize', '-b', type=int, default=64,
                    help='number of sentence pairs in each mini-batch')
parser.add_argument('--epoch', '-e', type=int, default=20,
                    help='number of sweeps over the dataset to train')
parser.add_argument('--resume', '-r', type=str,
                    help='resume the training from snapshot')
parser.add_argument('--save', '-s', type=str,
                    help='save a snapshot of the training')
parser.add_argument('--unit', '-u', type=int, default=1024,
                    help='number of units')
parser.add_argument('--layer', '-l', type=int, default=3,
                    help='number of layers')
parser.add_argument('--use-dataset-api', default=False,
                    action='store_true',
                    help='use TextDataset API to reduce CPU memory usage')
parser.add_argument('--min-source-sentence', type=int, default=1,
                    help='minimium length of source sentence')
parser.add_argument('--max-source-sentence', type=int, default=50,
                    help='maximum length of source sentence')
parser.add_argument('--min-target-sentence', type=int, default=1,
                    help='minimium length of target sentence')
parser.add_argument('--max-target-sentence', type=int, default=50,
                    help='maximum length of target sentence')
parser.add_argument('--log-interval', type=int, default=200,
                    help='number of iteration to show log')
parser.add_argument('--validation-interval', type=int, default=4000,
                    help='number of iteration to evlauate the model '
                    'with validation dataset')
parser.add_argument('--device', '-d', type=str, default='-1',
                    help='Device specifier. Either ChainerX device '
                    'specifier or an integer. If non-negative integer, '
                    'CuPy arrays with specified device id are used. If '
                    'negative integer, NumPy arrays are used')
parser.add_argument('--out', '-o', default='result',
                    help='directory to output the result')
group = parser.add_argument_group('deprecated arguments')
group.add_argument('--gpu', '-g', dest='device',
                   type=int, nargs='?', const=0,
                   help='GPU ID (negative value indicates CPU)')
2.2.3 Define Network Structure

The Chainer implementation of seq2seq is shown below. It implements the model depicted in the above figure.

seq2seq.py
class Seq2seq(chainer.Chain):

    def __init__(self, n_layers, n_source_vocab, n_target_vocab, n_units):
        super(Seq2seq, self).__init__()
        with self.init_scope():
            self.embed_x = L.EmbedID(n_source_vocab, n_units)
            self.embed_y = L.EmbedID(n_target_vocab, n_units)
            self.encoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
            self.decoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
            self.W = L.Linear(n_units, n_target_vocab)

        self.n_layers = n_layers
        self.n_units = n_units

    def forward(self, xs, ys):
        xs = [x[::-1] for x in xs]

        eos = self.xp.array([EOS], numpy.int32)
        ys_in = [F.concat([eos, y], axis=0) for y in ys]
        ys_out = [F.concat([y, eos], axis=0) for y in ys]

        # Both xs and ys_in are lists of arrays.
        exs = sequence_embed(self.embed_x, xs)
        eys = sequence_embed(self.embed_y, ys_in)

        batch = len(xs)
        # None represents a zero vector in an encoder.
        hx, cx, _ = self.encoder(None, None, exs)
        _, _, os = self.decoder(hx, cx, eys)

        # It is faster to concatenate data before calculating loss
        # because only one matrix multiplication is called.
        concat_os = F.concat(os, axis=0)
        concat_ys_out = F.concat(ys_out, axis=0)
        loss = F.sum(F.softmax_cross_entropy(
            self.W(concat_os), concat_ys_out, reduce='no')) / batch

        chainer.report({'loss': loss}, self)
        n_words = concat_ys_out.shape[0]
        perp = self.xp.exp(loss.array * batch / n_words)
        chainer.report({'perp': perp}, self)
        return loss

    def translate(self, xs, max_length=100):
        batch = len(xs)
        with chainer.no_backprop_mode(), chainer.using_config('train', False):
            xs = [x[::-1] for x in xs]
            exs = sequence_embed(self.embed_x, xs)
            h, c, _ = self.encoder(None, None, exs)
            ys = self.xp.full(batch, EOS, numpy.int32)
            result = []
            for i in range(max_length):
                eys = self.embed_y(ys)
                eys = F.split_axis(eys, batch, 0)
                h, c, ys = self.decoder(h, c, eys)
                cys = F.concat(ys, axis=0)
                wy = self.W(cys)
                ys = self.xp.argmax(wy.array, axis=1).astype(numpy.int32)
                result.append(ys)

        # Using `xp.concatenate(...)` instead of `xp.stack(result)` here to
        # support NumPy 1.9.
        result = chainer.get_device('@numpy').send(
            self.xp.concatenate([x[None, :] for x in result]).T)

        # Remove EOS taggs
        outs = []
        for y in result:
            inds = numpy.argwhere(y == EOS)
            if len(inds) > 0:
                y = y[:inds[0, 0]]
            outs.append(y)
        return outs
  • In Seq2seq, three functions are defined: the constructor __init__, the function call forward, and the function for translation translate.

seq2seq.py
    def __init__(self, n_layers, n_source_vocab, n_target_vocab, n_units):
        super(Seq2seq, self).__init__()
        with self.init_scope():
            self.embed_x = L.EmbedID(n_source_vocab, n_units)
            self.embed_y = L.EmbedID(n_target_vocab, n_units)
            self.encoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
            self.decoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
            self.W = L.Linear(n_units, n_target_vocab)

        self.n_layers = n_layers
        self.n_units = n_units
  • When we instantiate this class for making a model, we give the number of stacked lstms to n_layers, the vocabulary size of the source language to n_source_vocab, the vocabulary size of the target language to n_target_vocab, and the size of hidden vectors to n_units.

  • This network uses chainer.links.NStepLSTM, chainer.links.EmbedID, and chainer.links.Linear as its building blocks. All the layers are registered and initialized in the context with self.init_scope().

  • You can access all the parameters in those layers by calling self.params().

  • In the constructor, it initializes all parameters with values sampled from a uniform distribution \(U(-1, 1)\).

seq2seq.py
    def forward(self, xs, ys):
        xs = [x[::-1] for x in xs]

        eos = self.xp.array([EOS], numpy.int32)
        ys_in = [F.concat([eos, y], axis=0) for y in ys]
        ys_out = [F.concat([y, eos], axis=0) for y in ys]

        # Both xs and ys_in are lists of arrays.
        exs = sequence_embed(self.embed_x, xs)
        eys = sequence_embed(self.embed_y, ys_in)

        batch = len(xs)
        # None represents a zero vector in an encoder.
        hx, cx, _ = self.encoder(None, None, exs)
        _, _, os = self.decoder(hx, cx, eys)

        # It is faster to concatenate data before calculating loss
        # because only one matrix multiplication is called.
        concat_os = F.concat(os, axis=0)
        concat_ys_out = F.concat(ys_out, axis=0)
        loss = F.sum(F.softmax_cross_entropy(
            self.W(concat_os), concat_ys_out, reduce='no')) / batch

        chainer.report({'loss': loss}, self)
        n_words = concat_ys_out.shape[0]
        perp = self.xp.exp(loss.array * batch / n_words)
        chainer.report({'perp': perp}, self)
        return loss
  • The forward method takes sequences of source language’s word IDs xs and sequences of target language’s word IDs ys. Each sequence represents a sentence, and the size of xs is mini-batch size.

  • Note that the sequences of word IDs xs and ys are converted to a vocabulary-size one-hot vectors and then multiplied with the embedding matrix in sequence_embed to obtain embedding vectors exs and eys.

    seq2seq.py
    def sequence_embed(embed, xs):
        x_len = [len(x) for x in xs]
        x_section = numpy.cumsum(x_len[:-1])
        ex = embed(F.concat(xs, axis=0))
        exs = F.split_axis(ex, x_section, 0)
        return exs
    
  • self.encoder and self.decoder are the encoder and the decoder of the seq2seq model. Each element of the decoder output os is \(h_{[1:J]}^{(t)}\) in the figure above.

  • After calculating the recurrent layer output, the loss loss and the perplexity perp are calculated, and the values are logged by chainer.report.

Note

It is well known that the seq2seq model learns much better when the source sentences are reversed. The paper[1] says that “While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.” So, at the first line in the forward, the input sentences are reversed xs = [x[::-1] for x in xs].

seq2seq.py
    def translate(self, xs, max_length=100):
        batch = len(xs)
        with chainer.no_backprop_mode(), chainer.using_config('train', False):
            xs = [x[::-1] for x in xs]
            exs = sequence_embed(self.embed_x, xs)
            h, c, _ = self.encoder(None, None, exs)
            ys = self.xp.full(batch, EOS, numpy.int32)
            result = []
            for i in range(max_length):
                eys = self.embed_y(ys)
                eys = F.split_axis(eys, batch, 0)
                h, c, ys = self.decoder(h, c, eys)
                cys = F.concat(ys, axis=0)
                wy = self.W(cys)
                ys = self.xp.argmax(wy.array, axis=1).astype(numpy.int32)
                result.append(ys)

        # Using `xp.concatenate(...)` instead of `xp.stack(result)` here to
        # support NumPy 1.9.
        result = chainer.get_device('@numpy').send(
            self.xp.concatenate([x[None, :] for x in result]).T)

        # Remove EOS taggs
        outs = []
        for y in result:
            inds = numpy.argwhere(y == EOS)
            if len(inds) > 0:
                y = y[:inds[0, 0]]
            outs.append(y)
        return outs
  • After the model learned the parameters, the function translate is called to generate the translated sentences outs from the source sentences xs.

  • So as not to change the parameters, the codes for the translation are nested in the scope chainer.no_backprop_mode() and chainer.using_config('train', False).

2.2.4 Load French-English Corpus from WMT15 Dataset

In this tutorial, we use French-English corpus from WMT15 website that contains 10^9 documents. We must prepare additional libraries, dataset, and parallel corpus. To understand the pre-processing, see 2.3.1 Requirements.

After the pre-processing the dataset, let’s make dataset objects:

seq2seq.py

# Load pre-processed dataset
print('[{}] Loading dataset... (this may take several minutes)'.format(
    datetime.datetime.now()))
source_ids = load_vocabulary(args.SOURCE_VOCAB)
target_ids = load_vocabulary(args.TARGET_VOCAB)

if args.use_dataset_api:
    # By using TextDataset, you can avoid loading whole dataset on memory.
    # This significantly reduces the host memory usage.
    def _filter_func(s, t):
        sl = len(s.strip().split())  # number of words in source line
        tl = len(t.strip().split())  # number of words in target line
        return (
            args.min_source_sentence <= sl <= args.max_source_sentence and
            args.min_target_sentence <= tl <= args.max_target_sentence)

    train_data = load_data_using_dataset_api(
        source_ids, args.SOURCE,
        target_ids, args.TARGET,
        _filter_func,
    )
else:
    # Load all records on memory.
    train_source = load_data(source_ids, args.SOURCE)
    train_target = load_data(target_ids, args.TARGET)
    assert len(train_source) == len(train_target)

    train_data = [
        (s, t)
        for s, t in six.moves.zip(train_source, train_target)
        if (args.min_source_sentence <= len(s) <= args.max_source_sentence
            and
            args.min_target_sentence <= len(t) <= args.max_target_sentence)
    ]
print('[{}] Dataset loaded.'.format(datetime.datetime.now()))

if not args.use_dataset_api:
    # Skip printing statistics when using TextDataset API, as it is slow.
    train_source_unknown = calculate_unknown_ratio(
        [s for s, _ in train_data])
    train_target_unknown = calculate_unknown_ratio(
        [t for _, t in train_data])

    print('Source vocabulary size: %d' % len(source_ids))
    print('Target vocabulary size: %d' % len(target_ids))
    print('Train data size: %d' % len(train_data))
    print('Train source unknown ratio: %.2f%%' % (
        train_source_unknown * 100))
    print('Train target unknown ratio: %.2f%%' % (
        train_target_unknown * 100))

target_words = {i: w for w, i in target_ids.items()}
source_words = {i: w for w, i in source_ids.items()}

  • This code uses utility functions below:

    seq2seq.py
    def load_vocabulary(path):
        with io.open(path, encoding='utf-8') as f:
            # +2 for UNK and EOS
            word_ids = {line.strip(): i + 2 for i, line in enumerate(f)}
        word_ids['<UNK>'] = 0
        word_ids['<EOS>'] = 1
        return word_ids
    
    seq2seq.py
    def load_data(vocabulary, path):
        n_lines = count_lines(path)
        bar = progressbar.ProgressBar()
        data = []
        print('loading...: %s' % path)
        with io.open(path, encoding='utf-8') as f:
            for line in bar(f, max_value=n_lines):
                words = line.strip().split()
                array = numpy.array([vocabulary.get(w, UNK)
                                     for w in words], numpy.int32)
                data.append(array)
        return data
    
    seq2seq.py
    def calculate_unknown_ratio(data):
        unknown = sum((s == UNK).sum() for s in data)
        total = sum(s.size for s in data)
        return unknown / total
    
2.2.5 Define Evaluation Function (Bleu Score)

BLEU[3] (bilingual evaluation understudy) is the evaluation metric for the quality of text which has been machine-translated from one natural language to another.

seq2seq.py
class CalculateBleu(chainer.training.Extension):

    trigger = 1, 'epoch'
    priority = chainer.training.PRIORITY_WRITER

    def __init__(
            self, model, test_data, key, device, batch=100, max_length=100):
        self.model = model
        self.test_data = test_data
        self.key = key
        self.batch = batch
        self.device = device
        self.max_length = max_length

    def __call__(self, trainer):
        device = self.device

        with chainer.no_backprop_mode():
            references = []
            hypotheses = []
            for i in range(0, len(self.test_data), self.batch):
                sources, targets = zip(*self.test_data[i:i + self.batch])
                references.extend([[t.tolist()] for t in targets])

                sources = [device.send(x) for x in sources]
                ys = [y.tolist()
                      for y in self.model.translate(sources, self.max_length)]
                hypotheses.extend(ys)

        bleu = bleu_score.corpus_bleu(
            references, hypotheses,
            smoothing_function=bleu_score.SmoothingFunction().method1)
        chainer.report({self.key: bleu})
2.2.6 Create Iterator

Here, the code below just creates iterator objects.

seq2seq.py
train_iter = chainer.iterators.SerialIterator(train_data, args.batchsize)

2.2.7 Create RNN and Classification Model

Instantiate Seq2seq model.

seq2seq.py
model = Seq2seq(args.layer, len(source_ids), len(target_ids), args.unit)
2.2.8 Setup Optimizer

Prepare an optimizer. We use chainer.optimizers.Adam.

seq2seq.py
optimizer = chainer.optimizers.Adam()
optimizer.setup(model)

2.2.9 Setup and Run Trainer

Let’s make a trainer object.

seq2seq.py
updater = training.updaters.StandardUpdater(
    train_iter, optimizer, converter=convert, device=device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
trainer.extend(extensions.LogReport(
    trigger=(args.log_interval, 'iteration')))
trainer.extend(extensions.PrintReport(
    ['epoch', 'iteration', 'main/loss', 'main/perp',
     'validation/main/bleu', 'elapsed_time']),
    trigger=(args.log_interval, 'iteration'))

trainer.extend(
    extensions.snapshot(filename='snapshot_epoch_{.updater.iteration}'),
    trigger=(args.validation_interval, 'iteration'))

Setup the trainer’s extension to see the BLEU score on the test data.

seq2seq.py
    test_source = load_data(source_ids, args.validation_source)
    test_target = load_data(target_ids, args.validation_target)
    assert len(test_source) == len(test_target)
    test_data = list(six.moves.zip(test_source, test_target))
    test_data = [(s, t) for s, t in test_data if 0 < len(s) and 0 < len(t)]
    test_source_unknown = calculate_unknown_ratio(
        [s for s, _ in test_data])
    test_target_unknown = calculate_unknown_ratio(
        [t for _, t in test_data])

    print('Validation data: %d' % len(test_data))
    print('Validation source unknown ratio: %.2f%%' %
          (test_source_unknown * 100))
    print('Validation target unknown ratio: %.2f%%' %
          (test_target_unknown * 100))

    @chainer.training.make_extension()
    def translate(trainer):
        source, target = test_data[numpy.random.choice(len(test_data))]
        result = model.translate([model.xp.array(source)])[0]

        source_sentence = ' '.join([source_words[x] for x in source])
        target_sentence = ' '.join([target_words[y] for y in target])
        result_sentence = ' '.join([target_words[y] for y in result])
        print('# source : ' + source_sentence)
        print('# result : ' + result_sentence)
        print('# expect : ' + target_sentence)

    trainer.extend(
        translate, trigger=(args.validation_interval, 'iteration'))
    trainer.extend(
        CalculateBleu(
            model, test_data, 'validation/main/bleu', device),
        trigger=(args.validation_interval, 'iteration'))

if args.resume is not None:
    # Resume from a snapshot
    chainer.serializers.load_npz(args.resume, trainer)

Let’s start the training!

seq2seq.py

trainer.run()

if args.save is not None:
    # Save a snapshot
    chainer.serializers.save_npz(args.save, trainer)


2.3 Run Example
2.3.1 Requirements

Before running the example, you must prepare additional libraries, dataset, and parallel corpus.

2.3.1 Training the model

You can train the model with the script: chainer/examples/seq2seq/seq2seq.py

$ pwd
/root2chainer/chainer/examples/seq2seq
$ python seq2seq.py --gpu=0 giga-fren.preprocess.en giga-fren.preprocess.fr \
vocab.en vocab.fr \
--validation-source newstest2013.preprocess.en \
--validation-target newstest2013.preprocess.fr > log
100% (22520376 of 22520376) |#############| Elapsed Time: 0:09:20 Time: 0:09:20
100% (22520376 of 22520376) |#############| Elapsed Time: 0:10:36 Time: 0:10:36
100% (3000 of 3000) |#####################| Elapsed Time: 0:00:00 Time: 0:00:00
100% (3000 of 3000) |#####################| Elapsed Time: 0:00:00 Time: 0:00:00
epoch       iteration   main/loss   validation/main/loss  main/perp   validation/main/perp  validation/main/bleu  elapsed_time
0           200         171.449                           991.556                                                 85.6739
0           400         143.918                           183.594                                                 172.473
0           600         133.48                            126.945                                                 260.315
0           800         128.734                           104.127                                                 348.062
0           1000        124.741                           91.5988                                                 436.536
...

Note

Before running the script, be careful the locale and the python’s encoding. Please setup them to use utf-8 encoding.

2.3.1 Validate the model

While you are training the model, you can get the validation results:

...
# source : We knew the Government had tried many things , like launching <UNK> with <UNK> or organising speed dating evenings .
# result : Nous savions que le gouvernement avait <UNK> plusieurs fois , comme le <UNK> <UNK> , le <UNK> ou le <UNK> <UNK> .
# expect : Nous savions que le gouvernement avait tenté plusieurs choses comme lancer des parfums aux <UNK> ou organiser des soirées de <UNK>
...

3. Reference

API Reference

Variable and Parameter

Variable classes and utilities

chainer.Variable

Array with a structure to keep track of computation.

chainer.as_array

Returns the underlying array from a variable or an array.

chainer.as_variable

Converts an array or a variable into Variable.

chainer.backward

Runs backpropagation from variables simultaneously.

chainer.Parameter

Parameter variable that can be registered to a link.

chainer.variable.VariableNode

Node in the backward computational graph representing a variable.

N-dimensional array

chainer.Variable holds its value as an n-dimensional array (ndarray). Chainer supports the following classes:

Note

Python scalars (float, etc.) and NumPy scalars (numpy.float16, numpy.float32, etc.) cannot be used as chainer.Variable.array. See also chainer.utils.force_array().

Functions

Chainer provides variety of built-in function implementations in chainer.functions package. These functions usually return a Variable object or a tuple of multiple Variable objects. For a Variable argument of a function, an N-dimensional array can be passed if you do not need its gradient. Some functions additionally supports scalar arguments.

Note

Functions implemented in Chainer consists of the following two parts:

  • A class that inherits FunctionNode, which defines forward/backward computation.

  • A “wrapper” function around the class.

APIs listed in this page are “wrapper” of FunctionNode implementations. In most cases, you don’t have to use FunctionNode classes directly.

For example, chainer.functions.sum() is a wrapper function defined as def sum(...): in chainer/functions/math/sum.py, and it calls its corresponding FunctionNode implementation, Sum. Some functions may not have the corresponding FunctionNode implementation; one example is chainer.functions.average(), which is defined in chainer/functions/math/average.py, which calls other wrapper functions to calculate average.

If you are implementing your own functions, please see Define your own function.

Arithmetic functions

Basic arithmetic operations for Variables are implemented as operators. Refer to the Notes section of Variable for details.

chainer.functions.add() provides better performance when accumulating three or more Variables at once.

chainer.functions.add

Element-wise addition.

Activation functions

chainer.functions.clipped_relu

Clipped Rectifier Unit function.

chainer.functions.crelu

Concatenated Rectified Linear Unit function.

chainer.functions.elu

Exponential Linear Unit function.

chainer.functions.hard_sigmoid

Element-wise hard-sigmoid function.

chainer.functions.leaky_relu

Leaky Rectified Linear Unit function.

chainer.functions.log_softmax

Channel-wise log-softmax function.

chainer.functions.lstm

Long Short-Term Memory units as an activation function.

chainer.functions.maxout

Maxout activation function.

chainer.functions.prelu

Parametric ReLU function.

chainer.functions.rrelu

Randomized Leaky Rectified Liner Unit function.

chainer.functions.relu

Rectified Linear Unit function.

chainer.functions.relu6

Rectifier Unit function clipped at 6.

chainer.functions.selu

Scaled Exponential Linear Unit function.

chainer.functions.sigmoid

Element-wise sigmoid logistic function.

chainer.functions.slstm

S-LSTM units as an activation function.

chainer.functions.softmax

Softmax function.

chainer.functions.softplus

Element-wise softplus function.

chainer.functions.swish

Swish activation function.

chainer.functions.tanh

Elementwise hyperbolic tangent function.

chainer.functions.tree_lstm

TreeLSTM unit as an activation function.

Array manipulations

chainer.functions.as_strided

Create a new view of array with the given shape, strides, and offset.

chainer.functions.broadcast

Broadcast given variables.

chainer.functions.broadcast_to

Broadcast a given variable to a given shape.

chainer.functions.cast

Cast an input variable to a given type.

chainer.functions.concat

Concatenates given variables along an axis.

chainer.functions.copy

Copies the input variable onto the specified device.

chainer.functions.depth2space

Computes the depth2space transformation for subpixel calculations.

chainer.functions.diagonal

Take diagonal

chainer.functions.dstack

Concatenate variables along third axis (depth wise).

chainer.functions.expand_dims

Expands dimensions of an input variable without copy.

chainer.functions.flatten

Flatten a given array into one dimension.

chainer.functions.flip

Flips an input variable in reverse order along the given axis.

chainer.functions.fliplr

Flip array in the left/right direction.

chainer.functions.flipud

Flip array in the up/down direction.

chainer.functions.get_item

Extract elements from array with specified shape, axes and offsets.

chainer.functions.hstack

Concatenate variables horizontally (column wise).

chainer.functions.im2col

Extract patches from an image based on the filter.

chainer.functions.moveaxis

Move the source axes to the destination.

chainer.functions.pad

Pad an input variable.

chainer.functions.pad_sequence

Pad given arrays to make a matrix.

chainer.functions.permutate

Permutates a given variable along an axis.

chainer.functions.repeat

Construct an array by repeating a given array.

chainer.functions.reshape

Reshapes an input variable without copy.

chainer.functions.resize_images

Resize images to the given shape.

chainer.functions.rollaxis

Roll the axis backwards to the given position.

chainer.functions.scatter_add

Adds given values to specified elements of an array.

chainer.functions.select_item

Select elements stored in given indices.

chainer.functions.separate

Separates an array along a given axis.

chainer.functions.space2depth

Computes the space2depth transformation for subpixel calculations.

chainer.functions.spatial_transformer_grid

2D Spatial Transformer grid.

chainer.functions.spatial_transformer_sampler

2D Spatial Transformer sampler.

chainer.functions.split_axis

Splits given variables along an axis.

chainer.functions.squeeze

Remove dimensions of size one from the shape of a ndarray.

chainer.functions.stack

Concatenate variables along a new axis.

chainer.functions.swapaxes

Swap two axes of a variable.

chainer.functions.tile

Construct an array by tiling a given array.

chainer.functions.transpose

Permute the dimensions of an input variable without copy.

chainer.functions.transpose_sequence

Transpose a list of Variables.

chainer.functions.vstack

Concatenate variables vertically (row wise).

chainer.functions.where

Choose elements depending on condition.

Neural network connections

chainer.functions.bilinear

Applies a bilinear function based on given parameters.

chainer.functions.convolution_1d

1-dimensional convolution function.

chainer.functions.convolution_2d

Two-dimensional convolution function.

chainer.functions.convolution_3d

3-dimensional convolution function.

chainer.functions.convolution_nd

N-dimensional convolution function.

chainer.functions.deconvolution_1d

1-dimensional deconvolution function.

chainer.functions.deconvolution_2d

Two dimensional deconvolution function.

chainer.functions.deconvolution_3d

3-dimensional deconvolution function.

chainer.functions.deconvolution_nd

N-dimensional deconvolution function.

chainer.functions.depthwise_convolution_2d

Two-dimensional depthwise convolution function.

chainer.functions.deformable_convolution_2d_sampler

Two-dimensional deformable convolution function using computed offset.

chainer.functions.dilated_convolution_2d

Two-dimensional dilated convolution function.

chainer.functions.embed_id

Efficient linear function for one-hot input.

chainer.functions.linear

Linear function, or affine transformation.

chainer.functions.local_convolution_2d

Two-dimensional local convolution function.

chainer.functions.n_step_bigru

Stacked Bi-directional Gated Recurrent Unit function.

chainer.functions.n_step_bilstm

Stacked Bi-directional Long Short-Term Memory function.

chainer.functions.n_step_birnn

Stacked Bi-directional RNN function for sequence inputs.

chainer.functions.n_step_gru

Stacked Uni-directional Gated Recurrent Unit function.

chainer.functions.n_step_lstm

Stacked Uni-directional Long Short-Term Memory function.

chainer.functions.n_step_rnn

Stacked Uni-directional RNN function for sequence inputs.

chainer.functions.shift

Shift function.

Evaluation functions

chainer.functions.accuracy

Computes multiclass classification accuracy of the minibatch.

chainer.functions.binary_accuracy

Computes binary classification accuracy of the minibatch.

chainer.functions.classification_summary

Calculates Precision, Recall, F beta Score, and support.

chainer.functions.f1_score

chainer.functions.precision

chainer.functions.r2_score

Computes R^2(coefficient of determination) regression score function.

chainer.functions.recall

Loss functions

chainer.functions.absolute_error

Element-wise absolute error function.

chainer.functions.bernoulli_nll

Computes the negative log-likelihood of a Bernoulli distribution.

chainer.functions.black_out

BlackOut loss function.

chainer.functions.connectionist_temporal_classification

Connectionist Temporal Classification loss function.

chainer.functions.contrastive

Computes contrastive loss.

chainer.functions.crf1d

Calculates negative log-likelihood of linear-chain CRF.

chainer.functions.argmax_crf1d

Computes a state that maximizes a joint probability of the given CRF.

chainer.functions.cross_covariance

Computes the sum-squared cross-covariance penalty between y and z

chainer.functions.decov

Computes the DeCov loss of h

chainer.functions.discriminative_margin_based_clustering_loss

Discriminative margin-based clustering loss function

chainer.functions.gaussian_kl_divergence

Computes the KL-divergence of Gaussian variables from the standard one.

chainer.functions.gaussian_nll

Computes the negative log-likelihood of a Gaussian distribution.

chainer.functions.hinge

Computes the hinge loss for a one-of-many classification task.

chainer.functions.huber_loss

Computes the Huber loss.

chainer.functions.mean_absolute_error

Mean absolute error function.

chainer.functions.mean_squared_error

Mean squared error function.

chainer.functions.negative_sampling

Negative sampling loss function.

chainer.functions.sigmoid_cross_entropy

Computes cross entropy loss for pre-sigmoid activations.

chainer.functions.softmax_cross_entropy

Computes cross entropy loss for pre-softmax activations.

chainer.functions.squared_error

Squared error function.

chainer.functions.triplet

Computes triplet loss.

Mathematical functions

chainer.functions.absolute

Element-wise absolute.

chainer.functions.arccos

Elementwise arccosine function.

chainer.functions.arcsin

Elementwise arcsine function.

chainer.functions.arctan

Elementwise arctangent function.

chainer.functions.arctan2

Elementwise arctangent function with two arguments.

chainer.functions.arctanh

Elementwise inverse hyperbolic tangent function.

chainer.functions.argmax

Returns index which holds maximum of array elements over a given axis.

chainer.functions.argmin

Returns index which holds minimum of array elements over a given axis.

chainer.functions.average

Calculate weighted average of array elements over a given axis.

chainer.functions.batch_inv

Computes the inverse of a batch of square matrices.

chainer.functions.batch_l2_norm_squared

L2 norm (a.k.a. Euclidean norm) squared.

chainer.functions.batch_matmul

Computes the batch matrix multiplications of two sets of arrays.

chainer.functions.bias

Elementwise summation with broadcasting.

chainer.functions.ceil

Elementwise ceil function.

chainer.functions.cholesky

Cholesky Decomposition

chainer.functions.clip

Clips (limits) elements of input variable.

chainer.functions.cos

Elementwise cos function.

chainer.functions.cosh

Elementwise hyperbolic cosine function.

chainer.functions.cumprod

Cumulative prod of array elements over a given axis.

chainer.functions.cumsum

Cumulative sum of array elements over a given axis.

chainer.functions.det

Computes the determinant of a single square matrix.

chainer.functions.batch_det

Computes the determinant of a batch of square matrices.

chainer.functions.digamma

Digamma function.

chainer.functions.einsum

Einstein summation

chainer.functions.erf

Elementwise error function.

chainer.functions.erfc

Elementwise complementary error function.

chainer.functions.erfcinv

Elementwise inverse function of complementary error function.

chainer.functions.erfcx

Elementwise scaled complementary error function.

chainer.functions.erfinv

Elementwise inverse function of error function.

chainer.functions.exp

Elementwise exponential function.

chainer.functions.expm1

Elementwise exponential minus one function.

chainer.functions.fft

Fast Fourier transform.

chainer.functions.fix

Elementwise fix function.

chainer.functions.fmod

Elementwise mod function.

chainer.functions.floor

Elementwise floor function.

chainer.functions.identity

Just returns input variables.

chainer.functions.ifft

Inverse fast Fourier transform.

chainer.functions.inv

Computes the inverse of square matrix.

chainer.functions.lgamma

logarithm of gamma function.

chainer.functions.linear_interpolate

Elementwise linear-interpolation function.

chainer.functions.log

Elementwise natural logarithm function.

chainer.functions.log10

Elementwise logarithm function to the base 10.

chainer.functions.log1p

Elementwise natural logarithm plus one function.

chainer.functions.log2

Elementwise logarithm function to the base 2.

chainer.functions.log_ndtr

Logarithm of cumulative distribution function of normal distribution.

chainer.functions.logsumexp

Log-sum-exp of array elements over a given axis.

chainer.functions.matmul

Computes the matrix multiplication of two arrays.

chainer.functions.max

Maximum of array elements over a given axis.

chainer.functions.maximum

Element-wise maximum of input variables.

chainer.functions.mean

Calculate weighted average of array elements over a given axis.

chainer.functions.min

Minimum of array elements over a given axis.

chainer.functions.minimum

Element-wise minimum of input variables.

chainer.functions.ndtr

Elementwise cumulative distribution function of normal distribution.

chainer.functions.ndtri

Elementwise inverse function of ndtr.

chainer.functions.prod

Product of array elements over a given axis.

chainer.functions.polygamma

Polygamma function.

chainer.functions.rsqrt

Computes elementwise reciprocal of square root of input \(x_i\).

chainer.functions.scale

Elementwise product with broadcasting.

chainer.functions.sin

Elementwise sin function.

chainer.functions.sinh

Elementwise hyperbolic sine function.

chainer.functions.sign

Elementwise sign function.

chainer.functions.sparse_matmul

Computes the batched multiplication of sparse and dense matrix.

chainer.functions.sqrt

Elementwise square root function.

chainer.functions.square

Elementwise square function.

chainer.functions.squared_difference

Squared difference function.

chainer.functions.sum

Sum of array elements over a given axis.

chainer.functions.sum_to

Sum elements along axes to output an array of a given shape.

chainer.functions.tanh

Elementwise hyperbolic tangent function.

chainer.functions.tan

Elementwise tan function.

chainer.functions.tensordot

Returns the tensor dot product of two arrays along specified axes.

chainer.functions.zeta

Zeta function.

Noise injections

chainer.functions.dropout

Drops elements of input variable randomly.

chainer.functions.gaussian

Gaussian sampling function.

chainer.functions.gumbel_softmax

Gumbel-Softmax sampling function.

chainer.functions.simplified_dropconnect

Linear unit regularized by simplified dropconnect.

chainer.functions.zoneout

Drops elements of input variable and sets to previous variable randomly.

Normalization functions

chainer.functions.batch_normalization

Batch normalization function.

chainer.functions.batch_renormalization

Batch renormalization function.

chainer.functions.decorrelated_batch_normalization

Decorrelated batch normalization function.

chainer.functions.fixed_batch_normalization

Batch normalization function with fixed statistics.

chainer.functions.fixed_batch_renormalization

chainer.functions.fixed_decorrelated_batch_normalization

Decorrelated batch normalization function with fixed statistics.

chainer.functions.group_normalization

Group normalization function.

chainer.functions.layer_normalization

Layer normalization.

chainer.functions.local_response_normalization

Local response normalization across neighboring channels.

chainer.functions.normalize

Normalize input by L2 norm.

Spatial pooling

chainer.functions.average_pooling_1d

1-dimensional spatial average pooling function.

chainer.functions.average_pooling_2d

Spatial average pooling function.

chainer.functions.average_pooling_3d

3-dimensional spatial average pooling function.

chainer.functions.average_pooling_nd

N-dimensionally spatial average pooling function.

chainer.functions.max_pooling_1d

1-dimensional spatial max pooling function.

chainer.functions.max_pooling_2d

Spatial max pooling function.

chainer.functions.max_pooling_3d

3-dimensional spatial max pooling function.

chainer.functions.max_pooling_nd

N-dimensionally spatial max pooling function.

chainer.functions.roi_average_align_2d

Spatial Region of Interest (ROI) average align function.

chainer.functions.roi_average_pooling_2d

Spatial Region of Interest (ROI) average pooling function.

chainer.functions.roi_max_align_2d

Spatial Region of Interest (ROI) max align function.

chainer.functions.roi_max_pooling_2d

Spatial Region of Interest (ROI) max pooling function.

chainer.functions.roi_pooling_2d

Spatial Region of Interest (ROI) pooling function.

chainer.functions.spatial_pyramid_pooling_2d

Spatial pyramid pooling function.

chainer.functions.unpooling_1d

Inverse operation of 1-dimensional spatial pooling.

chainer.functions.unpooling_2d

Inverse operation of pooling for 2d array.

chainer.functions.unpooling_3d

Inverse operation of 3-dimensional spatial pooling.

chainer.functions.unpooling_nd

Inverse operation of N-dimensional spatial pooling.

chainer.functions.upsampling_2d

Upsampling using pooling indices.

Utility functions

chainer.functions.forget

Calls a function without storing intermediate results.

Function base

chainer.Function

Old-style interface of a differentiable function.

chainer.FunctionAdapter

Adapter class to wrap Function with FunctionNode.

chainer.FunctionNode

Function node of the computational graph.

chainer.force_backprop_mode

Make a context manager which enables back-propagation.

chainer.no_backprop_mode

Make a context manager which disables back-propagation.

chainer.grad

Computes the gradient of output variables w.r.t. the input variables.

Function hooks

Chainer provides a function-hook mechanism that enriches the behavior of forward and backward propagation of FunctionNode and Function.

chainer.function_hooks.CUDAProfileHook

chainer.function_hooks.CupyMemoryProfileHook

Function hook for measuring memory usage of functions in cupy memory pool.

chainer.function_hooks.PrintHook

Function hook that prints debug information.

chainer.function_hooks.TimerHook

Function hook for measuring elapsed time of functions.

You can also implement your own function-hook to inject arbitrary code before/after the forward/backward propagation.

chainer.FunctionHook

Base class of hooks for Functions.

Probability Distributions

Chainer provides many Distribution implementations in the chainer.distributions package.

Distributions

chainer.distributions.Bernoulli

Bernoulli Distribution.

chainer.distributions.Beta

Beta Distribution.

chainer.distributions.Categorical

Categorical Distribution.

chainer.distributions.Cauchy

Cauchy Distribution.

chainer.distributions.Chisquare

Chi-Square Distribution.

chainer.distributions.Dirichlet

Dirichlet Distribution.

chainer.distributions.Exponential

Exponential Distribution.

chainer.distributions.Gamma

Gamma Distribution.

chainer.distributions.Geometric

Geometric Distribution.

chainer.distributions.Gumbel

Gumbel Distribution.

chainer.distributions.Independent

Independent distribution.

chainer.distributions.Laplace

Laplace Distribution.

chainer.distributions.LogNormal

Logatithm Normal Distribution.

chainer.distributions.MultivariateNormal

MultivariateNormal Distribution.

chainer.distributions.Normal

Normal Distribution.

chainer.distributions.OneHotCategorical

OneHotCategorical Distribution.

chainer.distributions.Pareto

Pareto Distribution.

chainer.distributions.Poisson

Poisson Distribution.

chainer.distributions.Uniform

Uniform Distribution.

Functionals of distribution

chainer.cross_entropy

Computes Cross entropy.

chainer.kl_divergence

Computes Kullback-Leibler divergence.

chainer.register_kl

Decorator to register KL divergence function.

Base classes

chainer.Distribution

Interface of Distribution

Optimizers

chainer.optimizers.AdaDelta

Zeiler's ADADELTA.

chainer.optimizers.AdaGrad

AdaGrad optimizer.

chainer.optimizers.Adam

Adam optimizer.

chainer.optimizers.AdamW

AdamW optimizer.

chainer.optimizers.AMSGrad

AMSGrad optimizer.

chainer.optimizers.AdaBound

AdaBound optimizer.

chainer.optimizers.AMSBound

AMSBound optimizer.

chainer.optimizers.CorrectedMomentumSGD

Momentum SGD optimizer.

chainer.optimizers.MomentumSGD

Momentum SGD optimizer.

chainer.optimizers.NesterovAG

Nesterov's Accelerated Gradient.

chainer.optimizers.MSVAG

M-SVAG optimizer.

chainer.optimizers.RMSprop

RMSprop optimizer.

chainer.optimizers.RMSpropGraves

Alex Graves's RMSprop.

chainer.optimizers.SGD

Vanilla Stochastic Gradient Descent.

chainer.optimizers.SMORMS3

Simon Funk's SMORMS3.

Optimizer base classes

chainer.Optimizer

Base class of all numerical optimizers.

chainer.UpdateRule

Base class of all update rules.

chainer.optimizer.Hyperparameter

Set of hyperparameter entries of an optimizer.

chainer.GradientMethod

Base class of all single gradient-based optimizers.

Hook functions

chainer.optimizer_hooks.WeightDecay

Optimizer/UpdateRule hook function for weight decay regularization.

chainer.optimizer_hooks.Lasso

Optimizer/UpdateRule hook function for Lasso regularization.

chainer.optimizer_hooks.GradientClipping

Optimizer hook function for gradient clipping.

chainer.optimizer_hooks.GradientHardClipping

Optimizer/UpdateRule hook function for gradient clipping.

chainer.optimizer_hooks.GradientNoise

Optimizer/UpdateRule hook function for adding gradient noise.

chainer.optimizer_hooks.GradientLARS

Optimizer/UpdateRule hook function for layer wise adaptive rate scaling.

Weight Initializers

Weight initializers are used to initialize arrays. They destructively modify the content of numpy.ndarray or cupy.ndarray. Typically, weight initializers are passed to Links to initialize their weights and biases.

A weight initializer can be any of the following objects.

If an initializer object has the dtype attribute, the initializer can assume that the array to feed the data into has that dtype. If the required dtype, depending on the context where the initializer is used, does not match the dtype attribute, Chainer will report an error.

Base class

chainer.Initializer

Initializes array.

Concrete initializers

chainer.initializers.Identity

Initializes array with the identity matrix.

chainer.initializers.Constant

Initializes array with constant value.

chainer.initializers.Zero

Initializes array to all-zero.

chainer.initializers.One

Initializes array to all-one.

chainer.initializers.NaN

Initializes array to all-NaN.

chainer.initializers.Normal

Initializes array with a normal distribution.

chainer.initializers.LeCunNormal

Initializes array with scaled Gaussian distribution.

chainer.initializers.GlorotNormal

Initializes array with scaled Gaussian distribution.

chainer.initializers.HeNormal

Initializes array with scaled Gaussian distribution.

chainer.initializers.Orthogonal

Initializes array with an orthogonal system.

chainer.initializers.Uniform

Initializes array with a scaled uniform distribution.

chainer.initializers.LeCunUniform

Initializes array with a scaled uniform distribution.

chainer.initializers.GlorotUniform

Initializes array with a scaled uniform distribution.

chainer.initializers.HeUniform

Initializes array with scaled uniform distribution.

chainer.initializers.UpsamplingDeconvFilter

Initializes array with upsampling filter.

chainer.initializers.DownsamplingConvFilter

Initializes array with downsampling filter.

Helper function

chainer.initializers.generate_array

Return initialized array.

Snapshot Writers

chainer.training.extensions.snapshot_writers.Writer

Base class of snapshot writers.

chainer.training.extensions.snapshot_writers.SimpleWriter

The most simple snapshot writer.

chainer.training.extensions.snapshot_writers.ThreadWriter

Snapshot writer that uses a separate thread.

chainer.training.extensions.snapshot_writers.ProcessWriter

Snapshot writer that uses a separate process.

chainer.training.extensions.snapshot_writers.QueueWriter

Base class of queue snapshot writers.

chainer.training.extensions.snapshot_writers.ThreadQueueWriter

Snapshot writer that uses a thread queue.

chainer.training.extensions.snapshot_writers.ProcessQueueWriter

Snapshot writer that uses process queue.

Training Tools

Chainer provides a standard implementation of the training loops under the chainer.training module. It is built on top of many other core features of Chainer, including Variable and Function, Link/Chain/ChainList, Optimizer, Dataset, and Reporter/Summary. Compared to the training loop abstraction of other machine learning tool kits, Chainer’s training framework aims at maximal flexibility, while keeps the simplicity for the typical usages. Most components are pluggable, and users can overwrite the definition.

The core of the training loop abstraction is Trainer, which implements the training loop itself. The training loop consists of two parts: one is Updater, which actually updates the parameters to train, and the other is Extension for arbitrary functionalities other than the parameter update.

Updater and some extensions use chainer.dataset and Iterator to scan the datasets and load mini-batches. The trainer also uses Reporter to collect the observed values, and some extensions use DictSummary to accumulate them and computes the statistics.

You can find many examples for the usage of this training utilities from the official examples. You can also search the extension implementations from Extensions.

Trainer

chainer.training.Trainer

The standard training loop in Chainer.

Updaters

chainer.training.Updater

Interface of updater objects for trainers.

chainer.training.updaters.StandardUpdater

Standard implementation of Updater.

chainer.training.updaters.ParallelUpdater

Implementation of a parallel GPU Updater.

chainer.training.updaters.MultiprocessParallelUpdater

Implementation of a multiprocess parallel GPU Updater.

We have two kinds of updaters for multi-gpus training. The pros/cons for the updaters are as follows:

ParallelUpdater:

  • (+) Can use the same iterator for any number of GPUs

  • (-) No parallelism at CPU side

  • (-) GPUs used later may be blocked due to the limit of kernel-launch queue size

MultiprocessParallelUpdater:

  • (+) Parallelism at CPU side

  • (+) No degrade due to kernel launch queue size

  • (-) Need per-process data iterator

  • (-) Reporter cannot collect data except for one of the devices

Extensions

An extension is a callable object that can perform arbitrary actions during the training loop. Extensions can be registered to Trainer by using Trainer.extend() method, and they are invoked when the Trigger condition is satisfied.

In addition to the built-in extensions listed below, you can define your own extension by implementing Extension or using the make_extension() decorator. See Trainer Extensions for details.

Common

chainer.training.Extension

Base class of trainer extensions.

chainer.training.make_extension

Decorator to make given functions into trainer extensions.

Evaluation and Metrics Collection

These extensions provide features to collect additional metrics. The typical use case is to use Evaluator to perform evaluation with a validation dataset to compute validation loss/accuracy.

chainer.training.extensions.Evaluator

Trainer extension to evaluate models on a validation set.

chainer.training.extensions.MicroAverage

Calculates micro-average ratio.

chainer.training.extensions.FailOnNonNumber

Trainer extension to raise RuntimeError if parameters contain NaN or Inf.

chainer.training.extensions.ParameterStatistics

Trainer extension to report parameter statistics.

chainer.training.extensions.observe_lr

Returns a trainer extension to record the learning rate.

chainer.training.extensions.observe_value

Returns a trainer extension to continuously record a value.

Optimizer Behavior Control

These extensions provide features to adjust optimizer behavior. The typical use case is to change the learning rate of the optimizer over time.

chainer.training.extensions.ExponentialShift

Trainer extension to exponentially shift an optimizer attribute.

chainer.training.extensions.InverseShift

Trainer extension to shift an optimizer attribute.

chainer.training.extensions.LinearShift

Trainer extension to change an optimizer attribute linearly.

chainer.training.extensions.MultistepShift

Trainer extension to shift an optimizer attribute in several steps.

chainer.training.extensions.PolynomialShift

Trainer extension to polynomially shift an optimizer attribute.

chainer.training.extensions.WarmupShift

Trainer extension to gradually initialize an optimizer attribute.

chainer.training.extensions.StepShift

Trainer extension to shift an optimizer attribute in "steps".

Reporting

These extensions provide features to perform reporting of metrics and various statistics to the console or files.

chainer.training.extensions.PrintReport

Trainer extension to print the accumulated results.

chainer.training.extensions.ProgressBar

Trainer extension to print a progress bar and recent training status.

chainer.training.extensions.LogReport

Trainer extension to output the accumulated results to a log file.

chainer.training.extensions.PlotReport

Trainer extension to output plots.

chainer.training.extensions.VariableStatisticsPlot

Trainer extension to plot statistics for Variables.

chainer.training.extensions.DumpGraph

Trainer extension to dump a computational graph.

Snapshot

These extensions provide features to take snapshots of models.

chainer.training.extensions.snapshot

Returns a trainer extension to take snapshots of the trainer.

chainer.training.extensions.snapshot_object

Returns a trainer extension to take snapshots of a given object.

Memory Release

These extensions provide features to release memories.

chainer.training.extensions.unchain_variables

Trainer extension to unchain all comptational graphs.

Triggers

A trigger is a callable object to decide when to process some specific event within the training loop. It takes a Trainer object as the argument, and returns True if some event should be fired.

It is mainly used to determine when to call an extension. It is also used to determine when to quit the training loop.

chainer.training.get_trigger

Gets a trigger object.

chainer.training.triggers.BestValueTrigger

Trigger invoked when specific value becomes best.

chainer.training.triggers.EarlyStoppingTrigger

Trigger for Early Stopping

chainer.training.triggers.IntervalTrigger

Trigger based on a fixed interval.

chainer.training.triggers.ManualScheduleTrigger

Trigger invoked at specified point(s) of iterations or epochs.

chainer.training.triggers.MaxValueTrigger

Trigger invoked when specific value becomes maximum.

chainer.training.triggers.MinValueTrigger

Trigger invoked when specific value becomes minimum.

chainer.training.triggers.OnceTrigger

Trigger based on the starting point of the iteration.

chainer.training.triggers.TimeTrigger

Trigger based on a fixed time interval.

Datasets

Dataset Abstraction (chainer.dataset)

Chainer supports a common interface for training and validation of datasets. The dataset support consists of three components: datasets, iterators, and batch conversion functions.

Dataset represents a set of examples. The interface is only determined by combination with iterators you want to use on it. The built-in iterators of Chainer require the dataset to support __getitem__ and __len__ methods. In particular, the __getitem__ method should support indexing by both an integer and a slice. We can easily support slice indexing by inheriting DatasetMixin, in which case users only have to implement get_example() method for indexing. Basically, datasets are considered as stateless objects, so that we do not need to save the dataset as a checkpoint of the training procedure.

Iterator iterates over the dataset, and at each iteration, it yields a mini-batch of examples as a list. Iterators should support the Iterator interface, which includes the standard iterator protocol of Python. Iterators manage where to read next, which means they are stateful.

Batch conversion function converts the mini-batch into arrays to feed to the neural nets. They are also responsible to send each array to an appropriate device. Chainer currently provides two implementations:

These components are all customizable, and designed to have a minimum interface to restrict the types of datasets and ways to handle them. In most cases, though, implementations provided by Chainer itself are enough to cover the usages.

Chainer also has a light system to download, manage, and cache concrete examples of datasets. All datasets managed through the system are saved under the dataset root directory, which is determined by the CHAINER_DATASET_ROOT environment variable, and can also be set by the set_dataset_root() function.

Dataset Representation

See Dataset Examples (chainer.datasets) for dataset implementations.

chainer.dataset.DatasetMixin

Default implementation of dataset indexing.

Tabular Dataset Representation

chainer.dataset.TabularDataset

An abstract class that represents tabular dataset.

Tabular Dataset Helpers

chainer.dataset.tabular.DelegateDataset

A helper class to implement a TabularDataset.

chainer.dataset.tabular.from_data

Create a TabularDataset from lists/arrays/callables.

Iterator Interface

See Iterator for dataset iterator implementations.

chainer.dataset.Iterator

Base class of all dataset iterators.

Batch Conversion Function

chainer.dataset.Converter

Base class of converters.

chainer.dataset.converter

Decorator to make a converter.

chainer.dataset.concat_examples

Converter to wrap a callable with arbitrary arguments.

chainer.dataset.ConcatWithAsyncTransfer

Interface to concatenate data and transfer them to GPU asynchronously.

chainer.dataset.to_device

Send an array to a given device.

Dataset Management

chainer.dataset.get_dataset_root

Gets the path to the root directory to download and cache datasets.

chainer.dataset.set_dataset_root

Sets the root directory to download and cache datasets.

chainer.dataset.cached_download

Downloads a file and caches it.

chainer.dataset.cache_or_load_file

Caches a file if it does not exist, or loads it otherwise.

Dataset Examples (chainer.datasets)

The most basic dataset implementation is an array. Both NumPy and CuPy arrays can be used directly as datasets.

In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.

These built-in datasets are divided into two groups. One is a group of general datasets. Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point. The other one is a group of concrete, popular datasets. These concrete examples use the downloading utilities in the chainer.dataset module to cache downloaded and converted datasets.

General Datasets

General datasets are further divided into four types.

The first one is DictDataset and TupleDataset, both of which combine other datasets and introduce some structures on them.

The second one is ConcatenatedDataset and SubDataset. ConcatenatedDataset represents a concatenation of existing datasets. It can be used to merge datasets and make a larger dataset. SubDataset represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.

The third one is TransformDataset, which wraps around a dataset by applying a function to data indexed from the underlying dataset. It can be used to modify behavior of a dataset that is already prepared.

The last one is a group of domain-specific datasets. Currently, implementations for datasets of images (ImageDataset, LabeledImageDataset, etc.) and text (TextDataset) are provided.

DictDataset

chainer.datasets.DictDataset

Dataset of a dictionary of datasets.

TupleDataset

chainer.datasets.TupleDataset

Dataset of tuples from multiple equal-length datasets.

ConcatenatedDataset

chainer.datasets.ConcatenatedDataset

Dataset which concatenates some base datasets.

SubDataset

chainer.datasets.SubDataset

Subset of a base dataset.

chainer.datasets.split_dataset

Splits a dataset into two subsets.

chainer.datasets.split_dataset_random

Splits a dataset into two subsets randomly.

chainer.datasets.get_cross_validation_datasets

Creates a set of training/test splits for cross validation.

chainer.datasets.get_cross_validation_datasets_random

Creates a set of training/test splits for cross validation randomly.

TransformDataset

chainer.datasets.TransformDataset

Dataset that indexes the base dataset and transforms the data.

ImageDataset

chainer.datasets.ImageDataset

Dataset of images built from a list of paths to image files.

chainer.datasets.ZippedImageDataset

Dataset of images built from a zip file.

chainer.datasets.MultiZippedImageDataset

Dataset of images built from a list of paths to zip files.

LabeledImageDataset

chainer.datasets.LabeledImageDataset

Dataset of image and label pairs built from a list of paths and labels.

chainer.datasets.LabeledZippedImageDataset

Dataset of zipped image and label pairs.

TextDataset

chainer.datasets.TextDataset

Dataset of a line-oriented text file.

PickleDataset

chainer.datasets.PickleDataset

Dataset stored in a storage using pickle.

chainer.datasets.PickleDatasetWriter

Writer class that makes PickleDataset.

chainer.datasets.open_pickle_dataset

Opens a dataset stored in a given path.

chainer.datasets.open_pickle_dataset_writer

Opens a writer to make a PickleDataset.

Concrete Datasets

chainer.datasets.get_mnist

Gets the MNIST dataset.

chainer.datasets.get_kuzushiji_mnist

Gets the Kuzushiji-MNIST dataset.

chainer.datasets.get_kuzushiji_mnist_labels

Provides a list of labels for the Kuzushiji-MNIST dataset.

chainer.datasets.get_fashion_mnist_labels

Provide a list of the string value names of the labels.

chainer.datasets.get_fashion_mnist

Gets the Fashion-MNIST dataset.

chainer.datasets.get_cifar10

Gets the CIFAR-10 dataset.

chainer.datasets.get_cifar100

Gets the CIFAR-100 dataset.

chainer.datasets.get_ptb_words

Gets the Penn Tree Bank dataset as long word sequences.

chainer.datasets.get_ptb_words_vocabulary

Gets the Penn Tree Bank word vocabulary.

chainer.datasets.get_svhn

Gets the SVHN dataset.

Note

ChainerCV supports implementations of datasets that are useful for computer vision problems, which can be found in chainercv.datasets. Here is a subset of data loaders supported by ChainerCV:

  • Bounding Box Datasets
    • chainercv.datasets.VOCBboxDataset

    • chainercv.datasets.COCOBboxDataset

  • Semantic Segmentation Datasets
    • chainercv.datasets.ADE20KSemanticSegmentationDataset

    • chainercv.datasets.CamVidDataset

    • chainercv.datasets.CityscapesSemanticSegmentationDataset

    • chainercv.datasets.VOCSemanticSegmentationDataset

  • Instance Segmentation Datasets
    • chainercv.datasets.COCOInstanceSegmentationDataset

    • chainercv.datasets.VOCInstanceSegmentationDataset

  • Classification Datasets
    • chainercv.datasets.CUBLabelDataset

    • chainercv.datasets.OnlineProductsDataset

Iterator

Chainer provides some iterators that implement typical strategies to create mini-batches by iterating over datasets. SerialIterator is the simplest one, which extracts mini-batches in the main thread. MultiprocessIterator and MultithreadIterator are parallelized versions of SerialIterator. They maintain worker subprocesses and subthreads, respectively, to load the next mini-batch in parallel.

chainer.iterators.SerialIterator

Dataset iterator that serially reads the examples.

chainer.iterators.MultiprocessIterator

Dataset iterator that loads examples in parallel.

chainer.iterators.MultithreadIterator

Dataset iterator that loads examples in parallel.

chainer.iterators.DaliIterator

(Experimental) Iterator for DALI pipeline.

Order sampler examples

An Iterator iterates over a dataset according to an order represented by a 1-D array of indices. Order samplers are callables that are used by those iterators to generate this array.

chainer.iterators.OrderSampler

Base class of all order samplers.

chainer.iterators.ShuffleOrderSampler

Sampler that generates random orders.

Serializers

Serialization in NumPy NPZ format

NumPy serializers can be used in arbitrary environments that Chainer runs with. It consists of asymmetric serializer/deserializer due to the fact that numpy.savez() does not support online serialization. Therefore, serialization requires two-step manipulation: first packing the objects into a flat dictionary, and then serializing it into npz format.

chainer.serializers.DictionarySerializer

Serializer for dictionary.

chainer.serializers.NpzDeserializer

Deserializer for NPZ format.

chainer.serializers.save_npz

Saves an object to the file in NPZ format.

chainer.serializers.load_npz

Loads an object from the file in NPZ format.

Serialization in HDF5 format

chainer.serializers.HDF5Serializer

Serializer for HDF5 format.

chainer.serializers.HDF5Deserializer

Deserializer for HDF5 format.

chainer.serializers.save_hdf5

Saves an object to the file in HDF5 format.

chainer.serializers.load_hdf5

Loads an object from the file in HDF5 format.

Serializers base classes

chainer.Serializer

Base class of all serializers.

chainer.AbstractSerializer

Abstract base class of all serializers and deserializers.

chainer.Deserializer

Base class of all deserializers.

Backends and Devices

Common Classes and Utilities

chainer.backend.Device

A base class of unified devices.

chainer.get_device

Returns a device object.

chainer.using_device

Context manager to apply the thread-local device state.

chainer.backend.get_device_from_array

Gets the device from arrays.

chainer.backend.get_array_module

Gets an appropriate NumPy-compatible module to process arguments

chainer.DeviceResident

A base class of objects with multi-device hierarchy.

chainer.device_resident.DeviceResidentsVisitor

Base class of visitors that visits device resident objects recursively.

chainer.backend.copyto

Copies the elements of an ndarray to those of another one.

Concrete Device Classes

chainer.backend.CpuDevice

Device for CPU (NumPy) backend

chainer.backend.GpuDevice

Device for GPU (CuPy) backend

chainer.backend.Intel64Device

Device for Intel64 (Intel Architecture) backend with iDeep

chainer.backend.ChainerxDevice

Device for ChainerX backend

GPU (CuPy)

Device, context and memory management on CuPy.

Note

The package chainer.cuda has been renamed to chainer.backends.cuda as of v4.0.0, but the previous module path chainer.cuda is also available.

Chainer uses CuPy (with very thin wrapper) to exploit the speed of GPU computation. Following modules and classes defined in CuPy are imported to chainer.backends.cuda module for convenience (refer to this table when reading chainer’s source codes).

imported name

original name

chainer.backends.cuda.cupy

cupy

chainer.backends.cuda.cupyx

cupyx

chainer.backends.cuda.ndarray

cupy.ndarray

chainer.backends.cuda.cupy.cuda

cupy.cuda

chainer.backends.cuda.Device

cupy.cuda.Device

chainer.backends.cuda.Event

cupy.cuda.Event

chainer.backends.cuda.Stream

cupy.cuda.Stream

Chainer replaces the default allocator of CuPy by its memory pool implementation. It enables us to reuse the device memory over multiple forward/backward computations, and temporary arrays for consecutive elementwise operations.

Devices

chainer.backends.cuda.get_device

Gets the device from a device object, an ID integer or an array object.

chainer.backends.cuda.get_device_from_id

Gets the device from an ID integer.

chainer.backends.cuda.get_device_from_array

Gets the device from a list of CuPy array or a single CuPy array.

CuPy array allocation and copy

chainer.backends.cuda.copy

Copies a cupy.ndarray object using the default stream.

chainer.backends.cuda.to_cpu

Copies the given GPU array to host CPU.

chainer.backends.cuda.to_gpu

Copies the given CPU array to the specified device.

Kernel definition utilities

chainer.backends.cuda.memoize

Makes a function memoizing the result for each argument and device.

chainer.backends.cuda.clear_memo

Clears the memoized results for all functions decorated by memoize.

chainer.backends.cuda.elementwise

Creates an elementwise kernel function.

chainer.backends.cuda.raw

Creates a raw kernel function.

chainer.backends.cuda.reduce

Creates a global reduction kernel function.

CPU/GPU generic code support

chainer.backends.cuda.get_array_module

Gets an appropriate one from numpy or cupy.

cuDNN support

chainer.backends.cuda.set_max_workspace_size

Sets the workspace size for cuDNN.

chainer.backends.cuda.get_max_workspace_size

Gets the workspace size for cuDNN.

Intel64 (iDeep)

iDeep is a module that provides NumPy-like API and DNN acceleration using MKL-DNN for Intel CPUs. See Tips and FAQs and Performance Best Practices for details.

chainer.backends.intel64.is_ideep_available

Returns if iDeep is available.

ChainerX

chainer.backend.from_chx

Converts an array or arrays from ChainerX to NumPy or CuPy ones.

chainer.backend.to_chx

Converts an array or arrays to ChainerX.

Utilities

Convolution/Deconvolution utilities

chainer.utils.get_conv_outsize

Calculates output size of convolution.

chainer.utils.get_deconv_outsize

Calculates output size of deconvolution.

Common algorithms

chainer.utils.WalkerAlias

Implementation of Walker's alias method.

Common utilities

chainer.print_runtime_info

Shows Chainer runtime information.

Reporter

chainer.Reporter

Object to which observed values are reported.

chainer.get_current_reporter

Returns the current reporter object.

chainer.report

Reports observed values with the current reporter object.

chainer.report_scope

Returns a report scope with the current reporter.

Summary and DictSummary

chainer.Summary

Online summarization of a sequence of scalars.

chainer.DictSummary

Online summarization of a sequence of dictionaries.

Sparse utilities

A chainer.Variable can be converted into a sparse matrix in e.g. COO (Coordinate list) format. A sparse matrix stores the same data as the original object but with a different internal representation, optimized for efficient operations on sparse data, i.e. data with many zero elements.

Following are a list of supported sparse matrix formats and utilities for converting between a chainer.Variable and these representations.

Note

Please be aware that only certain functions accept sparse matrices as inputs, such as chainer.functions.sparse_matmul().

chainer.utils.CooMatrix

A sparse matrix in COO format.

chainer.utils.to_coo

Returns a single or a batch of matrices in COO format.

Experimental feature annotation

chainer.utils.experimental

Declares that user is using an experimental feature.

Configuring Chainer

Chainer provides some global settings that affect the behavior of some functionalities. Such settings can be configured using the unified configuration system. The system provides a transparent way to manage the configuration for each process and for each thread.

The configuration is managed by two global objects: chainer.global_config and chainer.config.

  • The global_config object maintains the configuration shared in the Python process. This is an instance of the GlobalConfig class. It can be used just as a plain object, and users can freely set any attributes on it.

  • The config object, on the other hand, maintains the configuration for the current thread. This is an instance of the LocalConfig class. It behaves like a thread-local object, and any attribute modifications are only visible to the current thread.

If no value is set to config for a given key, global_config is transparently referred. Thanks to this transparent lookup, users can always use config to read any configuration so that the thread-local configuration is used if available and otherwise the default global setting is used.

The following entries of the configuration are currently provided by Chainer. Some entries support environment variables to set the default values. Note that the default values are set in the global config.

Configuration Keys

  • cudnn_deterministic (default: False)

    Flag to configure deterministic computations in cuDNN APIs.

    If it is True, convolution functions that use cuDNN use the deterministic mode (i.e, the computation is reproducible). Otherwise, the results of convolution functions using cuDNN may be non-deterministic in exchange for better performance.

  • debug (default: False)

    Debug mode flag.

    If it is True, Chainer runs in debug mode. Enabling debug mode may introduce some performance overhead. See Debug Mode for more information of the debug mode.

    You can change the default value to True by setting CHAINER_DEBUG environment variable to 1.

  • dtype (default: numpy.float32)

    Default floating point data type.

    Chainer uses this dtype to construct arrays when the dtype is not specified (e.g. initializers).

    You can change the default value by setting CHAINER_DTYPE environment variable to mixed16, float16, float32, float64.

    Note

    If you want to use float16 for better performance, it is recommended that you use mixed16 instead of float16.

  • enable_backprop (default: True)

    Flag to enable backpropagation support.

    If it is True, computational graphs are created during forward passes by FunctionNodes, allowing backpropagation to start from any Variable in the graph. Otherwise, computational graphs are not created but memory consumptions are reduced. So calling backward() on the results of a function will not compute any gradients of any input.

  • keep_graph_on_report (default: False)

    Flag to configure whether or not to let report() keep the computational graph.

    If it is False, report() does not keep the computational graph when a Variable object is reported. It means that report() stores a copy of the Variable object which is purged from the computational graph. If it is True, report() just stores the Variable object as is with the computational graph left attached.

    You can change the default value to True by setting CHAINER_KEEP_GRAPH_ON_REPORT environment variable to 1.

  • warn_nondeterministic (default: False)

    Flag to give warning when a non-deterministic function is used. This function is experimental.

    If it is true, then functions that use non-deterministic functions and cannot be given a seed, such as atomicAdd, will give a warning when executed. For functions that can take a seed argument, such as split_dataset_random(), setting the seed should be done when the function is called and will not be flagged by this setting.

    Note that this feature is provided as best-effort. It cannot assure that every nondeterministic function can be detected. For example, SSE computations in CPU mode may cause non-deterministic behavior that would not raise a warning.

    Also, determinisitic outputs may still result, even if this flag produces a non-deterministic warning. For example, reduction on 1-dim axis should always be deterministic, but it may raise a warning.

  • train (default: True)

    Training mode flag.

    If it is True, Chainer runs in training mode. Otherwise, it runs in the testing (evaluation) mode.

    This configuration is used by Functions and Links that need to behave differently between training phase and evaluation (inference) phase. One example is chainer.links.BatchNormalization updates statistics using input data only when train is set to True. The other example is chainer.functions.dropout(), which does nothing when train is set to False.

    Generally, you are responsible to change the configuration to False during evaluation. If you are using Trainer with Evaluator extension, train configuration will automatically be switched to False during evaluation in the training loop.

    Note that this parameter does not reduce memory consumption or affect the creation of computational graphs required in order to compute gradients.

  • type_check (default: True)

    Type checking mode flag.

    If it is True, Chainer checks the types (data types and shapes) of inputs on Function applications. Otherwise, it skips type checking.

    You can change the default value to False by setting CHAINER_TYPE_CHECK environment variable to 0.

  • use_cudnn (default: 'auto')

    Flag to configure whether or not to use cuDNN.

    This is a ternary flag with 'always', 'auto', and 'never' as its allowed values. The meaning of each flag is as follows.

    • If it is 'always', Chainer will try to use cuDNN everywhere if possible.

    • If it is 'auto', Chainer will use cuDNN only if it is known that the usage does not degrade the performance.

    • If it is 'never', Chainer will never use cuDNN anywhere.

    You can change the default value by setting CHAINER_USE_CUDNN environment variable to any of 'always', 'auto' or 'never'.

  • use_ideep (default: 'never')

    Flag to configure whether or not to use iDeep.

    This is a ternary flag with 'always', 'auto', and 'never' as its allowed values. The meaning of each flag is as follows.

    • If it is 'always', Chainer will try to use iDeep everywhere if possible.

    • If it is 'auto', Chainer will use iDeep only if it is known that the usage does not degrade the performance.

    • If it is 'never', Chainer will never use iDeep anywhere.

    You can change the default value by setting CHAINER_USE_IDEEP environment variable to any of 'always', 'auto' or 'never'.

    Note that in spite of the configuration, optimizers will use iDeep if and only if the link is converted manually to iDeep (e.g., model.to_intel64()).

  • lazy_grad_sum (default: False)

    Flag to control the behavior of gradient accumulation.

    If it is True, gradients are accumulated in batch for performance. Otherwise gradients are accumulated one by one.

    You can change the default value to True by setting CHAINER_LAZY_GRAD_SUM environment variable to 1.

  • use_cudnn_tensor_core (default: 'auto')

    Flag to configure whether or not to enable Tensor Core operatons in cuDNN.

    This is a ternary flag with 'always', 'auto', and 'never' as its allowed values. The meaning of each flag is as follows.

    • If it is always, Chainer uses cuDNN’s Tensor Core operations.

    • If it is never, Chainer does not use cuDNN’s Tensor Core operations.

    • If it is auto, Chainer checks cuDNN version, the data type of input, the compute capability of the GPU used, and configures whether or not to use cuDNN’s Tensor Core operations.

  • autotune (default: False)

    Autotune for convolutional networks flag.

    If it is True, Chainer uses the cuDNN autotune feature to find the fastest calculation process for chainer.links.Convolution2D, ConvolutionND, Deconvolution2D, or DeconvolutionND links.

  • cudnn_fast_batch_normalization (default: False)

    Flag to configure whether or not to enable use of fast implementation for batch normalization in cuDNN.

    If True, Chainer will try to use the fast implementation for batch normalization in cuDNN by setting cuDNN’s batch normalization mode to CUDNN_BATCHNORM_SPATIAL_PERSISTENT. You can change the default value to True by setting CHAINER_CUDNN_FAST_BATCH_NORMALIZATION environment variable to 1.

  • in_recomputing (default: False)

    This flag is automatically set by chainer.functions.forget() and not intended to be changed by users. You can use this flag when implementing your own Link to avoid updating the internal states during recomputation done by chainer.functions.forget(). See the documentation of chainer.functions.forget() for details.

  • use_static_graph (default: True)

    Flag to configure whether or not to use the static subgraph optimization feature. Where the static subgraph optimization decorator is used, we generally assume that the feature should be used and the default value is thus True. However, if you would want to run the same code without the feature, you can simply set the flag to False instead of removing the decorators. This is useful when for instance running your model with ChainerX, since ChainerX is not supported by the static subgraph optimization feature.

User-defined Keys

Users can also define their own configurations. There are two ways:

  1. Use Chainer’s configuration objects. In this case, it is strongly recommended that the name be prefixed by “user_” to avoid name conflicts with configurations introduced to Chainer in the future.

  2. Use your own configuration objects. Users can define their own configuration objects using chainer.configuration.GlobalConfig and chainer.configuration.LocalConfig. In this case, there is no need to take care of the name conflicts.

Changing Configuration

If you want to share a setting within the process, set an attribute to the global configuration. This value is automatically extracted by referring to the local config.

>>> chainer.global_config.train
True
>>> chainer.config.train
True

>>> chainer.global_config.train = False

>>> chainer.global_config.train
False
>>> chainer.config.train
False

If you set an attribute to the local configuration, the value is only visible to the current thread.

>>> chainer.global_config.train
True
>>> chainer.config.train
True

>>> chainer.config.train = False

>>> chainer.global_config.train
True
>>> chainer.config.train
False

If you want to temporarily modify the configuration for the specific scope, you can use using_config(). For example, if you only want to enable debug mode in a fragment of code, write as follows.

>>> with chainer.using_config('debug', True):
...     pass  # code running in debug mode

If you want to switch to the test mode for an evaluation, you can do that in the same way.

>>> # Do training here
>>> with chainer.using_config('train', False):
...     pass  # Perform evaluation here

Note that Evaluator automatically switches to the test mode, and thus you do not need to manually switch in the loss function for the evaluation.

You can also make your own code behave differently in training and test modes as follows.

if chainer.config.train:
    pass  # code only running in the training mode
else:
    pass  # code only running in the test mode

chainer.global_config

chainer.config

Thread-local configuration of Chainer.

chainer.using_config

Context manager to temporarily change the thread-local configuration.

chainer.configuration.GlobalConfig

chainer.configuration.LocalConfig

Thread-local configuration of Chainer.

Environment Variables

Here are the environment variables Chainer uses.

CHAINER_SEED

Default seed value of random number generators for CUDA. If it is not set, the seed value is generated from Python random module. Set an integer value in decimal format.

CHAINER_DATASET_ROOT

Default directory path to store the downloaded datasets. See Datasets for details.

CHAINER_CUDNN

Set 0 to completely disable cuDNN in Chainer. In this case, cuDNN will not be used regardless of CHAINER_USE_CUDNN and chainer.config.use_cudnn configuration. Otherwise cuDNN is enabled automatically.

CHAINER_USE_CUDNN

Used as the default value for chainer.config.use_cudnn configuration. The value must be any of 'always', 'auto' or 'never'. If CHAINER_CUDNN is set to 0, this environment variable has no effect. See Configuring Chainer for details.

CHAINER_CUDNN_FAST_BATCH_NORMALIZATION

Used as the default value for chainer.config.cudnn_fast_batch_normalization configuration. Set 1 to enable use of fast implementation for batch normalization in cuDNN. See Configuring Chainer for details.

CHAINER_USE_IDEEP

Used as the default value for chainer.config.use_ideep configuration. The value must be any of 'always', 'auto' or 'never'. See Configuring Chainer for details.

CHAINER_LAZY_GRAD_SUM

Used as the default value for chainer.config.lazy_grad_sum configuration. Set 1 to enable batch accumulation of gradients. See Configuring Chainer for details.

CHAINER_DTYPE

Used as the default value for chainer.config.dtype configuration. The value must be any of 'mixed16', 'float16', 'float32' or 'float64'. See Configuring Chainer for details.

CHAINER_TYPE_CHECK

Used as the default value for chainer.config.type_check configuration. Set 0 to disable type checking. Otherwise type checking is enabled automatically. See Configuring Chainer and Type checking utilities for details.

CHAINER_DEBUG

Used as the default value for chainer.config.debug configuration. Set 1 to enable debug mode. It is disabled by default. In debug mode, Chainer performs various runtime checks that can help debug user’s code at the cost of some overhead. See Configuring Chainer and Debug Mode for details.

CHAINER_KEEP_GRAPH_ON_REPORT

Used as the default value for chainer.config.keep_graph_on_report configuration. Set 1 to let report() keep the computational graph. See Configuring Chainer for details.

CHAINER_PYTHON_350_FORCE

Set 1 to force using Chainer with Python 3.5.0. Note that Chainer does not work with Python 3.5.0. Use Python 3.5.2+ or other supported versions (see Installation).

The following environment variables are only effective when running unit tests.

CHAINER_TEST_GPU_LIMIT

Number of GPUs available for unit tests. When running unit test, test cases that require more GPUs than the specified value will be skipped. Set 0 to skip all test cases that require GPU. See Unit Testing for details.

CHAINER_TEST_RANDOM_NONDETERMINISTIC

Set 1 to use non-fixed seed for random number generators, even for test cases annotated with fix_random.

Debug Mode

In debug mode, Chainer checks values of variables on runtime and shows more detailed error messages. It helps you to debug your programs. However, it requires some additional overhead time.

If you want to enable debug mode for the entire code, you can set CHAINER_DEBUG environment variable to 1.

You can also enable or disable debug mode for the specific scope of code with chainer.using_config() or by changing chainer.config.debug configuration.

with chainer.using_config('debug', True):
   ...

See Configuring Chainer for the details of Chainer’s configuration mechanism.

In debug mode, Chainer checks all results of forward and backward computation, and if it finds a NaN value, it raises a RuntimeError. Some functions and links also check validity of input values more strictly.

You can check if debug mode is enabled with chainer.is_debug() function.

chainer.is_debug

Returns if the debug mode is enabled or not in the current thread.

chainer.set_debug

Enables or disables the debug mode in the current thread.

Visualization of Computational Graph

As neural networks get larger and complicated, it gets much harder to confirm if their architectures are constructed properly. Chainer supports visualization of computational graphs. Users can generate computational graphs by invoking build_computational_graph(). Generated computational graphs are dumped to specified format (Currently Dot Language is supported).

Basic usage is as follows:

import chainer.computational_graph as c
...
g = c.build_computational_graph(vs)
with open('path/to/output/file', 'w') as o:
    o.write(g.dump())

where vs is list of Variable instances and g is an instance of ComputationalGraph. This code generates the computational graph that are backward-reachable (i.e. reachable by repetition of steps backward) from at least one of vs.

Here is an example of (a part of) the generated graph (inception(3a) in GoogLeNet). This example is from example/imagenet.

_images/googlenet.png

chainer.computational_graph.build_computational_graph

Builds a graph of functions and variables backward-reachable from outputs.

chainer.computational_graph.ComputationalGraph

Class that represents computational graph.

Static Subgraph Optimizations: Usage

Note

This is an experimental feature and so the API might change in the future as it is developed.

This feature intends to improve runtime performance by optimizing the execution of the static subgraphs in a model. When this feature is enabled, the first iteration runs as normal except that an execution trace is also collected. The trace is then used to generate optimized code that is will be called instead of the define-by-run code starting from the second iteration.

chainer.static_graph

Decorator to mark a Chain's __call__() as a static sub-graph.

Basic usage

To enable static graph optimizations, it is only necessary to add the chainer.static_graph() decorator to a chain’s __call__() method. We will now show how the Chainer MNIST example can be modified to use this feature. The modified version with static subgraph optimizations is located at examples/static_graph_optimizations/mnist.

The first step is to import the necessary packages:

train_mnist.py
24from chainer import static_code
25from chainer import static_graph

Since the neural network model MLP corresponds to a static graph, we can annotate it as a static graph by using the chainer.static_graph() decorator on the chain’s __call__() method. This lets the framework know that that the define-by-run code of the chain always creates the same graph (that is, it always performs the same sequence of computations) each time it is called. We will refer to such a chain as a static chain in the documentation.

train_mnist.py
34# Network definition
35class MLP(chainer.Chain):
36
37    """A fully-connected neural network for digit classification.
38
39    """
40
41    def __init__(self, n_units, n_out):
42        super(MLP, self).__init__()
43        with self.init_scope():
44            # the size of the inputs to each layer will be inferred
45            self.l1 = L.Linear(None, n_units)  # n_in -> n_units
46            self.l2 = L.Linear(None, n_units)  # n_units -> n_units
47            self.l3 = L.Linear(None, n_out)  # n_units -> n_out
48
49    @static_graph
50    def __call__(self, x):
51        h1 = F.relu(self.l1(x))
52        h2 = F.relu(self.l2(h1))
53        return self.l3(h2)

Note

If your model’s define-by-run code has any control flow operations that could cause it to potentially call different Chainer functions/links each time it is called, then you cannot use this decorator.

Note

There are currently some restrictions on how variables can be passed into a static chain’s __call__() method. Refer to the documentation of chainer.static_graph() for details.

Recall that the define-by-run code of a static chain’s __call__() method only actually runs during the first iteration and is then replaced by optimized static schedule code. The current implementation only knows how to do this auto-replacement for calls to Chainer functions and links. Any other code that the user puts in __call__() (which we refer to as “side-effect code”) will only ever get called once by default, since the define-by-run code is only executed during the first iteration. In order to make sure such “side effect” code actually gets called each iteration, we need to put it inside a function or method decorated by static_code(). We expect there will rarely be a need to use side-effect code but for completeness, an example of a model that uses it is available in the MLPSideEffect Chain of the static graph MNIST example.

In this example, we only need to use chainer.static_graph() on the model chain, since the whole model is static. However, in more general dynamic models, each of the largest static subgraphs (which should each be written as a chain) should also use chainer.static_graph().

Note

Nested application of chainer.static_graph() is not allowed. That is, if a chainer.static_graph()-decorated chain calls another chains, only the outermost chain should use the decorator.

Calling a static chain multiple times in the same iteration

In a general dynamic graph network, it is not possible to know in advance how many times a static chain will be called in any particular iteration. Note that during training, it is necessary to maintain separate internal state (such as intermediate activations) for each of these calls so that the gradients can be computed in the backward pass. So, although the layer functions of the static schedule will be identical each time the same static chain is called, any internal state must be distinct. It is also possible that a static chain could be called multiple times with inputs of different shapes and/or types during the same iteration. To avoid confuction, “static schedule” will refer to both the functions and any corresponding internal state such as activations.

If backpropagation mode is disabled (chainer.config.enable_backprop is False), it is safe for the implementation to simply compute a static schedule for the first call and reuse it for subsequent calls, provided that the cached schedule is compatible with the input shapes/types. However, during training, it is necessary to maintain distinct internal state for each call in order to compute the gradients for the backward pass, which prevents us from reusing the same static schedule for each of the multiple calls of a static chain in an iteration.

The current implementation handles this issues as follows. A cache of static schedules, which is initially empty, is associated with each static chain. The size of this cache will be equal to the maximum number of times that the static chain has been called in any previous iteration, and the cache is reset whenever certain chain configuration flags change, such as training mode and backpropagation model. At the start of a given iteration, all cached schedules are available for use and the number of available schedules is decremented each time the static chain is called. If the chain is called when the cache is size zero, then its define-by-run code will execute to create a new schedule cache.

In order for such an implementation to work, each static chain must be notified when the forward pass has ended (or when the forward pass is started) so that all cached schedules can be made available for use again. In the current implementation, this is accomplished by calling the backward() method on a loss variable in the model. This is expected to handle the typical use cases. However, in some models it may be necessary to perform multiple forward passes before calling backward(). In such a case, to signel to a static chain that the forward pass (and the iteration) has ended, call my_chain.schedule_manager.end_forward(). The schedule_manager attribute of a static chain is an instance of a class called StaticScheduleFunction that will be available after the chain has been called.

Effects on model debugging

Note that since the code in the static chain’s __call__() only runs during the first iteration, you will only be able to debug this code as define-by-run during the first iteration. It is assumed that if the chain is actually is static, any problems in its define-by-run code should be apparent during the first iteration and it should not be (as) necessary to debug this code in later iterations. However, this feature does provide some functionality to help with debugging. For example, it is possible to obtain and inspect the current static schedules. It is also possible to directly step through the code of the static schedule if you wish (by debugging the forward() method of StaticScheduleFunction in static_graph).

Disabling the static subgraph optimization

It is possible to turn off the static subgraph optimization feature by setting the chainer.config.use_static_graph to False. If set to False, the chainer.static_graph() decorator will simply call the wrapped function without any further side effects.

Limitations and future work

  • Optimization switches to let the user select the trade-off between runtime performance and memory usage: The current implementation achieves its speedups mainly by reducing the amount of Python code that needs to run, but does not yet implement advanced optimizations for memory usage or runtime performance. Ideally, the user should be able to adjust performance tuning parameters to control the trade-off between memory consumption and runtime performance.

  • Incompatibility with GRU and LSTM links: This feature requires that all input variables to a chain need to explicitly appear in the arguments to the chain’s __call__() method. However, the GRU and LSTM links with state maintain variable attributes of the chain for the RNN state variables. Design changes to support such links and/or modifications to these links are being considered. These links may still be used with the current implementation, as long as the corresponding RNN is unrolled inside of a static chain. For an example of this, see the modified ptb example at examples/static_graph_optimizations/ptb

  • Memory usage: The current implementation caches all static schedules which can lead to high memory usage in some cases. For example, separate schedules are created when the training mode or mini-batch size changes.

  • Advanced graph optimizations: Advanced optimizations such as fusion of operations is not yet implemented.

  • Constraints on arguments to a static chain: The current version requires that all input variables used inside __call__() of a static chain must either appear in the arguments of this method or be defined in the define-by-run code. Furthermore, any variables that appear in the arguments list must appear by themselves or be contained inside a list or tuple. Arbitrary levels of nesting are allowed.

  • Model export: In the case where the complete computation graph for the model is static, it should be possible in principle to export the static schedule in a format that can be run on other platforms and languages. One of the other original motivations for this feature was to support exporting static Chainer models to run on C/C++ and/or optimize the static schedule execution code in Cython/C/C++. However, it seems that ONNX is now fulfilling this purpose and there is a separate ONNX exporter already in development for Chainer. Perhaps these two features can be merged at some point in the future.

  • Double-backward support: This feature was designed to support double-backward (gradient of gradient) but it has not been tested.

  • ChainerX is not supported. If you have code written using this feature but would like to run the model with ChainerX, please set the chainer.config.use_static_graph configuration to False. The code should then work without any additional changes.

Examples

For additional examples that use this feature, refer to the examples in examples/static_graph_optimizations.

Static Subgraph Optimizations: Design Notes

This documentation is intended provide information on the architecture and design of the static subgraph optimizations feature for those who are interested in contributing to its development. This documentation also describes how existing Chainer functions can be modified to run more efficiently when static subgraph optimizations are enabled.

Overview of dynamic and static graph frameworks

Existing deep learning frameworks can roughly be classified as either a “static graph” or “dynamic graph” framework. In a static graph framework, which we also call “define-and-run”, the computation graph is defined before the model is run. This implies that the same neural network model will be used each iteration without modifications, hence the name “static.” This allows various graph optimizations to potentially be performed to improve the runtime performance and/or reduce memory usage. The optimized code for the computation graph is then used when the model is run.

However, in a “dynamic graph” (also called “define-by-run”) framework such as Chainer, the computation graph is not defined before the model is run. Rather, it is constructed incrementally and automatically by the framework as the computations of the forward pass are executed. In Chainer, the user writes code to perform the computations of the forward pass in terms of Chainer functions, which have an API similar to an array library like NumPy. As these functions execute, the computation graph is incrementally built so that it will be available after the last function in the forward pass has been called. This has some advantages, such as allowing easier debugging compared to a static graph framework, since the user can step through the computations of the forward pass in a debugger. Define-by-run also provides the flexibility to include control flow operations so that a modified or even completely different graph can be constructed each iteration. Unfortunately, this flexibility also tends to make dynamic graph frameworks slower than static graph frameworks. For example, in Chainer there is a performance penalty involved in dynamically constructing the graph each iteration, since it involves creating many objects; each function call creates a new FunctionNode object as well as creating new VariableNode and array memory allocation for each output of the function. There are also various dynamic type checks and graph traversal that need to be performed, adding to the runtime overhead. Further, we cannot perform some optimizations such as function/kernel fusion and in-place operations.

Static subgraph optimizations feature

This feature is motivated by the observation that typical deep neural networks correspond to a static computation graph and that even those that correspond to a dynamic graph are typically mostly static. By “mostly static”, we mean that the largest static subgraphs each tend to contain many function nodes (that is, layers) so that the total number of function nodes in the graph tends to be much larger than the total number of largest static subgraphs. If the graph is at least mostly static, then a naive implementation of define-by-run will result in a large amount of redundant operations being performed each iteration to rebuild exactly the same subgraphs, perform the same dynamic type-checking operations, etc., which can sometimes be slow in Python; it will also result in lost opportunities to perform potential graph optimizations. A key assumption motivating this feature is that the main performance bottlenecks tend to occur inside the largest static subgraphs. So, if we can optimize these static subgraphs, it might be fine for any remaining framework code to remain implemented in pure Python. Although such Python code would be slow, it could have negligible runtime overhead.

The solution proposed by this feature is to retain the existing define-by-run style for specifying the model, but to also optionally allow the user to annotate the largest static subgraphs in a model. These “static graph” annotations will then allow the framework to automatically replace the define-by-run code of the static subgraphs with more performance-optimized code. The define-by-run code will still execute during the first iteration, to retain ease of debugging. However, as this code executes, a trace of the needed computations is also collected so that optimized static schedules can be generated for the annotated static subgraphs. Then, starting from the second iteration, this optimized code will automatically be run in place of the original define-by-run code. Note that in the common case in which the whole model is static, the user only needs to add a single “static graph” annotation and their code will then run with the performance of a static graph framework, while still supporting the define-by-run coding style.

The benefit of annotating the static subgraphs in the model is that it allows the define-by-run code to be replaced with an optimized static schedule, which can then potentially support a user-controllable trade-off between runtime performance and memory usage. This is possible because having the full computation graph available enables various optimizations that cannot safely or automatically be performed in define-by-run. Examples (which we have not yet implemented; contributions from the open source community are welcomed) include sub-linear memory usage [1], exploiting graph parallelism, operator fusion, and in-place optimizations.

The current implementation achieves its speedup by retaining only the code that is actually needed to compute the forward pass, backward pass, and so on. This allows us to remove most of the Python interpreter overhead because the Python code that performs dynamic operations such as allocating FunctionNode and Variable objects, checking types, and traversing the backward graph is not included in the optimized static schedule code.

Adding support to existing functions

Most functions and links will not need to be modified at all in order to support this feature, since the framework code will attempt to auto-wrap them inside a @static_code-decorated function. However, some functions might see a performance benefit if static graph support is added manually, since it may result in less redundant code being included in the static schedule. For example, any dynamic checking code that will return the same result every iteration does not need to be included in the static schedule.

An existing function (that is, a subclass of FunctionNode) can be modified to support static graph optimizations as follows. The basic idea is to wrap any code that needs to be called each iteration inside a method that is decorated with @static_code. Note that code that should only run once, such as initializing parameters, should not be wrapped.

It is also necessary to set the _supports_static_optimizations = True class attribute. Note that this attribute is False by default in FunctionNode.

Since the function is part of a static graph, any parameters and output arrays should ideally be statically allocated during the first iteration (while the define-by-run code is executing) and then reused starting from the second iteration. The @static_code-decorated functions that are called each iteration will perform the various deep learning computations, writing results in-place into these static arrays. Since the results are written in-place, there is no need for an @static_code-decorated function to explicitly return a result. Rather, any results arrays should be passed as inputs along with any other input arguments to the function. However, it also is allowed to return dynamically allocated arrays so that existing Chainer functions can be easily supported. The following code shows the typical pattern for performing the forward computations in a FunctionNode:

   @static_code
   def static_forward(self, inputs, outputs):
       # This function will get
included in the static
       # schedule and called each iteration.
       # Any input arrays must be passed in a list
       # to the `inputs` keyword argument.
       x = inputs[0]
       # Any output arrays must be passed in a list
       # to the `outputs` keyword argument, and must
       # have already been initialized to the required
       # shape. Results are written in-place into output
       # arrays.
       y = outputs[0]

       # Read from x, write results into y in-place.
       # Don't forget to zero y if necessary.
       y *= 0.0 # (if necessary)
       y[:] = 3.0*x # for example

   def forward(self, inputs):
       # Initialization/type checking code.
       # (only gets called once, during first iteration)
       type_check_blah(inputs)

       # Allocate output array. Note that since this line
       # is not wrapped using @static_code, it
       # will only ever get called once, during the first
       # iteration.
       y = xp.empty(y_shape).astype(x.dtype)

       # Call static function
       # (it will get called every iteration from optimized schedule)
       self.static_forward(inputs=[x], outputs=[y])
       return y,

It should not be necessary to modify the backward() implementation. As of Chainer v3 when double-backward (i.e., grad of grad) support was added, the backward() method of FunctionNode actually calls the forward() method of other FunctionNode`s, and so it is only necessary that the `forward() functions be wrapped.

For an example of how to add support to an existing function, see the Linear function.

Reference

[1] Training deep nets with sublinear memory cost

Caffe Model Support

Caffe is a popular framework maintained by BVLC at UC Berkeley. It is widely used by computer vision communities, and aims at fast computation and easy usage without any programming. The BVLC team provides trained reference models in their Model Zoo, which can reduce training time required for a new task.

Import

Chainer can import the reference models and emulate the network by Link implementations. This functionality is provided by the chainer.links.caffe.CaffeFunction class.

chainer.links.caffe.CaffeFunction

Caffe emulator based on the model file of Caffe.

Export

Chainer can export a model from Link.

chainer.exporters.caffe.export

(Experimental) Export a computational graph as Caffe format.

Assertion and Testing

Chainer provides some facilities to make debugging easy.

Type checking utilities

FunctionNode uses a systematic type checking of the chainer.utils.type_check module. It enables users to easily find bugs of forward and backward implementations. You can find examples of type checking in some function implementations.

chainer.utils.type_check.Expr

Abstract syntax tree of an expression.

chainer.utils.type_check.eval

chainer.utils.type_check.expect

Evaluates and tests all given expressions.

chainer.utils.type_check.TypeInfo

Type information of an input/gradient array.

chainer.utils.type_check.TypeInfoTuple

Type information of input/gradient tuples.

chainer.utils.type_check.Variable

Gradient checking utilities

Most function implementations are numerically tested by gradient checking. This method computes numerical gradients of forward routines and compares their results with the corresponding backward routines. It enables us to make the source of issues clear when we hit an error of gradient computations. The chainer.gradient_check module makes it easy to implement the gradient checking.

chainer.gradient_check.check_backward

Test backward procedure of a given function.

chainer.gradient_check.check_double_backward

Test twice differentiation of a given procedure.

chainer.gradient_check.numerical_grad

Computes numerical gradient by finite differences.

Standard Assertions

The assertions have same names as NumPy’s ones. The difference from NumPy is that they can accept both numpy.ndarray and cupy.ndarray.

chainer.testing.assert_allclose

Asserts if some corresponding element of x and y differs too much.

chainer.testing.assert_warns

Function testing utilities

Utilities for testing functions.

chainer.testing.FunctionTestCase

A base class for function test cases.

chainer.testing.unary_math_function_unittest

Decorator for testing unary mathematical Chainer functions.

Serialization testing utilities

Utilities for testing serializable objects.

chainer.testing.save_and_load

Saves src and loads it to dst using a de/serializer.

chainer.testing.save_and_load_hdf5

Saves src to an HDF5 file and loads it to dst.

chainer.testing.save_and_load_npz

Saves src to an NPZ file and loads it to dst.

Trainer Extension Testing Utilities

Utilities for testing trainer extensions.

chainer.testing.get_trainer_with_mock_updater

Returns a Trainer object with mock updater.

Repeat decorators

These decorators have a decorated test run multiple times in a single invocation. Criteria of passing / failing of the test changes according to the type of decorators. See the documentation of each decorator for details.

chainer.testing.condition.repeat_with_success_at_least

Decorator for multiple trial of the test case.

chainer.testing.condition.repeat

Decorator that imposes the test to be successful in a row.

chainer.testing.condition.retry

Decorator that imposes the test to be successful at least once.

Unit test annotation

Decorators for annotating unit tests.

chainer.testing.attr.gpu

Decorator to indicate that GPU is required to run the test.

chainer.testing.attr.multi_gpu

Decorator to indicate number of GPUs required to run the test.

chainer.testing.with_requires

Run a test case only when given requirements are satisfied.

chainer.testing.fix_random

Decorator that fixes random numbers in a test.

Parameterized test

Decorators for making a unit test parameterized.

chainer.testing.parameterize

chainer.testing.product

chainer.testing.product_dict

chainer.testing.inject_backend_tests

Installation

Requirements

You need to have the following components to use Chainer.

  • Python
    • Supported Versions: 3.5.2+, 3.6.0+, 3.7.0+ and 3.8.0+.

  • NumPy
    • Supported Versions: 1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16 and 1.17.

    • NumPy will be installed automatically during the installation of Chainer.

Before installing Chainer, we recommend that you upgrade setuptools and pip:

$ pip install -U setuptools pip

Note

Python 2 is not supported in Chainer v7.x releases. Please consider migrating Python 3 or use Chainer v6.x, which is the last version that supports Python 2.

Hardware Acceleration Support

You can accelerate performance of Chainer by installing the following optional components.

Note

CuPy v7.8.0 is the recommended version for Chainer v7 series.

Optional Features

The following packages are optional dependencies. Chainer can be installed without them, in which case the corresponding features are not available.

  • Image dataset support
    • pillow 2.3+

    • Run pip install pillow to install.

  • HDF5 serialization support
    • h5py 2.5+

    • Run pip install h5py to install.

  • Distributed Deep Learning using ChainerMN

Install Chainer

Using pip

We recommend to install Chainer via pip:

$ pip install chainer

Note

Any optional dependencies (including CuPy) can be added after installing Chainer. Chainer automatically detects the available packages and enables/disables the optional features appropriately.

Using Tarball

The tarball of the source tree is available via pip download chainer or from the release notes page. You can install Chainer from the tarball:

$ pip install chainer-x.x.x.tar.gz

You can also install the development version of Chainer from a cloned Git repository:

$ git clone https://github.com/chainer/chainer.git
$ cd chainer
$ pip install .

Enable CUDA/cuDNN support

In order to enable CUDA support, you have to install CuPy manually. If you also want to use cuDNN, you have to install CuPy with cuDNN support. See CuPy’s installation guide to install CuPy. Once CuPy is correctly set up, Chainer will automatically enable CUDA support.

You can refer to the following flags to confirm if CUDA/cuDNN support is actually available.

chainer.backends.cuda.available

True if Chainer successfully imports cupy.

chainer.backends.cuda.cudnn_enabled

True if cuDNN support is available.

Google Colaboratory

You can install Chainer and CuPy using the following snippet on Google Colaboratory:

!curl https://colab.chainer.org/install | sh -

See chainer/google-colaboratory for more details and examples.

Uninstall Chainer

Use pip to uninstall Chainer:

$ pip uninstall chainer

Note

When you upgrade Chainer, pip sometimes install the new version without removing the old one in site-packages. In this case, pip uninstall only removes the latest one. To ensure that Chainer is completely removed, run the above command repeatedly until pip returns an error.

Upgrade Chainer

Just use pip with -U option:

$ pip install -U chainer

Reinstall Chainer

If you want to reinstall Chainer, please uninstall Chainer and then install it. We recommend to use --no-cache-dir option as pip sometimes uses cache:

$ pip uninstall chainer
$ pip install chainer --no-cache-dir

Run Chainer with Docker

We are providing the official Docker image. Use nvidia-docker command to run Chainer image with GPU. You can login to the environment with bash, and run the Python interpreter:

$ nvidia-docker run -it chainer/chainer /bin/bash

Or run the interpreter directly:

$ nvidia-docker run -it chainer/chainer /usr/bin/python

FAQ

Warning message “cuDNN is not enabled” appears

You failed to build CuPy with cuDNN. If you don’t need cuDNN, ignore this message. Otherwise, retry to install CuPy with cuDNN. pip install -vvvv option helps you. There is no need of re-installing Chainer itself. See CuPy’s installation guide for more details.

CuPy always raises cupy.cuda.compiler.CompileException

See FAQ section of CuPy’s installation guide for details.

h5py installation failed

If the installation failed with error saying hdf5.h is not found, you need to install libhdf5 first. The way to install it depends on your environment:

# Ubuntu 14.04/16.04
$ apt-get install libhdf5-dev

# CentOS 7
$ yum -y install epel-release
$ yum install hdf5-devel

Note that h5py is not required unless you need HDF5 serialization support.

ChainerX Documentation

Warning

This feature is still in the earliest stage of its development. The behavior and interface are subject to change.

ChainerX is an ndarray implementation with Define-by-Run automatic differentiation capability. It roughly corresponds to “NumPy/CuPy + Chainer Variable”, while some additional features follow:

  • Speed: The whole ndarray and autograd implementation is written in C++, with a thin Python binding. It lowers the overhead existing in the pure Python implementation of Chainer.

  • Extensibility: The backend is pluggable so that it is much easier to add a support of new devices.

The speed is best achieved by directly using ChainerX APIs, while it also provides a compatibility layer through the conventional chainer.Variable interface for easier adoption of ChainerX in existing projects. See ChainerX Tutorial for more details.

Installation

ChainerX, or chainerx, can be installed as a top level Python package along with Chainer by configuring the environment variables below.

Note

Chainer must currently be installed from source in order to include ChainerX, but this is expected to change in the near future.

Installing from source

The following environment variables are available for building ChainerX from source.

Environment variable

Description

CHAINER_BUILD_CHAINERX

1 to build the chainerx package along with chainer. 0 to skip. Default is 0.

CHAINERX_BUILD_CUDA

1 to build chainerx with CUDA support. 0 to skip. Default is 0. See also CUDA support section below.

CHAINERX_ENABLE_BLAS

1 to enable BLAS, 0 to disable it. Default is 1. If BLAS is enabled, it is searched for and used if found. If not found, ChainerX will behave as if BLAS was disabled and use a basic implementation instead.

CHAINERX_ENABLE_LAPACK

1 to enable LAPACK, 0 to disable it. Default is 1. If LAPACK is enabled, it is searched for and used if found. If not found, ChainerX will behave as if LAPACK was disabled and may cause runtime errors.

Simply run pip install chainer after configuring the above environment variables. See Examples below.

CUDA support

When installing with the CUDA support, you also need to specify the cuDNN installation path.

You can specify either of the following environment variables to specify where to look for cuDNN installation.

Environment variable

Description

CUDNN_ROOT_DIR

Path to your cuDNN installation.

CHAINERX_CUDNN_USE_CUPY

1 to search for cuDNN library and include files in existing CuPy installation. Only applicable for CuPy installed via wheel (binary) distribution. Other variables related to cuDNN paths (such as CUDNN_ROOT_DIR) are ignored. Be warned that the resulting executable will be invalidated if CuPy is uninstalled, moved or replaced.

To support the NumPy/CuPy fallback mechanism, currently ChainerX with the CUDA support requires CuPy to be installed together.

Examples

Install ChainerX without CUDA support:

$ export CHAINER_BUILD_CHAINERX=1
$ export MAKEFLAGS=-j8  # Using 8 parallel jobs.
$ pip install chainer

Install ChainerX depending on CuPy wheel distribution:

$ pip install cupy_cuda101  # Note: Choose the proper CUDA SDK version number.
$ export CHAINER_BUILD_CHAINERX=1
$ export CHAINERX_BUILD_CUDA=1
$ export CHAINERX_CUDNN_USE_CUPY=1
$ export MAKEFLAGS=-j8  # Using 8 parallel jobs.
$ pip install chainer

Install ChainerX with CuPy built from source:

$ export CHAINER_BUILD_CHAINERX=1
$ export CHAINERX_BUILD_CUDA=1
$ export CUDNN_ROOT_DIR=path/to/cudnn
$ export MAKEFLAGS=-j8  # Using 8 parallel jobs.
$ pip install cupy
$ pip install chainer

ChainerX Tutorial

ChainerX, or chainerx, is meant to be a drop-in replacement for NumPy and CuPy, with additional operations specific to neural networks. As its core is implemented in C++, you can reduce the Python overhead for both the forward and backward passes compared to Chainer, speeding up your training and inference. This section will guide you through the essential APIs of Chainer to utilize ChainerX, but also how to use ChainerX on its own.

Introduction to ChainerX

The module chainerx aims to support a NumPy compatible interface with additional operations specific to neural networks. It for instance provides chainerx.conv() for N-dimensional convolutions and chainerx.batch_norm() for batch normalization. Additionally, and most importantly, the array in ChainerX chainerx.ndarray, distinguishes itself from NumPy and CuPy arrays in the following two aspects.

Automatic differentiation

Graph construction and backpropagation is built into the array, meaning that any function, including the NumPy-like functions, can be backpropagated through. In Chainer terms, it is a NumPy/CuPy array with chainer.Variable properties.

Device agnostic

Arrays can be allocated on any device belonging to any backend, in contrast to NumPy/CuPy arrays which are implemented for specific computing platforms (i.e. CPUs/GPUs respectively).

These differences are explained more in details by the sections further down.

The array chainerx.ndarray

The following example demonstrates how you can create an array and access its most basic attributes. Note that the APIs are identical to that of NumPy and CuPy. Other array creation routines including chainerx.ones(), chainerx.ones_like() and chainerx.random.normal() are all listed in here.

import chainerx as chx

x = chx.array([[0, 1, 2], [3, 4, 5]], dtype=chx.float32)

x.shape  # (2, 3)
x.dtype  # dtype('float32')
x.size  # 6
x.ndim  # 2
Backends and devices

Chainer distinguishes between CPU and GPU arrays using NumPy and CuPy but ChainerX arrays may be allocated on any device on any backend. You can specify the device during instantiation or transfer the array to a different device after it has been created.

x = chx.array([1, 2, 3])
x.device  # native:0

x = chx.array([1, 2, 3], device='cuda:0')
x.device  # cuda:0

x = x.to_device('cuda:1')
x.device  # cuda:1

The left-hand-side of the colon shows the name of the backend to which the device belongs. native in this case refers to the CPU and cuda to CUDA GPUs. The integer on the right-hand-side shows the device index. Together, they uniquely identify a physical device on which an array is allocated.

If you do not want to specify the device each time you create an array, it is possible to change the default device with chainerx.using_device().

with chx.using_device('cuda:0')
    x = chx.array([1, 2, 3])
x.device  # cuda:0

Note

Currently, two backends are built into ChainerX.

  1. The native backend, which is built by default.

  2. The cuda backend which is optional (See installation).

This backend abstraction allows developers to implement their own backends and plug them into ChainerX to perform computations on basically any other platform.

Array operations and backpropagation

Arrays support basic arithmetics and can be passed to functions just as you would expect. By marking an array to require gradients with chainerx.ndarray.require_grad(), further computations involving that array will construct a computational graph allowing backpropagation directly from the array. The following code shows how you could implement an affine transformation and backpropgate through it to compute the gradient of the output w.r.t. the input weight and bias.

x = chx.ones(784, dtype=chx.float32)
W = chx.random.normal(size=(784, 1000)).astype(chx.float32).require_grad()
b = chx.random.normal(size=(1000)).astype(chx.float32).require_grad()

y = x.dot(W) + b

y.grad = chx.ones_like(y)  # Initial upstream gradients, i.e. `grad_outputs`.
y.backward()

assert type(W.grad) is chx.ndarray
assert type(b.grad) is chx.ndarray

Note

The code above is device agnostic, meaning that you can execute it on any backend by simply wrapping the code with a chainerx.using_device().

Relation to Chainer

A chainerx.ndarray can be wrapped in a chainer.Variable and passed to any existing Chainer code.

var = ch.Variable(x)  # x is a chainerx.ndarray.

# Your Chainer code...

When further applying functions to the var, the computational graph is recorded in the underlying ndarray in C++ implementation, not in the chainer.Variable or the chainer.FunctionNode, as in the conventional Chainer. This eliminates the heavy Python overhead of the graph construction. Similarly, calling chainer.Variable.backward() on any resulting variable will delegate the work to C++ by calling chainerx.ndarray.backward() spending no time in the Python world.

NumPy/CuPy fallback

As the features above require ChainerX to provide an implementation corresponding to every chainer.FunctionNode implementation in Chainer, ChainerX utilizes a fallback mechanism while gradually extending the support. This approach is taken because the integration with Chainer takes time and we do not want existing Chainer users to have to make severe changes to their code bases in order to try ChainerX. The fallback logic simply casts the chainerx.ndarrays inside the chainer.Variable to numpy.ndarrays or cupy.ndarrays (without copy) and calls the forward and backward methods respectively.

Run your Chainer code with ChainerX

In order to utilize chainerx, you first need to transfer your model to a ChainerX device using chainer.Link.to_device(). This is a new method that has been introduced to replace chainer.Link.to_cpu() and chainer.Link.to_gpu(), extending device transfer to arbitrary devices. Similarly, you have to transfer the data (chainer.Variables) to the same device before feeding them to the model.

Will my FunctionNode work with ChainerX?

Our expectation is that it should work because of the fallback mechanism explained above, but in practice you may need some occasional fixes, depending on how the function was implemented. Also, you will not see any performance improvements from the fallback (but most likely a degradation because of the additional conversions).

To support ChainerX with your chainer.FunctionNode, you need to implement chainer.FunctionNode.forward_chainerx() with the same signature as chainer.FunctionNode.forward(), but where given inputs are of type chainerx.ndarray. It is expected to return a tuple just like chainer.FunctionNode.forward().

The example below shows how chainer.functions.matmul() is extended to support ChainerX. Note that chainer.Fallback can be returned in case the function cannot be implemented using ChainerX functions. This is also the default behavior in case the method is not implemented at all.

class MatMul(function_node.FunctionNode):

    def forward_chainerx(self, x):
        a, b = x
        if self.transa or self.transb or self.transc:
            return chainer.Fallback
        if a.dtype != b.dtype:
            return chainer.Fallback
        if a.ndim != 2 or b.ndim != 2:
            return chainer.Fallback
        if self.dtype is not None and self.dtype != a.dtype:
            return chainer.Fallback
        return chainerx.dot(a, b),  # Fast C++ implementation

Limitations

There are some non-obvious limitations in ChainerX:

  • ChainerX only supports a limited set of dtypes: bool_ int8 int16 int32 int64 uint8 float32 float64.

  • Operations with mixed dtypes are not supported. You need to explicitly convert dtypes using either chainerx.astype() or F.cast().

  • True division of Python, where 2/3 returns .66 rather than 0, is not supported yet. Given an ndarray a of the dtype int32, a / a does not return an array of float64, but returns an array of int32.

  • Only a limited set of Chainer functions are well tested with the ChainerX integration.

  • ChainerX CUDA backend requires cuDNN. See installation for details.

  • As ChainerX arrays have a computational graph in their own, some operations are prohibited for safety:

    • Unless an array is free from the computational graph, in-place modification of its data is prohibited.

      a = chainerx.zeros((2,), chainerx.float32)
      a.require_grad()  # install the computational graph on `a`.
      
      a += 1  # ! error
      

      The reason of this limitation is that, as backward operations may depend on the value of a, the backward gradients might be unexpectedly affected if it would be altered.

      You may circumvent this limitation by making a disconnected view:

      # A memory-shared view of `a` which is disconnected from the computational graph of `a`.
      b = a.as_grad_stopped()
      
      b += 1
      

      Note however that this operation is inherently dangerous. You should be super careful to ensure that that does not affect backward computations.

      Note also that we may restrict further in the future so that even in-place modification on a disconnected view is only allowed if it is actually safe.

    • If an array is wrapped with a Variable with requires_grad=True (which is default), you won’t be able to re-assign the array:

      a = chainerx.zeros((2,), chainerx.float32)
      b = chainerx.zeros((2,), chainerx.float32)
      var = chainer.Variable(a)
      
      var.array = b  # ! error
      

      You may circumvent this by using in-place assignment on var.array:

      var.array[:] = b
      

      This workaround may also be dangerous just as in the previous limitation.

Reference

Multi-Dimensional Array (ndarray)

chainerx.ndarray

Multi-dimensional array, the central data structure of ChainerX.

Utility functions

chainerx.to_numpy

Converts a ChainerX array to NumPy

Array Operations

Array creation routines

chainerx.empty

Returns an array without initializing the elements.

chainerx.empty_like

Returns a new array with same shape and dtype of a given array.

chainerx.eye

Returns a 2-D array with ones on the diagonals and zeros elsewhere.

chainerx.identity

Returns a 2-D identity array.

chainerx.ones

Returns a new array of given shape and dtype, filled with ones.

chainerx.ones_like

Returns an array of ones with same shape and dtype as a given array.

chainerx.zeros

Returns a new array of given shape and dtype, filled with zeros.

chainerx.zeros_like

Returns an array of zeros with same shape and dtype as a given array.

chainerx.full

Returns a new array of given shape and dtype, filled with a given value.

chainerx.full_like

Returns a full array with same shape and dtype as a given array.

chainerx.array

Creates an array.

chainerx.asarray

Converts an object to an array.

chainerx.asanyarray

Converts an object to an array.

chainerx.ascontiguousarray

Returns a C-contiguous array.

chainerx.copy

Creates a copy of a given array.

chainerx.frombuffer

Returns a 1-D array interpretation of a buffer.

chainerx.fromfile

Constructs an array from data in a text or binary file.

chainerx.fromfunction

Constructs an array by executing a function over each coordinate.

chainerx.fromiter

Constructs a new 1-D array from an iterable object.

chainerx.fromstring

Constructs a new 1-D array initialized from text data in a string.

chainerx.loadtxt

Constructs an array by loading data from a text file.

chainerx.arange

Returns an array with evenly spaced values within a given interval.

chainerx.linspace

Returns an array with evenly spaced numbers over a specified interval.

chainerx.diag

Returns a diagonal or a diagonal array.

chainerx.diagflat

Creates a diagonal array from the flattened input.

chainerx.meshgrid

Returns coordinate matrices from coordinate vectors.

chainerx.tri

Returns a 2-D array with ones at and below the given diagonal and zeros elsewhere.

chainerx.tril

Lower triangle of an array.

chainerx.triu

Upper triangle of an array.

Activation functions

chainerx.log_softmax

The log of the softmax of input array.

chainerx.tanh

Element-wise hyperbolic tangent function.

chainerx.relu

Rectified Linear Unit function.

chainerx.sigmoid

Element-wise sigmoid logistic function.

chainerx.slstm

S-LSTM units as an activation function.

chainerx.tree_lstm

TreeLSTM unit as an activation function.

Array manipulation routines

chainerx.reshape

Returns a reshaped array.

chainerx.ravel

Returns a flattened array.

chainerx.transpose

Permutes the dimensions of an array.

chainerx.broadcast_to

Broadcasts an array to a given shape.

chainerx.squeeze

Removes size-one axes from the shape of an array.

chainerx.asarray

Converts an object to an array.

chainerx.ascontiguousarray

Returns a C-contiguous array.

chainerx.concatenate

Joins arrays along an axis.

chainerx.stack

Stacks arrays along a new axis.

chainerx.hstack

Stack arrays in sequence horizontally (column wise).

chainerx.vstack

Stack arrays in sequence vertically (row wise).

chainerx.dstack

Stack arrays in sequence depth wise (along third axis).

chainerx.atleast_2d

View inputs as arrays with at least two dimensions.

chainerx.atleast_3d

View inputs as arrays with at least three dimensions.

chainerx.split

Splits an array into multiple sub arrays along a given axis.

chainerx.dsplit

Split array into multiple sub-arrays along the 3rd axis (depth).

chainerx.vsplit

Splits an array into multiple sub-arrays vertically (row-wise).

chainerx.hsplit

Split an array into multiple sub-arrays horizontally (column-wise).

chainerx.swapaxes

Interchange two axes of an array.

chainerx.repeat

Constructs an array by repeating a given array.

chainerx.expand_dims

Expand the shape of an array.

chainerx.flip

Reverse the order of elements in an array along the given axis.

chainerx.fliplr

Flip array in the left/right direction.

chainerx.flipud

Flip array in the up/down direction.

chainerx.moveaxis

Move axes of an array to new positions.

Evaluation routines

chainerx.accuracy

Computes multiclass classification accuracy of the minibatch.

Indexing routines

chainerx.take

Takes elements from an array along an axis.

chainerx.where

Return elements chosen from x or y depending on condition.

chainerx.nonzero

Return the indices of the elements that are non-zero.

Linear algebra

chainerx.dot

Returns a dot product of two arrays.

chainerx.linalg.cholesky

Computes the Cholesky decomposition of a matrix.

chainerx.linalg.qr

Compute the qr factorization of a matrix.

chainerx.linalg.svd

Singular Value Decomposition.

chainerx.linalg.eigh

Compute the eigenvalues and eigenvectors of a real symmetric matrix.

chainerx.linalg.eigvalsh

Compute the eigenvalues of a real symmetric matrix.

chainerx.linalg.solve

Solves a linear matrix equation, or system of linear scalar equations.

chainerx.linalg.inv

Computes the inverse of a matrix.

chainerx.linalg.pinv

Compute the (Moore-Penrose) pseudo-inverse of a matrix.

Logic functions

chainerx.all

Test whether all array elements along a given axis evaluate to True.

chainerx.any

Test whether any array element along a given axis evaluate to True.

chainerx.isinf

Test element-wise for positive or negative infinity.

chainerx.isnan

Test element-wise for NaN and return result as a boolean array.

chainerx.logical_and

Returns an array of x1 AND x2 element-wise.

chainerx.logical_or

Returns an array of x1 OR x2 element-wise.

chainerx.logical_xor

Returns an array of x1 XOR x2 element-wise.

chainerx.logical_not

Returns an array of NOT x element-wise.

chainerx.greater

Returns an array of (x1 > x2) element-wise.

chainerx.greater_equal

Returns an array of (x1 >= x2) element-wise.

chainerx.less

Returns an array of (x1 < x2) element-wise.

chainerx.less_equal

Returns an array of (x1 <= x2) element-wise.

chainerx.equal

Returns an array of (x1 == x2) element-wise.

chainerx.not_equal

Returns an array of (x1 != x2) element-wise.

Loss functions

chainerx.absolute_error

Element-wise absolute error function.

chainerx.squared_error

Element-wise squared error function.

chainerx.huber_loss

Element-wise Huber loss.

chainerx.gaussian_kl_divergence

Element-wise KL-divergence of Gaussian variables from the standard one.

chainerx.sigmoid_cross_entropy

Element-wise cross entropy loss for pre-sigmoid activations.

chainerx.softmax_cross_entropy

Element-wise cross entropy loss for pre-softmax activations.

Mathematical functions

chainerx.negative

Numerical negative, element-wise.

chainerx.add

Add arguments, element-wise.

chainerx.subtract

Subtract arguments, element-wise.

chainerx.multiply

Multiply arguments, element-wise.

chainerx.divide

Divide arguments, element-wise.

chainerx.mod

remainder(x1, x2) Return element-wise remainder of division.

chainerx.remainder

Return element-wise remainder of division.

chainerx.sum

Sum of array elements over a given axis.

chainerx.maximum

Maximum arguments, element-wise.

chainerx.minimum

Minimum arguments, element-wise.

chainerx.exp

Numerical exponential, element-wise.

chainerx.log

Natural logarithm, element-wise.

chainerx.log10

Base 10 logarithm, element-wise.

chainerx.log2

Base 2 logarithm, element-wise.

chainerx.log1p

Natural logarithm of one plus the input, element-wise.

chainerx.logsumexp

The log of the sum of exponentials of input array.

chainerx.log_softmax

The log of the softmax of input array.

chainerx.sqrt

Non-negative square-root, element-wise

chainerx.sin

Sine, element-wise

chainerx.cos

Cosine, element-wise

chainerx.tan

Tangent, element-wise

chainerx.arcsin

Inverse sine, element-wise

chainerx.arccos

Trigonometric inverse cosine, element-wise

chainerx.arctan

Trigonometric inverse tangent, element-wise

chainerx.arctan2

Element-wise arc tangent of \(\frac{x_1}{x_2}\) choosing the quadrant correctly.

chainerx.sinh

Hyperbolic Sine, element-wise

chainerx.cosh

Hyperbolic Cosine, element-wise

chainerx.tanh

Element-wise hyperbolic tangent function.

chainerx.arcsinh

Inverse hyperbolic sine, element-wise

chainerx.arccosh

Inverse hypberbolic inverse cosine, element-wise

chainerx.square

Returns the element-wise square of the input.

chainerx.clip

Clips the values of an array to a given interval.

chainerx.fabs

Compute the absolute values element-wise.

chainerx.sign

Returns an element-wise indication of the sign of a number.

chainerx.ceil

Return the ceiling of the input, element-wise..

chainerx.floor

Return the floor of the input, element-wise.

chainerx.bitwise_and

Compute the bit-wise AND of two arrays element-wise.

chainerx.bitwise_or

Compute the bit-wise OR of two arrays element-wise.

chainerx.bitwise_xor

Compute the bit-wise XOR of two arrays element-wise.

chainerx.left_shift

Shift the bits of an integer to the left.

chainerx.right_shift

Shift the bits of an integer to the right.

Random sampling

chainerx.random.normal

Draws random samples from a normal (Gaussian) distribution.

chainerx.random.uniform

Draws samples from a uniform distribution.

Sorting, searching, and counting

chainerx.argmax

Returns the indices of the maximum along an axis.

chainerx.argmin

Returns the indices of the minimum along an axis.

Statistics

chainerx.amax

Returns the maximum of an array or the maximum along an axis.

chainerx.mean

Compute the arithmetic mean along the specified axis.

chainerx.var

Compute the arithmetic var along the specified axis.

Connection

chainerx.conv

N-dimensional convolution.

chainerx.conv_transpose

N-dimensional transposed convolution.

chainerx.linear

Linear function, or affine transformation.

chainerx.lstm

Long Short-Term Memory units as an activation function.

Normalization

chainerx.batch_norm

Batch normalization function.

chainerx.fixed_batch_norm

Batch normalization function with fixed statistics.

Pooling

chainerx.max_pool

Spatial max pooling function.

chainerx.average_pool

Spatial average pooling function.

RNN

chainerx.n_step_lstm

Stacked Uni-directional Long Short-Term Memory function.

chainerx.n_step_bilstm

Stacked Bi-directional Long Short-Term Memory function.

chainerx.n_step_gru

Stacked Uni-directional Gated Recurrent Unit function.

chainerx.n_step_bigru

Stacked Bi-directional Gated Recurrent Unit function.

chainerx.n_step_rnn

Stacked Uni-directional RNN function for sequence inputs.

chainerx.n_step_birnn

Stacked Bi-directional RNN function for sequence inputs.

Context

chainerx.Context

An isolated execution environment of ChainerX.

Backend and Device

ChainerX adds a level of abstraction between the higher level array operations and the lower level computations and resource management. This abstraction is managed by the Backend and the Device classes. Native (CPU) and CUDA backends are two concrete implementations currently provided by ChainerX but the abstraction allows you to plug any backend into the framework.

Backend

chainerx.Backend

Pluggable entity that abstracts various computing platforms.

chainerx.get_backend

Returns a backend specified by the name.

Device

chainerx.Device

Represents a physical computing unit.

chainerx.get_device

Returns a device specified by the arguments.

chainerx.get_default_device

Returns the default device associated with the current thread.

chainerx.set_default_device

Sets the given device as the default device of the current thread.

chainerx.using_device

Creates a context manager to temporarily set the default device.

Utilities for Backpropagation

chainerx.backward

Runs backpropagation.

chainerx.no_backprop_mode

Creates a context manager which temporarily disables backpropagation.

chainerx.force_backprop_mode

Creates a context manager which temporarily enables backpropagation.

chainerx.is_backprop_required

Returns whether the backpropagation is enabled in the current thread.

Contribution Guide

This is a guide aimed towards contributors of ChainerX which is mostly implemented in C++. It describes how to build the project and how to run the test suite so that you can get started contributing.

Note

Please refer to the Chainer Contribution Guide for the more general contribution guideline that is not specific to ChainerX. E.g. how to download the source code, manage git branches, send pull requests or contribute to Chainer’s Python code base.

Note

There is a public ChainerX Product Backlog.

Building the shared library

You can build the C++ ChainerX project to generate a shared library similar to any other cmake project. Run the following command from the root of the project to generate chainerx_cc/build/chainerx/libchainerx.so,

$ mkdir chainerx_cc/build
$ cd chainerx_cc/build
$ cmake ..
$ make

The CUDA support is enabled by, either setting CHAINERX_BUILD_CUDA=1 as an environment variable or specifying -DCHAINERX_BUILD_CUDA=1 in cmake. When building with the CUDA support, either the CUDNN_ROOT_DIR environment variable or -DCUDNN_ROOT_DIR is required to locate the cuDNN installation path.

Note

CUDA without cuDNN is currently not supported.

Then, to install the headers and the library, run:

$ make install

You can specify the installation path using the prefix -DCMAKE_INSTALL_PREFIX=<...> in cmake.

Running the test suite

The test suite can be built by passing -DCHAINERX_BUILD_TEST=ON to cmake. It is not built by default. Once built, run the suite with the following command from within the build directory.

$ cd chainerx_cc/build
$ ctest -V

Coding standards

The ChainerX C++ coding standard is mostly based on the Google C++ Style Guide and principles.

Formatting

ChainerX is formatted using clang-format. To fix the formatting in-place, run the following command from chainerx_cc directory:

$ cd chainerx_cc
$ scripts/run-clang-format.sh --in-place
Lint checking

ChainerX uses the cpplint and clang-tidy for lint checking. Note that clang-tidy requires that you’ve finished running cmake. To run cpplint, run scripts/run-cpplint.sh from chainerx_cc directory:

$ cd chainerx_cc
$ scripts/run-cpplint.sh

To run clang-tidy, run make clang-tidy from the build directory:

$ cd chainerx_cc/build
$ make clang-tidy

Thread sanitizer

The thread sanitizer can be used to detect thread-related bugs, such as data races. To enable the thread sanitizer, pass -DCHAINERX_ENABLE_THREAD_SANITIZER=ON to cmake.

You can run the test with ctest -V as usual and you will get warnings if the thread sanitizer detects any issues.

CUDA runtime is known to cause a thread leak error as a false alarm. In such case, disable the thread leak detection using environment variable TSAN_OPTIONS='report_thread_leaks=0'.

Python contributions and unit tests

To test the Python binding, run the following command at the repository root:

$ pytest

The above command runs all the tests in the repository, including Chainer and ChainerMN. To run only ChainerX tests, specify the test directory:

$ pytest tests/chainerx_tests

Run tests with coverage:

$ pytest --cov --no-cov-on-fail --cov-fail-under=80 tests/chainerx_tests

Run tests without CUDA GPU:

$ pytest -m 'not cuda' tests/chainerx_tests

Tips and FAQs

Can I use ChainerX without Chainer?

Yes, it is possible. See the code samples below.

What does the C++ interface look like?

It is almost identical to the Python interface with a 1-to-1 mapping. The interface is still subject to change, but there is an example code:

GPU memory consumption is too high when used with CuPy

Both ChainerX and CuPy use their own GPU memory pools, meaning that GPU memory is not efficiently utilized (unused memory is kept without being freed by both ChainerX and CuPy). You can run your script after setting the environment variable CHAINERX_CUDA_CUPY_SHARE_ALLOCATOR to 1 to use the experimental feature which makes sure that both ChainerX and CuPy share the same memory pool, hence reducing your peak GPU memory-usage. You may also invoke chainerx._cuda.cupy_share_allocator instead of setting the environment variable for the same effect. In this case, it is recommended that you call the function prior to any GPU memory allocation.

Distributed Deep Learning with ChainerMN

ChainerMN enables multi-node distributed deep learning with the following features:

  • Scalable — it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,

  • Flexible — even dynamic neural networks can be trained in parallel thanks to Chainer’s flexibility, and

  • Easy — minimal changes to existing user code are required.

This blog post provides our benchmark results using up to 128 GPUs.

ChainerMN can be used for both inner-node (i.e., multiple GPUs inside a node) and inter-node settings. For inter-node settings, we highly recommend to use high-speed interconnects such as InfiniBand.

ChainerMN examples are available on GitHub. These examples are based on the examples of Chainer and the differences are highlighted.

Installation

Installation Guide

Requirements

ChainerMN depends on the following software libraries: CUDA-Aware MPI, NVIDIA NCCL, and a few Python packages including CuPy and MPI4py.

Note

In Chainer v5, ChainerMN became a part of Chainer package. Installing Chainer (pip install chainer) automatically makes ChainerMN available. Note that you still need to separately install requirements described below to actually run code using ChainerMN.

Before upgrading from Chainer v4 to v5 or later, make sure to remove existing chainermn package (pip uninstall chainermn).

CUDA-Aware MPI

ChainerMN relies on MPI. In particular, for efficient communication between GPUs, it uses CUDA-aware MPI. For details about CUDA-aware MPI, see this introduction article. (If you use only the CPU mode, MPI does not need to be CUDA-Aware. See Installation on Non-GPU Environments for more details.)

The CUDA-aware features depend on several MPI packages, which need to be configured and built properly. The following are examples of Open MPI and MVAPICH.

Open MPI (for details, see Open MPI’s official instructions):

$ ./configure --with-cuda
$ make -j4
$ sudo make install

MVAPICH (for details, see Mvapich’s official instructions):

$ ./configure --enable-cuda
$ make -j4
$ sudo make install
$ export MV2_USE_CUDA=1  # Should be set all the time when using ChainerMN
NCCL

Note

If you are installing CuPy using wheels (i.e., pip install cupy-cudaXX where XX is the CUDA version), you don’t have to install NCCL manually. The latest NCCL 2.x library is bundled with CuPy wheels.

See CuPy Installation Guide for the detailed steps to install CuPy.

To enable efficient intra- and inter-node GPU-to-GPU communication, we use NVIDIA Collective Communications Library (NCCL). See NCCL’s official instructions for installation.

ChainerMN requires NCCL even if you have only one GPU per node. The only exception is when you run ChainerMN on CPU-only environments. See Installation on Non-GPU Environments for more details.

Note

We recommend NCCL 2 but NCCL 1 can be used. However, for NCCL 1, PureNcclCommunicator is not supported in ChainerMN. If you use NCCL 1, please properly configure environment variables to expose NCCL both when you install and use ChainerMN. Typical configurations should look like the following:

export NCCL_ROOT=<path to NCCL directory>
export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH

If you change the version of NCCL installed, you have to reinstall CuPy. Because, current ChainerMN applies CuPy to use NCCL. See CuPy official instructions for reinstalltion.

MPI4py

You can install MPI4py by:

$ pip install mpi4py

Please make be sure to properly configure environment variables so that MPI is available at installation time, because MPI4py links to MPI library at installation time. In particular, if you have multiple MPI implementations installed in your environment, please expose the implementation that you want to use both when you install and use ChainerMN.

As of writing, MPI4py does not support Open MPI 4.x. Please use versions from the Tested Environments section below.

CuPy

Chainer and ChainerMN rely on CuPy to use GPUs. Please refer to CuPy Installation Guide for the detailed steps to install CuPy.

In most cases it is recommended that you install CuPy using wheel distribution (precompiled binary) rather than source distribution. If you are installing from source, NCCL library must be installed before installing CuPy to enable NCCL feature in CuPy. Refer to NCCL for the installation steps of NCCL library. See Check if NCCL is enabled in CuPy, if you want to check whether NCCL is enabled in your CuPy.

Chainer and ChainerMN can be installed without CuPy, in which case the corresponding features are not available. See Installation on Non-GPU Environments for more details.

Tested Environments

We tested ChainerMN on all the following environments.

  • OS

    • Ubuntu 14.04 LTS 64bit

    • Ubuntu 16.04 LTS 64bit

  • Python 2.7.13, 3.5.2, 3.6.1

  • MPI

    • Open MPI 2.1.6, 3.0.4, 3.1.4

  • MPI4py 3.0.0

  • NCCL 2.3.2 2.4.2

Note

Note that the following versions of Open MPI have some bugs that might cause ChainerMN programs to hang: 3.0.[0-2] and 3.1.[0-2]. For more details, see Open MPI Issue #3972 and Chainer Issue #5740 .

Also, mpi4py does not support Open MPI 4.0.x.

Installation on Non-GPU Environments

Users who want to try ChainerMN in CPU-only environment may skip installation of CuPy. Non-GPU set up may not be performant as GPU-enabled set up, but would be useful for testing or debugging training program in non-GPU environment such as laptops or CI jobs.

In this case, the MPI does not have to be CUDA-aware. Only naive communicator works with the CPU mode.

Step-by-Step Troubleshooting

This section is a step-by-step troubleshooting guide for ChainerMN. Please follow these steps to identify and fix your problem.

We assume that you are using Linux or another Unix-like environment.

Single-node environment
Basic MPI installation

Although ChainerMN stands for “Chainer MultiNode,” it is good to start from single-node execution. First of all, you need MPI. If MPI is correctly installed, you will see the mpicc and mpiexec commands in your PATH.

Below is an example of the output from Mvapich on Linux.:

$ which mpicc
/usr/local/bin/mpicc

$ mpicc -show
gcc -I/usr/local/include ...(snip)... -lmpi

$ which mpiexec
/usr/local/bin/mpiexec

$ mpiexec --version
HYDRA build details:
Version:                                 3.1.4
Release Date:                            Wed Sep  7 14:33:43 EDT 2016
CC:                              gcc
CXX:                             g++
F77:
F90:
Configure options:  (snip)
Process Manager:                         pmi
Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available:            hwloc
Resource management kernels available:   user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available:                 poll select

If you see any error in above commands, please go back to the CUDA-Aware MPI and check your MPI installation.

Check what MPI you are using

In CUDA-Aware MPI, we mention both of Open MPI and Mvapich. If the MPI is provided by the system administrator and you are not really sure which MPI you are using, check the output of mpiexec –version.

  • If the output contains HYDRA, then it’s MVAPICH (or possibly MPICH).

  • If the output contains OpenRTE, then it’s Open MPI.

However, in such a case, you should make sure that the MPI is CUDA-aware, as mentioned below. We recommend to build your own MPI.

Check if MPI is CUDA-aware

Your MPI must be configured as CUDA-aware. You can use the following C program to check it.

/* check_cuda_aware.c */
#include <assert.h>
#include <stdio.h>
#include <mpi.h>
#include <cuda_runtime.h>

#define CUDA_CALL(expr) do {                  \
  cudaError_t err;                            \
  err = expr;                                 \
  assert(err == cudaSuccess);                 \
} while(0)

int main(int argc, char **argv) {
  int rank, size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int *sendbuf_d = NULL;
  int *recvbuf_d = NULL;

  CUDA_CALL(cudaMalloc((void**)&sendbuf_d, sizeof(int)));
  CUDA_CALL(cudaMalloc((void**)&recvbuf_d, sizeof(int)));
  CUDA_CALL(cudaMemcpy(sendbuf_d, &rank, sizeof(int), cudaMemcpyDefault));

  MPI_Reduce(sendbuf_d, recvbuf_d, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

  if (rank == 0) {
    int sum = -1;
    CUDA_CALL(cudaMemcpy(&sum, recvbuf_d, sizeof(int), cudaMemcpyDefault));
    if (sum == (size-1) * size / 2) {
      printf("OK.\n");
    } else {
      printf("Error.\n");
    }
  }

  cudaFree(sendbuf_d);
  cudaFree(recvbuf_d);

  MPI_Finalize();
}

Save the code to a file named check_cuda_aware.c. You can compile and run it with the following command.:

$ export MPICH_CC=nvcc  # if you use Mvapich
$ export OMPI_CC=nvcc   # if you use Open MPI
$ $(mpicc -show check_cuda_aware.c -arch sm_53 | sed -e 's/-Wl,/-Xlinker /g' | sed -e 's/-pthread/-Xcompiler -pthread/')
$ ./a.out
OK.

If the proglam prints OK., your MPI is correctly configured.

Check mpi4py

Next, let’s check that mpi4py is correctly installed. You can use the following script to check it:

# coding: utf-8
import os
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

for i in range(size):
  if i == rank:
    print("{} {}".format(os.uname()[1], i))
  comm.Barrier()

Save the script into a file named check_mpi4py.py and run it. The output from the script should look like this.:

$ mpiexec -np 4 python check_mpi4py.py
host00 0
host00 1
host00 2
host00 3

The script prints hostnames and ranks (process id in MPI) from each MPI process in a sequential manner. host00 is the host name of the machine your are running the process. If you get an output like below, it indicates something is wrong with your installation.:

# Wrong output !
$ mpiexec -n 4 python check_mpi4py.py
host00 0
host00 0
host00 0
host00 0

A common problem is that the mpicc used to build mpi4py and mpiexec used to run the script are from different MPI installations.

Finally, run pytest to check the single-node configuration is ready.:

$ git clone git@github.com:chainer/chainer.git
Cloning into 'chainer'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 168242 (delta 1), reused 2 (delta 0), pack-reused 168235
Receiving objects: 100% (168242/168242), 41.15 MiB | 1.65 MiB/s, done.
Resolving deltas: 100% (123696/123696), done.
Checking connectivity... done.
$ cd chainer/
$ pytest tests/chainermn_tests/
......S.S...S.S...S.S...S.S.........SS
----------------------------------------------------------------------
Ran 38 tests in 63.083s

OK (SKIP=10)
Check if NCCL is enabled in CuPy

CuPy requires NCCL to be enabled. You can check it with the following command.:

$ python -c 'from cupy.cuda import nccl'

If you get an output like below, NCCL is not enabled in CuPy. Please check the installation guide of CuPy.:

Traceback (most recent call last):

 File "<string>", line 1, in <module>

ImportError: cannot import name 'nccl'
Multi-node environment
Check SSH connection and environment variables

To use ChainerMN on multiple hosts, you need to connect to computing hosts, including the one you are currently logged into, via ssh without password authentication (and preferably without username).:

$ ssh host00 'hostname'
host00   # without hitting the password

$ ssh host01 'hostname'
host01   # without hitting the password

...

You may get a message like this:

The authenticity of host 'host01 (xxx.xxx.xxx.xxx)' can't be established.
ECDSA key fingerprint is SHA256:haGUMcCeC5A8lGh1lpjpwL5dF4xCglZArhhxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)?

This message appears when you log in a host for the first time. Just type yes and the message won’t appear again. You need to repeat this process on all computing hosts.

Also, you need to pay attention to the environment variables on remote hosts. The MPI runtime connects to the remote hosts in non-interactive mode, and environment variables may differ from your interactive login sessions.:

$ ssh host00 'env' | grep LD_LIBRARY_PATH
# Check the values and compare it to the local value.

$ ssh host01 'env' | grep LD_LIBRARY_PATH
# Check the values and compare it to the local value.

...

In particular, check the following variables, which are critical to executing MPI programs:

  • PATH

  • LD_LIBRARY_PATH

  • MV2_USE_CUDA (if you use MVAPICH)

  • MV2_SMP_USE_CMA (if you use MVAPICH)

Besides, you need to make sure the same mpiexec binary is used to run MPI programs.:

$ ssh host00 'which mpiexec'
/usr/local/bin/mpiexec

$ ssh host01 'which mpiexec'
/usr/local/bin/mpiexec

All the commands should give the same mpiexec binary path.

Program files and data

When you run MPI programs, all hosts must have the same Python binary and script files in the same path. First, check that the python binary and version are identical among hosts. Be careful if you are using pyenv or Anaconda.:

$ ssh host00 'which python; python --version'
/home/username/.pyenv/shims/python
Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

$ ssh host01 'which python'
/home/username/.pyenv/shims/python
Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

...

Also, the script file (and possibly data files) must be in the same path on each host.

$ ls yourscript.py  # in the current directory
yourscript.py

$ ssh host00 "ls $PWD/yourscript.py"
/home/username/your/dir/yourscript.py

$ ssh host01 "ls $PWD/yourscript.py"
/home/username/your/dir/yourscript.py

...

If you are using NFS, everything should be okay. If not, you need to transfer all the necessary files manually.

In particular, when you run the ImageNet example in ChainerMN repository, all data files must be available on all computing hosts.

hostfile

The next step is to create a hostfile. A hostfile is a list of hosts on which MPI processes run.:

$ vi hostfile
$ cat hostfile
host00
host01
host02
host03

Then, you can run your MPI program using the hostfile. To check if the MPI processes run over multiple hosts, save the following script to a file and run it via mpiexec:

# print_rank.py
import os

from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

for i in range(size):
  if i == rank:
    print("{} {}".format(os.uname()[1], i))
  comm.Barrier()

If you get an output like below, it is working correctly.:

$ mpiexec -n 4 --hostfile hostfile python print_rank.py
host00 0
host01 1
host02 2
host03 3

If you have multiple GPUs, you may want to run multiple processes on each host. You can modify hostfile and specify the number of processes to run on each host.:

# If you are using Mvapich:
$ cat hostfile
host00:4
host01:4
host02:4
host03:4

# If you are using Open MPI
$ cat hostfile
host00 cpu=4
host01 cpu=4
host02 cpu=4
host03 cpu=4

With this hostfile, try running mpiexec again.:

$ mpiexec -n 8 --hostfile hostfile python print_rank.py
host00 0
host00 1
host00 2
host00 3
host01 4
host01 5
host01 6
host01 7

You will find that the first 4 processes run on host00 and the latter 4 on host01.

You can also specify computing hosts and resource mapping/binding using command line options of mpiexec. Please refer to the MPI manual for the more advanced use of mpiexec command.

If you get runtime error:

If you get the following error messages, please check the specified section of the troubleshooting or installation guide.

[hostxxx:mpi_rank_0][MPIDI_CH3I_SMP_init] CMA is not available. Set MV2_SMP_USE_CMA=0 to disable CMA.
[cli_0]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error, error stack:
MPIR_Init_thread(514)....:
MPID_Init(365)...........: channel initialization failed
MPIDI_CH3_Init(404)......:
MPIDI_CH3I_SMP_Init(2132): process_vm_readv: Operation not permitted


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 20327 RUNNING AT hostxxx
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

-> Check the value of MV2_SMP_USE_CMA (see CUDA-Aware MPI and Check SSH connection and environment variables).

[hostxx:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 20643 RUNNING AT hostxx
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

-> Check the value of MV2_USE_CUDA (see CUDA-Aware MPI and Check SSH connection and environment variables)

Tutorial

Overview

Data Parallelism

ChainerMN employs the data parallel approach for distributed training. In the data parallel approach, each worker has a model copy, and computes a gradient against a batch. Then, the workers collaborate to update the model using the gradients of all workers.

_images/parallelism.png
Training Iterations

What ChainerMN does for distributed training is actually quite simple. Let us look at what we do in each iteration. The following figure illustrates an iteration of standard training using Chainer (without ChainerMN). It consists of three steps: forward, backward and optimize.

_images/iteration_chainer.png

When using ChainerMN, an additional step all-reduce is inserted after the backward step. In this step, workers communicate to obtain the averaged gradient over gradients of all workers. Then, the aggregated gradient is used to improve the model in the optimization step.

_images/iteration_chainermn.png
MPI

ChainerMN is built on MPI. MPI invokes our training script in the SPMD (single program, multiple data) way. ChainerMN is designed to create a process on each GPU. For example, let us suppose you have two nodes with four GPUs each, and want to run train_imagenet.py. Then, you will invoke eight Python processes running train_imagenet.py by using mpiexec or mpirun.

Step 1: Communicators and Optimizers

In the following, we explain how to modify your code using Chainer to enable distributed training with ChainerMN. We take Chainer’s MNIST example and modify it in a step-by-step manner to see the standard way of using ChainerMN.

Creating a Communicator

We first need to create a communicator. A communicator is in charge of communication between workers. A communicator can be created as follows:

comm = chainermn.create_communicator()

Workers in a node have to use different GPUs. For this purpose, intra_rank property of communicators is useful. Each worker in a node is assigned a unique intra_rank starting from zero. Therefore, it is often convenient to use the intra_rank-th GPU.

The following line of code is found in the original MNIST example:

chainer.cuda.get_device_from_id(args.gpu).use()

which we modify as follows:

device = comm.intra_rank
chainer.cuda.get_device_from_id(device).use()
Creating a Multi-Node Optimizer

This is the most important step. We need to insert the communication right after backprop and right before optimization. In ChainerMN, it is done by creating a multi-node optimizer.

Method create_multi_node_optimizer receives a standard Chainer optimizer, and it returns a new optimizer. The returned optimizer is called multi-node optimizer. It behaves exactly same as the supplied original standard optimizer (e.g., you can add hooks such as WeightDecay), except that it communicates model parameters and gradients properly in a multi-node setting.

The following is the code line found in the original MNIST example:

optimizer = chainer.optimizers.Adam()

To obtain a multi-node optimizer, we modify that part as follows:

optimizer = chainermn.create_multi_node_optimizer(
    chainer.optimizers.Adam(), comm)
Run

With the above two changes, your script is ready for distributed training. Invoke your script with mpiexec or mpirun (see your MPI’s manual for details). The following is an example of executing the training with four processes at localhost:

$ mpiexec -n 4 python train_mnist.py

In the non-GPU mode, you may see a warning like shown below, but this message is harmless, and you can ignore it for now

Warning: using naive communicator because only naive supports CPU-only execution

If you have multiple GPUs on the localhost, 4 for example, you may also want to try:

$ mpiexec -n 4 python train_mnist.py --gpu
Multi-node execution

If you can successfully run the multi-process version of the MNIST example, you are almost ready for multi-node execution. The simplest way is to specify the --host argument to the mpiexec command. Let’s suppose you have two GPU-equipped computing nodes: host00 and host01, each of which has 4 GPUs, and so you have 8 GPUs in total:

$ mpiexec -n 8 -host host00,host01 python train_mnist.py

The script should print similar results to the previous intra-node execution.

Copying datasets

In the MNIST example, the rank 0 process reads the entire portion of the dataset and scatters it to other processes. In some applications, such as the ImageNet ChainerMN example, however, only the pathes to each data file are scattered and each process reads the actual data files. In such cases, all datasets must be readable on all computing nodes in the same location. You don’t need to worry about this if you use NFS (Network File System) or any other similar data synchronizing system. Otherwise, you need to manually copy data files between nodes using scp or rsync.

If you have trouble

If you have any trouble running the sample programs in your environment, go to the Step-by-Step Troubleshooting page and follow the steps to check your environment and configuration.

Next Steps

With only the above two changes distributed training is already performed. Thus, the model parameters are updated by using gradients that are aggregated over all the workers. However, this MNIST example still has a few areas in need of improvment. In the next page, we will see how to address the following problems:

  • Training period is wrong; ‘one epoch’ is not one epoch.

  • Evaluation is not parallelized.

  • Status outputs to stdout are repeated and annoying.

Step 2: Datasets and Evaluators

Following from the previous step, we continue to explain general steps to modify your code for ChainerMN through the MNIST example. All of the steps below are optional, although useful for many cases.

Scattering Datasets

If you want to keep the definition of ‘one epoch’ correct, we need to scatter the dataset to all workers.

For this purpose, ChainerMN provides a method scatter_dataset. It scatters the dataset of the specified root worker (by default, the worker whose comm.rank is 0) to all workers. The given dataset of other workers are ignored. The dataset is split into sub datasets of equal sizes, by duplicating some elements if necessary, and scattered to the workers. To create a sub dataset, chainer.datasets.SubDataset is used.

The following line of code from the original MNIST example loads the dataset:

train, test = chainer.datasets.get_mnist()

We modify it as follows. Only worker 0 loads the dataset, and then it is scattered to all the workers:

if comm.rank == 0:
    train, test = chainer.datasets.get_mnist()
else:
    train, test = None, None

train = chainermn.scatter_dataset(train, comm)
test = chainermn.scatter_dataset(test, comm)
Creating A Multi-Node Evaluator

This step is also an optional step, but useful when validation is taking a considerable amount of time. In this case, you can also parallelize the validation by using multi-node evaluators.

Similarly to multi-node optimizers, you can create a multi-node evaluator from a standard evaluator by using method create_multi_node_evaluator. It behaves exactly the same as the given original evaluator except that it reports the average of results over all workers.

The following line from the original MNIST example adds an evaluator extension to the trainer::

trainer.extend(extensions.Evaluator(test_iter, model, device=args.gpu))

To create and use a multi-node evaluator, we modify that part as follows:

evaluator = extensions.Evaluator(test_iter, model, device=device)
evaluator = chainermn.create_multi_node_evaluator(evaluator, comm)
trainer.extend(evaluator)
Suppressing Unnecessary Extensions

Some of extensions should be invoked only by one of the workers. For example, if the PrintReport extension is invoked by all of the workers, many redundant lines will appear in your console. Therefore, it is convenient to register these extensions only at workers of rank zero as follows:

if comm.rank == 0:
    trainer.extend(extensions.DumpGraph('main/loss'))
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.ProgressBar())

Tips and FAQs

Using MultiprocessIterator

If you are using MultiprocessIterator and communication goes through InfiniBand, you would probably face crashing problems. This is because MultiprocessIterator creates child processes by the fork system call, which has incompatibilities with the design of MPI and InfiniBand. To cope with this issue, use multiprocessing.set_start_method to start child processes, with a process explicitly forked right after, before communicator is created as follows:

multiprocessing.set_start_method('forkserver')
p = multiprocessing.Process()
p.start()
p.join()

communicator = chainermn.create_communicator(...)

Either forkserver mode or spawn mode should work. See our ImageNet example script for working sample code of MultiprocessIterator and forkserver. Unfortunately, multiprocessing.set_start_method is only available in Python 3.4+.

Using Your Own Evaluator

Method create_multi_node_evaluator can also be used for customized evaluator classes that inherit from chainer.training.extensions.Evaluator. Specifically, it wraps the evaluate method and returns the averaged values over all workers. Please also refer to our ImageNet example, where a customized evaluator is used.

Using MPI4py Communicator

ChainerMN is based on MPI4py. For advanced users (e.g., those who want to parallelize preprocessing, create custom extension, etc.), we encourage you to make use of MPI4py communicators. Let comm be a ChainerMN communicator, then you can obtain MPI4py communicator by comm.mpi_comm. Please refer to MPI4py API reference.

Using FP16

FP16 (16-bit half precision floating point values) is supported in pure_nccl of a ChainerMN communicator.

MPI process hangs after an unhandled Python exception.

An MPI runtime is expected to kill all of its child processes if one of them exits abnormally or without calling MPI_Finalize(). However, when a Python program runs on mpi4py, the MPI runtime often fails to detect the process failure, and the rest of the processes hang infinitely. It is especially problematic when you run your ChainerMN program on a cloud environment, in which you are charged on time basis.

This tiny program demonstrates the issue (note that it is not specific to ChainerMN).:

# test.py
def func():
  import mpi4py.MPI
  mpi_comm = mpi4py.MPI.COMM_WORLD
  if mpi_comm.rank == 0:
    raise ValueError('failure!')

  mpi4py.MPI.COMM_WORLD.Barrier()

if __name__ == '__main__':
  func()

# mpiexec -n 2 python test.py

mpi4py offers a solution to force all processes to abort if an uncaught exception occurs..

$ mpiexec -n 2 python -m mpi4py yourscript.py ...

This also works well with ChainerMN. See here for more details.

If you cannot apply the solution (i.e. you don’t have a control of how Python interpreter is invoked), you can inject the following code snippet into your script file

import sys

# === begin code snippet
_old_hook = sys.excepthook

# Global error handler
def global_except_hook(exctype, value, traceback):
  import sys
  try:
      import mpi4py.MPI

$ mpiexec -n 2 -x CHAINERMN_FORCE_ABORT_ON_EXCEPTION=1 python yourscript.py ...

Alternatively, you can explicitly call chainermn.global_except_hook.add_hook() from your code.:

import chainermn

chainermn.global_except_hook.add_hook()

The handler hooks uncaught exceptions and call MPI_Abort() to ensure that all process are terminated.

You can choose any of these solutions depending on your environment and restrictions.

NOTE: These techniques are effective only for unhandled Python exceptions. If your program crashes due to lower-level issues such as SIGSEGV, the MPI process may still hang.

Model Parallel

Overview

Model Parallelism

Even though ChainerMN mainly supports the data parallel approach for distributed training, it also has experimental APIs for the model parallel approach. The model parallel approach splits a given model into subcomponents loaded on several processes. This approach is useful in cases where

  • large mini-batch or high-resolusion is needed.

  • the model is too huge to run on a single process.

  • the mixture of experts are trained.

_images/parallelism.png
Philosophy

ChainerMN takes the following three approaches to realize the model parallelism.

1. Communication as Function

ChainerMN provides several special functions for communications such as chainermn.functions.bcast and chainermn.functions.alltoall, which wraps raw MPI communications. Users define communications between processes as Chainer function calls in the model definitions. This enables highly flexible communication patterns. Moreover, parameter updates in backward propagation are automatically invoked through backward defined in those functions for communications.

_images/communication_as_function.png
2. Synchronous Model Parallel

ChainerMN restricts itself to synchronous SGD. Though the asynchronous counterpart seems to be more computationally efficient, asynchronous SGD often suffer from the stale gradients problem and results in difficulty while debugging. ChainerMN’s synchronous communication model makes SGD simpler.

3. Single-Program-Multiple-Data (SPMD)

In principle, ChainerMN supports single-program-multiple-data (SPMD), which means the same program is invoked and different data are used on each process.

_images/spmd.png

Synchronous model-parallelism suits well with MPI programming style and SPMD model.

References

Model Parallel on ChainerMN

Step 1: Communicators

To perform multi-node communications, a communicator is needed. Basic usages are the same with the case of the data parallel, see Step 1: Communicators and Optimizers:

comm = chainermn.create_communicator()

If you want to define collective communications among limited number of processes later, it is useful to split the communicator:

subcomm = comm.split(comm.rank % 2, comm.rank)
_images/comm_split.png

For further detail about the communicator split, please refer to MPI tutorial.

Step 2: Datasets and Iterators

In model parallel training, all processes belong to at least one of the following dataset input patterns.

  1. model inputs come from datasets, and each process takes different mini-batches

  2. model inputs come from datasets, and several processes share the same mini-batches

  3. model inputs come from other processes

1. scatter_dataset

For the first case, you may use scatter_dataset as is introduced in Step 2: Datasets and Evaluators.

_images/scatter_dataset.png
2. multi node iterator

For the second case, iterator need to be modified, where create_multi_node_iterator is useful:

train, test = chainer.datasets.get_mnist()
train_iter = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(train, batchsize), comm)
test_iter = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(test, batchsize), comm)

The resulting iterators return the same mini-batches among processes specified by the communicator.

_images/multi_node_iterator.png
3. empty dataset

For the last case, you may use create_empty_dataset, which returns a dataset with the same number of empty tuples as the original dataset:

train, test = chainer.datasets.get_mnist()
train = chainermn.datasets.create_empty_dataset(train)
test = chainermn.datasets.create_empty_dataset(test)

This input pattern appears in the subsequent examples such as Example 1: Simple MLP. Note that datasets are required in Chainer’s updater API. The empty dataset can be used as a dummy dataset.

_images/empty_dataset.png
Step 3: Define Communications

ChainerMN supports most of the MPI communications as Chainer functions, including point-to-point and collective communications. To know usages of each communication, please refer to API Reference.

Example 1: Point-to-point Communication

This is an example to use point-to-point communications:

def __call__(self, x):
    h = f(x)
    h = chainermn.functions.send(x, comm, rank=1)
    return h

The communication target is specified by rank parameter. Note that the return value of send is often not negligible. Please refer to Note: Define-by-Run and Model Parallelism.

Example 2: Collective Communication

Here is another example to use collective communications:

def __call__(self, x):
    h = f(x)
    h = chainermn.functions.allgather(comm, h)
    h = F.stack(h, axis=0)
    h = F.average(h, axis=0)
    return h

This pattern often appears in the averaging ensemble training.

Note: Define-by-Run and Model Parallelism

In model-parallel training, a model on each process may become non-connected computational graph. Let’s take a look at an example.

_images/delegate_variable_0.png

Naive implementation of a model on process #0 could be:

class Model_0(chainer.Chain):
    def __call__(self, x):
        # first component
        z = f(x)
        chainermn.functions.send(z, comm, rank=1)

        # second component
        z = chainermn.functions.recv(comm, rank=1)
        y = h(z)

        return y

One may notice that there is no connection between the first and second components of computational graph. As we rely on defined-by-run framework, we cannot build a backward path from the second component to the first component. In order to build the backward path, a dummy variable, which we call delegate_variable, is needed.

_images/delegate_variable_1.png

The variable \(\phi\) in the above figure is delegate_variable, which is a return value of send and passed to an argument of recv:

class Model_0(chainer.Chain):
    def __call__(self, x):
        # first component
        z = f(x)
        phi = chainermn.functions.send(z, comm, rank=1)

        # second component
        z = chainermn.functions.recv(comm, rank=1, delegate_variable=phi)
        y = h(z)

        return y

class Model_1(chainer.Chain):
    def __call__(self, _):
        z = chainermn.functions.recv(comm, rank=0)
        z = g(z)
        phi = chainermn.functions.send(z, comm, rank=0)
        return phi

Model_1 also need to return a delegate variable \(\phi\) to backtrack its computational graph to compute gradients. Thus, the backward computation is guaranteed. Otherwise, backward computation will cause deadlock.

Note: Delegate Variable and Pseudo Connect

As we just see above, delegate variables must be appropriately handled to avoid potential deadlock. However, there are still some pathological cases. Let’s consider to send variables twice.

_images/pseudo_connect_0.png

Here, we must guarantee that backward tracking can find two send, but we can only return one delegate variable from each model. pseudo_connect is a special function to combine one delegate variable to another variable.

_images/pseudo_connect_1.png

In the above case, the returned variable \(\psi\) from pseudo_connect behaves as if it is \(\phi_2\), while its backward backtracks both \(\phi_1\) and \(\phi_2\):

class Model_0(chainer.Chain):
    def __call__(self, x):
        z1, z2 = f(x)
        phi1 = chainermn.functions.send(z1, comm, rank=1)
        phi2 = chainermn.functions.send(z2, comm, rank=1)
        psi = chainermn.functions.pseudo_connect(phi1, phi2)
        return psi

class Model_1(chainer.Chain):
    def __call__(self, _):
        z1 = chainermn.functions.recv(comm, rank=0)
        z2 = chainermn.functions.recv(comm, rank=0)
        y = g(z1, z2)
        return y

Example 1: Simple MLP

Here is the first example of model parallel, a simple MLP separated on two processes.

_images/model_parallel_mlp.png

First, let’s create a ChainerMN communicator:

if args.gpu:
    comm = chainermn.create_communicator('pure_nccl')
    device = comm.intra_rank
else:
    comm = chainermn.create_communicator('naive')
    device = -1

As we saw in Model Parallel on ChainerMN, one naive implementation would be to use the point-to-point communication such as send and recv:

class MLP0(chainer.Chain):
    def __init__(self, comm, n_out):
        super(MLP0SubA, self).__init__(
            l1=L.Linear(784, n_out))

    def __call__(self, x):
        h0 = F.relu(self.l1(x))
        phi = chainermn.functions.send(h0, self.comm, rank=1)
        # Note: do not forget to pass delegate variable
        y = chainermn.functions.recv(self.comm, rank=1, delegate_variable=phi)
        return y

class MLP1(chainer.Chain):
    def __init__(self, n_units, n_out):
        super(MLP1Sub, self).__init__(
            l2=L.Linear(None, n_units),
            l3=L.Linear(None, n_out))

    def __call__(self, _):
        h0 = chainermn.functions.recv(self.comm, rank=0)
        h1 = F.relu(self.l2(h0))
        return chainermn.functions.send(self.l3(h1), self.comm, rank=0)

One should note that

  • MLP0: delegate variable is indispensable which is passed from send to recv.

  • MLP1: the return value from send must be returned in __call__, which is used to track back the computational graph.

On each process, different models are trained:

if comm.rank == 0:
    model = L.Classifier(MLP0(comm, 100))
elif comm.rank == 1:
    model = MLP1(comm, 100, 10)

Since MLP1 receives its inputs from MLP0 over the point-to-point communication, let’s use empty_dataset instead of the usual dataset:

# Iterate dataset only on worker 0.
train, test = chainer.datasets.get_mnist()
if comm.rank == 1:
    train = chainermn.datasets.create_empty_dataset(train)
    test = chainermn.datasets.create_empty_dataset(test)

Now we can run a model parallel architecture.

There is an alternative API to define the same model without explicitly defining communication paths:

class MLP0SubA(chainer.Chain):
    def __init__(self, comm, n_out):
        super(MLP0SubA, self).__init__(
            l1=L.Linear(784, n_out))

    def __call__(self, x):
        return F.relu(self.l1(x))

class MLP0SubB(chainer.Chain):
    def __init__(self, comm):
        super(MLP0SubB, self).__init__()

    def __call__(self, y):
        return y

class MLP0(chainermn.MultiNodeChainList):
    # Model on worker 0.
    def __init__(self, comm, n_out):
        super(MLP0, self).__init__(comm=comm)
        self.add_link(MLP0SubA(comm, n_out), rank_in=None, rank_out=1)
        self.add_link(MLP0SubB(comm), rank_in=1, rank_out=None)

class MLP1Sub(chainer.Chain):
    def __init__(self, n_units, n_out):
        super(MLP1Sub, self).__init__(
            l2=L.Linear(None, n_units),
            l3=L.Linear(None, n_out))

    def __call__(self, h0):
        h1 = F.relu(self.l2(h0))
        return self.l3(h1)

class MLP1(chainermn.MultiNodeChainList):
    # Model on worker 1.
    def __init__(self, comm, n_units, n_out):
        super(MLP1, self).__init__(comm=comm)
        self.add_link(MLP1Sub(n_units, n_out), rank_in=0, rank_out=0)

MultiNodeChainList enables to define a multi model architecture, by adding non-connected component with add_link. Two arguments rank_in and rank_out specifies from which process the added link receives their inputs, and to which process it sends their outputs.

Although it may seems that there is no necessity to parallelize MLP with this size, it can be useful to train a MLP with many layers and parameters so that the entire model cannot be loaded on a single GPU. The entire training code is available here.

Example 2: seq2seq

This example shows how to parallelize models that involves RNN.

_images/seq2seq_0.png

Above figure depicts a typical encoder-decoder model, where the model is split up to encoder and decoder, both running respectively in two processes. When f or g are large models that consume huge memory such as CNN, model parallelism like this would be useful. In the forward computation, the encoder invokes send function to send its context vectors, and the decoder invokes recv to receive them. The backward computation must be built by pseudo_connect. As this communication pattern is very popular in RNNs, MultiNodeNStepRNN is a ready-made utility link for this pattern. It can replace this complicated communication pattern.

_images/seq2seq_1.png

MultiNodeNStepRNN can be created by create_multi_node_n_step_rnn:

rnn = chainermn.links.create_multi_node_n_step_rnn(
    L.NStepLSTM(n_layers, n_units, n_units, 0.1),
    comm, rank_in=None, rank_out=1)

where comm is a ChainerMN communicator (see Step 1: Communicators).

The overall model definition can be written as follows:

class Encoder(chainer.Chain):

    def __init__(self, comm, n_layers, n_units):
        super(Encoder, self).__init__(
            # Corresponding decoder LSTM will be invoked on process 1.
            mn_encoder=chainermn.links.create_multi_node_n_step_rnn(
                L.NStepLSTM(n_layers, n_units, n_units, 0.1),
                comm, rank_in=None, rank_out=1
            ),
        )
        self.comm = comm
        self.n_layers = n_layers
        self.n_units = n_units

    def __call__(self, *xs):
        exs = f(xs)
        c, h, _, phi = self.mn_encoder(exs)
        return phi

class Decoder(chainer.Chain):

    def __init__(self, comm, n_layers, n_units):
        super(Decoder, self).__init__(
            # Corresponding encoder LSTM will be invoked on process 0.
            mn_decoder=chainermn.links.create_multi_node_n_step_rnn(
                L.NStepLSTM(n_layers, n_units, n_units, 0.1),
                comm, rank_in=0, rank_out=None),
        )
        self.comm = comm
        self.n_layers = n_layers
        self.n_units = n_units

    def __call__(self, *ys):
        c, h, os, _ = self.mn_decoder(ys)
        # compute loss (omitted)

An example code with a training script is available here.

Example 3: Channel-wise Parallel Convolution

This is an example to parallelize CNN in channel-wise manner. This parallelization is useful with large batch size, or with high resolution images.

_images/parallel_conv.png

The basic strategy is

  1. to pick channels that each process is responsible for

  2. to apply convolution, and

  3. to use allgather to combine outputs of all channels into a single tensor

on each process. Parallel convolution model implementation could be like this:

class ParallelConvolution2D(chainer.links.Convolution2D):
    def __init__(self, comm, in_channels, out_channels, *args, **kwargs):
        self.comm = comm
        self.in_channels = in_channels
        self.out_channels = out_channels
        super(ParallelConvolution2D, self).__init__(
            self._in_channel_size, self._out_channel_size, *args, **kwargs)

    def __call__(self, x):
        x = x[:, self._channel_indices, :, :]
        y = super(ParallelConvolution2D, self).__call__(x)
        ys = chainermn.functions.allgather(self.comm, y)
        return F.concat(ys, axis=1)

    def _channel_size(self, n_channel):
        # Return the size of the corresponding channels.
        n_proc = self.comm.size
        i_proc = self.comm.rank
        return n_channel // n_proc + (1 if i_proc < n_channel % n_proc else 0)

    @property
    def _in_channel_size(self):
        return self._channel_size(self.in_channels)

    @property
    def _out_channel_size(self):
        return self._channel_size(self.out_channels)

    @property
    def _channel_indices(self):
        # Return the indices of the corresponding channel.
        indices = np.arange(self.in_channels)
        indices = indices[indices % self.comm.size == 0] + self.comm.rank
        return [i for i in indices if i < self.in_channels]

where comm is a ChainerMN communicator (see Step 1: Communicators).

ParallelConvolution2D can simply replace with the original Convolution2D. For the first convolution layer, all processes must input the same images to the model. MultiNodeIterator distributes the same batches to all processes every iteration:

if comm.rank != 0:
    train = chainermn.datasets.create_empty_dataset(train)
    test = chainermn.datasets.create_empty_dataset(test)

train_iter = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(train, args.batchsize), comm)
test_iter = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(test, args.batchsize,
                                     repeat=False, shuffle=False),
    comm)

An example code with a training script for VGG16 parallelization is available here.

Example 4: Ensemble

Ensemble is a training technique to obtain better classification performance by combining multiple base classifiers. Averaging ensemble is one of the simplest examples of ensemble, which takes average of all classifier outputs in the test phase. Model parallelism and collective communications can effectively help to implement it.

_images/averaging.png

The following wrapper makes model parallel averaging ensemble easier:

class Averaging(chainer.Chain):
    def __init__(self, comm, block):
        super(Averaging, self).__init__()
        self.comm = comm
        with self.init_scope():
            self.block = block

    def __call__(self, x):
        y = self.block(x)

        if not chainer.config.train:
            y = chainermn.functions.allgather(self.comm, y)
            y = F.stack(y, axis=0)
            y = F.average(y, axis=0)

        return y

Then, any links wrapped by Averaging are ready to be parallelized and averaged:

class Model(chainer.Chain):
    def __init__(self, comm):
        super(Model, self).__init__()
        self.comm = comm
        with self.init_scope():
            self.l1 = L.Linear(d0, d1)
            self.l2 = L.Linear(d1, d2)
            self.l3 = Averaging(self.comm, L.Linear(d2, d3))

    def __call__(self, x):
        h = F.relu(self.l1(x))
        h = F.relu(self.l2(h))
        y = F.relu(self.l3(h))
        return y

From the perspective of model inputs/outputs, the averaged model is compatible with the original model. Thus, we only need to replace the last layer with the averaged layer.

In averaging ensemble, each base classifier is trained independently and ensembled in the test phase. This can be implemented by using MultiNodeIterator only for the test iterator:

# train = (training dataset)
# test = (test dataset)

if comm.rank != 0:
    train = chainermn.datasets.create_empty_dataset(train)
    test = chainermn.datasets.create_empty_dataset(test)

train_iter = chainer.iterators.SerialIterator(train, batchsize)
test_iter = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(test, batchsize,
                                     repeat=False, shuffle=False),
    comm)

API Reference

Communicators

chainermn.create_communicator(communicator_name='pure_nccl', mpi_comm=None, **kwargs)

Create a ChainerMN communicator.

Different communicators provide different approaches of communication, so they have different performance charasteristics. The default communicator pure_nccl is expected to generally perform well on a variety of environments, so one need not to change communicators in most cases. However, you may need to choose other communicators depending on your computing platform and the availability of NCCL library. The following communicators are available.

Name

CPU

GPU

NCCL

Recommended Use Cases

pure_nccl

OK

Required (>= v2)

pure_nccl is recommended when NCCL2 is available in the environment.

flat

OK

N/A

naive

OK

OK

Testing on CPU mode

pure_nccl communicator supports multiple data types, FP32 and FP16, in gradient exchange. The communication data type is determined based on chainer.global_config.dtype and allreduce_grad_dtype. When allreduce_grad_dtype is the default value None, FP32 is used when chainer.global_config.dtype is numpy.float32 and FP16 otherwise. allreduce_grad_dtype parameter, which is either numpy.float16 or numpy.float32, overwrites the chainer.global_config.dtype.

The table blow summarizes the data type selection in gradient exchange.

allreduce_grad_dtype

global_config.dtype

None

numpy.float16

numpy.float32

chainer.mixed16

FP16

FP16

FP32

numpy.float16

FP16

FP16

FP32

numpy.float32

FP32

FP16

FP32

Other communicators, namely flat and naive, support only float32 communication, no matter what the model is. This is due to MPI’s limited support of float16.

Parameters
  • communicator_name – The name of communicator (naive, flat, or pure_nccl)

  • mpi_comm – MPI4py communicator

  • allreduce_grad_dtype – Data type of gradient used in All-Reduce. If None, the dtype of a model is used.

Returns

ChainerMN communicator that implements methods defined in chainermn.CommunicatorBase

class chainermn.CommunicatorBase

Interface definition of all communicators.

All communicators that have compatible set of methods with this class is supposed to work in ChainerMN’s parallel computation implementation. The methods are named after MPI functions, such as bcast() came from MPI_Bcast().

There are two types of methods: one that treats Python objects have _obj suffix. The other has methods without any suffix and it handles ndarray and arrays filled with scaler values. So the number of methods would be

[send, recv, bcast, gather, allreduce] * [ '_obj', '']

(with single exception alltoall, multi_node_mean_grad, split and bcast_data so far). Also methods are supposed to be written in this order. All those methods must be implemented in its implementation class, or otherwise it cannot be instantiated in runtime.

Note

As most implementation of _obj-sufficed methods involves Python object pickling and unpickling, there is an implicit size limit.

TODO(kuenishi): as of now no implementation class actually has allreduce method.

abstract allgather(x)

A primitive of inter-process all-gather communication.

This method tries to invoke all-gather communication within the communicator. All processes in the communicator are expected to invoke allgather(). This method relies on mpi4py fast communication optimized for numpy arrays, as well as send() and recv().

Note that this method can only handle the same shapes of data over all processes, and cannot handle tuple data.

Parameters

x (numpy/cupy array) – Array to be gathered.

Returns

Received arrays.

Return type

ys (tuple of numpy/cupy array)

abstract allreduce(data)

Allreduce operation among processes

Processes one of several aggregation operations using all data from all processes and returns the result of the aggregation to all processes.

TODO(kuenishi): add op argument once we find a use case for operations other than ‘SUM’.

Parameters

data (ndarray) – the data to aggregate among all nodes.

Returns

Sum of all data from all processes.

allreduce_grad(model, zero_fill=False)

mean Chainer model gradients.

Deprecated since version v7.0.0: This API is deprecated. Please use multi_node_mean_grad() instead.

Parameters
  • link (Link) – Link object.

  • zero_fill – A knob to control whether to fill gradients of initialized and unused Link (which is None internally) with zero-valued array, because the all gradients must be an array among processes for performing all-reduce, which might be an array or None after backward computation. Gradients of uninitialized Link are skipped. If it is False, gradients of unused Link are just skipped.

abstract allreduce_obj(obj)

Apply a reduce operation to all objects and spread the result.

For example of integers and summation, equivalent local code is:

>>> from functools import reduce
>>> reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])
15

The only operation currently supported is summation.

TODO(kuenishi): support other operations such as ‘MAX’, ‘MIN’ and ‘PROD’ with op argument once we need any of them.

Parameters

obj – An arbitrary object to apply reduce operation. Must have corresponding operation method e.g. __plus__().

Returns

The result of the operation applied to all objects.

abstract alltoall(xs)

All-to-all implementation for ndarray

Parameters

xs (tuple of numpy/cupy array) –

Returns

Received arrays. The length of tuple equals to the communicator size.

Return type

ys (tuple of numpy/cupy array)

abstract bcast(data, max_buf_len=None, root=0)

Broadcasts an ndarray from root process to all processes

Parameters
  • data (numpy/cupy array) – for root process, the data to broadcast. For non-root processes, this argument is ignored.

  • max_buf_len (int) – Length of send buffer.

  • root (int) – the process who has the data to broadcast.

Returns

The data sent from root process

Return type

ys (numpy/cupy array)

abstract bcast_data(model)

Broadcast Chainer model parameter data

abstract bcast_obj(obj, max_buf_len=None, root=0)

Broadcasts an arbitrary object from root to all non-root processes.

Parameters
  • obj – arbitrary object to broadcast to all other non-root processes. Will be ignored at all non-root processes.

  • max_buf_len (int) – max length of the send buffer

  • root (int) – rank of the root processes who sends an object

Returns

an object sent from the root process.

finalize()

Finalizes and cleans up internal resource.

The communicator SHALL NOT be used after calling this finalize(). The behaviour is undefined when calling finalize on the same communicator multiple times.

abstract gather(data, root=0)

Gathers an ndarray from all processes to root process

Parameters
  • data (ndarray, or scaler) – for root process this is ignored. For For non-root processes, the data to send to root process.

  • root (int) – rank of the process who receives the data.

Returns

For root process, the ndarray sent from non-root processes. For non-root processes, what?

abstract gather_obj(obj, root=0)

Gathers arbitrary objects from all non-root processes to the root.

Parameters
  • obj – arbtrary object to send to root process. Root process will receive this argument included in returned list.

  • root (int) – rank of the root node who receives all objects.

Returns

A list of objects sent from all processes.

TODO(kuenishi): make sure the ordering of objects in the returned list.

get_config(name=None)

Get configuration value(s)

Parameters

name (str) – Name of the configuration to get. If it is None, all config names and values are returned.

Returns

Actual value of the configuration if it is on. None if it is off. If None is given as name, None or dictionary of names and configuration values is returned.

property inter_rank

The rank of this node in the cluster.

property inter_size

Number of nodes that participates the cluster.

property intra_rank

Intra rank (process id in the machine) of this process.

abstract multi_node_mean_grad(model, zero_fill=False)

mean Chainer model gradients.

Parameters
  • link (Link) – Link object.

  • zero_fill – A knob to control whether to fill gradients of initialized and unused Link (which is None internally) with zero-valued array, because the all gradients must be an array among processes for performing all-reduce, which might be an array or None after backward computation. Gradients of uninitialized Link are skipped. If it is False, gradients of unused Link are just skipped.

property rank

Rank (process id in the cluster) of this process in integer.

abstract recv(source, tag)

Receives an ndarray from source.

To receive the message, sender must send the data.

Parameters
  • source (int) – Rank of the source process

  • tag (int) – The tag to specifically receive the message

Returns

The data sent from source process

abstract recv_obj(source, tag)

Receives an arbitrary Python object from source process with a tag.

Parameters
  • source (int) – Rank number of sender process, to selectively receive the object.

  • tag – tag to identify the message.

Returns

an object sent from the source by send_obj.

abstract scatter(xs, root=0)

A primitive of inter-process scatter communication.

This method tries to invoke scatter communication within the communicator. All processes in the communicator are expected to invoke scatter().

Parameters
  • xs (tuple of numpy/cupy array) – Arrays to be scattered.

  • root (int) – Rank of root process.

Returns

Received arrays.

Return type

ys (numpy/cupy array)

abstract send(data, dest, tag)

Sends an ndarray to destination

Receiver must invoke recv() to wait for the message.

Parameters
  • data – data to be sent (tuple, list or raw numpy/cupy array)

  • dest (int) – Rank of the destination process

  • tag (int) – The tag to identify the message

abstract send_obj(obj, dest, tag)

Sends an arbitrary Python object to destination with a tag.

Parameters
  • obj – Arbitrary object to send to receiver.

  • dest (int) – Rank number of receiver process (destination).

  • tag – tag to identify the message.

set_config(name, **kwargs)

Set configurations(s) on/off

The usage of configurations depends on each communicator. See create_communicator() for available configurations.

Parameters
  • name (str) – Name of configuration to set.

  • value – Give arbitrary object to set.

  • kwargs – Arbitrary arguments depending on each configuration.

property size

Number of processes of the cluster.

abstract split(color, key)

A function anologous to MPI_Comm_Split .

This method splits the inter MPI commnicator and return a wrapped ChainerMN communicator.

Parameters
  • color (int) – Index of new group. The process with the same color will be assigned to the same group.

  • key (int) – Control of rank assignment. The process will be assigned a rank in the new group ordered by the value of key. If you do not care of the rank, you can just simply specify the original rank.

Returns

CommunicatorBase

Optimizers and Evaluators

chainermn.create_multi_node_optimizer(actual_optimizer, communicator, double_buffering=False, zero_fill=True)

Create a multi node optimizer from a Chainer optimizer.

Parameters
  • actual_optimizer – Chainer optimizer (e.g., chainer.optimizers.Adam).

  • communicator – ChainerMN communicator.

  • double_buffering – If True, all-reduce and other processing (such as forward and backward) are overlapped using double buffering. There are cases where accuracy is affected because the gradients of the previous iteration are used for update. This flag is supported by PureNcclCommunicator only.

  • zero_fill – A knob to control whether to fill gradients of initialized and unused Link (which is None internally) with zero-valued array, because the all gradients must be an array among processes for performing all-reduce, which might be an array or None after backward computation. Gradients of uninitialized Link are skipped. If it is False, gradients of unused Link are just skipped.

Returns

The multi node optimizer based on actual_optimizer.

chainermn.create_multi_node_evaluator(actual_evaluator, communicator)

Create a multi node evaluator from a normal evaluator.

Actually this method patches the evaluator to work in multi node environment. This method adds several hidden attributes starting with _mn_ prefix.

Parameters
  • actual_evaluator – evaluator to be patched (e.g., chainer.training.extensions.Evaluator)

  • communicator – ChainerMN communicator

Returns

The multi-node patched actual_evaluator.

Note

After patched, original evaluator does not work correctly in non-MPI environment.

class chainermn.extensions.GenericMultiNodeEvaluator(comm, iterator, target, device=None, converter=<chainer.dataset.convert._ArbitraryCallableConverter object>, root=0, **kwargs)

Generic multi-node evaluator for non-allreducable evaluation.

This is to evaluate a Dataset that cannot evenly divided across all processes in the communicator, for evaluation calculation that is not applicable to a simple add-and-devide style averaging among processes.

Users are recommeneded to implement its own local calculation calc_local() (e.g. at each distributed GPU) and aggregation aggregate() of its results. Although it has built-in implementaiton of those two methods.

It has several drawbacks; 1) Additional implementation of aggregation required to users, and 2) no compatibility with Evaluator.

Note

No automatic support of Reporter is provided; Set it up at initialize() method

Parameters
  • comm – ChainerMN communicator object

  • iterator – An iterator for test dataset. Must be non-repeated.

  • target (callable) – A model to evaluate with test dataset

  • device (int or chainer.backend.Device) – A device indicator to send data with converter. Not used when the converter is not using any devices.

  • converter (callable) – A converter. Default value is chainer.dataset.concat_examples() .

  • root (int) – Rank number of root process to run bcast and gather with.

  • progress_hook (callable) – A callable that receives single argument for indicators. The callable is only callled at root process.

aggregate(results)

A generic aggregation method.

Override this method for original aggregation calculation. By default, it just does nothing but returns the input. This method is called once and only once across the cluster, at root process. Reporting can be run here.

Parameters

results (list) – List of return value of calc_local() obtained from all nodes..

calc_local(*args, **kwargs)

A generic method for local calculation.

Override this method to run its local calculation. Otherwise, results are calculated with original target and test dataset.

Parameters
  • args – Result of converter when it is tuple.

  • kwargs – Result of converter when it is dict.

Returns

Arbrary value may be returned, but must not be None.

Dataset Utilities

chainermn.scatter_dataset(dataset, comm, root=0, shuffle=False, seed=None, max_buf_len=268435456, *, force_equal_length=True)

Scatter the given dataset to the workers in the communicator.

The dataset of worker root (i.e., the worker whose comm.rank is root) is scattered to all workers. The given dataset of other workers are ignored. The dataset is split to sub datasets of almost equal sizes and scattered to workers. To create a sub dataset, chainer.datasets.SubDataset is used.

Note::

Make sure force_equal_length flag is not off for multinode evaluator or multinode updaters, which assume that the iterator has the same lengths among processes to work correctly.

Parameters
  • dataset – A dataset (e.g., list, numpy.ndarray, chainer.datasets.TupleDataset, …).

  • comm – ChainerMN communicator or MPI4py communicator.

  • shuffle (bool) – If True, the order of examples is shuffled before being scattered.

  • root (int) – The root process of the scatter operation.

  • seed (int) – Seed the generator used for the permutation of indexes. If an integer being convertible to 32 bit unsigned integers is specified, it is guaranteed that each sample in the given dataset always belongs to a specific subset. If None, the permutation is changed randomly.

  • max_buf_len (int) – Max buffer size to be used at broadcasting binaries. Must not be larger than 2147483647.

  • force_equal_length (bool) – Force the scattered fragments of the dataset have equal length. If True, number of scattered examples is guaranteed to be equal among processes and scattered datasets may have duplication among processes. Otherwise, number of scattered examples may not be equal among processes, but scattered examples are guaranteed to have no duplication among processes, intended for strict evaluation of test dataset to avoid duplicated examples.

Returns

Scattered dataset.

chainermn.scatter_index(n_total_samples, comm, root=0, *, force_equal_length=True)

Scatters only index to avoid heavy dataset broadcast

This is core functionality of scatter_dataset, which is almost equal to following code snippet:

(b, e) = scatter_index(len(dataset), comm)
order = None
if shuffle:
    order = numpy.random.RandomState(seed).permutation(
        n_total_samples)
    order = comm.bcast_obj(order)
dataset = SubDataset(dataset, b, e, order)
Note::

Make sure force_equal_length flag is not off for multinode evaluator or multinode updaters, which assume that the iterator has the same lengths among processes to work correctly.

Parameters
  • n_total_samples (int) – number of total samples to scatter

  • comm – ChainerMN communicator object

  • root (int) – root rank to coordinate the operation

  • force_equal_length (bool) – Force the scattered fragments of the index have equal length. If True, number of scattered indices is guaranteed to be equal among processes and scattered datasets may have duplication among processes. Otherwise, number of scattered indices may not be equal among processes, but scattered indices are guaranteed to have no duplication among processes, intended for strict evaluation of test dataset to avoid duplicated examples.

Returns

Tuple of two integers, that stands for beginning and ending offsets of the assigned sub part of samples. The ending offset is not border inclusive.

chainermn.datasets.create_empty_dataset(dataset)

Creates an empty dataset for models with no inputs and outputs.

This function generates an empty dataset, i.e., __getitem__() only returns None. Its dataset is compatible with the original one. Such datasets used for models which do not take any inputs, neither return any outputs. We expect models, e.g., whose forward() is starting with chainermn.functions.recv() and ending with chainermn.functions.send().

Parameters

dataset – Dataset to convert.

Returns

Dataset consists of only patterns in the original one.

Return type

TransformDataset

Functions

chainermn.functions.send(x, communicator, rank, tag=0)

Send elements to target process.

This function returns a dummy variable only holding the computational graph. If backward() is invoked by this dummy variable, it will try to receive gradients from the target process and send them back to the parent nodes.

Parameters
  • x (Variable) – Variable holding a matrix which you would like to send.

  • communicator (chainer.communicators.CommunicatorBase) – ChainerMN communicator.

  • rank (int) – Target process specifier.

  • tag (int) – Optional message ID (MPI feature).

Returns

A dummy variable with no actual data, only holding the computational graph. Please refer chainermn.functions.pseudo_connect for detail.

Return type

Variable

chainermn.functions.recv(communicator, rank, delegate_variable=None, tag=0, force_tuple=False)

Receive elements from target process.

This function returns data received from target process. If backward() is invoked, it will try to send gradients to the target process. The received array will be on the current CUDA device if the corresponding send() is invoked with arrays on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Note

If you define non-connected computational graph on one process, you have to use delegate_variable to specify the output of previous computational graph component. Otherwise backward() does not work well. Please refer chainermn.functions.pseudo_connect for detail.

Parameters
  • communicator (chainer.communicators.CommunicatorBase) – ChainerMN communicator.

  • rank (int) – Target process specifier.

  • delegate_variable (chainer.Variable) – Pointer to the other non-connected component.

  • tag (int) – Optional message ID (MPI feature).

  • force_tuple (bool) – If False (the default) a Variable will be returned when the number of outputs is one. Otherwise, this method returns a tuple even when the number of outputs is one.

Returns

Data received from target process. If backward() is invoked by this variable, it will send gradients to the target process.

Return type

Variable

chainermn.functions.pseudo_connect(delegate_variable, *actual_variables)

Connect independent connected graph component.

This function is implemented to return received arguments directly, except the first delegate_variable. In backward computation, it returns received gradients directly, adding a zero grad corresponding to delegate_variable. The detail of delegate_variable is described in the following notes.

Note

In model-parallel framework, models on each process might have many non-connected components. Here we call a given graph non-connected when multiple inter-process communications are needed for its computation. For example, consider the following example:

class ConnectedGraph(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(ConnectedGraph, self).__init__(comm)
        self.add_link(ConnectedGraphSub(), rank_in=3, rank_out=1)

This model receives inputs from rank=3 process and sends its outputs to rank=1 process. The entire graph can be seen as one connected component ConnectedGraphSub. Please refer the documentation of MultiNodeChainList for detail.

On the other hand, see the next example:

class NonConnectedGraph(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(NonConnectedGraph, self).__init__(comm)
        self.add_link(NonConnectedGraphSubA(), rank_in=3, rank_out=1)
        self.add_link(NonConnectedGraphSubB(), rank_in=1, rank_out=2)

This model consists of two components: at first, NonConnectedGraphSubA receives inputs from rank=3 process and sends its outputs to rank=1 process, and then NonConnectedGraphSubB receives inputs from rank=1 process and sends its outputs to rank=2 process. Here multiple inter-process communications are invoked between NonConnectedGraphSubA and NonConnectedGraphSubB, so it is regarded as non-connected.

Such kind of non-connected models can be problematic in backward computation. Chainer traces back the computational graph from the output variable, however naive implementation of chainermn.functions.recv does not take any inputs rather receives inputs by MPI_Recv, where backward path vanishes.

To prevent this, dummy variables what we call delegate_variable are used. In principle, chainermn.functions.send does not return any outputs because it sends data to the other process by MPI_Send. However, chainermn.functions.send returns a dummy / empty variable in our implementation, which is called delegate_variable. This variable does not hold any data, just used for retaining backward computation path. We can guarantee the backward computation just by putting delegate_variable to the next chainermn.functions.recv (chainermn.functions.recv has an optional argument to receive delegate_variable).

Note

In some cases the intermediate graph component returns model outputs. See the next example:

class NonConnectedGraph2(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(NonConnectedGraph2, self).__init__(comm)
        self.add_link(NonConnectedGraphSubA(), rank_in=1, rank_out=None)
        self.add_link(NonConnectedGraphSubB(), rank_in=None, rank_out=1)

This model first receives inputs from rank=1 process and make model outputs (specified by rank_out=None) in NonConnectedGraphSubA. Then using model inputs (specified by rank_in=None), NonConnectedGraphSubB sends its outputs to rank=1 process. Since MultiNodeChainList.__call__ returns outputs of the last component (in this case, outputs of NonConnectedGraphSubB), naive implementation cannot output the returned value of NonConnectedGraphSubA as the model outputs. In this case, pseudo_connect should be used.

pseudo_connect takes two arguments. The first one delegate_variable is what we explained in above note. In this case, returned value of NonConnectedGraphSubB corresponds to delegate_variable. The second one actual_variables is “what we want delegate_variable to imitate”. In NonConnectedGraph2, we obtain returned value of NonConnectedGraphSubB as the model outputs, but what we actually want is returned value of NonConnectedGraphSubA. At the same time we want to trace back this resulted variable in backward computation. Using pseudo_connect, we can make a variable whose data is the same as the returned value of NonConnectedGraphSubA, and which traces back NonConnectedGraphSubB first.

pseudo_connect should also be used in some pathological cases, for example, where multiple chainermn.functions.send occurs sequentially.

Parameters
  • delegate_variable (chainer.Variable) – Pointer to the previous non-connected graph component.

  • actual_variables (tuple of chainer.Variable) – Actual values which delegate_variable imitate.

Returns

A variable with the given values combined with delegating variable.

Return type

tuple of chainer.Variable

chainermn.functions.bcast(comm, x, root=0)

Differentiable broadcast communication between workers.

This function invokes broadcast communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, where gradients are gathered to the root process and summed up.

The received array will be on the current CUDA device if x on the invoking process is on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Parameters
  • comm – ChainerMN communicator.

  • x (chainer.Variable) – Variable to be sent.

Returns

Broadcasted variable.

Return type

y (chainer.Variable)

chainermn.functions.gather(comm, x, root=0)

Differentiable gather communication between workers.

This function invokes gather communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, where gradients are scattered from the root process to each slave.

The received array will be on the current CUDA device if x on the root process is on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Parameters
  • comm – ChainerMN communicator.

  • x (chainer.Variable) – Variable to be sent.

Returns

Gathered variables. None for slaves.

Return type

ys (chainer.Variable)

chainermn.functions.scatter(comm, xs, root=0)

Differentiable scatter communication between workers.

This function invokes scatter communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, where gradients are gathered to the root process.

The received array will be on the current CUDA device if xs on the root process is on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Parameters
  • comm – ChainerMN communicator.

  • xs (list of chainer.Variable) – Variables to be scattered for master process. None for slave process.

Returns

Scattered variable.

Return type

y (chainer.Variable)

chainermn.functions.alltoall(comm, xs)

Differentiable all-to-all communication between workers.

This function invokes all-to-all communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, just passing input gradients back. Unlike point-to-point communication such as chainermn.functions.send and chainermn.functions.recv, users need not to care about delegate variables, since backward() will not be invoked until all gradients from output direction arrive. Please refer to chainermn.functions.pseudo_connect about the detail of delegate variables.

The received array will be on the current CUDA device on the invoking process if xs is on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Parameters
  • comm – ChainerMN communicator.

  • xs (list of chainer.Variables) – Variables to send.

Returns

Received variables.

Return type

ys (list of chainer.Variables)

chainermn.functions.allgather(comm, x)

Differentiable all-gather communication between workers.

This function invokes gather communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, where gradients are reduced to each process.

The received array will be on the current CUDA device on the invoking process if x is on GPU. Please be aware that the current CUDA device is intended one. (https://docs-cupy.chainer.org/en/stable/tutorial/basic.html#current-device)

Parameters
  • comm – ChainerMN communicator.

  • x (chainer.Variables) – Variables to send.

Returns

Received variables.

Return type

ys (list of chainer.Variables)

Iterators

chainermn.iterators.create_multi_node_iterator(actual_iterator, communicator, rank_master=0)

Create a multi node iterator from a Chainer iterator.

This iterator shares the same batches on multiple processes, simply broadcasting batches from master process to slave processes in each iteration. Master process obtains batches from actual_iterator, which you can specify any Chainer iterator (e.g. chainer.iterators.SerialIterator).

Here is an example situation. When we train a sequence-to-sequence model, where the encoder and the decoder is located on two different processes, we want to share the same batches on each process, thus inputs for the encoder and output teacher signals for the decoder become consistent.

In order to use the multi node iterator, first create the iterator from Chainer iterator and ChainerMN communicator:

iterator = chainermn.iterators.create_multi_node_iterator(
    chainer.iterators.SerialIterator(
        dataset, batch_size, shuffle=True),
    communicator)

Then you can use it as the ordinary Chainer iterator:

updater = chainer.training.StandardUpdater(iterator, optimizer)
trainer = training.Trainer(updater)
trainer.run()

Since this iterator shares batches through network in each iteration, communication might be large. If you train your model-parallel network on extremely large dataset, you can also consider to use chainermn.iterators.create_synchronized_iterator.

Current multi node iterator supports numpy.float32 or tuple of numpy.float32 as the data type of the batch element.

Note

create_multi_node_iterator and serialize of created iterators must be called at the same time by master and slaves, unless it falls into deadlock because they synchronize internal states of iterators.

Parameters
  • actual_iterator – Chainer iterator (chainer.iterators.SerialIterator and chainer.iterators.MultiprocessIterator are supported).

  • communicator – ChainerMN communicator.

  • rank_master – process rank to be master.

Returns

The master-slave iterator based on actual_iterator.

chainermn.iterators.create_synchronized_iterator(actual_iterator, communicator)

Create a synchronized iterator from a Chainer iterator.

This iterator shares the same batches on multiple processes, using the same random number generators to maintain the order of batch shuffling same.

Here is an example situation. When we train a sequence-to-sequence model, where the encoder and the decoder is located on two different processes, we want to share the same batches on each process, thus inputs for the encoder and output teacher signals for the decoder become consistent.

In order to use the synchronized iterator, first create the iterator from Chainer iterator and ChainerMN communicator:

iterator = chainermn.iterators.create_synchronized_iterator(
    chainer.iterators.SerialIterator(
        dataset, batch_size, shuffle=True),
    communicator)

Then you can use it as the ordinary Chainer iterator:

updater = chainer.training.StandardUpdater(iterator, optimizer)
trainer = training.Trainer(updater)
trainer.run()

The resulting iterator shares the same shuffling order among processes in the specified communicator.

Parameters
  • actual_iterator – Chainer iterator (e.g., chainer.iterators.SerialIterator).

  • communicator – ChainerMN communicator.

Returns

The synchronized iterator based on actual_iterator.

Trainer extensions

class chainermn.extensions.AllreducePersistent(model, comm)

Chainer extension to averagize persistents over workers.

When called, this extension invokes all-reduce communication among workers to compute averages of persistent variables in the model. Persistent variables are updated to the averages. Currently, we ignore integer persistent variables, and only float persistent variables are handled.

This extension is mainly to improve the running mean and variance of BatchNormalization by increasing the effective number of examples. We do not need to call this frequently; call just before storing or evaluating the model.

Parameters
  • model (chainer.link.Link) – Target link object.

  • comm (ChainerMN communicator) – communicator to compute averages.

chainermn.extensions.multi_node_snapshot(comm, snapshot, replica_sets)

Create trainer extension for multi-node snapshots

Provides generis multi-node snapshot saving and auto-load feature at multi-node environment, leveraging power of single-node snapshot.

In many cases snapshot target may differ, e.g. only trainer of rank 0 process often has extensions such as LogReport and so on, to not confuse terminal output. Just loading at one process and broadcasting it to other processes does not work in that case.

This wrapper addresses that issue by defining sets of replicas where within the set the target object is replicated and supposed to be same among processes. For example, a trainer example, only the trainer at rank 0 has special extensions and others doesn’t:

trainer = Trainer(updater)
if comm.rank == 0:
    trainer.extend(extensions.DumpGraph('main/loss'))
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.ProgressBar())

This case can be described with two replica sets, where each set can be represented as single integer that indicates rank number, or iterable set/list/generator of integers like this:

replica_sets = [[0], range(1, comm.size)]

Here the first replica set is described as [0], or simply in short just 0, and the second replica set is range(1, comm.size), representing rest of processes other than 0. The remaining list can be omitted. Thus in that case, it can be simplified more:

replica_sets = [0,]

In this case, the snapshot will be saved at rank 0 process and at rank 1 process. The latter represents the replica set of range(1, comm.size) . In this case autoloading at initialization of snapshot extension works after the restart cleanly, even though the size of the communicator differs.

Once the replica sets are defined, it can be easily extended:

replica_sets = [0,]
snapshot = multi_node_snapshot(comm, extensions.snapshot(),
                               replica_sets)
trainer.extend(snapshot, trigger=(1, 'epoch'))

More example tuples of replica set representation follows:

code

nproc

actual sets

[0]

4

[{0}, {1, 2, 3}]

[0, 1]

4

[{0}, {1}, {2, 3}]

[0, 1], [2, 3]]

4

[{0, 1}, {2, 3}]

[]

4

[{0, 1, 2, 3}]

[range(0, 8, 2)]

8

[set(range(0, 8, 2)), set(range(1, 8, 2))]

Parameters
  • comm (ChainerMN communicator) – communicater object

  • snapshot – Snapshot extension object obtained via snapshot() .

  • replica_sets – list of replica set definition, where a replica set can be defined by single integer as rank number, or iterable integers.

Returns

Trainer extension that wraps snapshot and properly controles number of snapshots.

chainermn.create_multi_node_checkpointer(name, comm, cp_interval=5, gc_interval=5, path=None)

Create multi-node checkpointer object

Generational snapshot extension to allow fault tolerance; It keeps several old snapshots to rollback synchronized snapshot at each MPI process. Snapshot files are identified as ‘<name>.<rank>.<iteration>’.

  • <name> … identifier of the run where snapshot is kept for

  • <rank> … which process owned the model

  • <iteration> … number of iteration.

This extension keeps several files for each execution and allows users to resume the whole job at the latest snapshots of each MPI process, and the iteration where all snapshots agrees.

As this object is a usual Chainer extension, users can just create this object and pass to the trainer as an extension:

checkpointer = create_multi_node_checkpointer(name=run_id, comm=comm)
trainer.extend(checkpointer, trigger=(25, 'iteration'))

To run recovery at startup, before first iteration, run

checkpointer.maybe_load(trainer, optimizer)

before trainer.run() . If nothing is recovered (i.e. no snapshot found), trainer.updater.iteration will remain 0 . Otherwise it will have the value of snapshot and the training will resume from that iteration. optimizer is optional but this will let multi node optimizer avoid initial broadcast when all snapshot data among nodes are all in sync.

Note

Make sure that checkpointer.maybe_load is called after all extensions with states, such as ExponentialShift, set to the trainer.

Note

The checkpointer is deprecated. Please use chainermn.extensions.multi_node_snapshot() instead.

After training finished without errors all those temporary checkpoints will be cleaned up at all nodes.

Another example to use checkpointer without trainer would be:

checkpointer = create_multi_node_checkpointer(name=run_id, comm=comm)
checkpointer.maybe_load(obj_you_want_to_snap, optimizer)

while True: ## Training loop
    ...
    updater.update()
    ...
    checkpointer.save(obj_you_want_to_snap)  # Make a checkpoint
Parameters
  • name (str) – unique id of the run

  • comm – communicater in ChainerMN

  • cp_interval (int) – minimum number of checkpoints to preserve

  • gc_interval (int) – interval to collect non-preserved checkpoints

Configurations

Environmental Variables
CHAINERMN_FORCE_ABORT_ON_EXCEPTIONS

If this variable is set to a non-empty value, ChainerMN installs a global hook to Python’s sys.excepthook to call MPI_Abort() when an unhandled exception occurs. See MPI process hangs after an unhandled Python exception.

ChainerMN issue #236 may also help to understand the problem.

Execution Control
chainermn.global_except_hook.add_hook()

Add a global hook function that captures all unhandled exceptions.

The function calls MPI_Abort() to force all processes abort. It is useful when you run your training script on a cloud platform.

Export Chainer to ONNX

Introduction

ONNX-Chainer converts Chainer model to ONNX format, export it.

Installation

Install dependencies using pip via PyPI:

$ pip install 'onnx<1.7.0'

Quick Start

First, install ChainerCV to get the pre-trained models.

import numpy as np

import chainer
import chainercv.links as C
import onnx_chainer

model = C.VGG16(pretrained_model='imagenet')

# Pseudo input
x = np.zeros((1, 3, 224, 224), dtype=np.float32)

onnx_chainer.export(model, x, filename='vgg16.onnx')

vgg16.onnx file will be exported.

Other export examples are put on onnx_chainer/examples. Please check them.

Supported Functions

Currently 82 Chainer Functions are supported to export in ONNX format.

Activation

  • ClippedReLU

  • ELU

  • HardSigmoid

  • LeakyReLU

  • LogSoftmax

  • PReLUFunction

  • ReLU

  • Sigmoid

  • Softmax

  • Softplus

  • Tanh

Array

  • Cast

  • Concat

  • Copy

  • Depth2Space

  • Dstack

  • ExpandDims

  • GetItem

  • Hstack

  • Pad 1 2

  • Permutate

  • Repeat

  • Reshape

  • ResizeImages

  • Separate

  • Shape 5

  • Space2Depth

  • SplitAxis

  • Squeeze

  • Stack

  • Swapaxes

  • Tile

  • Transpose

  • Vstack

  • Where

Connection

  • Convolution2DFunction

  • ConvolutionND

  • Deconvolution2DFunction

  • DeconvolutionND

  • EmbedIDFunction 3

  • LinearFunction

Loss

  • SoftmaxCrossEntropy

Math

  • Absolute

  • Add

  • AddConstant

  • ArgMax

  • ArgMin

  • BroadcastTo

  • Clip

  • Div

  • DivFromConstant

  • Exp

  • Identity

  • LinearInterpolate

  • LogSumExp

  • MatMul

  • Max

  • Maximum

  • Mean

  • Min

  • Minimum

  • Mul

  • MulConstant

  • Neg

  • PowConstVar

  • PowVarConst

  • PowVarVar

  • Prod

  • RsqrtGPU

  • Sqrt

  • Square

  • Sub

  • SubFromConstant

  • Sum

Noise

  • Dropout 4

Normalization

  • BatchNormalization

  • FixedBatchNormalization

  • LocalResponseNormalization

  • NormalizeL2

Pooling

  • AveragePooling2D

  • AveragePoolingND

  • MaxPooling2D

  • MaxPoolingND

  • ROIPooling2D

  • Unpooling2D

1

mode should be either ‘constant’, ‘reflect’, or ‘edge’

2

ONNX doesn’t support multiple constant values for Pad operation

3

Current ONNX doesn’t support ignore_label for EmbedID

4

In test mode, all dropout layers aren’t included in the exported file

5

Chainer doesn’t support Shape function

Tested Environments

  • OS

    • Ubuntu 16.04, 18.04

    • Windows 10

  • Python 3.5.5, 3.6.7, 3.7.2

  • ONNX 1.4.1, 1.5.0, 1.6.0

    • opset version 7, 8, 9, 10, 11

  • ONNX-Runtime 0.5.0

Run Test

1. Install test modules

First, test modules for testing:

$ pip install -e .[test]
$ pip install onnxruntime

Test on GPU environment requires Cupy:

$ pip install cupy  # or cupy-cudaXX is useful
2. Run tests

Next, run pytest:

$ pytest -m "not gpu" tests/onnx_chainer_tests

on GPU environment:

$ pytest tests/onnx_chainer_tests

Contribution

Any contribution to ONNX-Chainer is welcome!

Module Reference

Export

ONNX-Chainer exports Chainer model to ONNX graph with various options.

onnx_chainer.export

Export function for chainer.Chain in ONNX format.

onnx_chainer.export_testcase

Export model and I/O tensors of the model in protobuf format.

Export Utilities

ONNX-Chainer provides some utility functions to help exporting.

onnx_chainer.replace_func.fake_as_funcnode

The target function fakes FunctionNode

onnx_chainer.replace_func.as_funcnode

The target function fakes FunctionNode

Convert Utilities

These utilities helps converting from Chainer model to ONNX format, mainly used them internally.

Testing Utilities

onnx_chainer.testing.input_generator.increasing

Returns a monotonically increasing ndarray for test inputs.

onnx_chainer.testing.input_generator.nonzero_increasing

Returns a monotonically increasing ndarray for test inputs.

onnx_chainer.testing.input_generator.positive_increasing

Returns a monotonically increasing ndarray for test inputs.

Indices and tables

API Compatibility Policy

This documentation explains the design policy on compatibilities of Chainer APIs. Development team should follow this policy on deciding to add, extend, and change APIs and their behaviors.

This documentation is written for both users and developers. Users can decide the level of dependencies on Chainer’s implementations in their codes based on this document. Developers should read through this documentation before creating pull requests that contain changes on the interface. Note that this documentation may contain ambiguities on the level of supported compatibilities.

Versioning and Backward Compatibility

The versioning of Chainer follows the PEP 440 and a part of Semantic versioning. See Contribution Guide for details of versioning.

The backward compatibility is kept for revision updates and minor updates, which are applied to the stable version. A major update from the latest release candidate basically keeps the backward compatibility, although it is not guaranteed. Any pre-releases may break the backward compatibility.

Breaking the Compatibility

We sometimes need to break the backward compatibility to improve the framework design and to support new kinds of machine learning methods. Such a change is only made into pre-releases (alpha, beta, and release candidate) and sometimes into the major update.

A change that breaks the compatibility affects user codes. We try to lower the cost of adapting your code to the newer version. The following list shows an example of what we can do to reduce the cost (Note: this is not a promise; what kind of actions we can take depends on the situation).

  • When an argument is removed from an existing API, passing the argument to the updated API will emit an error with a special error message. The error message tells you how to fix your code.

  • When a function or a class is removed, we make the current stable version emit a deprecation warning. Note that the deprecation warning is not printed by default in Python. You have to manually turn on the deprecation warning by warnings.simplefilter('always', DeprecationWarning).

  • When a definition of a link is changed, we try to enable it to deserialize a model dumped with an older version of Chainer. In most cases, we cannot guarantee that a model serialized with a newer version of Chainer is loadable by an older version of Chainer.

Experimental APIs

Thanks to many contributors, we have introduced many new features to Chainer.

However, we have sometimes released new features only to later notice that their APIs are not appropriate. In particular, we sometimes know that the API is likely to be modified in the near future because we do not have enough knowledge about how well the current design fits to the real usages. The objective of experimental APIs is to declare that the APIs are likely to be updated in the near future so that users can decide if they can(not) use them.

Any newly added API can be marked as experimental. Any API that is not experimental is called stable in this document.

Note

Undocumented behaviors are not considered as APIs, so they can be changed at any time (even in a revision update). The treatment of undocumented behaviors are described in Undocumented behaviors section.

When users use experimental APIs for the first time, warnings are raised once for each experimental API, unless users explicitly disable the emission of the warnings in advance.

See the documentation of chainer.utils.experimental() to know how developers mark APIs as experimental and how users enable or disable the warnings practically.

Note

It is up to developers if APIs should be annotated as experimental or not. We recommend to make the APIs experimental if they implement large modules or make a decision from several design choices.

Supported Backward Compatibility

This section defines backward compatibilities that revision updates must maintain.

Documented Interface

Chainer has the official API documentation. Many applications can be written based on the documented features. We support backward compatibilities of documented features. In other words, codes only based on the documented features run correctly with revision-updated versions.

Developers are encouraged to use apparent names for objects of implementation details. For example, attributes outside of the documented APIs should have one or more underscores at the prefix of their names.

Note

Although it is not stated as a rule, we also try to keep the compatibility for any interface that looks like a stable feature. For example, if the name of a symbol (function, class, method, attribute, etc.) is not prefixed by an underscore and the API is not experimental, the API should be kept over revision updates even if it is not documented.

Undocumented behaviors

Behaviors of Chainer implementation not stated in the documentation are undefined. Undocumented behaviors are not guaranteed to be stable between different revision versions.

Even revision updates may contain changes to undefined behaviors. One of the typical examples is a bug fix. Another example is an improvement on implementation, which may change the internal object structures not shown in the documentation. As a consequence, even revision updates do not support compatibility of pickling, unless the full layout of pickled objects is clearly documented.

Documentation Error

Compatibility is basically determined based on the documentation, although it sometimes contains errors. It may make the APIs confusing to assume the documentation always stronger than the implementations. We therefore may fix the documentation errors in any updates that may break the compatibility in regard to the documentation.

Note

Developers should not fix the documentation and implementation of the same functionality at the same time in revision updates as a “bug fix” unless the bug is so critical that no users are expected to be using the old version correctly.

Object Attributes and Properties

Object attributes and properties are sometimes replaced by each other. It does not break the user codes, except the codes depend on how the attributes and properties are implemented.

Functions and Methods

Methods may be replaced by callable attributes keeping the compatibility of parameters and return values. It does not break the user codes, except the codes depend on how the methods and callable attributes are implemented.

Exceptions and Warnings

The specifications of raising exceptions are considered as a part of standard backward compatibilities. No exception is raised in the future revision versions with correct usages that the documentation allows.

On the other hand, warnings may be added at any revision updates for any APIs. It means revision updates do not keep backward compatibility of warnings.

Model Format Compatibility

Links and chains serialized by official serializers that Chainer provides are correctly loaded with the future versions. They might not be correctly loaded with Chainer of the lower versions.

Note

Current serialization APIs do not support versioning. It prevents us from introducing changes in the layout of objects that support serialization. We are discussing versioning in serialization APIs.

Installation Compatibility

The installation process is another concern of compatibilities.

Any changes on the set of dependent libraries that force modifications on the existing environments should be done in pre-releases and major updates. Such changes include following cases:

  • dropping supported versions of dependent libraries (e.g. dropping cuDNN v2)

  • adding new mandatory dependencies (e.g. adding h5py to setup_requires)

Note

We sometimes have to narrow the supported versions due to bugs in the specific versions of libraries. In such a case, we may drop the support of those versions even in revision updates unless a workaround is found for the issue.

Contribution Guide

Chainer is an open source software hosted on GitHub and welcomes contributors to take part in the development of the framework. This is a document aimed towards such contributors. Anyone who for instance would like to file an issue or send a pull request (PR) is encouraged to go through it.

Note

As announced, Chainer is under the maintenance phase and further development will be limited to bug-fixes and maintenance only. Pull-requests for new features, enhancements, or backward-compatibility breaking changes will not be accepted.

Issues and Pull Requests

First steps in contributing to Chainer often involve filing an issue or creating a PR. This section describes how to do so.

How to File an Issue

To file an issue on GitHub, you often only need to follow instructions given by the template. Write precise explanations on how you want Chainer to behave or include necessary and sufficient conditions to reproduce the bugs. Feature requests should include what you want to do and preferably why. You may additionally suggest how.

Warning

If you have a question regarding the usage of Chainer, it is recommended that you send a post to StackOverflow or the Chainer User Group instead of the issue tracker. The issue tracker is not a place to share knowledge on practices.

How to Send a Pull Request

If you can write code to fix an issue, it is encouraged to send a PR.

In that case, confirm the following points before starting to write any code.

  • Read Coding Guidelines and Unit Testing.

  • Check the appropriate branch to which you should send a PR, following Git Branches. If you are unsure about which branch to target, choose the master branch. The current source tree of the chosen branch is the starting point of your change.

After writing your code (including unit tests and hopefully documentations!), send a PR on GitHub. You have to write a precise explanation of what and how in the description; this is the first documentation of your code and an important part of your PR.

However, even if your code is not complete, you can send a PR as a work-in-progress (WIP) PR by prefixing the PR title with [WIP]. If you just describe the PR, the core team and other contributors can join the discussion about how to proceed with it. WIP PRs may occasionally be useful for discussing based on concrete code.

When a PR is created (or updated), it is automatically tested in one of our CI environments, namely Travis CI. There are other CI environments as well often manually triggered by the reviewer. The various CIs are required to test for instance different platforms or CUDA environments. Once the tests in all CI environments pass and/or the PR is approved by the reviewer, the PR will be merged.

Note

If you are planning to add a new feature or modify existing APIs, it is recommended that you open an issue and discuss the design first. Following the consequences of the discussions, you can send a PR that is smoothly reviewed in a shorter time.

Issue/Pull Request Labels

Issues and PRs are labeled on GitHub so that they can be grouped, filtered and better maintained. For instance, a label can indicate that a ticket needs response from the PR author, or that an issue needs immediate action in case of a critical bug. Please refer to the list of lables on GitHub.

Coding Guidelines

We follow PEP 8 and partially OpenStack Style Guidelines as basic style guidelines. Any contributions in terms of code are expected to follow these guidelines.

You can use the autopep8 and the flake8 commands to check whether or not your code follows the guidelines. In order to avoid confusion from using different tool versions, we pin the versions of those tools. Install them with the following command (from within the top directory of the Chainer repository):

$ pip install -e '.[stylecheck]'

And check your code with:

$ autopep8 path/to/your/code.py
$ flake8 path/to/your/code.py

autopep8 can automatically correct Python code to conform to the PEP 8 style guide:

$ autopep8 --in-place path/to/your/code.py

The flake8 command lets you know parts of your code that are not following the style guidelines.

Note that flake8 command is not perfect. It does not check some of the style guidelines. Here is a (not-exhaustive) list of the rules that flake8 cannot check.

  • Relative imports are prohibited. [H304]

  • Importing non-module symbols is prohibited.

  • Import statements must be organized into three parts: standard libraries, third-party libraries, and internal imports. [H306]

In addition, we restrict the usage of shortcut aliases in any global-scope code. In particular, you cannot use shortcut aliases to designate a parent class in global-scope class definitions. When you want to make a class inheriting another class defined in another module, you have to spell out the full module name instead of importing a module that provides an alias.

For example, the following code is not allowed.

import chainer

class MyLink(chainer.Link): ...

Instead, import chainer.link and use that.

import chainer.link

class MyLink(chainer.link.Link): ...

If you feel the code too verbose, you can also use from import or import as.

from chainer import link

class MyLink(link.Link): ...

Note

From v3.0, we allow shortcut aliases used inside of functions and methods that are not called from any global scope code. For example, you can write chainer.Variable instead of chainer.variable.Variable inside of functions and methods. Use of such aliases was prohibited in the past for avoiding confusing errors related to cyclic dependencies; we relaxed the rule so that the library code looks similar to user code.

When you use such shortcut aliases, please be careful of cyclic imports. One of the typical pitfalls is a way to import chainer.functions. An import like import chainer.functions as F within modules under chainer.functions does not work. An import like from chainer import functions works well with Python 3, but does not with Python 2. We recommend that you use import chainer.functions and spell out like chainer.functions.foo in your methods.

Unit Testing

Testing is one of the most important aspects of your PR. You should write test cases and verify your implementation by following the testing guide above. If you modify code related to existing unit tests, you must run appropriate commands and confirm that the tests still pass.

Note that we are using pytest and the mock package for testing. They are not included in Chainer and need to be installed as follows:

$ pip install pytest mock

How to Run Tests

You can run all unit tests with the following command from the root directory of the Chainer:

$ python -m pytest

Or specify a test script that you want to run:

$ python -m pytest path/to/your/test.py

You can also run all unit tests under a specific directory:

$ python -m pytest tests/chainer_tests/<directory name>

Some tests require CUDA and cuDNN by default. In order to run unit tests that do not require CUDA and cuDNN, set an environment variable and filter using test marks as follows:

$ export CHAINER_TEST_GPU_LIMIT=0
$ python -m pytest path/to/your/test.py -m='not cudnn'

Some GPU tests involve multiple GPUs. If you want to run GPU tests with insufficient number of GPUs, specify the number of available GPUs to CHAINER_TEST_GPU_LIMIT. For example, if you only have a single GPU, launch pytest with the following command to skip multi-GPU tests:

$ export CHAINER_TEST_GPU_LIMIT=1
$ python -m pytest path/to/gpu/test.py

Some tests spend too much time. If you want to skip such tests, pass -m='not slow' option to the command:

$ python -m pytest path/to/your/test.py -m='not slow'

Test File and Directory Naming Conventions

Tests are found in the tests/chainer_tests directory. In order to enable the test runner to find test scripts correctly, we are using a special naming convention for the test subdirectories and the test scripts.

  • The name of each subdirectory of tests must end with the _tests suffix.

  • The name of each test script must start with the test_ prefix.

When we write a test for a module, we use the appropriate path and file name for the test script whose correspondence to the tested module is clear. For example, if you want to write a test for a module chainer.x.y.z, the test script must be located at tests/chainer_tests/x_tests/y_tests/test_z.py.

How to Write Tests

There are many examples of unit tests under the tests directory, so reading some of them is a good and recommended way to learn how to write tests for Chainer. They use the unittest package of the standard library, while some tests are additionally using utilities from chainer.testing.

In addition to the Coding Guidelines mentioned above, the following rules apply to the test code:

  • All test classes must inherit from unittest.TestCase.

  • Use unittest features to write tests, except for the following cases:

    • Use assert statement instead of self.assert* methods (e.g., write assert x == 1 instead of self.assertEqual(x, 1)).

    • Use with pytest.raises(...): instead of with self.assertRaises(...):.

Note

We are incrementally applying the above style. Some existing tests may be using the old style (self.assertRaises, etc.), but all newly written tests should follow the above style.

Even if your patch includes GPU-related code, your tests should not fail without GPU capability. Test functions that require CUDA must be tagged with the chainer.testing.attr.gpu decorator:

import unittest
from chainer.testing import attr

class TestMyFunc(unittest.TestCase):
    ...

    @attr.gpu
    def test_my_gpu_func(self):
        ...

The functions tagged with the gpu decorator are skipped if CHAINER_TEST_GPU_LIMIT=0 environment variable is set. We also have the chainer.testing.attr.cudnn decorator to let pytest know that the test depends on cuDNN. The test functions decorated with cudnn are skipped if -m='not cudnn' is given.

The test functions decorated with gpu must not depend on multiple GPUs. In order to write tests for multiple GPUs, use the chainer.testing.attr.multi_gpu() decorator instead:

import unittest
from chainer.testing import attr

class TestMyFunc(unittest.TestCase):
    ...

    @attr.multi_gpu(2)  # specify the number of required GPUs here
    def test_my_two_gpu_func(self):
        ...

If your test requires too much time, add the chainer.testing.attr.slow decorator. The test functions decorated with slow are skipped if -m='not slow' is given:

import unittest
from chainer.testing import attr

class TestMyFunc(unittest.TestCase):
    ...

    @attr.slow
    def test_my_slow_func(self):
        ...

Note

If you want to specify more than two attributes, use and operator like -m='not cudnn and not slow'. See detail in the documentation of pytest.

Documentation

When adding a new feature to the framework, you should also document it in the reference so that other users can find it in the official documentation. For example, if you are adding a new function under chainer.functions, Functions should be updated.

The documentation source is stored under docs directory and written in reStructuredText format.

To build the documentation, you need to install Sphinx:

$ pip install sphinx sphinx_rtd_theme

Note

Docstrings (documentation comments in the source code) are collected from the installed Chainer module. If you have edited docstrings in checked-out source files and want to see those changes reflected in the generated html, Chainer must be installed in develop mode to see those changes reflected in the generated documentation. To do this use pip install -e . from the the top of the Chainer directory.

Then you can build the documentation in HTML format locally:

$ cd docs
$ make html

HTML files are generated under build/html directory. Open index.html with the browser and see if it is rendered as expected.

Note

If you are unsure about how to write the documentation or failed to build it locally, you can submit a PR without documentation. Reviewers will help you with it.

Other Forms of Contribution

There are several other ways in which you can contribute to Chainer without directly working with the code base. Following are such contributions.

Development Cycle

This section explains the development process of Chainer.

Versioning

The versioning of Chainer follows PEP 440 and a part of Semantic versioning. The version number consists of three or four parts: X.Y.Zw where X denotes the major version, Y denotes the minor version, Z denotes the revision number, and the optional w denotes the pre-release suffix. While the major, minor, and revision numbers follow the rule of semantic versioning, the pre-release suffix follows PEP 440, the Python community standards.

Note that a major update basically does not contain compatibility-breaking changes from the last release candidate (RC). This is not a strict rule, though; if there is a critical bug in the API that need to be fixed for the major version, breaking changes may be introduced.

For more on backward compatibility, please refer to the API Compatibility Policy.

Release Cycle

A milestone for each upcoming release is published on GitHub. The GitHub milestones are used to group issues and PRs belonging to a release.

Git Branches

master branch is used for Chainer v7.x development.

Tips and FAQs

It takes too long time to compile a computational graph. Can I skip it?

Chainer does not compile computational graphs, so you cannot skip it, or, I mean, you have already skipped it :).

It seems you have actually seen on-the-fly compilations of CUDA kernels. CuPy compiles kernels on demand to make kernels optimized to the number of dimensions and element types of input arguments. Pre-compilation is not available, because we have to compile an exponential number of kernels to support all CuPy functionalities. This restriction is unavoidable because Python cannot call CUDA/C++ template functions in generic way. Note that every framework using CUDA require compilation at some point; the difference between other statically-compiled frameworks (such as cutorch) and Chainer is whether a kernel is compiled at installation or at the first use.

These compilations should run only at the first use of the kernels. The compiled binaries are cached to the $(HOME)/.cupy/kernel_cache directory by default. If you see that compilations run every time you run the same script, then the caching is failed. Please check that the directory is kept as is between multiple executions of the script. If your home directory is not suited to caching the kernels (e.g. in case that it uses NFS), change the kernel caching directory by setting the CUPY_CACHE_DIR environment variable to an appropriate path. See CuPy Overview for more details.

MNIST example does not converge in CPU mode on Mac OS X

Note

Mac OS X is not an officially supported OS.

Many users have reported that MNIST example does not work correctly when using vecLib as NumPy backend on Mac OS X. vecLib is the default BLAS library installed on Mac OS X.

We recommend using other BLAS libraries such as OpenBLAS.

To use an alternative BLAS library, it is necessary to reinstall NumPy. Here are instructions to install NumPy with OpenBLAS using Conda.

$ conda install -c conda-forge numpy

Otherwise, to install NumPy without Conda, you may need to install NumPy from source.

Use Homebrew to install OpenBLAS.

$ brew install openblas

Uninstall existing NumPy installation

$ pip uninstall numpy

You’ll to create a file called .numpy-site.cfg in your home (~/) directory with the following:

[openblas]
libraries = openblas
library_dirs = /usr/local/opt/openblas/lib
include_dirs = /usr/local/opt/openblas/include

Install NumPy from the source code

pip install --no-binary :all: numpy

Confirm NumPy has been installed with OpenBLAS by running this command:

$ python -c "import numpy; print(numpy.show_config())"

You should see the following information:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  libraries = ['openblas', 'openblas']
  library_dirs = ['/usr/local/opt/openblas/lib']
  language = c
  define_macros = [('HAVE_CBLAS', None)]
  runtime_library_dirs = ['/usr/local/opt/openblas/lib']
 ...

Once this is done, you should be able to import chainer without OpenBLAS errors.

For details of this problem, see issue #704.

How do I fix InvalidType error?

Chainer raises an InvalidType exception when invalid inputs are given to Functions. If you got InvalidType, generally you need to check if dtype and/or shape of inputs are valid for the function.

Here are some examples of InvalidType errors:

import chainer.functions as F
import numpy as np

arr = np.arange(10) - 5
F.relu(arr)
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType:
Invalid operation is performed in: ReLU (Forward)

Expect: x.dtype.kind == f
Actual: i != f

In this case, kind of x (the first argument of the function relu()) is expected to be f (floating-point), whereas the input was i (signed integer). You need to cast the input appropriately before passing to the function (e.g., x.astype(np.float32)).

import chainer.functions as F
import numpy as np

x = np.ones((4, 4))
y = np.ones((3, 3))
F.concat([x, y])
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType:
Invalid operation is performed in: Concat (Forward)

Expect: in_types[0].shape[0] == in_types[1].shape[0]
Actual: 4 != 3

In this case, the function expects that x.shape[0] is equal to y.shape[0], but actually it was 4 and 3, respectively.

See Type Checks for the detailed behavior of type checking system in Chainer.

How do I accelerate my model using Chainer Backend for Intel Architecture?

Follow these steps to utilize Chainer Backend for Intel Architecture in your model.

Install Chainer Backend for Intel Architecture

The following environments are recommended by Chainer Backend for Intel Architecture.

  • Ubuntu 14.04 / 16.04 LTS (64-bit) and CentOS 7 (64-bit)

  • Python 2.7.6+, 3.5.2+, and 3.6.0+

On recommended systems, you can install Chainer Backend for Intel Architecture wheel (binary distribution) by:

$ pip install 'ideep4py<2.1'

Note

ideep4py v1.0.x is incompatible with v2.0.x, and is not supported in Chainer v5.0 or later.

Enable Chainer Backend for Intel Architecture Configuration

Currently Chainer Backend for Intel Architecture is disabled by default because it is an experimental feature. You need to manually enable it by changing chainer.config.use_ideep configuration to 'auto'. See Configuring Chainer for details.

The easiest way to change the configuration is to set environment variable as follows:

export CHAINER_USE_IDEEP="auto"

You can also use chainer.using_config() to change the configuration.

x = np.ones((3, 3), dtype='f')
with chainer.using_config('use_ideep', 'auto'):
    y = chainer.functions.relu(x)
print(type(y.data))
<class 'ideep4py.mdarray'>

Convert Your Model to Chainer Backend for Intel Architecture

You need to call model.to_intel64() (in the same way you call model.to_gpu() to transfer your link to GPU) to convert the link to Chainer Backend for Intel Architecture.

Run Your Model

Now your model is accelerated by Chainer Backend for Intel Architecture!

Please note that not all functions and optimizers support Chainer Backend for Intel Architecture acceleration. Also note that Chainer Backend for Intel Architecture will not be used depending on the shape and data type of the input data.

My training process gets stuck when using MultiprocessIterator

When you are using OpenCV somewhere in your code and the MultiprocessIterator is used in the training code, the training loop may get stuck at some point. In such situation, there are several workarounds to prevent the process got stuck.

  1. Set the environment variable as follows: OMP_NUM_THREADS=1

  2. Add cv2.setNumThreads(0) right after import cv2 in your training script.

  3. Use MultithreadIterator instead of MultiprocessIterator.

This problem is originally reported here: A training loop got stuck in a certain condition with multi-processing updater and opencv for Chainer and the discussion on related problems is still going here: OpenCV + Python multiprocessing breaks on OSX.

Performance Best Practices

This guide explains some tips and advice for maximizing the performance of Chainer.

Use the Latest Version

It is generally recommended that you use the latest version of Chainer and its dependent libraries (CUDA, cuDNN, iDeep, etc.). Some of the new features and performance optimizations introduced in newer versions of dependent libraries may not be available in older versions of Chainer. Also, Chainer itself is incrementally being improved to provide better performance.

If you are using Chainer v4 or later, you can check the version configuration by:

chainer.print_runtime_info()
Chainer: 4.0.0
NumPy: 1.14.3
CuPy:
  CuPy Version          : 4.0.0
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 9000
  CUDA Driver Version   : 9000
  CUDA Runtime Version  : 9000
  cuDNN Build Version   : 7100
  cuDNN Version         : 7100
  NCCL Build Version    : 2102

Generally, the Chainer team is maintaining the API between minor updates (e.g., v4.0 to v4.1) so that users can upgrade Chainer without modifying their code (see API Compatibility Policy for our policy). As for major updates, please refer to the Upgrade Guide to understand what should be done for migration.

Enable Hardware Accelerations

Using GPU

In most cases, running on GPU will give you better performance than on CPU. When using GPU, also make sure to install cuDNN, which is a library to accelerate deep neural network computations.

Note

You don’t have to manually install cuDNN if you are using CuPy wheels, which includes the latest version of cuDNN. Check the output of chainer.print_runtime_info(); if you see the cuDNN version number, it is installed properly and will be used by Chainer automatically.

Note

If you wish, you can manually disable use of cuDNN using chainer.config.use_cudnn configuration option. See Configuring Chainer for details.

Using CPU

If you are running Chainer on CPU, you can use iDeep to utilize vector instructions of CPU. See Tips and FAQs for steps to run your model with iDeep.

You can also improve performance by building NumPy linked to Intel MKL. See Numpy/Scipy with Intel® MKL and Intel® Compilers for the detailed instructions.

Note

If you installed numpy package using Anaconda, you may already have MKL-linked NumPy. Check the output of numpy.show_config() to see what linear algebra library is linked.

Note

Use of iDeep and MKL-linked NumPy are orthogonal. You can use both of them at once to maximize the performance.

Migrate Data Preprocessing Code from NumPy to CuPy

If you are preprocessing your dataset or running data augmentation using NumPy, you may be able to use CuPy as a substitution to improve performance.

Note

It is not always efficient to use CuPy instead of NumPy, especially when the computation is not very heavy, or it cannot be done in batch.

Avoid Data Transfer

If you are using GPU, be aware of data transfer between CPU and GPU. For example, printing chainer.Variable on GPU (e.g., for debugging) will cause memory transfer from GPU to CPU, which will incur synchronization overhead.

You can use NVIDIA Visual Profiler to diagnose this kind of issue.

Optimize cuDNN Convolution

Workspace Size

Some convolution algorithms in cuDNN use additional GPU memory as a temporary buffer. This is called “workspace,” and users can adjust the upper limit of its size. By increasing the limit of workspace size, cuDNN may be able to use better (i.e., memory consuming but faster) algorithm.

The default size (in bytes) is:

>>> chainer.backends.cuda.get_max_workspace_size()
8388608

and can be adjusted using chainer.backends.cuda.set_max_workspace_size().

Maximum required workspace size may vary depending on various conditions such as GPU hardware and batch size of inputs.

Auto-Tuner

Some convolution algorithms in cuDNN support the auto-tuner feature that finds the fastest convolution algorithm for given inputs. You can turn on this feature by setting autotune configuration to True.

See Configuring Chainer for detailed descriptions.

Note

Auto-tuner tries to find the best algorithm for every first observation of the input shape combination. Therefore, the first batch will become slower when auto-tuner is enabled. The result of auto-tuner is cached on memory so that it can be reused for data with the same input shape combination. In other words, algorithm selected in the first batch will be reused for the second and later batches, as long as the input shape combination is the same.

If you set autotune configuration to False, the default convolution algorithm will always be selected, regardless of the previous auto-tuner results.

Note

Auto-tuner always use the maximum workspace size.

Fine-Tune Configuration

There are some Chainer configuration values that affect performance. Although the default values work well in most cases, you can adjust the following configurations for better performance.

  • enable_backprop

    If you are running your model for inference (i.e., you don’t have to use back propagation because you are not training the model), you can set this configuration to False to improve performance and reduce memory consumption.

  • type_check

    By default, Chainer checks the integrity between input data and functions. This makes possible to display friendly message when, for example, data with invalid dtype or shape is given to a function. By setting this configuration to False, you can let Chainer skip such check to improve performance. It is recommended that you turn off the check only for well-tested code and input data.

See Configuring Chainer for detailed descriptions.

Load Datasets Concurrently

If loading process of your dataset is I/O-bound or CPU-bound, consider using chainer.iterators.MultithreadIterator or chainer.iterators.MultiprocessIterator to load dataset concurrently using multiple threads or processes, instead of chainer.iterators.SerialIterator which works in a single thread in a single process.

Use Multiple GPUs

You can utilize multiple GPUs to make the training process faster.

For data parallelism, you can use chainer.training.updaters.ParallelUpdater or chainer.training.updaters.MultiprocessParallelUpdater instead of chainer.training.updaters.StandardUpdater. For model parallelism, you need to manually transfer each chainer.Link in your model to each device.

See Using GPU(s) in Chainer for the working examples of each case.

Use Multiple Nodes

You can scale-out the training process of your Chainer model to multiple-node cluster by using ChainerMN module which enables distributed deep learning.

Upgrade Guide

This is a list of changes introduced in each release that users should be aware of when migrating from older versions. Most changes are carefully designed not to break existing code; however changes that may possibly break them are highlighted with a box.

Chainer v7

Dropping Support of Python 2.7

In Chainer v7, Python 2.7 is no longer supported as it reaches its end-of-life (EOL) in January 2020. Python 3.5.2 is the minimum Python version supported by Chainer v7. Please upgrade the Python version if you are using Python 2.7 to any later versions listed under Installation.

CuPy v7

Chainer v7 requires CuPy v7 if you need GPU support. Please see the Upgrade Guide for CuPy v7 for details.

Chainer v6

Dropping Support of Python 3.4

In Chainer v6, Python 3.4 is no longer supported as it reaches its end-of-life (EOL) in March 2019. Python 3.5.1 is the minimum Python 3 version supported by Chainer v6. Please upgrade the Python version if you are using Python 3.4 to any later versions listed under Installation.

CuPy Needs To Be Manually Updated

Prior to Chainer v6, CuPy is automatically updated to the appropriate version when updating Chainer (i.e., pip install -U chainer updates CuPy package). In Chainer v6, Chainer does not perform this automatic update. You need to manually update CuPy package when updating Chainer package.

This is because the automatic update made users difficult to switch between CuPy packages (e.g. cupy-cuda90 and cupy-cuda92 etc). See #5425 for details.

Deprecation Notice on Communicators and Old NCCL versions

Chainer v6 only supports NCCL 2.3 and newer versions. Old NCCL versions are to be deprecated and will be removed in future versions. As of old NCCL deprecation, several communicators built for them are to be deprecated as well:

  • hierarchical

  • two_dimensional

  • single_node

They will be removed in future versions. Also, default communicator changed to pure_nccl from hierarchical.

CuPy v6

Chainer v6 requires CuPy v6 if you need GPU support. Please see the Upgrade Guide for CuPy v6 for details.

Chainer v5

ChainerMN Became Part of Chainer

ChainerMN, which enables multi-node distributed deep learning using Chainer, has been merged to Chainer v5.

Prior to Chainer v4, ChainerMN was provided as a separate chainermn package. In Chainer v5, ChainerMN now became a part of Chainer; ChainerMN will be installed just by installing chainer package. If you are using chainermn package, make sure to remove it by pip uninstall chainermn before upgrading to Chainer v5 or later.

For documentation of ChainerMN, see Distributed Deep Learning with ChainerMN.

FunctionNode Classes are Hidden from chainer.functions

Prior to Chainer v5, FunctionNode classes (e.g., chainer.functions.MaxPooling2D) are exposed under chainer.functions. In Chainer v5, these classes are hidden from chainer.functions. Use the equivalent wrapper functions listed in Functions (e.g., chainer.functions.max_pooling_2d()) instead.

Some wrapper functions now provide options to access internal states to avoid directly using FunctionNode classes.

For example, suppose your existing code needs to access MaxPooling2D.indexes to later perform upsampling:

p = F.MaxPooling2D(2, 2)
h = p.apply((x,))[0]
...
y = F.upsampling_2d(h, p.indexes, ksize=2)

The above code may raise this error in Chainer v5:

AttributeError: module 'chainer.functions' has no attribute 'MaxPooling2D'

You can rewrite the above code using return_indices option of chainer.functions.max_pooling_2d():

h, indices = F.max_pooling_2d(x, 2, 2, return_indices=True)
...
y = F.upsampling_2d(h, indices, ksize=2)

Updaters Automatically Call Optimizer.new_epoch

This change should affect only a minority of users (who call new_epoch() while using a trainer, or who implement their own Updater class).

Optimizers provide new_epoch() method, which can be used to change the behavior of optimizers depending on the current epoch number. Prior to Chainer v5, this method was expected to be called by users. In Chainer v5, updaters have been changed to call new_epoch() automatically. If you have been calling new_epoch() method manually while using a trainer (or an updater), you may need any of the following fixes:

  • Pass auto_new_epoch=False to the constructor of the updater (e.g., StandardUpdater) to stop new_epoch() from being called automatically by the updater.

  • Avoid calling new_epoch() method manually.

If you implement your own Updater class, you may need to update your code to automatically call new_epoch() (you can refer to the changes introduced in #4608 to understand how to fix your updater).

Extending the Backend Namespace

In addition to chainer.backends, we introduced chainer.backend. This subpackage contains utility functions that span several backends. For instance, it includes chainer.backend.get_array_module() which used to be defined in chainer.backends.cuda.get_array_module(). Both can be used but the latter will be deprecated.

get_device_from_array Returns Actual Device for Empty Arrays

Prior to Chainer v5, chainer.backends.cuda.get_device_from_array() returned chainer.backends.cuda.DummyDeviceType if the array is empty. In Chainer v5, it has been changed to return the actual cupy.cuda.Device object:

>>> x = cupy.array([])
>>> chainer.backends.cuda.get_device_from_array(x)
<CUDA Device 0>

Update of Docker Images

Chainer official Docker images (see Installation for details) are now updated to use CUDA 9.2 and cuDNN 7.

To use these images, you may need to upgrade the NVIDIA driver on your host. See Requirements of nvidia-docker for details.

CuPy v5

Chainer v5 requires CuPy v5 if you need GPU support. Please see the Upgrade Guide for CuPy v5 for details.

Chainer v4

Introduction of Backend Namespace

We introduced chainer.backends subpackage for future support of various backend libraries other than NumPy and CuPy. By this change, chainer.cuda module is now moved to chainer.backends.cuda.

This does not break the existing code; you can safely continue to use chainer.cuda (e.g., from chainer import cuda) but it is now encouraged to use from chainer.backends import cuda instead.

Namespace Changes for Updaters

chainer.training.StandardUpdater and chainer.training.ParallelUpdater are now moved to chainer.training.updaters.StandardUpdater and chainer.training.updaters.ParallelUpdater respectively, to align with the namespace convention of other subpackages. See the discussion in #2982 for more details.

This change does not break the existing code; you can safely continue to use updater classes directly under chainer.training but it is now encouraged to use chainer.training.updaters instead.

Namespace Changes for Optimizer Hooks

Optimizer hook functions are moved from chainer.optimizer.* to chainer.optimizer_hooks.*. For example, chainer.optimizer.WeightDecay is now located chainer.optimizer_hooks.WeightDecay.

If the existing code is using hooks directly under chainer.optimizer, DeprecationWarning will be shown. You are now encouraged to use chainer.optimizer_hooks instead.

Prohibition of Mixed Use of Arrays on Different Devices in Function Arguments

Argument validation of functions is now strictened to check device consistency of argument variables to provide better error messages to users. Suppose the following code:

v1 = chainer.Variable(np.arange(10, dtype=np.float32))      # CPU
v2 = chainer.Variable(cupy.arange(10, dtype=cupy.float32))  # GPU

# The line below raises an exception, because arguments are on different device.
F.maximum(v1, v2)

Prior to v4, the above code raises an exception like ValueError: object __array__ method not producing an array, which was difficult to understand. In v4, the error message would become TypeError: incompatible array types are mixed in the forward input (Maximum). This kind of error usually occurs by mistake (for example, not performing to_gpu for some variables).

Attention

As the argument validation is strictened, call of functions intentionally mixing NumPy/CuPy arrays in arguments will not work in Chainer v4. Please transfer all arrays to the same device before calling functions.

References to Function Nodes Not Retained in TimerHook and CupyMemoryProfilerHook

To reduce memory consumption, references to the function nodes will no longer be retained in the chainer.function_hooks.CupyMemoryProfileHook and chainer.function_hooks.TimerHook. See the discussion in #4300 for more details.

Attention

The existing code using function nodes retained in call_history attribute of these hooks will not work. The first element of call_history became the name of the function, instead of the function node instance itself. You can define your own function hook if you need to access the function node instances.

Update of Docker Images

Chainer official Docker images (see Installation for details) are now updated to use CUDA 8.0 and cuDNN 6.0. This change was introduced because CUDA 7.5 does not support NVIDIA Pascal GPUs.

To use these images, you may need to upgrade the NVIDIA driver on your host. See Requirements of nvidia-docker for details.

CuPy v4

Chainer v4 requires CuPy v4 if you need GPU support. Please see the Upgrade Guide for CuPy v4 for details.

Chainer v3

Introduction of New-style Functions

This release introduces new-style functions (classes inheriting from FunctionNode) that support double backward (gradient of gradient). See the Release Note for v3.0.0 for the usage of this feature.

Many of Functions are already migrated to new-style, although some of functions are still old-style (classes inheriting from Function). We are going to migrate more old-style functions to new-style in upcoming minor releases.

This does not break the existing code. Old-style functions (classes inheriting from Function) are still supported in v3 and future versions of Chainer.

If you are going to write new functions, it is encouraged to use FunctionNode to support double backward.

Attention

Users relying on undocumented function APIs (directly instantiating old-style classes) may experience an error like TypeError: 'SomeFunction' object is not callable after upgrading to v3. Please use the function APIs documented in Functions.

Changed Behavior of matmul Function

The behavior of chainer.functions.matmul() has been changed to behave like the corresponding NumPy function (numpy.matmul()). See the discussion in #2426 for more details.

Attention

The existing code using chainer.functions.matmul() may require modification to work with Chainer v3.

Also note that chainer.functions.batch_matmul() is now deprecated by this change. You can rewrite it using chainer.functions.matmul().

Removed use_cudnn Argument in spatial_transformer_grid and spatial_transformer_sampler Functions

use_cudnn argument has been removed from chainer.functions.spatial_transformer_grid() and chainer.functions.spatial_transformer_sampler(). See the discussion in #2955 for more details.

Attention

The existing code using use_cudnn argument of chainer.functions.spatial_transformer_grid() and chainer.functions.spatial_transformer_sampler() require modification to work with Chainer v3. Please use the configuration context (e.g., with chainer.using_config('use_cudnn', 'auto'):) to enable or disable use of cuDNN. See Configuring Chainer for details.

CuPy v2

Chainer v3 requires CuPy v2 if you need GPU support. Please see the Upgrade Guide for CuPy v2 for details.

Chainer v2

See Upgrade Guide from v1 to v2 for the changes introduced in Chainer v2.

Upgrade Guide from v1 to v2

This documentation provides detailed information of differences between Chainer v1 and v2. You will know by reading it which part of your code is required (or recommended) to be fixed when you upgrade Chainer from v1 to v2.

CuPy
CuPy has been separated from Chainer into a separate package

CuPy, which was originally a part of Chainer, has been separated into a different Python package since Chainer v2. It changes the way to set up Chainer with CUDA support. In particular, you have to separately install cupy package to enable CUDA support. See Installation for the recommended installation steps.

Fortunately, there is no need of updating your source code to catch up with this change.

Global configurations
Training mode is configured by a thread-local flag

In Chainer v2, the concept of training mode is added. It is represented by a thread-local flag chainer.config.train, which is a part of the unified configuration. When chainer.config.train is True, functions of Chainer run in the training mode, and otherwise they run in the test mode. For example, BatchNormalization and dropout() behave differently in each mode.

In Chainer v1, such a behavior was configured by the train or test argument of each function. This train/test argument has been removed in Chainer v2. If your code is using the train or test argument, you have to update it. In most cases, what you have to do is just removing the train / test argument from any function calls.

Example

Consider the following model definition and the code to call it in test mode written for Chainer v1.

# Chainer v1
import chainer.functions as F

class MyModel(chainer.Link):
    ...

    def __call__(self, x, train=True):
        return f(F.dropout(x, train=train))

m = MyModel(...)
y = m(x, train=False)

In Chainer v2, it should be updated into the following code:

# Chainer v2
import chainer.functions as F

class MyModel(chainer.Link):
    ...

    def __call__(self, x):
        return f(F.dropout(x))

m = MyModel(...)
with chainer.using_config('train', False):
    y = m(x)
Configurations are added and replace some of existing global flags

There are many global settings moved to the unified configuration other than the training mode. Following is the complete list of the configuration entries that have corresponding features in Chainer v1.

chainer.config.cudnn_deterministic

It is corresponding to the deterministic argument of some convolution functions in Chainer v1. This argument has been removed since Chainer v2. If you are using this argument, you have to use the chainer.config.cudnn_deterministic flag to change the behavior of the convolution functions.

chainer.config.debug

It is corresponding to the debug mode in Chainer v1, which was configured by set_debug() and extracted by is_debug(). These functions are also available in Chainer v2, so you basically do not need to update the code related to the debug mode.

chainer.config.enable_backprop

It is corresponding to the backprop mode in Chainer v1. The functions no_backprop_mode() and force_backprop_mode() are still available in Chainer v2, which automatically turns on/off the enable_backprop flag. One important difference from Chainer v1 is that the volatile flag is removed from Variable. Therefore, there are more situations that you need to modify the enable_backprop flag.

chainer.config.keep_graph_on_report

This flag configures whether or not to keep the computational graph alive for a reported variable. In Chainer v2, when a Variable object is reported by report(), a copy of the variable isolated from the computational graph is created and stored by default. Setting True to this flag, you can change this behavior and then the original Variable object is stored as is. See When a variable is reported, the variable is copied with the graph purged for the details.

chainer.config.train

It is corresponding to the train or test argument of some functions in Chainer v1. This argument has been removed since Chainer v2. If you are using this argument, you have to use the chainer.config.train flag instead. See Training mode is configured by a thread-local flag for more details.

chainer.config.type_check

It is corresponding to the Function.type_check_enable flag. If your code touches this flag, you have to use chainer.config.type_check instead. Note that the environment variable CHAINER_TYPE_CHECK is still available in Chainer v2, so if you are only using the environment variable, there is no need of updating your code.

chainer.config.use_cudnn

It is corresponding to the use_cudnn argument of many functions that have cuDNN implementations. This argument has been removed since Chainer v2. If you are using this argument, you have to use the chainer.config.use_cudnn flag instead. Note that this flag is ternary, not binary. See Configuring Chainer for more details.

These configurations can be modified in two ways.

  • Simply substituting a new value to an entry, like chainer.config.train = False.

  • Using the chainer.using_config context manager. It can be used with the with statement of Python as follows:

    with chainer.using_config('train', False):
        do something  # this code runs with chainer.config.train == False
    

    It recovers the original configuration after quitting the with block.

The chainer.config manages the thread-local configuration. You can also set the global configuration by modifying chainer.global_config. Note that the global configuration is used only if the entry of the thread-local configuration is not explicitly set up.

Variable
Volatile flag is removed

The Variable.volatile flag has been removed since Chainer v2.

Instead, the configuration chainer.config.enable_backprop can be used to enable/disable the automatic differentiation feature. If it is True, Chainer always creates a computational graph on the forward propagation, which corresponds to passing non-volatile variables in Chainer v1. Otherwise, Chainer does not create a graph, which corresponds to passing volatile variables in Chainer v1. The biggest difference is that enable_backprop is a thread-local flag, whereas volatile was a flag local to each Variable object. Note that enable_backprop flag has already existed in Chainer v1, which took effect only if all the inputs to the function have volatile == 'auto'.

The chainer.config.enable_backprop flag can be modified directly or by using using_config(). See Configuring Chainer for details. There is also a convenience function, no_backprop_mode(), to turn off the flag.

If you are using the Variable.volatile flag, you have to stop setting this flag (it will not take effect), and set the enable_backprop flag instead.

Example

Let model be your model, and consider the following code that calls it in volatile mode.

# Chainer v1
x_data = ...   # ndarray
x = chainer.Variable(x_data, volatile=True)
y = model(x)

In Chainer v2, it should be updated as follows.

# Chainer v2
x_data = ...   # ndarray
x = chainer.Variable(x_data)
with chainer.no_backprop_mode():
    y = model(x)
Variable is not a part of a computational graph anymore

The Variable class has been separated into two distinct classes, the Variable class and the VariableNode class, since Chainer v2. Every Variable object owns its own VariableNode object. A computational graph consists of Function objects and VariableNode objects. When one applies a Function to a Variable, the VariableNode object of the variable is extracted and set to one of the inputs of the function.

Note that the underlying data array of the variable is still held by the Variable object. It allows each Function implementation to release unneeded arrays from the computational graph, resulting in greatly reduced memory consumption.

This change does not affect most users’ code. If you are directly traversing the computational graph by yourself or modifying the graph ad-hoc, you may have to update your code. In most cases, it is enough to just change Variable into VariableNode in the code traversing the computational graph.

Parameter has to be an instance of Parameter class

Chainer v2 has a subclass of Variable called Parameter. This class has an interface convenient on setting up a parameter variable registered to Link.

You basically do not need to update your code because Link.add_param() creates a Parameter object in Chainer v2. There is a new recommended way of registering parameters to a link in Chainer v2, though. See here for the recommended way of parameter registration.

Small changes to Variable

There are some changes on the interface and specification of methods.

  • len(variable) returns the length of the first axis of the underlying array in Chainer v2. This is equivalent to len(variable.data). It is different from the behavior of Chainer v1, in which len returned the total number of elements in the underlying array.

  • repr(variable) returns a NumPy-like text representation of the underlying array in Chainer v2. In Chainer v1, it just returns a string that shows the name of the variable.

Function
The force_tuple option of split_axis is True by default

In Chainer v2, the force_tuple argument of functions.split_axis() is set to True by default. Therefore, it always returns a tuple regardless of the number of sections made after the split. It was False by default in Chainer v1.

Type check APIs are updated to enable lazy building of the error messages

In Chainer v2, the type check APIs are updated so that the overhead of checking types is greatly reduced. In order to achieve the overhead reduction, some APIs are changed.

If you have custom Function implementations that do type checking, you have to update your code. The following list shows which part has to be updated.

Background of this change: In Chainer v1, the type checking APIs build an abstract syntax tree (AST) based on each expression that tests some condition. The AST is used to emit a kind error message. However, building an AST requires constructions of many Python objects, which adds large Python overheads. In Chainer v2, the Function.type_check_forward() method is called once or twice. At the first call, the type checking APIs run in light-weight mode, where it does not build an AST and just checks the condition. The second call is made only if there is a test that fails, where it builds an AST. This change makes the ordinary path of running the type checking much faster, while keeping the kind error messages.

Methods to release unneeded arrays are added

As is written above, Chainer v2 introduced a new mechanism to reduce the memory consumption of each Function implementation. In many cases, a Function implementation does not need some input arrays in its backward computation. A new method called Function.retain_inputs() can be used to specify which input arrays are actually needed. This method must not be called from the outside of Function.forward().

Example

For example, consider the following simple addition function.

class AddFunction(chainer.Function):
    def forward(self, inputs):
        return inputs[0] + inputs[1],

    def backward(self, inputs, grad_outputs):
        return grad_outputs[0], grad_outputs[0]

It can be seen that the backward computation of this function does not use any of the inputs. Then, specifying an empty tuple of indexes to retain_inputs() will reduce the memory overhead.

class AddFunction(chainer.Function):
    def forward(self, inputs):
        self.retain_inputs(())  # does not retain both inputs
        return inputs[0] + inputs[1],

    def backward(self, inputs, grad_outputs):
        return grad_outputs[0], grad_outputs[0]

In some cases, the function can (or have to) use the output arrays instead of the inputs in its backward computation. In Chainer v1, we have written code that store the output arrays to attributes of the Function object and reuse them in the backward() method. In Chainer v2, it is recommended that you use Function.retain_outputs() to declare which outputs are required in the backward computation. The retained output arrays can be accessed via Function.output_data.

Note

The existing Function implementations that store the output arrays to its attributes will run correctly in Chainer v2. There is no any memory overhead right now. It is recommended that you use retain_outputs(), though, so that we can incorporate more memory optimization in the future.

Example

For example, consider the following simple implementation of the tanh function.

class TanhFunction(chainer.Function):
    def forward(self, inputs):
        xp = chainer.cuda.get_array_module(inputs[0])
        self.y = xp.tanh(inputs[0])
        return self.y,

    def backward(self, inputs, grad_outputs):
        one = self.y.dtype.type(1)  # avoid type promotion
        return grad_outputs[0] * (one - self.y * self.y),

We can use retain_outputs() instead of preserving the output array by ourselves as follows.

class TanhFunction(chainer.Function):
    def forward(self, inputs):
        self.retain_outputs((0,))
        xp = chainer.cuda.get_array_module(inputs[0])
        return xp.tanh(inputs[0]),

    def backward(self, inputs, grad_outputs):
        y = self.output_data[0]
        one = y.dtype.type(1)  # avoid type promotion
        return grad_outputs[0] * (one - y * y)
Optimizer
Deprecated methods of Optimizer are removed

The following methods are removed from Optimizer. These methods have been already deprecated in the past versions. If you are using these methods, you have to update your code.

  • zero_grads: use Link.zerograds() instead.

  • compute_grads_norm: you can compute the gradient norm by iterating the list of parameters by Link.params().

  • clip_grads: use GradientClipping instead.

  • weight_decay: use WeightDecay instead.

  • accumulate_grads: use Link.addgrads() instead.

GradientMethod is redesigned to allow parameter-specific update rules

In Chainer v2, the new class UpdateRule is used to define an update rule specific to each Parameter object. The UpdateRule is set to each Parameter object, and is used at each update step. This object implements an update formula using the data and gradient arrays.

Each UpdateRule object has enabled flag, which configures if the update rule should be applied to that parameter on update. By setting the flag to False, you can freeze the parameter. There is also a convenient method Link.enable_update() and Link.disable_update(), which configure the flag of each parameter under the link hierarchy. In other frameworks, a similar feature is called layer freezing. In Chainer v2, this is officially supported by these methods.

Each UpdateRule object can also hold its own hook functions similar to Optimizer. The built-in hook functions except for GradientClipping can also be used as a hook function of UpdateRule.

In most cases, you do not have to update your code because each optimizer automatically sets up an appropriate UpdaterRule object to each parameter.

If you are using a custom gradient-based optimizer implementation, you need to update the implementation. The following list shows what you have to do.

  • Write a subclass of UpdateRule that implements the update rule.

  • Rewrite your GradientMethod implementation. The new implementation only has to set up the update rule for each parameter in the target link.

You can see live examples in the optimizer implementations provided by Chainer.

Serializer
None is serializable

In Chainer v2, all serializers start supporting None value to be serialized and deserialized. Users’ code can rely on this feature, i.e., it can serialize and deserialize None value with any given serializer. This change only affects your code if it provides its own serializer implementations.

Trainer and Extension
Updater and Evaluator pass raw data arrays to the loss function

In Chainer v2, Updater and Evaluator pass raw data arrays to the loss function without wrapping them with Variable. You might need to update your code so that the loss function (in most cases, the model’s __call__ ) accepts raw arrays.

Note that raw arrays can be directly passed to any Function; they are automatically wrapped by Variable. For example, if the input is directly passed to a Function object (or any function under chainer.functions), you do not need to update the code.

Example

Consider the following code that obtains the shape of the input via Variable.data.

# Chainer v1
class MyLink(chainer.Link):
    def __call__(self, x):
        shape = x.data.shape  # valid if x is Variable, invalid if x is ndarray
        ...

It should be updated so that the link also accepts a raw array as the input. In this case, we have Variable.shape which is equivalent to data.shape, so you can simply write as follows.

# Chainer v2
class MyLink(chainer.Link):
    def __call__(self, x):
        shape = x.shape  # valid regardless of x being Variable or ndarray
        ...
trigger option is removed from snapshot and snapshot_object

In Chainer v2, the trigger option is removed from the snapshot() and snapshot_object() extensions. The effect of the option was duplicated with the trigger option of Trainer.extend. If you are passing the trigger argument to these extensions, you have to update your code. The update can be done by passing the value to the corresponding Trainer.extend.

Example

Assume that trainer is an instance of Trainer, and consider that you were adding a snapshot() extension as follows.

# Chainer v1
trainer.extend(chainer.training.extensions.snapshot(trigger=(1000, 'iteration')))

It should be updated as follows (note that this code also works with Chainer v1).

# Chainer v1/v2
trainer.extend(chainer.training.extensions.snapshot(), trigger=(1000, 'iteration'))
Extension.invoke_before_training is removed

In Chainer v2, The attribute invoke_before_training of Extension is removed. Instead, the Extension.initialize method is added. This method is called by Trainer.run before entering the training loop.

In Chainer v1, the extension is just called before entering the training loop when invoke_before_training is True. If you have a custom extension that has invoke_before_training=True , you have to update the code. What you have to do is to remove the invoke_before_training flag and override initialize() method. If you are using the make_extension() decorator, you can set the initialize function by passing the initializer argument to make_extension().

The dump_graph extension dumps the valid graph only at its first invocation

In Chainer v2, the dump_graph() extension dumps the valid computational graph only at its first invocation. If you want to dump the graph more than once, you have to fix the code. The easiest fix is setting the chainer.config.keep_graph_on_report flag to True. Note that this fix will cancel the improvement on the memory consumption made in Chainer v2. More memory-efficient fix is to dump the graph without using an extension, e.g. by customizing the loss function or the updater.

Here is the background of this change. In Chainer v2, the Reporter copies reported variables with purging the computational graph by default. On the other hand, the dump_graph() extension requires the computational graph reachable from the reported variable. In order to make the graph available, the dump_graph() extension turns on the chainer.config.keep_graph_on_report flag at its initializer (i.e., it turns on the graph before entering the training loop). Since we also wanted to achieve the memory efficiency, the dump_graph() extension turns off the flag after dumping the graph at its first invocation (strictly speaking, it recovers the original value). As a result, the computational graph is not available from the second invocation.

Since the dump_graph() recovers the original flag value at its invocation, you can keep the graph dumped more than once by changing the original flag value.

Reporter
When a variable is reported, the variable is copied with the graph purged

In Chainer v2, when a Variable object is reported using report() function (or directly using Reporter), a copy of the variable is made without preserving the computational graph. If your code depends on the reachability of the computational graph from the reported variable, you have to update your code. The easiest way to update your code is setting chainer.config.keep_graph_on_report to True, then Chainer will keep the computational graph reachable from the reported variable.

The possible examples that are affected by this change are as follows (not exhaustive).

  • A custom extension that runs backprop from a reported variable. It is definitely an example of assuming the reachability of the computational graph from the reported variable.

  • An extension that visualizes the computational graph from a reported variable. If you are writing such an extension by yourself, you have to turn on the keep_graph_on_report flag. The dump_graph() extension is another example, for which see the above item for the details.

This change is made for the memory performance reason; with this change, the memory used by the computational graph for training is immediately released before invoking extensions. Therefore, changing the behavior by overwriting chainer.config.keep_graph_on_report may increase the memory consumption. It may cause an out-of-memory error if the computational graph of the loss function consumes almost all the memory available in your environment and there is an extension that uses a certain amount of memory (e.g. Evaluator).

Other utilities
Some obsolete classes and functions are removed

The following classes and functions are removed in Chainer v2.

License

Copyright (c) 2015 Preferred Infrastructure, Inc.

Copyright (c) 2015 Preferred Networks, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Indices and tables