Chainer Colab Notebooks : An easy way to learn and use Deep Learning

You can run notebooks on Colaboratory as soon as you can click the link of “Show on Colaboratory” of each page.

Chainer Begginer’s Hands-on

How to write a training loop in Chainer

In this notebook session we will learn how to train a deep neural network to classify hand-written digits using the popular MNIST dataset. This dataset contains 50000 training examples and 10000 test examples. Each example contains a 28x28 greyscale image and a corresponding class label for the digit. Since the digits 0-9 are used, there are 10 class labels.

Chainer provides a feature called Trainer that can be used to simplify the training process. However, we think it is good for first-time users to understand how the training process works before using the Trainer feature. Even advanced users might sometimes want to write their own training loop and so we will explain how to do so here. The complete training process consists of the following steps:

  1. Prepare a datasets that contain the train/validation/test examples.
  2. Optionally set of iterators for the datasets.
  3. Write a training loop that performs the following operations in each iteration:
    1. Retreive batches of examples from the training dataset.
    2. Feed the batches into the model.
    3. Run the forward pass on the model to compute the loss.
    4. Run the backward pass on the model to compute the gradients.
    5. Run the optimizer on the model to update the parameters.
    6. (Optional): Ocassionally check the performance on a validation/test set.
[2]:
# Install Chainer and CuPy!

!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Requirement already satisfied: cupy-cuda80==4.0.0b3 from https://github.com/kmaehashi/chainer-colab/releases/download/2018-02-06/cupy_cuda80-4.0.0b3-cp36-cp36m-linux_x86_64.whl in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: chainer==4.0.0b3 in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from protobuf>=3.0.0->chainer==4.0.0b3)

1. Prepare the dataset

Chainer contains some built-in functions that can be used to download and return Chainer-formated versions of popular datasets used by the ML and deep learning communities. In this example, we will use the buil-in function that retreives the MNIST dataset.

[3]:
from chainer.datasets import mnist

# Download the MNIST data if you haven't downloaded it yet
train, test = mnist.get_mnist(withlabel=True, ndim=1)

# set matplotlib so that we can see our drawing inside this notebook
%matplotlib inline
import matplotlib.pyplot as plt

# Display an example from the MNIST dataset.
# `x` contains the input image array and `t` contains that target class
# label as an integer.
x, t = train[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)
/usr/local/lib/python3.6/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')
_images/notebook_hands_on_chainer_begginers_hands_on_01_Write_the_training_loop_3_1.png
label: 5

2. Create the dataset iterators

Although this is an optional step, it can often be convenient to use iterators that operate on a dataset and return a certain number of examples (often called a “mini-batch”) at a time. The number of examples that is returned at a time is called the “batch size” or “mini-batch size.” Chainer already has an Iterator class and some subclasses that can be used for this purpose and it is straightforward for users to write their own as well.

We will use the SerialIterator subclass of Iterator in this example. The SerialIterator can either return the examples in the same order that they appear in the dataset (that is, in sequential order) or can shuffle the examples so that they are returned in a random order.

An Iterator can return a new minibatch by calling its ‘next()’ method. An Iterator also has properties to manage the training such as ‘epoch’: how many times we have gone through the entire dataset, ‘is_new_epoch’: whether the current iteration is the first iteration of a new epoch.

[ ]:
from chainer import iterators

# Choose the minibatch size.
batchsize = 128

train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize,
                                     repeat=False, shuffle=False)
Details about SerialIterator
  • SerialIterator is a built-in subclass of Iterator that can be used to retrieve a dataset in either sequential or shuffled order.
  • The Iterator initializer takes two arguments: the dataset object and a batch size.
  • When data need to be used repeatedly for training, set the ‘repeat’ argument to ‘True’ (the default). When data is needed only once and no longer necessary for retriving the data anymore, set ‘repeat’ to ‘False’.
  • When you want to shuffle the training dataset for every epoch, set the ‘shuffle’ argument ‘True’.

In the example above, we set ‘batchsize = 128’, ‘train_iter’is the Iterator for the training dataset, and ‘test_iter’ is the Iterator for test dataset. These iterators will therefore return 128 image examples as a bundle.

3. Define the model

Now let’s define a neural network that we will train to classify the MNIST images. For simplicity, we will use a fully-connected network with three layers. We will set each hidden layer to have 100 units and set the output layer to have 10 units, corresponding to the 10 class labels for the MNIST digits 0-9.

We first briefly explain Link, Function, Chain, and Variable which are the basic components used for defining and running a model in Chainer.

Chain
  • Chain is a class that can hold multiple links and/or functions. It is a subclass of Link and so it is also a Link.
  • This means that a chain can contain parameters, which are the parameters of any links that it deeply contains.
  • In this way, Chain allows us to construct models with a potentially deep hierarchy of functions and links.
  • It is often convenient to use a single chain that contains all of the layers (other chains, links, and functions) of the model. This is because we will need to optimize the model’s parameters during training and if all of the parameters are contained by a single chain, it turns out to be straightfoward to pass these parameters into an optimizer (which we describe in more detail below).
Variable

In Chainer, both the activations (that is, the inputs and outputs of function and links) and the model parameters are instances of the Variable class. A Variable holds two arrays: a data array that contains the values that are read/written during the forward pass (or the parameter values), and a grad array that contains the corresponding gradients that will are computed during the backward pass.

A Variable can potentially contain two types of arrays as well, depending whether the array resides in CPU or GPU memory. By default, the CPU is used and these will be Numpy arrays. However, it is possible to move or create these arrays on the GPU as well, in which case they will be CuPy arrays. Fortunately, CuPy uses an API that is nearly identical to Numpy. This is convinient because in addition to making it easier for users to learn (there is almost nothing to learn if you are already familiar with Numpy), it often allows us to reuse the same code for both Numpy and CuPy arrays.

Create our model as a subclass of ‘Chain’

We can create our model to write a new subclass of Chain. The two main steps are:

  1. Any links (possibly also including other chains) that we wish to call during the forward computation of our chain must first be supplied to the chain’s __init__ method. After the __init__ method has been called, these links will then be accessable as attributes of our chain object. This means that we also need to provide the attribute name that we want to use for each link that is supplied. We do this by providing the attribute name and corresponding link object as keyword arguments to __init__, as we will do in the MLP chain below.
  2. We need to define a __call__ method that allows our chain to be called like a function. This method takes one or more Variable objects as input (that is, the input activations) and returns one or more Variable objects. This method executes the forward pass of the model by calling any of the links that we supplied to __init__ earlier as well as any functions.

Note that the links only need to be supplied to __init__, not __call__. This is because they contain parameters. Since functions do not contain any parameters, they can be called in __call__ without having to supply them to the chain beforehand. For example, we can use a function such as F.relu by simply calling it in __call__ but a link such as L.Linear would need to first be supplied to the chain’s __init__ in order to call it in __call__.

If we decide that we want to call a link in a chain after __init__ has already been called, we can use the add_link method of Chain to add a new link at any time.

In Chainer, the Python code that implements the forward computation code itself represents the model. In other words, we can conceptually think of the computation graph for our model being constructed dynamically as this forward computation code executes. This allows Chainer to describe networks in which different computations can be performed in each iteration, such as branched networks, intuitively and with a high degree a flexibiity. This is the key feature of Chainer that we call Define-by-Run.

How to run a model on GPU
  • The Link and Chain classes have a to_gpu method that takes a GPU id argument specifying which GPU to use. This method sends all of the model parameters to GPU memory.
  • By default, the CPU is used.
[ ]:
import chainer
import chainer.links as L
import chainer.functions as F

class MLP(chainer.Chain):

    def __init__(self, n_mid_units=100, n_out=10):
        # register layers with parameters by super initializer
        super(MLP, self).__init__()
        with self.init_scope():
            self.l1=L.Linear(None, n_mid_units)
            self.l2=L.Linear(None, n_mid_units)
            self.l3=L.Linear(None, n_out)

    def __call__(self, x):
        # describe the forward pass, given x (input data)
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

gpu_id = 0  # change to -1 if not using GPU

model = MLP()
if gpu_id >= 0:
    model.to_gpu(gpu_id)
NOTE

The L.Linear class is a link that represents a fully connected layer. When ‘None’ is passed as the first argument, this allows the number of necessary input units (n_input) and also the size of the weight and bias parameters to be automatically determined and computed at runtime during the first forward pass. We call this feature parameter shape placeholder. This can be a very helpful feature when defining deep neural network models, since it would often be tedious to manually determine these input sizes.

As mentioned previously, a Link can contain multiple parameter arrays. For example, the L.Linear link conatins two parameter arrays: the weights W and bias b. Recall that for a given link or chain, such as the MLP chain above, the links it contains can be accessed as attributes (or properties). The parameters of a link can also be accessed as attributes. For example, following code shows how to access the bias parameter of layer l1:

[6]:
print('The shape of the bias of the first layer, l1, in the model、', model.l1.b.shape)
print('The values of the bias of the first layer in the model after initialization、', model.l1.b.data)
The shape of the bias of the first layer, l1, in the model、 (100,)
The values of the bias of the first layer in the model after initialization、 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]

4. Select an optimization algorithm

Chainer provides a wide variety of optimization algorithms that can be used to optimize the model parameters during training. They are located in the chainer.optimizers module.

Here, we are going to use the basic stochastic gradient descent (SGD) method, which is implemented by optimizers.SGD. The model (recall that it is a Chain object) we created is passed to the optimizer object by providing the model as an argument to the optimizer’s ‘setup’ method. In this way, Optimizer can automatically find the model parameters to be optimized.

You can easily try out other optimizers as well. Please test and observe the results of various optimizers. For example, you could try to change ‘SGD” of ‘chainer.optimizers.SGD’ to ‘MomentumSGD’, ‘RMSprop’, ‘Adam’, etc., and run your training loop.

[ ]:
from chainer import optimizers

# Choose an optimizer algorithm
optimizer = optimizers.SGD(lr=0.01)
# Give the optimizer a reference to the model so that it
# can locate the model's parameters.
optimizer.setup(model)
NOTE

Observe that above, we setlr to 0.01 in the SGD constructor. This value is known as a the “learning rate”, one of the most important ** hyper parameters** that need be adjusted in order to obtain the best performance. The various optimizers may each have different hyper-parameters and so be sure to check the documentation for the details.

5. Write the training loop

We now show how to write the training loop. Since we are working on a digit classification problem, we will use softmax_cross_entropy as the loss function for the optimizer to minimize. For other types of problems, such as regression models, other loss functions might be more appropriate. See the Chainer documentation for detailed information on the various loss functions that are available.

Our training loop will be structured as follows. We will first get a mini-batch of examples from the training dataset. We will then feed the batch into our model by calling our model (a Chain object) like a function. This will execute the forward-pass code that we wrote for the chain’s __call__ method that we wrote above. This will cause the model to output class label predictions that we supply to the loss function along with the true (that is, target) values. The loss function will output the loss as a Variable object. We then clear any previous gradients and perform the backward pass by calling the backward method on the loss variable which computes the parameter gradients. We need to clear the gradients first because the backward method accumulates gradients instead of overwriting the previous values. Since the optimizer already was given a reference to the model, it already has access to the parameters and the newly-computed gradients and so we can now call the update method of the optimizer which will update the model parameters.

At this point you might be wondering how calling backward on the loss variable could possibly compute the gradients for all of the model parameters. This works as follows. First recall that all activation and parameter arrays in the model are instances of Variable. During the forward pass, as each function is called on its inputs, we save references in each output variable that refer to the function that created it and its input variables. In this way, by the time the final loss variable is computed, it actually contains backward references that lead all the way back to the input variables of the model. That is, the loss variable contains a representation of the entire computational graph of the model, which is recomputed each time the forward pass is performed. By following these backward references from the loss variable, each function as a backward method that gets called to compute any parameter gradients. Thus, by the time the end of the backward graph is reached (at the input variables of the model), all parameter gradients have been computed.

Thus, there are four steps in single training loop iteration as shown below.

  1. Obtain and pass a mini-batch of example images into the model and obtain the output digit predictions prediction_train.
  2. Compute the loss function, giving it the predicted labels from the output of our model and also the true “target” label values.
  3. Clear any previous gradients and call the backward method of ‘Variable’ to compute the parameter gradients for the model.
  4. Call the ‘update’ method of Optimizer, which performs one optimization step and updates all of the model parameters.

In addition to the above steps, it is good to occasionally check the performance of our model on a validation and/or test set. This allows us to observe how well it can generalize to new data and also check whether it is overfitting. The code below checks the performance on the test set at the end of each epoch. The code has the same structure as the training code except that no backpropagation is performed and we also commpute the accuracy on the test set using the F.accuracy function.

We can write the training loop code as follows:

[8]:
import numpy as np
from chainer.dataset import concat_examples
from chainer.cuda import to_cpu

max_epoch = 10

while train_iter.epoch < max_epoch:

    # ---------- The first iteration of training loop ----------
    train_batch = train_iter.next()
    image_train, target_train = concat_examples(train_batch, gpu_id)

    # calculate the prediction of the model
    prediction_train = model(image_train)

    # calculation of loss function, softmax_cross_entropy
    loss = F.softmax_cross_entropy(prediction_train, target_train)

    # calculate the gradients in the model
    model.cleargrads()
    loss.backward()

    # update the parameters of the model
    optimizer.update()
    # --------------- until here One loop ----------------

    # Check if the generalization of the model is improving
    # by measuring the accuracy of prediction after every epoch

    if train_iter.is_new_epoch:  # after finishing the first epoch

        # display the result of the loss function
        print('epoch:{:02d} train_loss:{:.04f} '.format(
            train_iter.epoch, float(to_cpu(loss.data))), end='')

        test_losses = []
        test_accuracies = []
        for test_batch in test_iter:
            test_batch = test_iter.next()
            image_test, target_test = concat_examples(test_batch, gpu_id)

            # forward the test data
            prediction_test = model(image_test)

            # calculate the loss function
            loss_test = F.softmax_cross_entropy(prediction_test, target_test)
            test_losses.append(to_cpu(loss_test.data))

            # calculate the accuracy
            accuracy = F.accuracy(prediction_test, target_test)
            accuracy.to_cpu()
            test_accuracies.append(accuracy.data)

        test_iter.reset()

        print('val_loss:{:.04f} val_accuracy:{:.04f}'.format(
            np.mean(test_losses), np.mean(test_accuracies)))
epoch:01 train_loss:0.9583 val_loss:0.7864 val_accuracy:0.8211
epoch:02 train_loss:0.4803 val_loss:0.4501 val_accuracy:0.8812
epoch:03 train_loss:0.3057 val_loss:0.3667 val_accuracy:0.8982
epoch:04 train_loss:0.2477 val_loss:0.3292 val_accuracy:0.9054
epoch:05 train_loss:0.2172 val_loss:0.3059 val_accuracy:0.9133
epoch:06 train_loss:0.2202 val_loss:0.2880 val_accuracy:0.9176
epoch:07 train_loss:0.3009 val_loss:0.2758 val_accuracy:0.9214
epoch:08 train_loss:0.3399 val_loss:0.2632 val_accuracy:0.9252
epoch:09 train_loss:0.3497 val_loss:0.2538 val_accuracy:0.9287
epoch:10 train_loss:0.1691 val_loss:0.2430 val_accuracy:0.9311

6. Save the trained model

Chainer provides two types of serializers that can be used to save and restore model state. One supports the HDF5 format and the other supports the Numpy NPZ format. For this example, we are going to use the NPZ format to save our model since it is easy to use with Numpy without requiring an additional dependencies or libraries.

[9]:
from chainer import serializers

serializers.save_npz('my_mnist.model', model)

# check if the model is saved.
%ls -la my_mnist.model
-rw-r--r-- 1 root root 333954 Feb 23 08:13 my_mnist.model

7. Perform classification by restoring a previously trained model

We will now use our previously trained and saved MNIST model to classify a new image. In order to load a previously-trained model, we need to perform the following two steps: 1. We must use the same model definition code the was used to create the previously-trained model. For our example, this is the MLP chain that we created earlier. 2. We then overwrite any parameters in the newly-created model with the values that were saved earlier using the serializer. The serializers.load_npz function can be used to do this.

Now that the model has been restored, it can be used to predict image labels on new input images.

[ ]:
# Create the infrence (evaluation) model as the previous model
infer_model = MLP()

# Load the saved parameters into the parameters of the new inference model to overwrite
serializers.load_npz('my_mnist.model', infer_model)

# Send the model to utilize GPU by to_GPU
if gpu_id >= 0:
    infer_model.to_gpu(gpu_id)
[11]:
# Get a test image and label
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)
_images/notebook_hands_on_chainer_begginers_hands_on_01_Write_the_training_loop_20_0.png
label: 7
[12]:
from chainer.cuda import to_gpu

# change the shape to minibutch.
# In this example, the size of minibatch is 1.
# Inference using any mini-batch size can be performed.

print(x.shape, end=' -> ')
x = x[None, ...]
print(x.shape)

# to calculate by GPU, send the data to GPU, too.
if gpu_id >= 0:
    x = to_gpu(x, 0)

# forward calculation of the model by sending X
y = infer_model(x)

# The result is given as Variable, then we can take a look at the contents by the attribute, .data.
y = y.data

# send the gpu result to cpu
y = to_cpu(y)

# The most probable number by looking at the argmax
pred_label = y.argmax(axis=1)

print('predicted label:', pred_label[0])
(784,) -> (1, 784)
predicted label: 7

Using the Trainer feature

Chainer has a feature called Trainer that can often be used to simplify the process of training and evaluating a model. This feature supports training a model in a way such that the user is not required to explicitly write the code for the training loop. For many types of models, including our MNIST model, Trainer allows us to write our training and evaluation code much more concisely.

Chainer contains several extensions that can be used with Trainer to visualize your results, evaluate your model, store and manage log files more easily.

This example will show how to use the Trainer feature to train a fully-connected feed-forward neural network on the MNIST dataset.

[1]:
# Install Chainer and CuPy!

!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Requirement already satisfied: cupy-cuda80==4.0.0b3 from https://github.com/kmaehashi/chainer-colab/releases/download/2018-02-06/cupy_cuda80-4.0.0b3-cp36-cp36m-linux_x86_64.whl in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: chainer==4.0.0b3 in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from protobuf>=3.0.0->chainer==4.0.0b3)

1. Prepare the dataset

Load the MNIST dataset, as in the previous notebook.

[2]:
from chainer.datasets import mnist

train, test = mnist.get_mnist()
/usr/local/lib/python3.6/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')
Downloading from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz...

2. Prepare the dataset iterators

[ ]:
from chainer import iterators

batchsize = 128

train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize, False, False)

3. Prepare the Model

We use the same model as before.

[ ]:
import chainer
import chainer.links as L
import chainer.functions as F

class MLP(chainer.Chain):

    def __init__(self, n_mid_units=100, n_out=10):
        super(MLP, self).__init__()
        with self.init_scope():
            self.l1=L.Linear(None, n_mid_units)
            self.l2=L.Linear(None, n_mid_units)
            self.l3=L.Linear(None, n_out)

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

gpu_id = 0  # Set to -1 if you don't have a GPU

model = MLP()
if gpu_id >= 0:
    model.to_gpu(gpu_id)

4. Prepare the Updater

As mentioned above, the trainer object (instance of Trainer) actually implements the training loop for us. However, before we can use it, we must first prepare another object that will actually perform one iteration of training operations and pass it to the trainer. This other object will be an subclass of Updater such as StandardUpdater or a custom subclass. It will therefore need to hold all of the components that are needed in order to perform one iteration of training, such as a dataset iterator, and optimizer that also holds the model.

  • Updater
    • Iterator
      • Dataset
    • Optimizer
      • Model

Since we can also write customized updaters, they can perform any kinds of computations. However, for this example we will use the StandardUpdater which performs the following steps each time it is called:

  1. Retrieve a batch of data from a dataset using the iterator that we supplied.
  2. Feed the batch into the model and calculate the loss using the optimizer that we supplied. Since we will supply the model to the optimizer first, this means that the updater can also access the model from the optimizer.
  3. Update the model parameters using the optimizer that we supplied.

Once the updater has been set up, we can pass it to the trainer and start the training process. The trainer will then automatically create the training loop and call the updater once per iteration.

Now let’s create the updater object.

[ ]:
from chainer import optimizers
from chainer import training

max_epoch = 10

# Note: L.Classifier is actually a chain that wraps 'model' to add a loss
# function.
# Since we do not specify a loss funciton here, the default
# 'softmax_cross_entropy' is used.
# The output of this modified 'model' will now be a loss value instead
# of a class label prediction.
model = L.Classifier(model)

# Send the model to the GPU.
if gpu_id >= 0:
    model.to_gpu(gpu_id)

# selection of your optimizing method
optimizer = optimizers.SGD()
# Give the optimizer a reference to the model
optimizer.setup(model)

# Get an Updater that uses the Iterator and Optimizer
updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
NOTE

The L.Classifier object is actually a chain that places our model in its predictor attribute. This modifies model so that when it is called, it will now take both the input image batch and a class label batch as inputs to its __call__ method and output a Variable that contains the loss value. The loss function to use can be optionally set but we use the default which is softmax_cross_entropy.

Specifically, when __call__ is called on our model (which is now a L.Classifier object), the image batch is supplied to its predictor attribute, which contains our MLP model. The ourput of the MLP (which consists of class label predictions) is supplied to the loss function along with the target class labels to compute the loss value which is then returned as a Variable object.

Note that we use StandardUpdater, which is the simplest type of Updater in Chainer. There are also other types of updaters available, such as ParallelUpdater (which is intended for multiple GPUs) and you can also write a custom updater if you wish.

5. Setup Trainer

Now that we have set up an updater, we can pass it to a Trainer object. You can optionally pass a stop_trigger to the second trainer argument as a tuple, (length, unit) to tell the trainer stop automatically according to your indicated timing. The length is given as an arbitrary integer and unit is given as a string which currently must be either epoch or iteration. Without setting stop_trigger, the training will not stop automatically.

[ ]:
# Send Updater to Trainer
trainer = training.Trainer(updater, (max_epoch, 'epoch'),
                           out='mnist_result')

The out argument in the trainer will set up an output directory to save the logfiles, the image files of graphs to show the time progress of loss, accuracy, etc.

Next, we will explain how to display/save those outputs by using extensions.

6. Add extensions to trainer

There are several optional trainer extensions that provide the following capabilites:

  • Save log files automatically (LogReport)
  • Display the training information to the terminal periodically (PrintReport)
  • Visualize the loss progress by plottig a graph periodically and save its image (PlotReport)
  • Automatically serialize the model or the state of Optimizer periodically (snapshot/snapshot_object)
  • Display Progress Bar to show the progress of training (ProgressBar)
  • Save the model architecture as a dot format readable by Graphviz (dump_graph)

Now you can utilize the wide variety of tools shown above right away! To do so, simply pass the desired extensions object to the Trainer object by using the extend() method of Trainer.

[ ]:
from chainer.training import extensions

trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
trainer.extend(extensions.snapshot(filename='snapshot_epoch-{.updater.epoch}'))
trainer.extend(extensions.snapshot_object(model.predictor, filename='model_epoch-{.updater.epoch}'))
trainer.extend(extensions.Evaluator(test_iter, model, device=gpu_id))
trainer.extend(extensions.dump_graph('main/loss'))
LogReport

Collect loss and accuracy automatically every epoch or iteration and store the information under the log file in the directory assigned by the out argument of Trainer.

PrintReport

Reporter aggregates the results to output to the standard output. The timing for displaying the output can be given by the list.

PlotReport

PlotReport plots the values specified by its arguments, draws the graph and saves the image in the directory set by ‘file name’.

snapshot

The snapshot method saves the Trainer object at the designated timing (defaut: every epoch) in the directory assigned by out argument in Trainer. The Trainer object, as mentioned before, has an Updater which contains an Optimizer and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.

snapshot_object

When you save the whole Trainer object, in some cases it is very tedious to retrieve only the inside of the model. By using snapshot_object, you can save the particular object (in this case, the model wrapped by Classifier) in addition to saving the Trainer object. Classifier is a Chain object that keeps the Chain object given by the first argument as a property called predictor and calculates the loss. Classifier doesn’t have any parameters other than those inside its predictor model, and so we only save model.predictor.

Evaluator

The Iterator that uses the evaluation dataset (such as a validation or test dataset) and the model object are passed to Evaluator. The Evaluator evaluates the model using the given dataset at the specified timing interval.

dump_graph

This method saves the computational graph of the model. The graph is saved in Graphviz dot format. The output location (directory) to save the graph is set by the out argument of Trainer.


The extensions class has a lot of options other than those mentioned here. For instance, by using the trigger option, you can set individual timings to activate the extensions more flexibly. Please take a look at the official document in more detail:Trainer extensions

7. Start Training

To start training, just call run method from Trainer object.

[8]:
trainer.run()
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           1.59174     0.605827       0.830716              0.811017                  11.9659
2           0.620236    0.8436         0.468635              0.878362                  15.3477
3           0.434218    0.883229       0.375343              0.897943                  18.6432
4           0.370276    0.896985       0.331921              0.906843                  21.9966
5           0.336392    0.905517       0.306286              0.914062                  25.3211
6           0.313135    0.911114       0.290228              0.917623                  28.6711
7           0.296385    0.915695       0.277794              0.921084                  32.0399
8           0.282124    0.919338       0.264538              0.924248                  35.3713
9           0.269923    0.922858       0.257158              0.923853                  38.7191
10          0.25937     0.925373       0.245689              0.927907                  42.0867

Let’s see the graph of loss saved in the mnist_result directory.

[9]:
from IPython.display import Image
Image(filename='mnist_result/loss.png')
[9]:
_images/notebook_hands_on_chainer_begginers_hands_on_02_Try_Trainer_class_20_0.png

How about the accuracy?

[10]:
Image(filename='mnist_result/accuracy.png')
[10]:
_images/notebook_hands_on_chainer_begginers_hands_on_02_Try_Trainer_class_22_0.png

Furthermore, let’s visualize the computational graph output by dump_graph of extensions using Graphviz.

[13]:
!apt-get install graphviz -y
!dot -Tpng mnist_result/cg.dot -o mnist_result/cg.png
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  fontconfig libcairo2 libcdt5 libcgraph6 libdatrie1 libgd3 libgraphite2-3
  libgvc6 libgvpr2 libharfbuzz0b libice6 libjbig0 libltdl7 libpango-1.0-0
  libpangocairo-1.0-0 libpangoft2-1.0-0 libpathplan4 libpixman-1-0 libsm6
  libthai-data libthai0 libtiff5 libwebp6 libxaw7 libxcb-render0 libxcb-shm0
  libxext6 libxmu6 libxpm4 libxt6 x11-common
Suggested packages:
  gsfonts graphviz-doc libgd-tools
The following NEW packages will be installed:
  fontconfig graphviz libcairo2 libcdt5 libcgraph6 libdatrie1 libgd3
  libgraphite2-3 libgvc6 libgvpr2 libharfbuzz0b libice6 libjbig0 libltdl7
  libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libpathplan4
  libpixman-1-0 libsm6 libthai-data libthai0 libtiff5 libwebp6 libxaw7
  libxcb-render0 libxcb-shm0 libxext6 libxmu6 libxpm4 libxt6 x11-common
0 upgraded, 32 newly installed, 0 to remove and 1 not upgraded.
Need to get 4,228 kB of archives.
After this operation, 21.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu artful/main amd64 libxext6 amd64 2:1.3.3-1 [29.4 kB]
Get:2 http://archive.ubuntu.com/ubuntu artful/main amd64 fontconfig amd64 2.11.94-0ubuntu2 [177 kB]
Get:3 http://archive.ubuntu.com/ubuntu artful/main amd64 x11-common all 1:7.7+19ubuntu3 [22.0 kB]
Get:4 http://archive.ubuntu.com/ubuntu artful/main amd64 libice6 amd64 2:1.0.9-2 [40.2 kB]
Get:5 http://archive.ubuntu.com/ubuntu artful/main amd64 libsm6 amd64 2:1.2.2-1 [15.8 kB]
Get:6 http://archive.ubuntu.com/ubuntu artful/main amd64 libjbig0 amd64 2.1-3.1 [26.6 kB]
Get:7 http://archive.ubuntu.com/ubuntu artful/main amd64 libcdt5 amd64 2.38.0-16ubuntu2 [19.5 kB]
Get:8 http://archive.ubuntu.com/ubuntu artful/main amd64 libcgraph6 amd64 2.38.0-16ubuntu2 [40.0 kB]
Get:9 http://archive.ubuntu.com/ubuntu artful/main amd64 libtiff5 amd64 4.0.8-5 [150 kB]
Get:10 http://archive.ubuntu.com/ubuntu artful/main amd64 libwebp6 amd64 0.6.0-3 [181 kB]
Get:11 http://archive.ubuntu.com/ubuntu artful/main amd64 libxpm4 amd64 1:3.5.12-1 [34.0 kB]
Get:12 http://archive.ubuntu.com/ubuntu artful/main amd64 libgd3 amd64 2.2.5-3 [119 kB]
Get:13 http://archive.ubuntu.com/ubuntu artful/main amd64 libpixman-1-0 amd64 0.34.0-1 [230 kB]
Get:14 http://archive.ubuntu.com/ubuntu artful/main amd64 libxcb-render0 amd64 1.12-1ubuntu1 [14.8 kB]
Get:15 http://archive.ubuntu.com/ubuntu artful/main amd64 libxcb-shm0 amd64 1.12-1ubuntu1 [5,482 B]
Get:16 http://archive.ubuntu.com/ubuntu artful/main amd64 libcairo2 amd64 1.14.10-1ubuntu1 [558 kB]
Get:17 http://archive.ubuntu.com/ubuntu artful/main amd64 libltdl7 amd64 2.4.6-2 [38.8 kB]
Get:18 http://archive.ubuntu.com/ubuntu artful/main amd64 libthai-data all 0.1.26-3 [132 kB]
Get:19 http://archive.ubuntu.com/ubuntu artful/main amd64 libdatrie1 amd64 0.2.10-5 [17.6 kB]
Get:20 http://archive.ubuntu.com/ubuntu artful/main amd64 libthai0 amd64 0.1.26-3 [17.7 kB]
Get:21 http://archive.ubuntu.com/ubuntu artful/main amd64 libpango-1.0-0 amd64 1.40.12-1 [152 kB]
Get:22 http://archive.ubuntu.com/ubuntu artful/main amd64 libgraphite2-3 amd64 1.3.10-2 [78.3 kB]
Get:23 http://archive.ubuntu.com/ubuntu artful/main amd64 libharfbuzz0b amd64 1.4.2-1 [211 kB]
Get:24 http://archive.ubuntu.com/ubuntu artful/main amd64 libpangoft2-1.0-0 amd64 1.40.12-1 [33.2 kB]
Get:25 http://archive.ubuntu.com/ubuntu artful/main amd64 libpangocairo-1.0-0 amd64 1.40.12-1 [20.8 kB]
Get:26 http://archive.ubuntu.com/ubuntu artful/main amd64 libpathplan4 amd64 2.38.0-16ubuntu2 [22.6 kB]
Get:27 http://archive.ubuntu.com/ubuntu artful/main amd64 libgvc6 amd64 2.38.0-16ubuntu2 [587 kB]
Get:28 http://archive.ubuntu.com/ubuntu artful/main amd64 libgvpr2 amd64 2.38.0-16ubuntu2 [167 kB]
Get:29 http://archive.ubuntu.com/ubuntu artful/main amd64 libxt6 amd64 1:1.1.5-1 [160 kB]
Get:30 http://archive.ubuntu.com/ubuntu artful/main amd64 libxmu6 amd64 2:1.1.2-2 [46.0 kB]
Get:31 http://archive.ubuntu.com/ubuntu artful/main amd64 libxaw7 amd64 2:1.0.13-1 [173 kB]
Get:32 http://archive.ubuntu.com/ubuntu artful/main amd64 graphviz amd64 2.38.0-16ubuntu2 [710 kB]
Fetched 4,228 kB in 4s (954 kB/s)
Extracting templates from packages: 100%
Selecting previously unselected package libxext6:amd64.
(Reading database ... 16692 files and directories currently installed.)
Preparing to unpack .../00-libxext6_2%3a1.3.3-1_amd64.deb ...
Unpacking libxext6:amd64 (2:1.3.3-1) ...
Selecting previously unselected package fontconfig.
Preparing to unpack .../01-fontconfig_2.11.94-0ubuntu2_amd64.deb ...
Unpacking fontconfig (2.11.94-0ubuntu2) ...
Selecting previously unselected package x11-common.
Preparing to unpack .../02-x11-common_1%3a7.7+19ubuntu3_all.deb ...
Unpacking x11-common (1:7.7+19ubuntu3) ...
Selecting previously unselected package libice6:amd64.
Preparing to unpack .../03-libice6_2%3a1.0.9-2_amd64.deb ...
Unpacking libice6:amd64 (2:1.0.9-2) ...
Selecting previously unselected package libsm6:amd64.
Preparing to unpack .../04-libsm6_2%3a1.2.2-1_amd64.deb ...
Unpacking libsm6:amd64 (2:1.2.2-1) ...
Selecting previously unselected package libjbig0:amd64.
Preparing to unpack .../05-libjbig0_2.1-3.1_amd64.deb ...
Unpacking libjbig0:amd64 (2.1-3.1) ...
Selecting previously unselected package libcdt5.
Preparing to unpack .../06-libcdt5_2.38.0-16ubuntu2_amd64.deb ...
Unpacking libcdt5 (2.38.0-16ubuntu2) ...
Selecting previously unselected package libcgraph6.
Preparing to unpack .../07-libcgraph6_2.38.0-16ubuntu2_amd64.deb ...
Unpacking libcgraph6 (2.38.0-16ubuntu2) ...
Selecting previously unselected package libtiff5:amd64.
Preparing to unpack .../08-libtiff5_4.0.8-5_amd64.deb ...
Unpacking libtiff5:amd64 (4.0.8-5) ...
Selecting previously unselected package libwebp6:amd64.
Preparing to unpack .../09-libwebp6_0.6.0-3_amd64.deb ...
Unpacking libwebp6:amd64 (0.6.0-3) ...
Selecting previously unselected package libxpm4:amd64.
Preparing to unpack .../10-libxpm4_1%3a3.5.12-1_amd64.deb ...
Unpacking libxpm4:amd64 (1:3.5.12-1) ...
Selecting previously unselected package libgd3:amd64.
Preparing to unpack .../11-libgd3_2.2.5-3_amd64.deb ...
Unpacking libgd3:amd64 (2.2.5-3) ...
Selecting previously unselected package libpixman-1-0:amd64.
Preparing to unpack .../12-libpixman-1-0_0.34.0-1_amd64.deb ...
Unpacking libpixman-1-0:amd64 (0.34.0-1) ...
Selecting previously unselected package libxcb-render0:amd64.
Preparing to unpack .../13-libxcb-render0_1.12-1ubuntu1_amd64.deb ...
Unpacking libxcb-render0:amd64 (1.12-1ubuntu1) ...
Selecting previously unselected package libxcb-shm0:amd64.
Preparing to unpack .../14-libxcb-shm0_1.12-1ubuntu1_amd64.deb ...
Unpacking libxcb-shm0:amd64 (1.12-1ubuntu1) ...
Selecting previously unselected package libcairo2:amd64.
Preparing to unpack .../15-libcairo2_1.14.10-1ubuntu1_amd64.deb ...
Unpacking libcairo2:amd64 (1.14.10-1ubuntu1) ...
Selecting previously unselected package libltdl7:amd64.
Preparing to unpack .../16-libltdl7_2.4.6-2_amd64.deb ...
Unpacking libltdl7:amd64 (2.4.6-2) ...
Selecting previously unselected package libthai-data.
Preparing to unpack .../17-libthai-data_0.1.26-3_all.deb ...
Unpacking libthai-data (0.1.26-3) ...
Selecting previously unselected package libdatrie1:amd64.
Preparing to unpack .../18-libdatrie1_0.2.10-5_amd64.deb ...
Unpacking libdatrie1:amd64 (0.2.10-5) ...
Selecting previously unselected package libthai0:amd64.
Preparing to unpack .../19-libthai0_0.1.26-3_amd64.deb ...
Unpacking libthai0:amd64 (0.1.26-3) ...
Selecting previously unselected package libpango-1.0-0:amd64.
Preparing to unpack .../20-libpango-1.0-0_1.40.12-1_amd64.deb ...
Unpacking libpango-1.0-0:amd64 (1.40.12-1) ...
Selecting previously unselected package libgraphite2-3:amd64.
Preparing to unpack .../21-libgraphite2-3_1.3.10-2_amd64.deb ...
Unpacking libgraphite2-3:amd64 (1.3.10-2) ...
Selecting previously unselected package libharfbuzz0b:amd64.
Preparing to unpack .../22-libharfbuzz0b_1.4.2-1_amd64.deb ...
Unpacking libharfbuzz0b:amd64 (1.4.2-1) ...
Selecting previously unselected package libpangoft2-1.0-0:amd64.
Preparing to unpack .../23-libpangoft2-1.0-0_1.40.12-1_amd64.deb ...
Unpacking libpangoft2-1.0-0:amd64 (1.40.12-1) ...
Selecting previously unselected package libpangocairo-1.0-0:amd64.
Preparing to unpack .../24-libpangocairo-1.0-0_1.40.12-1_amd64.deb ...
Unpacking libpangocairo-1.0-0:amd64 (1.40.12-1) ...
Selecting previously unselected package libpathplan4.
Preparing to unpack .../25-libpathplan4_2.38.0-16ubuntu2_amd64.deb ...
Unpacking libpathplan4 (2.38.0-16ubuntu2) ...
Selecting previously unselected package libgvc6.
Preparing to unpack .../26-libgvc6_2.38.0-16ubuntu2_amd64.deb ...
Unpacking libgvc6 (2.38.0-16ubuntu2) ...
Selecting previously unselected package libgvpr2.
Preparing to unpack .../27-libgvpr2_2.38.0-16ubuntu2_amd64.deb ...
Unpacking libgvpr2 (2.38.0-16ubuntu2) ...
Selecting previously unselected package libxt6:amd64.
Preparing to unpack .../28-libxt6_1%3a1.1.5-1_amd64.deb ...
Unpacking libxt6:amd64 (1:1.1.5-1) ...
Selecting previously unselected package libxmu6:amd64.
Preparing to unpack .../29-libxmu6_2%3a1.1.2-2_amd64.deb ...
Unpacking libxmu6:amd64 (2:1.1.2-2) ...
Selecting previously unselected package libxaw7:amd64.
Preparing to unpack .../30-libxaw7_2%3a1.0.13-1_amd64.deb ...
Unpacking libxaw7:amd64 (2:1.0.13-1) ...
Selecting previously unselected package graphviz.
Preparing to unpack .../31-graphviz_2.38.0-16ubuntu2_amd64.deb ...
Unpacking graphviz (2.38.0-16ubuntu2) ...
Setting up libpathplan4 (2.38.0-16ubuntu2) ...
Setting up libxcb-render0:amd64 (1.12-1ubuntu1) ...
Setting up libxext6:amd64 (2:1.3.3-1) ...
Setting up libjbig0:amd64 (2.1-3.1) ...
Setting up libdatrie1:amd64 (0.2.10-5) ...
Setting up libtiff5:amd64 (4.0.8-5) ...
Setting up libgraphite2-3:amd64 (1.3.10-2) ...
Setting up libpixman-1-0:amd64 (0.34.0-1) ...
Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
Setting up libltdl7:amd64 (2.4.6-2) ...
Setting up libxcb-shm0:amd64 (1.12-1ubuntu1) ...
Setting up libxpm4:amd64 (1:3.5.12-1) ...
Setting up libthai-data (0.1.26-3) ...
Setting up x11-common (1:7.7+19ubuntu3) ...
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up libcdt5 (2.38.0-16ubuntu2) ...
Setting up fontconfig (2.11.94-0ubuntu2) ...
Regenerating fonts cache... done.
Setting up libcgraph6 (2.38.0-16ubuntu2) ...
Setting up libwebp6:amd64 (0.6.0-3) ...
Setting up libcairo2:amd64 (1.14.10-1ubuntu1) ...
Setting up libgvpr2 (2.38.0-16ubuntu2) ...
Setting up libgd3:amd64 (2.2.5-3) ...
Setting up libharfbuzz0b:amd64 (1.4.2-1) ...
Setting up libthai0:amd64 (0.1.26-3) ...
Setting up libpango-1.0-0:amd64 (1.40.12-1) ...
Setting up libice6:amd64 (2:1.0.9-2) ...
Setting up libsm6:amd64 (2:1.2.2-1) ...
Setting up libpangoft2-1.0-0:amd64 (1.40.12-1) ...
Setting up libxt6:amd64 (1:1.1.5-1) ...
Setting up libpangocairo-1.0-0:amd64 (1.40.12-1) ...
Setting up libxmu6:amd64 (2:1.1.2-2) ...
Setting up libxaw7:amd64 (2:1.0.13-1) ...
Setting up libgvc6 (2.38.0-16ubuntu2) ...
Setting up graphviz (2.38.0-16ubuntu2) ...
Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
[14]:
Image(filename='mnist_result/cg.png')
[14]:
_images/notebook_hands_on_chainer_begginers_hands_on_02_Try_Trainer_class_25_0.png

From the top to the bottom, you can track the data flow of the computations, how data and parameters are passed to what type of Function and the calculated loss is output.

8. Evaluate a pre-trained model

[15]:
import numpy as np
from chainer import serializers
from chainer.cuda import to_gpu
from chainer.cuda import to_cpu

model = MLP()
serializers.load_npz('mnist_result/model_epoch-10', model)

%matplotlib inline
import matplotlib.pyplot as plt

x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)

if gpu_id >= 0:
    model.to_gpu(gpu_id)
    x = to_gpu(x[None, ...])
    y = model(x)
    y = to_cpu(y.data)
else:
    x = x[None, ...]
    y = model(x)
    y = y.data

print('predicted_label:', y.argmax(axis=1)[0])
_images/notebook_hands_on_chainer_begginers_hands_on_02_Try_Trainer_class_28_0.png
label: 7
predicted_label: 7

It successfully executed !!

Creating and training convolutional neural networks

We will now improve upon our previous example by creating some more sophisticed image classifiers and using a more challanging dataset. Specifically, we will implement convolutional neural networks (CNNs) and train them using the CIFAR10 dataset, which uses natural color images. This dataset uses 60000 small color images of size 32x32x3 (the 3 is for the RGB color channels) and 10 class labels. 50000 of these are used for training and the remaining 10000 are for the test set. There is also a CIFAR100 version that uses 100 class labels, but we will only use CIFAR10 here.

airplane automobile bird cat deer dog frog horse ship truck
image0 image1 image2 image3 image4 image5 image6 image7 image8 image9
[2]:
# Install Chainer and CuPy!

!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Requirement already satisfied: cupy-cuda80==4.0.0b3 from https://github.com/kmaehashi/chainer-colab/releases/download/2018-02-06/cupy_cuda80-4.0.0b3-cp36-cp36m-linux_x86_64.whl in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: chainer==4.0.0b3 in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from protobuf>=3.0.0->chainer==4.0.0b3)

1. Define the Model

As in the previous examples, we define our model as a subclass of Chain. Our CNN model will have three layers of convolutions followed by two fully connected layers. Although this is still a fairly small CNN, it will still significantly outperform a fully-connected model. After completing this notebook, you are encouraged to try an experiment yourself to verify this, such as by using the MLP from the previous example or similar.

Recall from the first hands-on example that we define a model as follows.

  1. Inside the initializer of the model chain class, provide the names and corresponding layer objects as keyword arguments to parent(super) class.
  2. Define a __call__ method so that we can call the chain like a function. This method is used to implement the forward computation.
[3]:
import chainer
import chainer.functions as F
import chainer.links as L

class MyModel(chainer.Chain):

    def __init__(self, n_out):
        super(MyModel, self).__init__()
        with self.init_scope():
            self.conv1=L.Convolution2D(None, 32, 3, 3, 1)
            self.conv2=L.Convolution2D(32, 64, 3, 3, 1)
            self.conv3=L.Convolution2D(64, 128, 3, 3, 1)
            self.fc4=L.Linear(None, 1000)
            self.fc5=L.Linear(1000, n_out)

    def __call__(self, x):
        h = F.relu(self.conv1(x))
        h = F.relu(self.conv2(h))
        h = F.relu(self.conv3(h))
        h = F.relu(self.fc4(h))
        h = self.fc5(h)
        return h
/usr/local/lib/python3.6/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')

2. Train the model

Let’s define a ‘train’ function that we can also use to train other models easily later on. This function takes a model object, trains the model to classify the 10 CIFAR10 classes, and finally returns the trained model.

We will use this train function to train the MyModel network defined above.

[4]:
from chainer.datasets import cifar
from chainer import iterators
from chainer import optimizers
from chainer import training
from chainer.training import extensions

def train(model_object, batchsize=64, gpu_id=0, max_epoch=20):

    # 1. Dataset
    train, test = cifar.get_cifar10()

    # 2. Iterator
    train_iter = iterators.SerialIterator(train, batchsize)
    test_iter = iterators.SerialIterator(test, batchsize, False, False)

    # 3. Model
    model = L.Classifier(model_object)
    if gpu_id >=0:
        model.to_gpu(gpu_id)

    # 4. Optimizer
    optimizer = optimizers.Adam()
    optimizer.setup(model)

    # 5. Updater
    updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)

    # 6. Trainer
    trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='{}_cifar10_result'.format(model_object.__class__.__name__))

    # 7. Evaluator

    class TestModeEvaluator(extensions.Evaluator):

        def evaluate(self):
            model = self.get_target('main')
            ret = super(TestModeEvaluator, self).evaluate()
            return ret

    trainer.extend(extensions.LogReport())
    trainer.extend(TestModeEvaluator(test_iter, model, device=gpu_id))
    trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
    trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
    trainer.run()
    del trainer

    return model

gpu_id = 0  # Set to -1 if you don't have a GPU

model = train(MyModel(10), gpu_id=gpu_id)
Downloading from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz...
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           1.52065     0.445912       1.30575               0.527568                  9.28
2           1.22656     0.559399       1.20226               0.571059                  18.0121
3           1.07178     0.616637       1.14007               0.593451                  26.0337
4           0.954909    0.658791       1.09014               0.617237                  34.1482
5           0.849026    0.696152       1.05939               0.633161                  42.2791
6           0.748826    0.732514       1.05501               0.64172                   50.343
7           0.645085    0.769486       1.07674               0.63963                   58.4476
8           0.542578    0.806218       1.13905               0.643909                  66.4593
9           0.438032    0.845169       1.14717               0.648985                  74.6315
10          0.351087    0.8762         1.27986               0.642118                  82.7441
11          0.268679    0.90741        1.41233               0.63545                   91.8396
12          0.201693    0.931978       1.55193               0.635549                  99.9606
13          0.163686    0.944393       1.7719                0.64172                   108.078
14          0.137398    0.952665       1.89784               0.635947                  116.18
15          0.120794    0.958527       2.02123               0.636943                  124.336
16          0.105321    0.964309       2.09981               0.631668                  132.463
17          0.0979085   0.966073       2.23514               0.632564                  140.565
18          0.10191     0.965389       2.2047                0.629678                  148.612
19          0.0889873   0.96937        2.40244               0.623806                  156.785
20          0.0798767   0.973191       2.4482                0.630175                  165.914

The training has completed. Let’s take a look at the results.

[5]:
from IPython.display import Image
Image(filename='MyModel_cifar10_result/loss.png')
[5]:
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_7_0.png
[6]:
Image(filename='MyModel_cifar10_result/accuracy.png')
[6]:
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_8_0.png

Although the accuracy on the training set reach 98%, the loss on the test set started increasing after 5 epochs and the test accuracy plateaued round 60%. It looks like the model is overfitting to the training data.

3. Prediction using our trained model

Although the test accuracy is only around 60%, let’s try to classify some test images with this model.

[7]:
%matplotlib inline
import matplotlib.pyplot as plt

cls_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
             'dog', 'frog', 'horse', 'ship', 'truck']

def predict(model, image_id):
    _, test = cifar.get_cifar10()
    x, t = test[image_id]
    model.to_cpu()
    y = model.predictor(x[None, ...]).data.argmax(axis=1)[0]
    print('predicted_label:', cls_names[y])
    print('answer:', cls_names[t])

    plt.imshow(x.transpose(1, 2, 0))
    plt.show()

for i in range(5):
    predict(model, i)
predicted_label: dog
answer: cat
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_11_1.png
predicted_label: ship
answer: ship
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_11_3.png
predicted_label: ship
answer: ship
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_11_5.png
predicted_label: airplane
answer: airplane
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_11_7.png
predicted_label: deer
answer: frog
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_11_9.png

Some are correctly classified, others are not. Even though the model can predict the classification using the training datase with 100% accuracy, it is meaningless if we cannot generalize to (previously unseen) test data. The accuracy of on the test set data is believed to estimate generalization ability more directly.

How can we design and train a model with better generalization ability?

4. Create a deeper model

Let’s try making our CNN deeper by adding more layers and see how it performs. We will also make our model modular by writing it as the combination of three chains. This will help to improve readability and reduce code duplication: - A single convolutional neural net, ConvBlock - A single fully connected neural net, LinearBlock - Create a full model by chaining many of these two blocks together

Define the block of layers

Let’s define the network blocks, ConvBlock and LinearBlock, which will be stacked to create the full model.

[ ]:
class ConvBlock(chainer.Chain):

    def __init__(self, n_ch, pool_drop=False):
        w = chainer.initializers.HeNormal()
        super(ConvBlock, self).__init__()
        with self.init_scope():
            self.conv = L.Convolution2D(None, n_ch, 3, 1, 1,
                                 nobias=True, initialW=w)
            self.bn = L.BatchNormalization(n_ch)


        self.pool_drop = pool_drop

    def __call__(self, x):
        h = F.relu(self.bn(self.conv(x)))
        if self.pool_drop:
            h = F.max_pooling_2d(h, 2, 2)
            h = F.dropout(h, ratio=0.25)
        return h

class LinearBlock(chainer.Chain):

    def __init__(self):
        w = chainer.initializers.HeNormal()
        super(LinearBlock, self).__init__()
        with self.init_scope():
            self.fc = L.Linear(None, 1024, initialW=w)

    def __call__(self, x):
        return F.dropout(F.relu(self.fc(x)), ratio=0.5)

ConvBlock is defined by inheriting Chain. It contains a single convolution layer and a Batch Normalization layer registered by the constructor. The __call__ method recieves the data and applies activation funtion to it. If pool_drop is set to True, the Max_Pooling and Dropout functions are also applied.

Let’s now define the deeper CNN network by stacking the component blocks.

[ ]:
class DeepCNN(chainer.ChainList):

    def __init__(self, n_output):
        super(DeepCNN, self).__init__(
            ConvBlock(64),
            ConvBlock(64, True),
            ConvBlock(128),
            ConvBlock(128, True),
            ConvBlock(256),
            ConvBlock(256, True),
            LinearBlock(),
            LinearBlock(),
            L.Linear(None, n_output)
        )

    def __call__(self, x):
        for f in self.children():
            x = f(x)
        return x

Note that DeepCNN inherits from ChainList instead of Chain. The ChainList class inherits from Chain and is very useful when you define networks that consist of a long sequence of Link and/or Chain layers.

Note also the difference in the way that links and/or chains are supplied to the initializer of ChainList; they are passed as normal arguments, not as keyword arguments. Also, in the __call__ method, they are retreived from the list in the order they were registered by calling the self.children() method.

This feature enables us to describe the forward propagation very concisely. With the component list returned by self.children(), we can write the entire forward network by using a for loop to access each component chain one after another. Then we can first set the input ‘x’ to the first net and its output is passed to the next series of ‘Link’ or ‘Chain’.

[10]:
gpu_id = 0  # Set to -1 if you don't have a GPU

model = train(DeepCNN(10), gpu_id=gpu_id)
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           1.97534     0.284127       1.52095               0.429638                  51.7811
2           1.47785     0.447223       1.30206               0.584594                  136.648
3           1.23227     0.553377       1.06446               0.660032                  240.008
4           1.04628     0.630702       0.964976              0.695163                  343.388
5           0.902084    0.685642       0.912458              0.695561                  447.047
6           0.776821    0.729373       0.744387              0.764132                  550.3
7           0.683545    0.768286       0.631135              0.798069                  653.239
8           0.598311    0.798315       0.593679              0.804339                  756.449
9           0.53423     0.822011       0.60011               0.80623                   859.992
10          0.482092    0.837708       0.502585              0.832803                  963.425
11          0.42906     0.855994       0.446699              0.851811                  1066.43
12          0.389187    0.869638       0.431314              0.862261                  1169.77
13          0.357603    0.879436       0.431607              0.857484                  1273.36
14          0.326755    0.889165       0.433513              0.862162                  1376.66
15          0.300896    0.899248       0.555515              0.814192                  1479.75
16          0.278662    0.90739        0.439382              0.864351                  1582.91
17          0.250386    0.914242       0.470831              0.861266                  1685.86
18          0.235094    0.921875       0.464271              0.865346                  1788.89
19          0.228264    0.923716       0.429198              0.872313                  1891.77
20          0.20953     0.930038       0.448946              0.865545                  1994.67

The training is completed. Let’s take a look at the loss and accuracy.

[11]:
Image(filename='DeepCNN_cifar10_result/loss.png')
[11]:
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_22_0.png
[12]:
Image(filename='DeepCNN_cifar10_result/accuracy.png')
[12]:
_images/notebook_hands_on_chainer_begginers_hands_on_03_Write_your_own_network_23_0.png

Now the accuracy on the test set has improved a lot compared to the previous smaller CNN. Previously the accuracy was around 60% and now it is around 87%. According to current research reports, the most advanced model can reach around 97%. To improve the accuracy more, it is necessary not only to improve the models but also to increase the training data (Data augmentation) or to combine multiple models to carry out the best perfomance (Ensemble method). You may also find it interesting to experiment with some larger and more difficult datasets. There is more room for improvement by your new ideas!

How to write a custom dataset class

A typical strategy to improve generalization in deep neural networks is to increase the number of training examples which allows more parameters to be used in the model. Even if the number of parameters is kept unchanged, increasing the number of training examples often improves generalization performance.

Since it can be tedious and expensive to manually obtain and label additional training examples, a useful strategy is to consider methods for automatically increasing the size of a training set. Fortunately, for image datasets, there are several augmentation methods that have been found to work well in practice. They include:

  • Randomly cropping several slightly smaller images from the original training image.
  • Horizontally flipping the image.
  • Randomly rotating the image.
  • Applying various distortions and or noise to the images, etc.

In this example, we will write a custom dataset class that performs the first two of these augmentation methods on the CIFAR10 dataset. We will then train our previous deep CNN and check that the generalization performance on the test set has in fact improved.

We will create this dataset augmentation class as a subclass of DatasetMixin, which has the following API:

  • __len__ method to return the size of data in dataset.
  • get_example method to return data or a tuple of data and label, which are passed by i argument variable.

Other necessary features for a dataset can be prepared by inheriting chainer.dataset.DasetMixin class.

[1]:
# Install Chainer and CuPy!

!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Requirement already satisfied: cupy-cuda80==4.0.0b3 from https://github.com/kmaehashi/chainer-colab/releases/download/2018-02-06/cupy_cuda80-4.0.0b3-cp36-cp36m-linux_x86_64.whl in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: chainer==4.0.0b3 in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from chainer==4.0.0b3)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from protobuf>=3.0.0->chainer==4.0.0b3)

1. Write the dataset augmentation class for CIFAR10

[2]:
import numpy as np
from chainer import dataset
from chainer.datasets import cifar

gpu_id = 0  # Set to -1 if you don't have a GPU

class CIFAR10Augmented(dataset.DatasetMixin):

    def __init__(self, train=True):
        train_data, test_data = cifar.get_cifar10()
        if train:
            self.data = train_data
        else:
            self.data = test_data
        self.train = train
        self.random_crop = 4

    def __len__(self):
        return len(self.data)

    def get_example(self, i):
        x, t = self.data[i]
        if self.train:
            x = x.transpose(1, 2, 0)
            h, w, _ = x.shape
            x_offset = np.random.randint(self.random_crop)
            y_offset = np.random.randint(self.random_crop)
            x = x[y_offset:y_offset + h - self.random_crop,
                  x_offset:x_offset + w - self.random_crop]
            if np.random.rand() > 0.5:
                x = np.fliplr(x)
            x = x.transpose(2, 0, 1)
        return x, t
/usr/local/lib/python3.6/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')

This class performs the following types of data augmentation on the CIFAR10 example images:

  • Randomly crop a 28X28 area form the 32X32 whole image data.
  • Randomly perform a horizontal flip with 0.5 probability.

2. Train on the CIFAR10 dataset using our dataset augmentation class

Let’s now train the same deep CNN from the previous example. The only difference is that we will now use our dataset augmentation class. Since we reuse the same model with the same number of parameters, we can observe how much the augmentation improves the test set generalization performance.

[3]:
import chainer
import chainer.functions as F
import chainer.links as L
from chainer.datasets import cifar
from chainer import iterators
from chainer import optimizers
from chainer import training
from chainer.training import extensions

class ConvBlock(chainer.Chain):

    def __init__(self, n_ch, pool_drop=False):
        w = chainer.initializers.HeNormal()
        super(ConvBlock, self).__init__()
        with self.init_scope():
            self.conv = L.Convolution2D(None, n_ch, 3, 1, 1,
                                 nobias=True, initialW=w)
            self.bn = L.BatchNormalization(n_ch)


        self.pool_drop = pool_drop

    def __call__(self, x):
        h = F.relu(self.bn(self.conv(x)))
        if self.pool_drop:
            h = F.max_pooling_2d(h, 2, 2)
            h = F.dropout(h, ratio=0.25)
        return h

class LinearBlock(chainer.Chain):

    def __init__(self):
        w = chainer.initializers.HeNormal()
        super(LinearBlock, self).__init__()
        with self.init_scope():
            self.fc = L.Linear(None, 1024, initialW=w)

    def __call__(self, x):
        return F.dropout(F.relu(self.fc(x)), ratio=0.5)

class DeepCNN(chainer.ChainList):

    def __init__(self, n_output):
        super(DeepCNN, self).__init__(
            ConvBlock(64),
            ConvBlock(64, True),
            ConvBlock(128),
            ConvBlock(128, True),
            ConvBlock(256),
            ConvBlock(256, True),
            LinearBlock(),
            LinearBlock(),
            L.Linear(None, n_output)
        )

    def __call__(self, x):
        for f in self.children():
            x = f(x)
        return x

def train(model_object, batchsize=64, gpu_id=gpu_id, max_epoch=20):

    # 1. Dataset
    train, test = CIFAR10Augmented(), CIFAR10Augmented(False)

    # 2. Iterator
    train_iter = iterators.SerialIterator(train, batchsize)
    test_iter = iterators.SerialIterator(test, batchsize, False, False)

    # 3. Model
    model = L.Classifier(model_object)
    if gpu_id >= 0:
            model.to_gpu(gpu_id)

    # 4. Optimizer
    optimizer = optimizers.Adam()
    optimizer.setup(model)

    # 5. Updater
    updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)

    # 6. Trainer
    trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='{}_cifar10augmented_result'.format(model_object.__class__.__name__))

    # 7. Evaluator

    class TestModeEvaluator(extensions.Evaluator):

        def evaluate(self):
            model = self.get_target('main')
            ret = super(TestModeEvaluator, self).evaluate()
            return ret

    trainer.extend(extensions.LogReport())
    trainer.extend(TestModeEvaluator(test_iter, model, device=gpu_id))
    trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
    trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
    trainer.run()
    del trainer

    return model

model = train(DeepCNN(10), gpu_id=gpu_id, max_epoch=30)
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           1.92407     0.299832       1.47721               0.43959                   85.7994
2           1.46391     0.460767       1.37372               0.510947                  170.782
3           1.24698     0.555558       1.19284               0.613953                  255.058
4           1.09747     0.613997       0.987924              0.68133                   339.642
5           0.987182    0.655391       0.935555              0.69586                   424.341
6           0.898338    0.690401       0.839072              0.722034                  509.142
7           0.81304     0.725192       0.747639              0.764033                  593.771
8           0.738109    0.753441       0.681274              0.77906                   678.347
9           0.672411    0.776235       0.60375               0.805633                  762.626
10          0.617099    0.793834       0.527143              0.827926                  846.95
11          0.575074    0.810059       0.489332              0.832803                  931.496
12          0.538337    0.820563       0.539499              0.822154                  1016.05
13          0.509321    0.832021       0.474118              0.840764                  1101.08
14          0.487421    0.837628       0.452331              0.848129                  1185.79
15          0.462403    0.846771       0.422598              0.860072                  1270.2
16          0.442934    0.852153       0.394928              0.868929                  1354.51
17          0.421746    0.858516       0.404235              0.87092                   1438.79
18          0.407674    0.862756       0.395771              0.867237                  1523.3
19          0.39542     0.868938       0.396354              0.877687                  1607.85
20          0.383477    0.871999       0.37822               0.877488                  1692.17
21          0.371322    0.876399       0.388828              0.87281                   1777.04
22          0.36101     0.878861       0.369287              0.880573                  1861.17
23          0.353587    0.882282       0.422225              0.863953                  1936.13
24          0.344247    0.883963       0.366904              0.882066                  1977.98
25          0.335103    0.888387       0.370403              0.883161                  2019.74
26          0.322226    0.893086       0.353938              0.88545                   2061.59
27          0.323977    0.892626       0.44101               0.860072                  2103.4
28          0.315712    0.895146       0.344539              0.892815                  2145.16
29          0.303662    0.898777       0.385994              0.88754                   2187.08
30          0.29911     0.900448       0.397462              0.880573                  2228.82

In the case without the previous data augmentation, it was found that the precision which was capped at about 87% can be improved to 89% or more by applying augmentation to the learning data. It is an improvement of over 2%.

Finally, let’s take a look at the loss and precision graph.

[4]:
from IPython.display import Image
Image(filename='DeepCNN_cifar10augmented_result/loss.png')
[4]:
_images/notebook_hands_on_chainer_begginers_hands_on_04_Write_your_own_dataset_class_8_0.png
[5]:
Image(filename='DeepCNN_cifar10augmented_result/accuracy.png')
[5]:
_images/notebook_hands_on_chainer_begginers_hands_on_04_Write_your_own_dataset_class_9_0.png

Other Hands-on

Chainer

Chainer Hands-on: Introduction To Train Deep Learning Model in Python

Goal

Play with neural networks using Chainer in image recognition.

Lessons to be learned

Attendees will learn the following features of Chainer.

  1. Easy debug
  2. CPU/GPU-compatible array manipulation
Agenda
Section 1. MNIST Classification by Perceptron

Simple neural networks to classify hand-written digit images

  • Defining and training multi-layer perceptron
  • Evaluating and visualizing result
  • Model improvement and debugging
Section 2. Inside Chainer

Summary of features, class structures and implementations

  • NumPy and CuPy
  • Variable and Function
  • Link and Chain
  • Define-by-Run
Note

We assume that Chainer 1.20.0.1 is installed on a CUDA-7.0-enabled environment for this jupyter notebook.

[1]:
## Install Chainer and CuPy!

!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Requirement already satisfied: cupy-cuda80 in /usr/local/lib/python3.6/dist-packages (4.3.0)
Requirement already satisfied: chainer in /usr/local/lib/python3.6/dist-packages (4.3.1)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80) (1.14.5)
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80) (0.3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80) (1.11.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from chainer) (3.0.4)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from chainer) (3.6.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.0.0->chainer) (39.1.0)
Preparation: Chainer import

First, import Chainer and related modules. CuPy will be introduced later.

[2]:
## Import Chainer
from chainer import Chain, Variable, optimizers, serializers, datasets, training
from chainer.training import extensions
import chainer.functions as F
import chainer.links as L
import chainer

## Import NumPy and CuPy
import numpy as np
import cupy as cp

## Utilities
import time
import math

print('Chainer version: ', chainer.__version__)
Chainer version:  4.3.1
Section 1. MNIST Classification by Perceptron

MNIST is a benchmark classification dataset in machine learning. It contains 70,000 hand-written digit images. Labels (0-9) are also provided (10-class classification problem). The task is to predict which digit given images belongs to.

Each sample is represented as 28x28 gray scale image (784 dimensional vector)

As the most simple neural network model, we use a multi-layer perceptron of size 2 (MLP2). It consists of input, output, and one hidden unit between them. They are connected with linear layers (fully-connected layers), which contain weight matrix and bias term, respectively. The activation function for the hidden unit is hyperbolic tangent (tanh).

The following class implements MLP2. Note that only the type and size of each layer is defined in __init__ method. The actual forward computation is directly written in a separate __call__ method. On the other hand, there is no explicit definition of backward computation, since Chainer remembers the computational graph in forward computation and backward computation can be done along it (described in Section 2).

[ ]:
## 2-layer Multi-Layer Perceptron (MLP)
class MLP2(Chain):

    # Initialization of layers
    def __init__(self):
        super(MLP2, self).__init__(
            l1=L.Linear(784, 100),  # From 784-dimensional input to hidden unit with 100 nodes
            l2=L.Linear(100, 10),  # From hidden unit with 100 nodes to output unit with 10 nodes  (10 classes)
        )

    # Forward computation by __call__
    def __call__(self, x):
        h1 = F.tanh(self.l1(x))     # Forward from x to h1 through activation with tanh function
        y = self.l2(h1)                 # Forward from h1to y
        return y

MNIST dataset can be loaded into main memory by chainer.datasets.get_mnist().

Following the standard problem setting of MNIST, we divide 70,000 samples into the training image-label pairs, train of size 60,000, and the testing pairs test, of size 10,000.

[4]:
train, test = chainer.datasets.get_mnist()
print('Train:', len(train))
print('Test:', len(test))
Train: 60000
Test: 10000

These variables will be used throughout the experiments.

[ ]:
batchsize=100
Experiment 1.1 - CPU-based training of MLP2

As the initial setting, we use NumPy for CPU-based execution. Number of epochs (how many times each training sample will be used) is set to 2.

[ ]:
enable_cupy = False # No CuPy (Use NumPy)
n_epoch=2 # Only 2 epochs
Definition: method for MNIST train and test

The following train_and_test() actually run the experiments by using Trainer that was introduced from Chainer v1.11.0. It contains the last 3 parts of the standard ML workflow below.

Optimizer will be used during the model training to update the model parameters (weight matrix and bias term for linear layer) through back propagation. Chainer supports most of the widely-used optimizers (SGD, AdaGrad, RMSProp, Adam, etc…). Here we use SGD. L.Classifier is a wrapper to build a classification model using a neural network, which is MLP2 in this setting. The default loss for L.Classifier is softmax cross entropy.

[ ]:
def  train_and_test():
    training_start = time.clock()
    log_trigger = 600, 'iteration'
    device = -1
    if enable_cupy:
        model.to_gpu()
        chainer.cuda.get_device(0).use()
        device = 0
    optimizer = optimizers.SGD()
    optimizer.setup(classifier_model)
    train_iter = chainer.iterators.SerialIterator(train, batchsize)
    test_iter = chainer.iterators.SerialIterator(test, batchsize, repeat=False, shuffle=False)
    updater = training.StandardUpdater(train_iter, optimizer, device=device)
    trainer = training.Trainer(updater, (n_epoch, 'epoch'), out='out')
    trainer.extend(extensions.dump_graph('main/loss'))
    trainer.extend(extensions.Evaluator(test_iter, classifier_model, device=device))
    trainer.extend(extensions.LogReport(trigger=log_trigger))
    trainer.extend(extensions.PrintReport(
        ['epoch', 'iteration', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy']), trigger=log_trigger)
    trainer.run()
    elapsed_time = time.clock() - training_start
    print('Elapsed time: %3.3f' % elapsed_time)
Execution: wait until test finishes

Let’s run the experiment and get the first result. It takes 30 seconds or so.

[8]:
model = MLP2() # MLP2 model
classifier_model = L.Classifier(model)
train_and_test() # May take 30 sec or more
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy
1           600         1.12293     0.6524                0.752617       0.8582
2           1200        0.566502    0.473245              0.86225        0.882
Elapsed time: 15.236
Evaluation: see the 1st result

The validation/main/accuracy should be less than 0.90. This is not bad, but can be improved. Later we will try other settings.

Preparation: import visualization tools

We use matplotlib to display computational graphs and MNIST images.

[9]:
## Import utility and visualization tools
!apt-get install graphviz
!pip install pydot
import pydot
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from IPython.display import Image, display
import chainer.computational_graph as cg
Reading package lists... Done
Building dependency tree
Reading state information... Done
graphviz is already the newest version (2.38.0-16ubuntu2).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Requirement already satisfied: pydot in /usr/local/lib/python3.6/dist-packages (1.2.4)
Requirement already satisfied: pyparsing>=2.1.4 in /usr/local/lib/python3.6/dist-packages (from pydot) (2.2.0)
Definition: method for visualizing computational graph

Chainer can export the computational graph from input to the loss function.

[ ]:
def display_graph():
    graph = pydot.graph_from_dot_file('out/cg.dot') # load from .dot file
    graph[0].write_png('graph.png')

    img = Image('graph.png', width=600, height=600)
    display(img)
Execution: visualize the computational graph of MLP2

By running display_graph(), a directed graph will be shown. Three ellipsoids on top correspond to the input 100 images with 784 dimensions, weight matrix of size 100x784, and the bias term vector of length 100 for a linear layer.

The intermediate hidden unit with 100 nodes will be transferred to the next linear layer through a tanh activation function. The final 100 vectors for 10 classes are compared to the answers of int32 with SoftmaxCrossEntropy loss function. The loss value is given as a float32 value.

After building this graph, the backpropagation can work from the loss back to the input to update the model parameters (the weight matrices and bias terms of two LinearFunction).

[11]:
display_graph()
_images/notebook_hands_on_chainer_chainer_23_0.png
Definition: method for plotting images with predictions

“Answer’” is the ground truth given in the dataset, and “Predict:” gives the prediction by the current model.

[ ]:
def plot_examples():
    %matplotlib inline
    plt.figure(figsize=(12,50))
    if enable_cupy:
       model.to_cpu()
    for i in range(45, 105):
        x = Variable(np.asarray([test[i][0]]))  # test data
        t = Variable(np.asarray([test[i][1]]))  # labels
        y = model(x)
        prediction = y.data.argmax(axis=1)
        example = (test[i][0] * 255).astype(np.int32).reshape(28, 28)
        plt.subplot(20, 5, i - 44)
        plt.imshow(example, cmap='gray')
        plt.title("No.{0} / Answer:{1}, Predict:{2}".format(i, t.data[0], prediction[0]))
        plt.axis("off")
    plt.tight_layout()
Execution: see some of examples are misclassified

Though most of the samples are correctly classified, there can be some mistakes. For example, No. 46 on the first row might be classified as ‘3’, though it looks ‘1’ to humans. The current model may also misclassify No.54 on the second row as ‘2’, which is a strange ‘6’.

[13]:
 plot_examples()
_images/notebook_hands_on_chainer_chainer_27_0.png
Experiment 1.2 - Increase number of epochs

To improve the test accuracy, try to simply increase the number of epochs. Other conditions remain the same.

[ ]:
enable_cupy = False
n_epoch=5                 # Increased from 2 to 5
Execution: run the new experiment with 5 epochs

Definitely it will take longer time.

[15]:
model = MLP2()
classifier_model = L.Classifier(model)
train_and_test()
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy
1           600         1.13003     0.65561               0.748433       0.8549
2           1200        0.565097    0.473293              0.864067       0.8843
3           1800        0.452602    0.405271              0.882583       0.8944
4           2400        0.401363    0.368402              0.891067       0.9006
5           3000        0.370984    0.344863              0.89725        0.9049
Elapsed time: 33.086
Evaluation: find that the accuracy becomes higher

The loss is smaller and the validation/main/accuracy is higher (0.90+) than the previous experiment.

Execution: find which mistakes have been removed

No.46 and/or No.54 can be correctly classified this time

[16]:
plot_examples()
_images/notebook_hands_on_chainer_chainer_34_0.png
Experiment 1.3 - Enable GPU computation with CuPy

Though adding more epochs can lead to higher accuracy, 5 epochs already takes more than one minute. In this case, we try to make it faster by enabling CuPy to use GPU.

[ ]:
enable_cupy = True # Now use CuPy
n_epoch=5
Execution: train the same model using GPU

The speed of training is clearly different.

[18]:
model = MLP2()
classifier_model = L.Classifier(model)
train_and_test()
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy
1           600         1.10894     0.640712              0.752166       0.8585
2           1200        0.558549    0.463849              0.865416       0.8837
3           1800        0.449113    0.398115              0.884584       0.8942
4           2400        0.399297    0.362251              0.893501       0.9024
5           3000        0.369314    0.339284              0.8993         0.9085
Elapsed time: 17.687
Evaluation: compare the training time

GPU-enabled training should be 5+ times faster than CPU.

Experiment 1.4 - Add one more layer

Then we use a different MLP with one more layer.

Definition: MLP with 3 layers

MLP3 has two hidden units of same size (100 nodes), which are also connected with additional L.Linear. The forward computation is almost the same with MLP2 to use tanh as activation functions.

[ ]:
## 3-layer multi-Layer Perceptron (MLP)
class MLP3(Chain):

    def __init__(self):
        super(MLP3, self).__init__(
            l1=L.Linear(784, 100),
            l2=L.Linear(100, 100),   # Additional  layer
            l3=L.Linear(100, 10)
        )

    def __call__(self, x):
        h1 = F.tanh(self.l1(x))   # Hidden unit 1
        h2 = F.tanh(self.l2(h1)) # Hidden unit 2
        y = self.l3(h2)
        return y
Preparation: create MLP3-based classifier model
[ ]:
enable_cupy = True
n_epoch=5
Execution: train new MLP3-based model
[21]:
model = MLP3()  # Use MLP3 instead of MLP2
classifier_model = L.Classifier(model)
train_and_test()
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy
1           600         1.06588     0.590563              0.749349       0.8589
2           1200        0.506428    0.421322              0.87035        0.8923
3           1800        0.402706    0.360067              0.89145        0.9022
4           2400        0.356675    0.328675              0.900918       0.9083
5           3000        0.329147    0.306729              0.906951       0.9128
Elapsed time: 20.396
Evaluation: compare the accuracy of MLP3 with MLP2

MLP3 can achieve smaller loss and higher accuracy thanks to its higher expressiveness. On the other hand, the computation time slightly increases for handling more parameters.

Execution: see the computational graph with 3 layers

It contains 3 LinearFunction and 2 Tanh activations.

[22]:
display_graph()
_images/notebook_hands_on_chainer_chainer_49_0.png
Execution: can you find any misclassified samples?

MLP3 is good enough to predict the labels of most of the samples.

[23]:
plot_examples()
_images/notebook_hands_on_chainer_chainer_51_0.png
Chainer’s feature - (1) Easy debug

Debugging complex neural networks is hard because runtime errors of other frameworks usually do not directly tell which part of model definition or implementation is wrong. However, Chainer supports type check in forward computation, so that debugging neural networks can be done just like debugging programs.

Definition: an enbugged version of MLP

In MLP3Wrong, three bugs were introduced into MLP3. Let’s find them during the execution and correct one by one later.

[ ]:
## Find three bugs in this model definition
class MLP3Wrong(Chain):

    def __init__(self):
        super(MLP3Wrong, self).__init__(
            l1=L.Linear(748, 100),
            l2=L.Linear(100, 100),
            l3=L.Linear(100, 10)
        )

    def __call__(self, x):
        h1 = F.tanh(self.l1(x))
        h2 = F.tanh(self.l2(x))
        y = self.l3(h3)
        return y

enable_cupy = True
n_epoch=5
Execution: find errors by reading stack trace

In the forward computation, the stack trace points out where the errors actually occur. This is done by the Define-by-Run approach of Chainer, in which the computational graph is directly constructed during forward computation.

If you finish correcting three bugs, MLP3Wrong must be exactly the same with the definition of MLP3.

[25]:
model = MLP3Wrong() # MLP3Wrong
classifier_model = L.Classifier(model)
train_and_test()
Exception in main training loop:
Invalid operation is performed in: LinearFunction (Forward)

Expect: in_types[0].shape[1] == in_types[1].shape[1]
Actual: 784 != 748
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/usr/local/lib/python3.6/dist-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/usr/local/lib/python3.6/dist-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/usr/local/lib/python3.6/dist-packages/chainer/optimizer.py", line 650, in update
    loss = lossfun(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/chainer/links/model/classifier.py", line 134, in __call__
    self.y = self.predictor(*args, **kwargs)
  File "<ipython-input-24-532380536ccf>", line 11, in __call__
    h1 = F.tanh(self.l1(x))
  File "/usr/local/lib/python3.6/dist-packages/chainer/links/connection/linear.py", line 134, in __call__
    return linear.linear(x, self.W, self.b)
  File "/usr/local/lib/python3.6/dist-packages/chainer/functions/connection/linear.py", line 234, in linear
    y, = LinearFunction().apply(args)
  File "/usr/local/lib/python3.6/dist-packages/chainer/function_node.py", line 243, in apply
    self._check_data_type_forward(in_data)
  File "/usr/local/lib/python3.6/dist-packages/chainer/function_node.py", line 328, in _check_data_type_forward
    self.check_type_forward(in_type)
  File "/usr/local/lib/python3.6/dist-packages/chainer/functions/connection/linear.py", line 23, in check_type_forward
    x_type.shape[1] == w_type.shape[1],
  File "/usr/local/lib/python3.6/dist-packages/chainer/utils/type_check.py", line 524, in expect
    expr.expect()
  File "/usr/local/lib/python3.6/dist-packages/chainer/utils/type_check.py", line 482, in expect
    '{0} {1} {2}'.format(left, self.inv, right))
Will finalize trainer extensions and updater before reraising the exception.
---------------------------------------------------------------------------
InvalidType                               Traceback (most recent call last)
<ipython-input-25-9a17b0e6105a> in <module>()
      1 model = MLP3Wrong() # MLP3Wrong
      2 classifier_model = L.Classifier(model)
----> 3 train_and_test()

<ipython-input-7-78326c82d77b> in train_and_test()
     19         ['epoch', 'iteration', 'main/loss', 'validation/main/loss',
     20          'main/accuracy', 'validation/main/accuracy']), trigger=log_trigger)
---> 21     trainer.run()
     22     elapsed_time = time.clock() - training_start
     23     print('Elapsed time: %3.3f' % elapsed_time)

/usr/local/lib/python3.6/dist-packages/chainer/training/trainer.py in run(self, show_loop_exception_msg)
    318                 print('Will finalize trainer extensions and updater before '
    319                       'reraising the exception.', file=sys.stderr)
--> 320             six.reraise(*sys.exc_info())
    321         finally:
    322             for _, entry in extensions:

/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/usr/local/lib/python3.6/dist-packages/chainer/training/trainer.py in run(self, show_loop_exception_msg)
    304                 self.observation = {}
    305                 with reporter.scope(self.observation):
--> 306                     update()
    307                     for name, entry in extensions:
    308                         if entry.trigger(self):

/usr/local/lib/python3.6/dist-packages/chainer/training/updaters/standard_updater.py in update(self)
    147
    148         """
--> 149         self.update_core()
    150         self.iteration += 1
    151

/usr/local/lib/python3.6/dist-packages/chainer/training/updaters/standard_updater.py in update_core(self)
    158
    159         if isinstance(in_arrays, tuple):
--> 160             optimizer.update(loss_func, *in_arrays)
    161         elif isinstance(in_arrays, dict):
    162             optimizer.update(loss_func, **in_arrays)

/usr/local/lib/python3.6/dist-packages/chainer/optimizer.py in update(self, lossfun, *args, **kwds)
    648         if lossfun is not None:
    649             use_cleargrads = getattr(self, '_use_cleargrads', True)
--> 650             loss = lossfun(*args, **kwds)
    651             if use_cleargrads:
    652                 self.target.cleargrads()

/usr/local/lib/python3.6/dist-packages/chainer/links/model/classifier.py in __call__(self, *args, **kwargs)
    132         self.loss = None
    133         self.accuracy = None
--> 134         self.y = self.predictor(*args, **kwargs)
    135         self.loss = self.lossfun(self.y, t)
    136         reporter.report({'loss': self.loss}, self)

<ipython-input-24-532380536ccf> in __call__(self, x)
      9
     10     def __call__(self, x):
---> 11         h1 = F.tanh(self.l1(x))
     12         h2 = F.tanh(self.l2(x))
     13         y = self.l3(h3)

/usr/local/lib/python3.6/dist-packages/chainer/links/connection/linear.py in __call__(self, x)
    132             in_size = functools.reduce(operator.mul, x.shape[1:], 1)
    133             self._initialize_params(in_size)
--> 134         return linear.linear(x, self.W, self.b)

/usr/local/lib/python3.6/dist-packages/chainer/functions/connection/linear.py in linear(x, W, b)
    232         args = x, W, b
    233
--> 234     y, = LinearFunction().apply(args)
    235     return y

/usr/local/lib/python3.6/dist-packages/chainer/function_node.py in apply(self, inputs)
    241
    242         if configuration.config.type_check:
--> 243             self._check_data_type_forward(in_data)
    244
    245         hooks = chainer.get_function_hooks()

/usr/local/lib/python3.6/dist-packages/chainer/function_node.py in _check_data_type_forward(self, in_data)
    326         in_type = type_check.get_types(in_data, 'in_types', False)
    327         with type_check.get_function_check_context(self):
--> 328             self.check_type_forward(in_type)
    329
    330     def check_type_forward(self, in_types):

/usr/local/lib/python3.6/dist-packages/chainer/functions/connection/linear.py in check_type_forward(self, in_types)
     21             x_type.ndim == 2,
     22             w_type.ndim == 2,
---> 23             x_type.shape[1] == w_type.shape[1],
     24         )
     25         if type_check.eval(n_in) == 3:

/usr/local/lib/python3.6/dist-packages/chainer/utils/type_check.py in expect(*bool_exprs)
    522         for expr in bool_exprs:
    523             assert isinstance(expr, Testable)
--> 524             expr.expect()
    525
    526

/usr/local/lib/python3.6/dist-packages/chainer/utils/type_check.py in expect(self)
    480             raise InvalidType(
    481                 '{0} {1} {2}'.format(self.lhs, self.exp, self.rhs),
--> 482                 '{0} {1} {2}'.format(left, self.inv, right))
    483
    484

InvalidType:
Invalid operation is performed in: LinearFunction (Forward)

Expect: in_types[0].shape[1] == in_types[1].shape[1]
Actual: 784 != 748
Experiment 1.5 - Make your own model

Now it is your turn. Let’s modify the model by yourself to achieve higher accuracy.

Since increasing the number of epochs is obviously the easiest way, try to reach 0.95+ within 10 epochs & less than 100 sec. training.

Definition: define a new model with more options

Tune the neural network model for better performance. There are many options:

  • Increase the number of epochs
  • Increase the number of nodes
  • Add more layers
  • Use different types of activation functions
[ ]:
## Let's create new Multi-Layer Perceptron (MLP)
class MLPNew(Chain):

    def __init__(self):
        # Add more layers?
        super(MLPNew, self).__init__(
            l1=L.Linear(784, 100),  # Increase output node as (784, 200)?
            l2=L.Linear(100, 100),  # Increase nodes as (200, 200)?
            l3=L.Linear(100, 10)      # Increase nodes as (200, 10)?
        )

    def __call__(self, x):
        h1 = F.relu(self.l1(x))        # Replace F.tanh with F.sigmoid  or F.relu ?
        h2 = F.relu(self.l2(h1))        # Replace F.tanh with F.sigmoid  or F.relu ?
        y = self.l3(h2)
        return y

enable_cupy = True #  Use CuPy for faster training
n_epoch = 5 # Add more epochs?
Execution: create a better model with 0.95+ accuracy
[27]:
model = MLPNew()
classifier_model = L.Classifier(model)
train_and_test()
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy
1           600         1.37153     0.610667              0.64245        0.8477
2           1200        0.493793    0.387275              0.8697         0.894
3           1800        0.376693    0.329261              0.894701       0.9075
4           2400        0.332247    0.298809              0.904967       0.9143
5           3000        0.30512     0.280604              0.912418       0.92
Elapsed time: 20.306
Execution: no mistake anymore?

With 0.95+ accuracy, you may not find any misclassification in these 60 examples.

[28]:
plot_examples()
_images/notebook_hands_on_chainer_chainer_63_0.png
Execution: see how your best model looks like
[29]:
display_graph()
_images/notebook_hands_on_chainer_chainer_65_0.png
Advanced: Convolutional NN implementation

In this Section, we only used MLP with linear (fully-connected) layers. However, recent progress of deep learning in image recognition comes from a different type of network called Convolutional Neural Network (CNN).

Though it is beyond the scope of this hands-on, Chainer also include an example code for ImageNet classification that contains many variants of CNN.

Definition: AlexNet (ImageNet 2012 winner) model

AlexNet is the standard CNN that was used for winning ImageNet 2012 classification contest.

Chainer supports all of the commonly-used layers and functions so that users can re-implement such state-of-the-art models and extend it for their own problems. For example, AlexNet includes:

  • Convolutional layer (L.Convolution2D)
  • Max pooling (F.max_pooling_2d)
  • Local response normalization (F.local_response_normalization)
  • Dropout (F.dropout)

For more details on the functions, please refer to Standard Function implementations in Chainer reference manual.

[ ]:
## Definition of AlexNet
class AlexNet(chainer.Chain):

    def __init__(self):
        super(AlexNet, self).__init__(
            conv1=L.Convolution2D(3,  96, 11, stride=4),
            conv2=L.Convolution2D(96, 256,  5, pad=2),
            conv3=L.Convolution2D(256, 384,  3, pad=1),
            conv4=L.Convolution2D(384, 384,  3, pad=1),
            conv5=L.Convolution2D(384, 256,  3, pad=1),
            fc6=L.Linear(9216, 4096),
            fc7=L.Linear(4096, 4096),
            fc8=L.Linear(4096, 1000),
        )
        self.train = True

    def __call__(self, x, t):
        self.clear()
        h = F.max_pooling_2d(F.relu(
            F.local_response_normalization(self.conv1(x))), 3, stride=2)
        h = F.max_pooling_2d(F.relu(
            F.local_response_normalization(self.conv2(h))), 3, stride=2)
        h = F.relu(self.conv3(h))
        h = F.relu(self.conv4(h))
        h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2)
        h = F.dropout(F.relu(self.fc6(h)), train=self.train)
        h = F.dropout(F.relu(self.fc7(h)), train=self.train)
        y = self.fc8(h)
        return y

Section 2. Inside Chainer

In Section 1, we showed how to build and train neural networks in Chainer through image recognition. Users can also apply Chainer to their own problems other than such pattern recognition tasks.

Though we only combined preset layers and functions to build neural networks in the experiments, users may need to create new kinds of networks, by writing code for lower level of implementations, from scratch.

Chainer is designed to encourage users to rapidly make such prototype of new models, test it, and improve through trial-and-error. In the following, we explain the core components inside the Chainer.

3.1 NumPy and CuPy

NumPy is the widely-used library in Python for numerical computations based on CPU. On the other hand, neural networks can benefit from GPU for faster computatins of multi-dimensional arrays. However, NumPy does not support GPU so that Python users have to write GPU-specific code as in the initial version of Chainer.

Therefore, CuPy has been created and added to Chainer as a NumPy-compatible library based on CUDA. It currently supports many of the APIs in NumPy so that users can write CPU/GPU-agnostic code in most cases.

Execution: test NumPy

By using NumPy, create a matrix of size 1000x1000, transpose it, multiply 2 to each element, and repeat them for 5000 times.

[31]:
## import numpy as np
a = np.arange(1000000).reshape(1000, -1)
t1 = time.clock()
for i in range(5000):
    a = np.arange(1000000).reshape(1000, -1)
    b = a.T * 2
t2 = time.clock()
print(t2 -t1)
15.250113999999996
Execution: test CuPy

Execute the same computation with CuPy. It should be about 4 times faster than NumPy.

[32]:
## import cupy as cp
a = cp.arange(1000000).reshape(1000, -1)
t1 = time.clock()
for i in range(5000):
    a = cp.arange(1000000).reshape(1000, -1)
    b = a.T * 2
t2 = time.clock()
print(t2 -t1)
1.4757419999999968
Chainer’s feature - (2) CPU/GPU-compatible array manipulation

Since CuPy provides the same interface as NumPy as possible, users can switch them without modifying computation logic as follows.

[33]:
def xp_test(xp):
    a = xp.arange(1000000).reshape(1000, -1)
    t1 = time.clock()
    for i in range(5000):
        a = xp.arange(1000000).reshape(1000, -1)
        b = a.T * 2
    t2 = time.clock()
    print(t2 -t1)

enable_cupy = False
xp_test(np if not enable_cupy else cp)
enable_cupy = True
xp_test(np if not enable_cupy else cp)
15.178259999999995
1.5018499999999904
3.2 Variable and Function

Variable and Function are two basic classes in Chainer. As their names suggest, Variable represents the values of variables and Function represents a static function on Variable.

Execution: Variable is a class for multi-dimensional arrays

Variable can be initialized with NumPy/CuPy-arrays and it will be stored in .data.

[34]:
x = Variable(np.asarray([[0, 2],[1, -3]]).astype(np.float32))

print(type(x))
print(type(x.data))
print(x.data)
<class 'chainer.variable.Variable'>
<class 'numpy.ndarray'>
[[ 0.  2.]
 [ 1. -3.]]
Execution: Variable can move between CPU and GPU

By calling to_gpu() and to_cpu(), the content in .data can be either of the array in NumPy or CuPy.

[35]:
x.to_gpu()
print(type(x.data))
x.to_cpu()
print(type(x.data))
<class 'cupy.core.core.ndarray'>
<class 'numpy.ndarray'>
Execution: Function is used for transforming Variables

The actual computation is defined in forward() method, and the output must be also an instance of Variable.

[36]:
from chainer import function

class MyFunc(function.Function):
    def forward(self, x):
        self.y = x[0] **2 + 2 * x[0] + 1 # y = x^2 + 2x + 1
        return self.y,

def my_func(x):
    return MyFunc()(x)

x = Variable(np.asarray([[0, 2],[1, -3]]).astype(np.float32))
y = my_func(x)
print(type(x))
print(x.data)
print(type(y))
print(y.data)
<class 'chainer.variable.Variable'>
[[ 0.  2.]
 [ 1. -3.]]
<class 'chainer.variable.Variable'>
[[1. 9.]
 [4. 4.]]
Execution: Variable remembers history

Each instance of Variable remembers the function, which generates it, in .creator. If its .ceator is None, the Variable instance is called root.

[37]:
x = Variable(np.asarray([[0, 2],[1, -3]]).astype(np.float32))

## y is created by MyFunc
y = my_func(x)
print(y.creator)

## z is created by F.sigmoid
z = F.sigmoid(x)
print(z.creator)

## x is created by user
print(x.creator)
<__main__.MyFunc object at 0x7f409a6b3470>
<chainer.functions.activation.sigmoid.Sigmoid object at 0x7f409a6b3cc0>
None
Variable natively supports backpropagation

Backpropagation is the standard way to optimize neural networks. After forward computation, the loss is given at the output (as gradient), then the corresponding gradients are assigned to each intermediate layer by backtracking the computational graph. Then the parameters will be updated using the gradient information.

In Chainer, since all of the variables in forward computation are stored and automatic differentiation is supported, backward() traces the computational graph backward from the terminal (output) to the root (input of which .creator is None). Then the optimizer updates the model.

Definition: quadratic equation as forward computation

As shown in the previous section, forward computation can be regarded as a chain of functions to generate the final Variable instance. During the computation, Chainer remembers all of the intermediate Variable instances.

[ ]:
## A mock of forward computation
def forward(x):
    z = 2 * x
    y = x ** 2 - z + 1
    return y, z
Execution: backward computation to assign gradients

By setting y.grad and call y.backward(), the gradient information will be transferred to x and z.

[ ]:
x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
y, z = forward(x)
y.grad = np.ones((2, 3), dtype=np.float32)
y.backward(retain_grad=True)
[40]:
## Gradient for x: 2*x - 2
print(x.grad)
[[ 0.  2.  4.]
 [ 6.  8. 10.]]
[41]:
## Gradient for z: -1
print(z.grad)
[[-1. -1. -1.]
 [-1. -1. -1.]]
Advanced: Define-by-Run scheme

In most of the existing deep learning frameworks, the model construction and training are two separate processes. In advance of training, a fixed computational graph for a model is built by parsing the model definition. Most of them use a text or symbolic style program to define a neural network. These definitions can be regarded as a kind of domain-specific language (DSL) for deep learning. Then, given a training dataset, actual training runs for updating the model. The following figure shows the two processes. We call it Define-and-Run scheme.

The Define-and-Run is very straightforward, and good for optimizing the computational graph before training. On the other hand, it has some drawbacks. For example, it requires special syntax to implement recurrent neural networks. The memory efficiency might not be optimal since all of the computational graph should be stored on the main memory from the beginning to the end of training.

Therefore, Chainer uses another approach named Define-by-Run. The model definition is combined with training as actual forward computation builds the computational graph on the fly. It enables users to easily implement complex networks with loops and branching by using host language. Modifications to the computational graph during the training such as truncated BPTT can also be done efficiently.

We would like interested users to refer to our research paper.


Section 3. Summary

We introduced Chainer as a poweruful, intuitive and flexible framework for deep learning. Chainer enables users to easily implement complex models proposed by recent academic papers and also rapidly make a prototype of new algorithms by themselves.

The following images were generate by a Chainer implementation of a famous paper “A neural algorithm of Artistic style”. The style of a content image of a cat is modified into artistic images of which style resembles the style images next to it.

This is just an example, but you can see how this kind of fancy models are implemented in only a few hundreds of lines of codes in Chainer. There is also a list of code examples provided by users for many use-cases.


This is the end of this hands-on. For more details, please refer to the official tutorial.

Classify anime characters with a fine-tuned model

(This notebook is based on mitmul/chainer-handson/animeface-character/classify_characters.ipynb in Japanese.)

In this notebook, we will learn to:

  1. Fine-tune an Illustration2Vec model on the “animeface-character” dataset.
  2. Classify 146 kinds of character faces with more than 90% accuracy using the fine-tuned models.

After reading the notebook, using Chainer you should be able to:

  • Make a new dataset object.
  • Divide a dataset into training / validation.
  • Fine-tune a model on a new task by using a trained models.
  • Bonus: How to write a dataset class from scratch.
Summary

We show a specific example of how to setup a dataset which is not already provided by Chainer and use it for training a network. The basic procedure is almost the same as the chapter that explains the CIFAR 10 dataset class (described in Chainer v 4: Beginner Tutorial, Japanese only for now).

Here, we will explain how to initialize models with weights of pre-trained models. The new model will be initialized with models whose domain is related to our current task. To fine-tune a network from Caffe’s .caffemodel, the procedure is the same.

First, we execute the following cell and install “Chainer” and its GPU back end “CuPy”. If the “runtime type” of Colaboratory is GPU, you can run Chainer with GPU as a backend.

[1]:
!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Let’s import the necessary modules, then check the version of Chainer, NumPy, CuPy, Cuda and other execution environments.

[2]:
import chainer

chainer.print_runtime_info()
Chainer: 4.4.0
NumPy: 1.14.5
CuPy:
  CuPy Version          : 4.4.0
  CUDA Root             : None
  CUDA Build Version    : 8000
  CUDA Driver Version   : 9000
  CUDA Runtime Version  : 8000
  cuDNN Build Version   : 7102
  cuDNN Version         : 7102
  NCCL Build Version    : 2213

Use pip to install the required libraries.

[3]:
%%bash
pip install Pillow
pip install dill
Requirement already satisfied: Pillow in /usr/local/lib/python3.6/dist-packages (4.0.0)
Requirement already satisfied: olefile in /usr/local/lib/python3.6/dist-packages (from Pillow) (0.45.1)
Collecting dill
  Downloading https://files.pythonhosted.org/packages/6f/78/8b96476f4ae426db71c6e86a8e6a81407f015b34547e442291cd397b18f3/dill-0.2.8.2.tar.gz (150kB)
Building wheels for collected packages: dill
  Running setup.py bdist_wheel for dill: started
  Running setup.py bdist_wheel for dill: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/e2/5d/17/f87cb7751896ac629b435a8696f83ee75b11029f5d6f6bda72
Successfully built dill
Installing collected packages: dill
Successfully installed dill-0.2.8.2
1. Download the dataset

First, we will download the dataset, we will use this dataset. Credits to @nagadomi (a Kaggle Grand Master), who created the face area thumbnails from animated character.

[4]:
%%bash
if [ ! -d animeface-character-dataset ]; then
    curl -L -O http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/data/animeface-character-dataset.zip
    unzip -q animeface-character-dataset.zip
    rm -rf animeface-character-dataset.zip
fi
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  564M  100  564M    0     0  56.4M      0  0:00:10  0:00:10 --:--:-- 56.4M
2. Problem settings

We use images of anime character faces in “animeface-character-dataset”, and train a network to classify which character is in the image. We classify the character faces in the validation dataset, because we separate training and the validation dataset.

Also, we initialize the model weights using a pre-trained model from a similar domain, rather than randomly initializing the weights. We don’t train our model from scratch, this is commonly known as fine-tuning.

The dataset used for training contains many images, with each characters in its own folder. ### Example images

  • 000_hatsune_miku

|face\_128\_326\_108.png|

  • 002_suzumiya_haruhi

|face\_1000\_266\_119.png|

  • 007_nagato_yuki

|face\_83\_270\_92.png|

  • 012_asahina_mikuru

|face\_121\_433\_128.png|

3. Creating a dataset object

Here, we show how to create a dataset object using a class called LabeledImageDataset which is often used for image classification problems.

First, we will get the path list of the image files. The image files are located in the different directories for each character under animeface-character-dataset/thumb directory. In the code below, if the file ignore is contained in the directory, we will skip that directory to load.

[ ]:
import os
import glob
from itertools import chain

# image directories
IMG_DIR = 'animeface-character-dataset/thumb'

# directories for each character
dnames = glob.glob('{}/*'.format(IMG_DIR))

# the list of image files' path
fnames = [glob.glob('{}/*.png'.format(d)) for d in dnames
          if not os.path.exists('{}/ignore'.format(d))]
fnames = list(chain.from_iterable(fnames))

Next, because the name of image directories contains the name of the character, we use it to make an ID that makes it unique for each character.

[ ]:
# Create unique id for each character from the directory name
labels = [os.path.basename(os.path.dirname(fn)) for fn in fnames]
dnames = [os.path.basename(d) for d in dnames
          if not os.path.exists('{}/ignore'.format(d))]
labels = [dnames.index(l) for l in labels]

Let’s create a simple dataset object. We simply pass the list of tuples with the file path and its label to LabeledImageDataset. This is an iterator that returns a tuple like (img, label).

[ ]:
from chainer.datasets import LabeledImageDataset

# Crate dataset
d = LabeledImageDataset(list(zip(fnames, labels)))

Next, we use a convenient function called TransformDataset provided by Chainer. This is a wrapper class that takes dataset objects and functions that represent the transformation to each data, which you can use to prepare the data augmentation and preprocessing parts outside the dataset class.

[ ]:
from chainer.datasets import TransformDataset
from PIL import Image

width, height = 160, 160

# function for resizing images
def resize(img):
    img = Image.fromarray(img.transpose(1, 2, 0))
    img = img.resize((width, height), Image.BICUBIC)
    return np.asarray(img).transpose(2, 0, 1)

# transformation for each data
def transform(inputs):
    img, label = inputs
    img = img[:3, ...]
    img = resize(img.astype(np.uint8))
    img = img - mean[:, None, None]
    img = img.astype(np.float32)
    # Flip horizontally at random
    if np.random.rand() > 0.5:
        img = img[..., ::-1]
    return img, label

# dataset with transformation
td = TransformDataset(d, transform)

By doing this, you can create a dataset object that returns the dataset transformed by transform function.

Let’s split this into two separeted datasets for training and velidation. We use 80% of the entire dataset for traiing, and the remaining 20% for validation. split_dataset_random shuffles the data in the dataset once, and then split it.

[ ]:
from chainer import datasets

train, valid = datasets.split_dataset_random(td, int(len(d) * 0.8), seed=0)

Several other functions are also provided, such as get_cross_validation_datasets_random which returns several different pairs of training and verification data sets for cross validation. Have a look at this.:SubDataset

Then, mean used in transform is the average image contained in the training dataset. Let’s calculate this.

[ ]:
import matplotlib.pyplot as plt
import numpy as np


# if the average image is not calculated, just calculate it
if not os.path.exists('image_mean.npy'):
    # We want to calculate the average without transformation
    t, _ = datasets.split_dataset_random(d, int(len(d) * 0.8), seed=0)

    mean = np.zeros((3, height, width))
    for img, _ in t:
        img = resize(img[:3].astype(np.uint8))
        mean += img
    mean = mean / float(len(d))
    np.save('image_mean', mean)
else:
    mean = np.load('image_mean.npy')

Let’s display the calculated average image.

[11]:
# diplay averaged image
%matplotlib inline
plt.imshow(mean.transpose(1, 2, 0) / 255)
plt.show()
_images/notebook_hands_on_chainer_classify_anime_characters_25_0.png

You may be scared with the image…

When subtracting the mean from each image, we use the average for each pixel. So, we calculate the average pixel value (RGB) of this average image.

[ ]:
mean = mean.mean(axis=(1, 2))
4. Model definition and preparation for Fine-tuning

Next, we will define the model. Here, we define the new model based on the network used in Illustration2Vec, whch can predict tag, extract features and etc. The new model use the layers of Illustration2Vec except last two layers, and add two fully-connected layers instead of them. The two fully-connected layers are initialized randomly.

When traiing, we fix the weights of the Illustration2Vec layers. It means that we only train two newly added layers.

First, I download the trained parameters of the Illustration2Vec model.

[13]:
%%bash
if [ ! -f illust2vec_ver200.caffemodel ]; then
    curl -L -O https://github.com/rezoo/illustration2vec/releases/download/v2.0.0/illust2vec_ver200.caffemodel
fi
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   618    0   618    0     0    618      0 --:--:-- --:--:-- --:--:--  2771
100  933M  100  933M    0     0  35.8M      0  0:00:26  0:00:26 --:--:-- 86.4M

This trained weights are provided in the form of caffemodel, and Chainer is very easy to load Caffe’s trained model (`CaffeFunction <http://docs.chainer.org/en/stable/reference/generated/chainer.links.caffe.CaffeFunction.html#chainer.links.caffe.CaffeFunction>`__). So, we use this to load the parameters and model structure. However, since it takes time to load, we save the Chain object using Python standard module pickle. By doing this, loading model becomes faster next time .

The actual network code is as follows.

[14]:
import dill

import chainer
import chainer.links as L
import chainer.functions as F

from chainer import Chain
from chainer.links.caffe import CaffeFunction
from chainer import serializers


class Illust2Vec(Chain):

    CAFFEMODEL_FN = 'illust2vec_ver200.caffemodel'

    def __init__(self, n_classes, unchain=True):
        w = chainer.initializers.HeNormal()
        model = CaffeFunction(self.CAFFEMODEL_FN)  # Load and save CaffeModel. (It takes time)
        del model.encode1  # Delete unnecessary layers for memory saving.。
        del model.encode2
        del model.forwards['encode1']
        del model.forwards['encode2']
        model.layers = model.layers[:-2]

        super(Illust2Vec, self).__init__()
        with self.init_scope():
            self.trunk = model  # Include the original Illust2Vec model as trunk in this model.
            self.fc7 = L.Linear(None, 4096, initialW=w)
            self.bn7 = L.BatchNormalization(4096)
            self.fc8 = L.Linear(4096, n_classes, initialW=w)

    def __call__(self, x):
        h = self.trunk({'data': x}, ['conv6_3'])[0]  # Extract the output of conv6_3 of the original Illust2Vec model.
        h.unchain_backward()
        h = F.dropout(F.relu(self.bn7(self.fc7(h))))  # Here and after are newly added layers
        return self.fc8(h)


n_classes = len(dnames)
model = Illust2Vec(n_classes)
model = L.Classifier(model)
/usr/local/lib/python3.6/dist-packages/chainer/links/caffe/caffe_function.py:165: UserWarning: Skip the layer "encode1neuron", since CaffeFunction does notsupport Sigmoid layer
  'support %s layer' % (layer.name, layer.type))
/usr/local/lib/python3.6/dist-packages/chainer/links/caffe/caffe_function.py:165: UserWarning: Skip the layer "loss", since CaffeFunction does notsupport SigmoidCrossEntropyLoss layer
  'support %s layer' % (layer.name, layer.type))

Look at h.unchain_backward() appeared in __call__. If we call unchain_backward of some intermediate Variable of the network, it cuts off connection of forward node. Therefore, during training, no errors are transmitted to the forward layers. As a result, the parameter is not updated.

As I mentioned abeve,

When traiing, we fix the weights of the Illustration2Vec layers. It means that we only train two newly added layers.

It can be achieved by h.unchain_backward().

5. Learning

Let’s train the model with the dataset. First, load the necessary modules.

[ ]:
from chainer import iterators
from chainer import training
from chainer import optimizers
from chainer.training import extensions
from chainer.training import triggers
from chainer.dataset import concat_examples

Next, set the training parameters as follows:

  • Batch size : 64
  • Learning rate starts at 0.01 and it is multiplied by 0.1 at 10 epochs.
  • Learn with 20 epochs
[ ]:
batchsize = 64
gpu_id = 0
initial_lr = 0.01
lr_drop_epoch = 10
lr_drop_ratio = 0.1
train_epoch = 20

Let’s kick the training.

[17]:
train_iter = iterators.MultiprocessIterator(train, batchsize)
valid_iter = iterators.MultiprocessIterator(
    valid, batchsize, repeat=False, shuffle=False)

optimizer = optimizers.MomentumSGD(lr=initial_lr)
optimizer.setup(model)
optimizer.add_hook(chainer.optimizer.WeightDecay(0.0001))

updater = training.StandardUpdater(
    train_iter, optimizer, device=gpu_id)

trainer = training.Trainer(updater, (train_epoch, 'epoch'), out='AnimeFace-result')
trainer.extend(extensions.LogReport())
trainer.extend(extensions.observe_lr())

# logging values
trainer.extend(extensions.PrintReport(
    ['epoch',
     'main/loss',
     'main/accuracy',
     'val/main/loss',
     'val/main/accuracy',
     'elapsed_time',
     'lr']))

# Save loss plot automatically every epoch
trainer.extend(extensions.PlotReport(
        ['main/loss',
         'val/main/loss'],
        'epoch', file_name='loss.png'))

# Save accuracy plot automatically every epoch
trainer.extend(extensions.PlotReport(
        ['main/accuracy',
         'val/main/accuracy'],
        'epoch', file_name='accuracy.png'))

# Extension to validate model with train property set to False
trainer.extend(extensions.Evaluator(valid_iter, model, device=gpu_id), name='val')

# Learning rate is multiplied by lr_drop_ratio for each specified epoch
trainer.extend(
    extensions.ExponentialShift('lr', lr_drop_ratio),
    trigger=(lr_drop_epoch, 'epoch'))

trainer.run()
epoch       main/loss   main/accuracy  val/main/loss  val/main/accuracy  elapsed_time  lr
1           1.60994     0.613411       0.6051         0.833812           102.794       0.01
2           0.605895    0.828228       0.550095       0.860241           193.605       0.01
3           0.407292    0.885969       0.469584       0.870144           284.337       0.01
4           0.325062    0.905112       0.427613       0.887003           374.966       0.01
5           0.250727    0.923531       0.396959       0.895822           465.039       0.01
6           0.206382    0.938431       0.406959       0.890555           555.431       0.01
7           0.184174    0.943398       0.385616       0.901281           645.739       0.01
8           0.153923    0.955195       0.379971       0.90401            735.907       0.01
9           0.136681    0.957574       0.384024       0.904159           826.123       0.01
10          0.112497    0.967094       0.362051       0.907562           916.594       0.01
11          0.0905753   0.974752       0.347325       0.911149           1007.43       0.001
12          0.0731635   0.979512       0.353764       0.912496           1097.69       0.001
13          0.0757203   0.980339       0.340012       0.915672           1187.99       0.001
14          0.0719905   0.979201       0.344738       0.909504           1278.21       0.001
15          0.0680711   0.982616       0.335869       0.912234           1368.43       0.001
16          0.0670189   0.980625       0.3339         0.917203           1458.31       0.001
17          0.0612799   0.984065       0.335891       0.913879           1548.63       0.001
18          0.0669879   0.982719       0.336597       0.915821           1638.88       0.001
19          0.0631883   0.984272       0.335587       0.914439           1729.09       0.001
20          0.0628357   0.983237       0.34545        0.911149           1819.37       0.001

Training was finishued in about 30 minutes. The result of logging was like the above. Finally, we have achieved more than 90% accuracy with the validation dataset. Let’s display the loss curve and the accuracy curve during the training process.

[18]:
from IPython.display import Image
Image(filename='AnimeFace-result/loss.png')
[18]:
_images/notebook_hands_on_chainer_classify_anime_characters_40_0.png
[19]:
Image(filename='AnimeFace-result/accuracy.png')
[19]:
_images/notebook_hands_on_chainer_classify_anime_characters_41_0.png

It seems that it has successfully converged.

Finally, we take several images from validation datasets, and look at the individual classification results.

[20]:
%matplotlib inline
import matplotlib.pyplot as plt

from PIL import Image
from chainer import cuda

chainer.config.train = False
for _ in range(10):
    x, t = valid[np.random.randint(len(valid))]
    x = cuda.to_gpu(x)
    y = F.softmax(model.predictor(x[None, ...]))

    pred = os.path.basename(dnames[int(y.data.argmax())])
    label = os.path.basename(dnames[t])

    print('pred:', pred, 'label:', label, pred == label)

    x = cuda.to_cpu(x)
    x += mean[:, None, None]
    x = x / 256
    x = np.clip(x, 0, 1)
    plt.imshow(x.transpose(1, 2, 0))
    plt.show()
pred: 191_shidou_hikaru label: 191_shidou_hikaru True
_images/notebook_hands_on_chainer_classify_anime_characters_43_1.png
pred: 139_caro_ru_lushe label: 139_caro_ru_lushe True
_images/notebook_hands_on_chainer_classify_anime_characters_43_3.png
pred: 180_matsuoka_miu label: 180_matsuoka_miu True
_images/notebook_hands_on_chainer_classify_anime_characters_43_5.png
pred: 070_nijihara_ink label: 070_nijihara_ink True
_images/notebook_hands_on_chainer_classify_anime_characters_43_7.png
pred: 001_kinomoto_sakura label: 001_kinomoto_sakura True
_images/notebook_hands_on_chainer_classify_anime_characters_43_9.png
pred: 114_natsume_rin label: 114_natsume_rin True
_images/notebook_hands_on_chainer_classify_anime_characters_43_11.png
pred: 014_hiiragi_kagami label: 014_hiiragi_kagami True
_images/notebook_hands_on_chainer_classify_anime_characters_43_13.png
pred: 055_ibuki_fuuko label: 169_shihou_matsuri False
_images/notebook_hands_on_chainer_classify_anime_characters_43_15.png
pred: 070_nijihara_ink label: 070_nijihara_ink True
_images/notebook_hands_on_chainer_classify_anime_characters_43_17.png
pred: 171_ikari_shinji label: 171_ikari_shinji True
_images/notebook_hands_on_chainer_classify_anime_characters_43_19.png

When I randomly selected ten images, I got the 9 correct answers. How about you?

Finally, it may be usfull, we save the snapshot of the model.

[ ]:
from chainer import serializers

serializers.save_npz('animeface.model', model)
6. Extra 1: How to write dataset class in full scratch

To write a dataset class in full scratch, you have to prepare a self class that inherits the chainer.dataset.DatasetMixin class. That class must have __len__ andget_example methods. For example, it becomes as follows.

```python class MyDataset(chainer.dataset.DatasetMixin):

def __init__(self, image_paths, labels):
    self.image_paths = image_paths
    self.labels = labels

def __len__(self):
    return len(self.image_paths)

def get_example(self, i):
    img = Image.open(self.image_paths[i])
    img = np.asarray(img, dtype=np.float32)
    img = img.transpose(2, 0, 1)
    label = self.labels[i]
    return img, label

```

This class is instanciate with a list of image file paths and a list of labels arranged in a corresponding order. If we specify an index with the [] accessor, it load the image from the corresponding path, aligne it with the label, and return them as a tuple.

For example, it can be used as follows.

image_files = ['images/hoge_0_1.png', 'images/hoge_5_1.png', 'images/hoge_2_1.png', 'images/hoge_3_1.png', ...]
labels = [0, 5, 2, 3, ...]

dataset = MyDataset(image_files, labels)

img, label = dataset[2]

#=> it will return the image data and its label of 'images/hoge_2_1.png'.

This object can be passed directly to the Iterator and can be used for training using the Trainer. In other words,

train_iter = iterators.MultiprocessIterator(dataset, batchsize=128)

we can create an iterator like this, pass it to the updater along with the Optimizer.

7. Extra 2: How to make the simplest dataset object

Actually, the dataset for Trainer is ** just a Python list**. In other words, if **you can get the length with len() and the element can be retrieved with the [] accessor**, you can treat it as a dataset object. For example,

data_list = [(x1, t1), (x2, t2), ...]

If you make a list of tuples such as (data, label), you can pass them to the Iterator.

train_iter = iterators.MultiprocessIterator(data_list, batchsize=128)

However, the drawback of such a way is that you have to put the entire dataset in memory before training. In order to prevent this, the combination of ImageDataset and TupleDataset, and LabaledImageDataset are provided. Please refer to the document for details.

http://docs.chainer.org/en/stable/reference/datasets.html#general-datasets

[ ]:

Chainer RL

ChainerRL Quickstart Guide

This is a quickstart guide for users who just want to try ChainerRL for the first time.

Run the command below to install ChainerRL:

[1]:
# Install Chainer, ChainerRL and CuPy!

!curl https://colab.chainer.org/install | sh -!apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
!pip -q install chainerrl
!pip -q install gym
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay
Extracting templates from packages: 100%

First, you need to import necessary modules. The module name of ChainerRL is chainerrl. Let’s import gym and numpy as well since they are used later.

[2]:
import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np
/usr/local/lib/python3.6/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')

ChainerRL can be used for any problems if they are modeled as “environments”. OpenAI Gym provides various kinds of benchmark environments and defines the common interface among them. ChainerRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: reset and step.

  • env.reset will reset the environment to the initial state and return the initial observation.
  • env.step will execute a given action, move to the next state and return four values:
  • a next observation
  • a scalar reward
  • a boolean value indicating whether the current state is terminal or not
  • additional information
  • env.render will render the current state.

Let’s try ‘CartPole-v0’, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

[3]:
env = gym.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
#env.render()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
observation space: Box(4,)
action space: Discrete(2)
initial observation: [ 0.03962616 -0.00805331 -0.03614126  0.03048748]
next observation: [ 0.03946509 -0.20263884 -0.03553151  0.31155195]
reward: 1.0
done: False
info: {}

Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.

ChainerRL provides various agents, each of which implements a deep reinforcement learning algorithm.

To use DQN (Deep Q-Network), you need to define a Q-function that receives an observation and returns an expected future return for each action the agent can take. In ChainerRL, you can define your Q-function as chainer.Link as below. Note that the outputs are wrapped by chainerrl.action_value.DiscreteActionValue, which implements chainerrl.action_value.ActionValue. By wrapping the outputs of Q-functions, ChainerRL can treat discrete-action Q-functions like this and NAFs (Normalized Advantage Functions) in the same way.

[ ]:
class QFunction(chainer.Chain):

    def __init__(self, obs_size, n_actions, n_hidden_channels=50):
        super().__init__()
        with self.init_scope():
            self.l0 = L.Linear(obs_size, n_hidden_channels)
            self.l1 = L.Linear(n_hidden_channels, n_hidden_channels)
            self.l2 = L.Linear(n_hidden_channels, n_actions)

    def __call__(self, x, test=False):
        """
        Args:
            x (ndarray or chainer.Variable): An observation
            test (bool): a flag indicating whether it is in test mode
        """
        h = F.tanh(self.l0(x))
        h = F.tanh(self.l1(h))
        return chainerrl.action_value.DiscreteActionValue(self.l2(h))

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)

If you want to use CUDA for computation, as usual as in Chainer, call to_gpu.

When using Colaboratory, you need to change runtime type to GPU.

[5]:
q_func.to_gpu(0)
[5]:
<__main__.QFunction at 0x7effb079e3c8>

You can also use ChainerRL’s predefined Q-functions.

[ ]:
_q_func = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction(
    obs_size, n_actions,
    n_hidden_layers=2, n_hidden_channels=50)

As in Chainer, chainer.Optimizer is used to update models.

[ ]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

A Q-function and its optimizer are used by a DQN agent. To create a DQN agent, you need to specify a bit more parameters and configurations.

[ ]:
# Set the discount factor that discounts future rewards.
gamma = 0.95

# Use epsilon-greedy for exploration
explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# Chainer only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(np.float32, copy=False)

# Now create an agent that will interact with the environment.
agent = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_interval=1,
    target_update_interval=100, phi=phi)

Now you have an agent and an environment. It’s time to start reinforcement learning!

In training, use agent.act_and_train to select exploratory actions. agent.stop_episode_and_train must be called after finishing an episode. You can get training statistics of the agent via agent.get_statistics.

[9]:
n_episodes = 200
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    reward = 0
    done = False
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while not done and t < max_episode_len:
        # Uncomment to watch the behaviour
        # env.render()
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
    if i % 10 == 0:
        print('episode:', i,
              'R:', R,
              'statistics:', agent.get_statistics())
    agent.stop_episode_and_train(obs, reward, done)
print('Finished.')
episode: 10 R: 54.0 statistics: [('average_q', 0.3839775436296336), ('average_loss', 0.11211375439623882)]
episode: 20 R: 74.0 statistics: [('average_q', 3.356048617398484), ('average_loss', 0.08360401755686123)]
episode: 30 R: 66.0 statistics: [('average_q', 6.465730209073646), ('average_loss', 0.15742219333446614)]
episode: 40 R: 182.0 statistics: [('average_q', 9.854616982127487), ('average_loss', 0.16397699776876554)]
episode: 50 R: 116.0 statistics: [('average_q', 12.850724195092248), ('average_loss', 0.141014359570396)]
episode: 60 R: 200.0 statistics: [('average_q', 16.680755617341624), ('average_loss', 0.15486771810916689)]
episode: 70 R: 200.0 statistics: [('average_q', 18.60101457834084), ('average_loss', 0.13990398771960172)]
episode: 80 R: 200.0 statistics: [('average_q', 19.611751582138908), ('average_loss', 0.169348575205351)]
episode: 90 R: 200.0 statistics: [('average_q', 19.979411869969834), ('average_loss', 0.15618550247257176)]
episode: 100 R: 200.0 statistics: [('average_q', 20.1084139808058), ('average_loss', 0.16387995202882835)]
episode: 110 R: 68.0 statistics: [('average_q', 20.125493464098238), ('average_loss', 0.14188708221665755)]
episode: 120 R: 200.0 statistics: [('average_q', 19.981348423218275), ('average_loss', 0.12173593674987096)]
episode: 130 R: 200.0 statistics: [('average_q', 20.031584503682154), ('average_loss', 0.14900986264764007)]
episode: 140 R: 181.0 statistics: [('average_q', 19.969489587497048), ('average_loss', 0.08019790542958775)]
episode: 150 R: 200.0 statistics: [('average_q', 20.0445616818784), ('average_loss', 0.17976971012090015)]
episode: 160 R: 173.0 statistics: [('average_q', 20.004161140161834), ('average_loss', 0.1392587406221566)]
episode: 170 R: 104.0 statistics: [('average_q', 20.00619890615657), ('average_loss', 0.1589133686481899)]
episode: 180 R: 200.0 statistics: [('average_q', 19.988814191729215), ('average_loss', 0.11023728141409249)]
episode: 190 R: 183.0 statistics: [('average_q', 19.893458825764306), ('average_loss', 0.10419487772551624)]
episode: 200 R: 199.0 statistics: [('average_q', 19.940461710890656), ('average_loss', 0.15900440799351787)]
Finished.

Now you finished training the agent. How good is the agent now? You can test it by using agent.act and agent.stop_episode instead. Exploration such as epsilon-greedy is not used anymore.

[ ]:
# Start virtual display
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
import os
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display.screen)

[11]:
frames = []
for i in range(3):
    obs = env.reset()
    done = False
    R = 0
    t = 0
    while not done and t < 200:
        frames.append(env.render(mode = 'rgb_array'))
        action = agent.act(obs)
        obs, r, done, _ = env.step(action)
        R += r
        t += 1
    print('test episode:', i, 'R:', R)
    agent.stop_episode()
env.render()

import matplotlib.pyplot as plt
import matplotlib.animation
import numpy as np
from IPython.display import HTML

plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
patch = plt.imshow(frames[0])
plt.axis('off')
animate = lambda i: patch.set_data(frames[i])
ani = matplotlib.animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval = 50)
HTML(ani.to_jshtml())
test episode: 0 R: 200.0
test episode: 1 R: 200.0
test episode: 2 R: 200.0
[11]:


Once Loop Reflect
_images/notebook_hands_on_chainerrl_quickstart_20_2.png

Recording video file.

[12]:

# wrap env for recording video
envw = gym.wrappers.Monitor(env, "./", force=True)

for i in range(3):
    obs = envw.reset()
    done = False
    R = 0
    t = 0
    while not done and t < 200:
        envw.render()
        action = agent.act(obs)
        obs, r, done, _ = envw.step(action)
        R += r
        t += 1
    print('test episode:', i, 'R:', R)
    agent.stop_episode()
test episode: 0 R: 200.0
test episode: 1 R: 200.0
test episode: 2 R: 200.0

Download videos on runnning.

[ ]:
from google.colab import files
import glob

for file in glob.glob("openaigym.video.*.mp4"):
  files.download(file)

You should remove video files.

[ ]:
!rm openaigym.video.*

If test scores are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call agent.save to save the agent, then agent.load to load the saved agent.

[ ]:
# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

RL completed!

But writing code like this every time you use RL might be boring. So, ChainerRL has utility functions that do these things.

[16]:
# Set up the logger to print info messages for understandability.
import logging
import sys
gym.undo_logger_setup()  # Turn off gym's default logger settings
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

chainerrl.experiments.train_agent_with_evaluation(
    agent, env,
    steps=2000,           # Train the agent for 2000 steps
    eval_n_runs=10,       # 10 episodes are sampled for each evaluation
    max_episode_len=200,  # Maximum length of each episodes
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result')      # Save everything to 'result' directory
/usr/local/lib/python3.6/dist-packages/gym/__init__.py:15: UserWarning: gym.undo_logger_setup is deprecated. gym no longer modifies the global logging configuration
  warnings.warn("gym.undo_logger_setup is deprecated. gym no longer modifies the global logging configuration")
outdir:result step:200 episode:0 R:200.0
statistics:[('average_q', 20.13107348407955), ('average_loss', 0.1130567486698384)]
outdir:result step:320 episode:1 R:120.0
statistics:[('average_q', 20.134093816794454), ('average_loss', 0.13519476892439852)]
outdir:result step:520 episode:2 R:200.0
statistics:[('average_q', 20.09233843875654), ('average_loss', 0.1332404190763901)]
outdir:result step:720 episode:3 R:200.0
statistics:[('average_q', 20.081831597545516), ('average_loss', 0.13068583669631)]
outdir:result step:901 episode:4 R:181.0
statistics:[('average_q', 19.99495162254429), ('average_loss', 0.09401080450214364)]
outdir:result step:1101 episode:5 R:200.0
statistics:[('average_q', 20.014892631038933), ('average_loss', 0.11939343070713773)]
test episode: 0 R: 200.0
test episode: 1 R: 200.0
test episode: 2 R: 200.0
test episode: 3 R: 200.0
test episode: 4 R: 200.0
test episode: 5 R: 200.0
test episode: 6 R: 200.0
test episode: 7 R: 200.0
test episode: 8 R: 200.0
test episode: 9 R: 200.0
The best score is updated -3.4028235e+38 -> 200.0
Saved the agent to result/1101
outdir:result step:1291 episode:6 R:190.0
statistics:[('average_q', 19.936340675579885), ('average_loss', 0.1115743888475369)]
outdir:result step:1491 episode:7 R:200.0
statistics:[('average_q', 19.923170098629676), ('average_loss', 0.1098893872285867)]
outdir:result step:1672 episode:8 R:181.0
statistics:[('average_q', 19.831724256166893), ('average_loss', 0.11151171360379805)]
outdir:result step:1842 episode:9 R:170.0
statistics:[('average_q', 19.753546435176624), ('average_loss', 0.10779849649639554)]
outdir:result step:2000 episode:10 R:158.0
statistics:[('average_q', 19.814065306106478), ('average_loss', 0.07133777467302949)]
test episode: 0 R: 184.0
test episode: 1 R: 200.0
test episode: 2 R: 179.0
test episode: 3 R: 174.0
test episode: 4 R: 198.0
test episode: 5 R: 179.0
test episode: 6 R: 185.0
test episode: 7 R: 191.0
test episode: 8 R: 198.0
test episode: 9 R: 188.0
Saved the agent to result/2000_finish

That’s all of the ChainerRL quickstart guide. To know more about ChainerRL, please look into the examples directory and read and run the examples. Thank you!

[ ]:

Official Example

DCGAN: Generate the images with Deep Convolutional GAN

Note: This notebook is created from chainer/examples/dcgan. If you want to run it as script, please refer to the above link.

In this notebook, we generate images with generative adversarial network (GAN).

Generated Image

First, we execute the following cell and install “Chainer” and its GPU back end “CuPy”. If the “runtime type” of Colaboratory is GPU, you can run Chainer with GPU as a backend.

[2]:
!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
  libcusparse8.0 libnvrtc8.0 libnvtoolsext1
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 28.9 MB of archives.
After this operation, 71.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libcusparse8.0 amd64 8.0.61-1 [22.6 MB]
Get:2 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvrtc8.0 amd64 8.0.61-1 [6,225 kB]
Get:3 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvtoolsext1 amd64 8.0.61-1 [32.2 kB]
Fetched 28.9 MB in 2s (14.4 MB/s)
Selecting previously unselected package libcusparse8.0:amd64.
(Reading database ... 18408 files and directories currently installed.)
Preparing to unpack .../libcusparse8.0_8.0.61-1_amd64.deb ...
Unpacking libcusparse8.0:amd64 (8.0.61-1) ...
Selecting previously unselected package libnvrtc8.0:amd64.
Preparing to unpack .../libnvrtc8.0_8.0.61-1_amd64.deb ...
Unpacking libnvrtc8.0:amd64 (8.0.61-1) ...
Selecting previously unselected package libnvtoolsext1:amd64.
Preparing to unpack .../libnvtoolsext1_8.0.61-1_amd64.deb ...
Unpacking libnvtoolsext1:amd64 (8.0.61-1) ...
Setting up libnvtoolsext1:amd64 (8.0.61-1) ...
Setting up libcusparse8.0:amd64 (8.0.61-1) ...
Setting up libnvrtc8.0:amd64 (8.0.61-1) ...
Processing triggers for libc-bin (2.26-0ubuntu2.1) ...

Let’s import the necessary modules, then check the version of Chainer, NumPy, CuPy, Cuda and other execution environments.

[ ]:
import os

import numpy as np

import chainer
from chainer import cuda
import chainer.functions as F
import chainer.links as L
from chainer import Variable
from chainer.training import extensions


chainer.print_runtime_info()
Chainer: 4.4.0
NumPy: 1.14.5
CuPy:
  CuPy Version          : 4.4.1
  CUDA Root             : None
  CUDA Build Version    : 8000
  CUDA Driver Version   : 9000
  CUDA Runtime Version  : 8000
  cuDNN Build Version   : 7102
  cuDNN Version         : 7102
  NCCL Build Version    : 2213

1. Setting parameters

Here we set the parameters for training.

  • n_epoch: Epoch number. How many times we pass through the whole training data.
  • n_units: Number of units. How many hidden state vectors each Recursive Neural Network node has.
  • batchsize: Batch size. How many train data we will input as a block when updating parameters.
  • n_label: Number of labels. Number of classes to be identified. Since there are 5 labels this time, 5.
  • epoch_per_eval: How often to perform validation.
  • is_test: If True, we use a small dataset.
  • gpu_id: GPU ID. The ID of the GPU to use. For Colaboratory it is good to use 0.
[ ]:
# parameters
n_epoch = 100  # number of epochs
n_hidden = 100  # number of hidden units
batchsize = 50  # minibatch size
snapshot_interval = 10000  # number of iterations per snapshots
display_interval = 100  # number of iterations per display the status
gpu_id = 0
out_dir = 'result'
seed = 0  # random seed

2. Preparation of training data and iterator

In this notebook, we will use the training data which are preprocessed by chainer.datasets.get_cifar10.

From Wikipedia, it says

The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research.The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

Let’s retrieve the CIFAR-10 dataset by using Chainer’s dataset utility function get_cifar10. CIFAR-10 is a set of small natural images. Each example is an RGB color image of size 32x32. In the original images, each component of pixels is represented by one-byte unsigned integer. This function scales the components to floating point values in the interval [0, scale].

[ ]:
# Load the CIFAR10 dataset if args.dataset is not specified
train, _ = chainer.datasets.get_cifar10(withlabel=False, scale=255.)
Downloading from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz...
[ ]:
train_iter = chainer.iterators.SerialIterator(train, batchsize)

3. Preparation of the model

Let’s define the network. We will create the model called DCGAN(Deep Convolutional GAN). As shown below, it is a model using CNN(Convolutional Neural Network) as its name suggests.

DCGAN cited from [1]

First, let’s define a network for the generator.

[ ]:
class Generator(chainer.Chain):

    def __init__(self, n_hidden, bottom_width=4, ch=512, wscale=0.02):
        super(Generator, self).__init__()
        self.n_hidden = n_hidden
        self.ch = ch
        self.bottom_width = bottom_width

        with self.init_scope():
            w = chainer.initializers.Normal(wscale)
            self.l0 = L.Linear(self.n_hidden, bottom_width * bottom_width * ch,
                               initialW=w)
            self.dc1 = L.Deconvolution2D(ch, ch // 2, 4, 2, 1, initialW=w)
            self.dc2 = L.Deconvolution2D(ch // 2, ch // 4, 4, 2, 1, initialW=w)
            self.dc3 = L.Deconvolution2D(ch // 4, ch // 8, 4, 2, 1, initialW=w)
            self.dc4 = L.Deconvolution2D(ch // 8, 3, 3, 1, 1, initialW=w)
            self.bn0 = L.BatchNormalization(bottom_width * bottom_width * ch)
            self.bn1 = L.BatchNormalization(ch // 2)
            self.bn2 = L.BatchNormalization(ch // 4)
            self.bn3 = L.BatchNormalization(ch // 8)

    def make_hidden(self, batchsize):
        return np.random.uniform(-1, 1, (batchsize, self.n_hidden, 1, 1)).astype(np.float32)

    def __call__(self, z):
        h = F.reshape(F.relu(self.bn0(self.l0(z))),
                                (len(z), self.ch, self.bottom_width, self.bottom_width))
        h = F.relu(self.bn1(self.dc1(h)))
        h = F.relu(self.bn2(self.dc2(h)))
        h = F.relu(self.bn3(self.dc3(h)))
        x = F.sigmoid(self.dc4(h))
        return x

When we make a network in Chainer, we should follow some rules:

  1. Define a network class which inherits Chain.
  2. Make chainer.links ‘s instances in the init_scope(): of the initializer __init__.
  3. Concatenate chainer.links ‘s instances with chainer.functions to make the whole network.

If you are not familiar with constructing a new network, you can read this tutorial.

As we can see from the initializer __init__, the Generator uses the deconvolution layer Deconvolution2D and the batch normalization BatchNormalization. In __call__, each layer is concatenated by relu except the last layer.

Because the first argument of L.Deconvolution is the channel size of input and the second is the channel size of output, we can find that each layer halve the channel size. When we construct Generator with ch=1024, the network is same with the image above.


Note

Be careful when you concatenate a fully connected layer’s output and a convolutinal layer’s input. As we can see the 1st line of __call__, the output and input have to be concatenated with reshaping by reshape.

In addtion, let’s define a network for the discriminator.

[ ]:
class Discriminator(chainer.Chain):

    def __init__(self, bottom_width=4, ch=512, wscale=0.02):
        w = chainer.initializers.Normal(wscale)
        super(Discriminator, self).__init__()
        with self.init_scope():
            self.c0_0 = L.Convolution2D(3, ch // 8, 3, 1, 1, initialW=w)
            self.c0_1 = L.Convolution2D(ch // 8, ch // 4, 4, 2, 1, initialW=w)
            self.c1_0 = L.Convolution2D(ch // 4, ch // 4, 3, 1, 1, initialW=w)
            self.c1_1 = L.Convolution2D(ch // 4, ch // 2, 4, 2, 1, initialW=w)
            self.c2_0 = L.Convolution2D(ch // 2, ch // 2, 3, 1, 1, initialW=w)
            self.c2_1 = L.Convolution2D(ch // 2, ch // 1, 4, 2, 1, initialW=w)
            self.c3_0 = L.Convolution2D(ch // 1, ch // 1, 3, 1, 1, initialW=w)
            self.l4 = L.Linear(bottom_width * bottom_width * ch, 1, initialW=w)
            self.bn0_1 = L.BatchNormalization(ch // 4, use_gamma=False)
            self.bn1_0 = L.BatchNormalization(ch // 4, use_gamma=False)
            self.bn1_1 = L.BatchNormalization(ch // 2, use_gamma=False)
            self.bn2_0 = L.BatchNormalization(ch // 2, use_gamma=False)
            self.bn2_1 = L.BatchNormalization(ch // 1, use_gamma=False)
            self.bn3_0 = L.BatchNormalization(ch // 1, use_gamma=False)

    def __call__(self, x):
        h = add_noise(x)
        h = F.leaky_relu(add_noise(self.c0_0(h)))
        h = F.leaky_relu(add_noise(self.bn0_1(self.c0_1(h))))
        h = F.leaky_relu(add_noise(self.bn1_0(self.c1_0(h))))
        h = F.leaky_relu(add_noise(self.bn1_1(self.c1_1(h))))
        h = F.leaky_relu(add_noise(self.bn2_0(self.c2_0(h))))
        h = F.leaky_relu(add_noise(self.bn2_1(self.c2_1(h))))
        h = F.leaky_relu(add_noise(self.bn3_0(self.c3_0(h))))
        return self.l4(h)

The Discriminator network is almost same with the transposed network of the Generator. However, there are minor different points:

  1. Use leaky_relu as activation functions
  2. Deeper than Generator
  3. Add some noise when concatenating layers
[ ]:
def add_noise(h, sigma=0.2):
    xp = cuda.get_array_module(h.data)
    if chainer.config.train:
        return h + sigma * xp.random.randn(*h.shape)
    else:
        return h

Let’s make the instances of the Generator and the Discriminator.

[ ]:
gen = Generator(n_hidden=n_hidden)
dis = Discriminator()

4. Preparing Optimizer

Next, let’s make optimizers for the models created above.

[ ]:
# Setup an optimizer
def make_optimizer(model, alpha=0.0002, beta1=0.5):
    optimizer = chainer.optimizers.Adam(alpha=alpha, beta1=beta1)
    optimizer.setup(model)
    optimizer.add_hook(
        chainer.optimizer_hooks.WeightDecay(0.0001), 'hook_dec')
    return optimizer
[ ]:
opt_gen = make_optimizer(gen)
opt_dis = make_optimizer(dis)

5. Preparation and training of Updater · Trainer

The GAN need the two models: the generator and the discriminator. Usually, the default updaters pre-defined in Chainer take only one model. So, we need to define a custom updater for the GAN training.

The definition of DCGANUpdater is a little complicated. However, it just minimize the loss of the discriminator and that of the generator alternately. We will explain the way of updating the models.

As you can see in the class definiton, DCGANUpdater inherits StandardUpdater. In this case, almost all necessary functions are defined in StandardUpdater, we just override the functions of __init__ and update_core.

[ ]:
class DCGANUpdater(chainer.training.updaters.StandardUpdater):

    def __init__(self, *args, **kwargs):
        self.gen, self.dis = kwargs.pop('models')
        super(DCGANUpdater, self).__init__(*args, **kwargs)

    def loss_dis(self, dis, y_fake, y_real):
        batchsize = len(y_fake)
        L1 = F.sum(F.softplus(-y_real)) / batchsize
        L2 = F.sum(F.softplus(y_fake)) / batchsize
        loss = L1 + L2
        chainer.report({'loss': loss}, dis)
        return loss

    def loss_gen(self, gen, y_fake):
        batchsize = len(y_fake)
        loss = F.sum(F.softplus(-y_fake)) / batchsize
        chainer.report({'loss': loss}, gen)
        return loss

    def update_core(self):
        gen_optimizer = self.get_optimizer('gen')
        dis_optimizer = self.get_optimizer('dis')

        batch = self.get_iterator('main').next()
        x_real = Variable(self.converter(batch, self.device)) / 255.
        xp = chainer.backends.cuda.get_array_module(x_real.data)

        gen, dis = self.gen, self.dis
        batchsize = len(batch)

        y_real = dis(x_real)

        z = Variable(xp.asarray(gen.make_hidden(batchsize)))
        x_fake = gen(z)
        y_fake = dis(x_fake)

        dis_optimizer.update(self.loss_dis, dis, y_fake, y_real)
        gen_optimizer.update(self.loss_gen, gen, y_fake)

In the intializer __init__, an addtional key word argument models is required as you can see the codes below. Also, we use key word arguments iterator, optimizer and device. Be careful for the optimizer. We needs not only two models but also two optimizers. So, we should input optimizer as dictionary {'gen': opt_gen, 'dis': opt_dis}. In the DCGANUpdater, you can access the iterator with self.get_iterator('main'). Also, you can access the optimizers with self.get_optimizer('gen') and self.get_optimizer('dis').

In update_core, the two loss functions loss_dis and loss_gen are minimized by the optimizers. At first two lines, we access to the optimizers. Then, we generates next batch of training data by self.get_iterator('main').next(), and convert batch to x_real to make the training data suitable for self.device (e.g. GPU or CPU). After that, we minimize the loss functions with the optimizers.

[ ]:
  updater = DCGANUpdater(
      models=(gen, dis),
      iterator=train_iter,
      optimizer={
          'gen': opt_gen, 'dis': opt_dis},
      device=gpu_id)
  trainer = chainer.training.Trainer(updater, (n_epoch, 'epoch'), out=out_dir)
[ ]:
from PIL import Image
import chainer.backends.cuda


def out_generated_image(gen, dis, rows, cols, seed, dst):
    @chainer.training.make_extension()
    def make_image(trainer):
        np.random.seed(seed)
        n_images = rows * cols
        xp = gen.xp
        z = Variable(xp.asarray(gen.make_hidden(n_images)))
        with chainer.using_config('train', False):
            x = gen(z)
        x = chainer.backends.cuda.to_cpu(x.data)
        np.random.seed()

        x = np.asarray(np.clip(x * 255, 0.0, 255.0), dtype=np.uint8)
        _, _, H, W = x.shape
        x = x.reshape((rows, cols, 3, H, W))
        x = x.transpose(0, 3, 1, 4, 2)
        x = x.reshape((rows * H, cols * W, 3))

        preview_dir = '{}/preview'.format(dst)
        preview_path = preview_dir +\
            '/image{:0>8}.png'.format(trainer.updater.iteration)
        if not os.path.exists(preview_dir):
            os.makedirs(preview_dir)
        Image.fromarray(x).save(preview_path)
    return make_image
[ ]:
snapshot_interval = (snapshot_interval, 'iteration')
display_interval = (display_interval, 'iteration')
trainer.extend(
    extensions.snapshot(filename='snapshot_iter_{.updater.iteration}.npz'),
    trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
    gen, 'gen_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
    dis, 'dis_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.LogReport(trigger=display_interval))
trainer.extend(extensions.PrintReport([
    'epoch', 'iteration', 'gen/loss', 'dis/loss',
]), trigger=display_interval)
trainer.extend(extensions.ProgressBar(update_interval=100))
trainer.extend(
    out_generated_image(
        gen, dis,
        10, 10, seed, out_dir),
    trigger=snapshot_interval)
[ ]:
# Run the training
trainer.run()

6. Checking the performance with test data

[ ]:
%%bash
ls result/preview
image00010000.png
image00020000.png
image00030000.png
image00040000.png
image00050000.png
image00060000.png
image00070000.png
image00080000.png
image00090000.png
image00100000.png
[ ]:
from IPython.display import Image, display_png
import glob

image_files = sorted(glob.glob(out_dir + '/preview/*.png'))
[ ]:
display_png(Image(image_files[0]))  # first snapshot
_images/notebook_official_example_dcgan_33_0.png
[ ]:
display_png(Image(image_files[-1]))  # last snapshot
_images/notebook_official_example_dcgan_34_0.png

Reference

[1] [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434)

Sentiment Analisys with Recursive Neural Network

Note: This notebook is created from chainer/examples/sentiment. If you want to run it as script, please refer to the above link.

In this notebook, we will analysys the sentiment of the documents by using Recursive Neural Network.

First, we execute the following cell and install “Chainer” and its GPU back end “CuPy”. If the “runtime type” of Colaboratory is GPU, you can run Chainer with GPU as a backend.

[1]:
!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
  libcusparse8.0 libnvrtc8.0 libnvtoolsext1
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 28.9 MB of archives.
After this operation, 71.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libcusparse8.0 amd64 8.0.61-1 [22.6 MB]
Get:2 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvrtc8.0 amd64 8.0.61-1 [6,225 kB]
Get:3 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvtoolsext1 amd64 8.0.61-1 [32.2 kB]
Fetched 28.9 MB in 2s (10.4 MB/s)

78Selecting previously unselected package libcusparse8.0:amd64.
(Reading database ... 18298 files and directories currently installed.)
Preparing to unpack .../libcusparse8.0_8.0.61-1_amd64.deb ...
7Progress: [  0%] [..........................................................] 87Progress: [  6%] [###.......................................................] 8Unpacking libcusparse8.0:amd64 (8.0.61-1) ...
7Progress: [ 12%] [#######...................................................] 87Progress: [ 18%] [##########................................................] 8Selecting previously unselected package libnvrtc8.0:amd64.
Preparing to unpack .../libnvrtc8.0_8.0.61-1_amd64.deb ...
7Progress: [ 25%] [##############............................................] 8Unpacking libnvrtc8.0:amd64 (8.0.61-1) ...
7Progress: [ 31%] [##################........................................] 87Progress: [ 37%] [#####################.....................................] 8Selecting previously unselected package libnvtoolsext1:amd64.
Preparing to unpack .../libnvtoolsext1_8.0.61-1_amd64.deb ...
7Progress: [ 43%] [#########################.................................] 8Unpacking libnvtoolsext1:amd64 (8.0.61-1) ...
7Progress: [ 50%] [#############################.............................] 87Progress: [ 56%] [################################..........................] 8Setting up libnvtoolsext1:amd64 (8.0.61-1) ...
7Progress: [ 62%] [####################################......................] 87Progress: [ 68%] [#######################################...................] 8Setting up libcusparse8.0:amd64 (8.0.61-1) ...
7Progress: [ 75%] [###########################################...............] 87Progress: [ 81%] [###############################################...........] 8Setting up libnvrtc8.0:amd64 (8.0.61-1) ...
7Progress: [ 87%] [##################################################........] 87Progress: [ 93%] [######################################################....] 8Processing triggers for libc-bin (2.26-0ubuntu2.1) ...

78

Let’s import the necessary modules, then check the version of Chainer, NumPy, CuPy, Cuda and other execution environments.

[12]:
import collections
import numpy as np

import chainer
from chainer import cuda
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
from chainer import reporter


chainer.print_runtime_info()
Chainer: 4.1.0
NumPy: 1.14.3
CuPy:
  CuPy Version          : 4.1.0
  CUDA Root             : None
  CUDA Build Version    : 8000
  CUDA Driver Version   : 9000
  CUDA Runtime Version  : 8000
  cuDNN Build Version   : 7102
  cuDNN Version         : 7102
  NCCL Build Version    : 2104

1. Preparation of training data

In this notebook, we will use the training data which are preprocessed by chainer/examples/sentiment/download.py. Let’s run the following cells, download the necessary training data and unzip it.

[ ]:
# download.py
import os.path
from six.moves.urllib import request
import zipfile


request.urlretrieve(
    'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
    'trainDevTestTrees_PTB.zip')
zf = zipfile.ZipFile('trainDevTestTrees_PTB.zip')
for name in zf.namelist():
    (dirname, filename) = os.path.split(name)
    if not filename == '':
        zf.extract(name, '.')

Let’s execute the following command and check if the training data have been prepared.

dev.txt  test.txt  train.txt

It will be OK if the above output is displayed.

[14]:
!ls trees
dev.txt  test.txt  train.txt

Let’s look at the first line of test.txt and see how each sample is written.

[15]:
!head trees/dev.txt -n1
(3 (2 It) (4 (4 (2 's) (4 (3 (2 a) (4 (3 lovely) (2 film))) (3 (2 with) (4 (3 (3 lovely) (2 performances)) (2 (2 by) (2 (2 (2 Buy) (2 and)) (2 Accorsi))))))) (2 .)))

As displayed above, each sample is defined by a tree structure.

The tree structure is recursively defined as (value, node), and the class label for node is value.

The class labels represent 1(really negative), 2(negative), 3(neutral), 4(positive), and 5(really positive), respectively.

The representation of the one sample is shown below.

2. Setting parameters

Here we set the parameters for training. * n_epoch: Epoch number. How many times we pass through the whole training data. * n_units: Number of units. How many hidden state vectors each Recursive Neural Network node has. * batchsize: Batch size. How many train data we will input as a block when updating parameters. * n_label: Number of labels. Number of classes to be identified. Since there are 5 labels this time, 5. * epoch_per_eval: How often to perform validation. * is_test: If True, we use a small dataset. * gpu_id: GPU ID. The ID of the GPU to use. For Colaboratory it is good to use 0.

[ ]:
# parameters
n_epoch = 100  # number of epochs
n_units = 30  # number of units per layer
batchsize = 25  # minibatch size
n_label = 5  # number of labels
epoch_per_eval = 5  # number of epochs per evaluation
is_test = True
gpu_id = 0

if is_test:
    max_size = 10
else:
    max_size = None

3. Preparing the iterator

Let’s read the dataset used for training, validation, test and create an Iterator.

First, we convert each sample represented by str type to a tree structure data represented by a dictionary type.

We will tokenize the string with read_corpus implemented by the parser SexpParser. After that, we convert each tokenized sample to a tree structure data by convert_tree. By doing like this, it is possible to express a label as int, a node as a two-element tuple, and a tree structure as a dictionary, making it a more manageable data structure than the original string.

[ ]:
# data.py
import codecs
import re


class SexpParser(object):

    def __init__(self, line):
        self.tokens = re.findall(r'\(|\)|[^\(\) ]+', line)
        self.pos = 0

    def parse(self):
        assert self.pos < len(self.tokens)
        token = self.tokens[self.pos]
        assert token != ')'
        self.pos += 1

        if token == '(':
            children = []
            while True:
                assert self.pos < len(self.tokens)
                if self.tokens[self.pos] == ')':
                    self.pos += 1
                    break
                else:
                    children.append(self.parse())
            return children
        else:
            return token


def read_corpus(path, max_size):
    with codecs.open(path, encoding='utf-8') as f:
        trees = []
        for line in f:
            line = line.strip()
            tree = SexpParser(line).parse()
            trees.append(tree)
            if max_size and len(trees) >= max_size:
                break

    return trees


def convert_tree(vocab, exp):
    assert isinstance(exp, list) and (len(exp) == 2 or len(exp) == 3)

    if len(exp) == 2:
        label, leaf = exp
        if leaf not in vocab:
            vocab[leaf] = len(vocab)
        return {'label': int(label), 'node': vocab[leaf]}
    elif len(exp) == 3:
        label, left, right = exp
        node = (convert_tree(vocab, left), convert_tree(vocab, right))
        return {'label': int(label), 'node': node}

Let’s use read_corpus () and convert_tree () to create an iterator.

[ ]:
vocab = {}

train_data = [convert_tree(vocab, tree)
                        for tree in read_corpus('trees/train.txt', max_size)]
train_iter = chainer.iterators.SerialIterator(train_data, batchsize)

validation_data = [convert_tree(vocab, tree)
                                 for tree in read_corpus('trees/dev.txt', max_size)]
validation_iter = chainer.iterators.SerialIterator(validation_data, batchsize,
                                                                                   repeat=False, shuffle=False)

test_data = [convert_tree(vocab, tree)
                        for tree in read_corpus('trees/test.txt', max_size)]

Let’s try to display the first element of test_data. It is represented by the following tree structure, lable expresses the score of that node, and the numerical value of the leaf node corresponds to the word id in the dictionary vocab.

[19]:
print(test_data[0])
{'label': 2, 'node': ({'label': 3, 'node': ({'label': 3, 'node': 252}, {'label': 2, 'node': 71})}, {'label': 1, 'node': ({'label': 1, 'node': 253}, {'label': 2, 'node': 254})})}

4. Preparing the model

Let’s define the network.

We traverse each node of the tree structure data by traverse and calculate the loss loss of the whole tree. The implementation of traverse is a recursive call, which will traverse child nodes in turn. (It is a common implementation when treating tree structure data!)

First, we calculate the hidden state vector v. In the case of a leaf node, we obtain a hidden state vector stored in embed by model.leaf(word) from word idword. In the case of an intermediate node, the hidden vector is calculated with the hidden state vector left and right of the child nodes by v = model.node(left, right).

loss += F.softmax_cross_entropy(y, t) adds the loss of the current node to the loss of the child node, then returns loss to the parent node by return loss, v.

After the line loss += F.softmax_cross_entropy(y, t), there are some lines for logging accuracy and etc. But it is not necessary for the model definition itself.

[ ]:
class RecursiveNet(chainer.Chain):

    def traverse(self, node, evaluate=None, root=True):
        if isinstance(node['node'], int):
            # leaf node
            word = self.xp.array([node['node']], np.int32)
            loss = 0
            v = model.leaf(word)
        else:
            # internal node
            left_node, right_node = node['node']
            left_loss, left = self.traverse(left_node, evaluate=evaluate, root=False)
            right_loss, right = self.traverse(right_node, evaluate=evaluate, root=False)
            v = model.node(left, right)
            loss = left_loss + right_loss

        y = model.label(v)

        label = self.xp.array([node['label']], np.int32)
        t = chainer.Variable(label)
        loss += F.softmax_cross_entropy(y, t)

        predict = cuda.to_cpu(y.data.argmax(1))
        if predict[0] == node['label']:
            evaluate['correct_node'] += 1
        evaluate['total_node'] += 1

        if root:
            if predict[0] == node['label']:
                evaluate['correct_root'] += 1
            evaluate['total_root'] += 1

        return loss, v

    def __init__(self, n_vocab, n_units):
        super(RecursiveNet, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l = L.Linear(n_units * 2, n_units)
            self.w = L.Linear(n_units, n_label)

    def leaf(self, x):
        return self.embed(x)

    def node(self, left, right):
        return F.tanh(self.l(F.concat((left, right))))

    def label(self, v):
        return self.w(v)

    def __call__(self, x):
        accum_loss = 0.0
        result = collections.defaultdict(lambda: 0)
        for tree in x:
            loss, _ = self.traverse(tree, evaluate=result)
            accum_loss += loss

        reporter.report({'loss': accum_loss}, self)
        reporter.report({'total': result['total_node']}, self)
        reporter.report({'correct': result['correct_node']}, self)
        return accum_loss

One attention to the implementation of __call__.

x passed to __call__ is mini-batched input data and contains samples s_n like [s_1, s_2, ..., s_N].

In a network such as Convolutional Network used for image recognition, it is possible to perform parallel calculation collectively for mini batch x. However, in the case of a tree-structured network like this one, it is difficult to compute parallel because of the following reasons.

  • Data length varies depending on samples.
  • The order of calculation for each sample is different.

So, the implementation is to calculate each sample and finally summarize the results.

Note: Actually, you can perform parallel calculation of mini batch in Recursive Neural Network by using stack. Since it is published in the latter part of notebook as (Advanced), please refer to it.

[ ]:
model = RecursiveNet(len(vocab), n_units)

if gpu_id >= 0:
    model.to_gpu()

# Setup optimizer
optimizer = chainer.optimizers.AdaGrad(lr=0.1)
optimizer.setup(model)
optimizer.add_hook(chainer.optimizer_hooks.WeightDecay(0.0001))

5. Preparation and training of Updater · Trainer

As usual, we define an updater and a trainer to train the model. This time, I do not use L.Classifier and calculate the accuracy accuracy by myself. You can easily implement it using extensions.MicroAverage. For details, please refer to chainer.training.extensions.MicroAverage.

[22]:
def _convert(batch, device):
  return batch

updater = chainer.training.StandardUpdater(
    train_iter, optimizer, device=gpu_id, converter=_convert)

trainer = chainer.training.Trainer(updater, (n_epoch, 'epoch'))
trainer.extend(
        extensions.Evaluator(validation_iter, model, device=gpu_id, converter=_convert),
        trigger=(epoch_per_eval, 'epoch'))
trainer.extend(extensions.LogReport())

trainer.extend(extensions.MicroAverage(
        'main/correct', 'main/total', 'main/accuracy'))
trainer.extend(extensions.MicroAverage(
        'validation/main/correct', 'validation/main/total',
        'validation/main/accuracy'))

trainer.extend(extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
          'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
trainer.run()
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           1707.8                            0.155405                                 14.6668
2           586.467     556.419               0.497748       0.396175                  17.3996
3           421.267                           0.657658                                 19.3942
4           320.414     628.025               0.772523       0.42623                   22.2462
5           399.621                           0.704955                                 24.208
6           318.544     595.03                0.786036       0.420765                  27.0585
7           231.529                           0.880631                                 29.0178
8           160.546     628.959               0.916667       0.431694                  31.7562
9           122.076                           0.957207                                 33.8269
10          93.6623     669.898               0.975225       0.445355                  36.5802
11          74.2366                           0.986486                                 38.5855
12          60.2297     701.062               0.990991       0.448087                  41.4308
13          49.7152                           0.997748                                 43.414
14          41.633      724.893               0.997748       0.453552                  46.1698
15          35.3564                           0.997748                                 48.1999
16          30.402      744.493               1              0.448087                  50.9842
17          26.4137                           1                                        53.0605
18          23.188      760.43                1              0.459016                  55.7924
19          20.5913                           1                                        57.7479
20          18.4666     773.808               1              0.461749                  60.5636
21          16.698                            1                                        62.52
22          15.2066     785.205               1              0.461749                  65.2603
23          13.9351                           1                                        67.3052
24          12.8404     794.963               1              0.461749                  70.0323
25          11.8897                           1                                        71.9788
26          11.0575     803.388               1              0.459016                  74.7653
27          10.3237                           1                                        76.7485
28          9.67249     810.727               1              0.472678                  79.4539
29          9.09113                           1                                        81.4813
30          8.56935     817.176               1              0.480874                  84.1942
31          8.09874                           1                                        86.2475
32          7.6724      822.889               1              0.480874                  88.956
33          7.2846                            1                                        90.9035
34          6.93052     827.989               1              0.480874                  93.6949
35          6.60615                           1                                        95.6557
36          6.30805     832.574               1              0.486339                  98.4233
37          6.03332                           1                                        100.456
38          5.77941     836.724               1              0.486339                  103.18
39          5.5442                            1                                        105.129
40          5.32575     840.507               1              0.486339                  107.954

6. Checking the performance with test data

[23]:
def evaluate(model, test_trees):
    result = collections.defaultdict(lambda: 0)
    with chainer.using_config('train', False), chainer.no_backprop_mode():
        for tree in test_trees:
            model.traverse(tree, evaluate=result)
    acc_node = 100.0 * result['correct_node'] / result['total_node']
    acc_root = 100.0 * result['correct_root'] / result['total_root']
    print(' Node accuracy: {0:.2f} %% ({1:,d}/{2:,d})'.format(
        acc_node, result['correct_node'], result['total_node']))
    print(' Root accuracy: {0:.2f} %% ({1:,d}/{2:,d})'.format(
        acc_root, result['correct_root'], result['total_root']))

print('Test evaluation')
evaluate(model, test_data)
Test evaluation
 Node accuracy: 54.49 %% (170/312)
 Root accuracy: 50.00 %% (5/10)
(Advanced) Mini-batching in Recursive Neural Network[1]

Recursive Neural Network is difficult to compute mini-batched data in parallel because of the following reasons.

  • Data length varies depending on samples.
  • The order of calculation for each sample is different.

However, using the stack, Recursive Neural Network can perform mini batch parallel calculation.

Preparation of Dataset, Iterator

First, we convert the recursive calculation of Recursive Neural Network to a serial calculation using a stack.

For each node of the tree structure dataset, numbers are assigned to each node in “returning order” as follows.

The returning order is a procedure of numbering nodes of a tree structure. It is a procedure of attaching a smaller number to all child nodes than the parent node. If you process nodes in descending order of numbers, you can trace the child nodes before the parent node.

[ ]:
def linearize_tree(vocab, root, xp=np):
    # Left node indexes for all parent nodes
    lefts = []
    # Right node indexes for all parent nodes
    rights = []
    # Parent node indexes
    dests = []
    # All labels to predict for all parent nodes
    labels = []

    # All words of leaf nodes
    words = []
    # Leaf labels
    leaf_labels = []

    # Current leaf node index
    leaf_index = [0]

    def traverse_leaf(exp):
        if len(exp) == 2:
            label, leaf = exp
            if leaf not in vocab:
                vocab[leaf] = len(vocab)
            words.append(vocab[leaf])
            leaf_labels.append(int(label))
            leaf_index[0] += 1
        elif len(exp) == 3:
            _, left, right = exp
            traverse_leaf(left)
            traverse_leaf(right)

    traverse_leaf(root)

    # Current internal node index
    node_index = leaf_index
    leaf_index = [0]

    def traverse_node(exp):
        if len(exp) == 2:
            leaf_index[0] += 1
            return leaf_index[0] - 1
        elif len(exp) == 3:
            label, left, right = exp
            l = traverse_node(left)
            r = traverse_node(right)

            lefts.append(l)
            rights.append(r)
            dests.append(node_index[0])
            labels.append(int(label))

            node_index[0] += 1
            return node_index[0] - 1

    traverse_node(root)
    assert len(lefts) == len(words) - 1

    return {
        'lefts': xp.array(lefts, 'i'),
        'rights': xp.array(rights, 'i'),
        'dests': xp.array(dests, 'i'),
        'words': xp.array(words, 'i'),
        'labels': xp.array(labels, 'i'),
        'leaf_labels': xp.array(leaf_labels, 'i'),
    }
[ ]:
xp = cuda.cupy if gpu_id >= 0 else np

vocab = {}

train_data = [linearize_tree(vocab, t, xp)
                        for t in read_corpus('trees/train.txt', max_size)]
train_iter = chainer.iterators.SerialIterator(train_data, batchsize)

validation_data = [linearize_tree(vocab, t, xp)
                       for t in read_corpus('trees/dev.txt', max_size)]
validation_iter = chainer.iterators.SerialIterator(
    validation_data, batchsize, repeat=False, shuffle=False)

test_data = [linearize_tree(vocab, t, xp)
                       for t in read_corpus('trees/test.txt', max_size)]

Let’s try to display the first element of test_data.

lefts is the index of the left node for the dests parent node, rights is the index of the right node for the dests parent node, dests is the parent node’s index, dictionary contains word id,`labels has parent node label, and`` `leaf_labels`` contains dictionary of leaf node labels.

[26]:
print(test_data[0])
{'lefts': array([0, 2, 4], dtype=int32), 'rights': array([1, 3, 5], dtype=int32), 'dests': array([4, 5, 6], dtype=int32), 'words': array([252,  71, 253, 254], dtype=int32), 'labels': array([3, 1, 2], dtype=int32), 'leaf_labels': array([3, 2, 1, 2], dtype=int32)}

Definition of mini-batchable models

Recursive Neural Network has two operations: Operation A for computing an embedding vector for the leaf node. Operation B for computing the hidden state vector of the parent node from the hidden state vectors of the two child nodes.

For each sample, we assign index to each node in returning order. If you traverse the node in return order, you will find that operation A is performed on the leaf node and operation B is performed at the other nodes.

This operation can also be regarded as using a stack to scan a tree structure. A stack is a last-in, first-out data structure that allows you to do two things: a push operation to add data and a pop operation to get the last pushed data.

For operation A, push the calculation result to the stack. For operation B, pop two data and push the new calculation result.

When we parallelize the above operation, it is necessary to traverse nodes and perform operation A and operation B precisely because the tree structure is different for each sample. However, by using the stack, we can calculate different tree structures by simple repeating processing. Therefore, parallelization is possible.

[ ]:
from chainer import cuda
from chainer.utils import type_check


class ThinStackSet(chainer.Function):
    """Set values to a thin stack."""

    def check_type_forward(self, in_types):
        type_check.expect(in_types.size() == 3)
        s_type, i_type, v_type = in_types
        type_check.expect(
            s_type.dtype.kind == 'f',
            i_type.dtype.kind == 'i',
            s_type.dtype == v_type.dtype,
            s_type.ndim == 3,
            i_type.ndim == 1,
            v_type.ndim == 2,
            s_type.shape[0] >= i_type.shape[0],
            i_type.shape[0] == v_type.shape[0],
            s_type.shape[2] == v_type.shape[1],
        )

    def forward(self, inputs):
        xp = cuda.get_array_module(*inputs)
        stack, indices, values = inputs
        stack[xp.arange(len(indices)), indices] = values
        return stack,

    def backward(self, inputs, grads):
        xp = cuda.get_array_module(*inputs)
        _, indices, _ = inputs
        g = grads[0]
        gv = g[xp.arange(len(indices)), indices]
        g[xp.arange(len(indices)), indices] = 0
        return g, None, gv


def thin_stack_set(s, i, x):
    return ThinStackSet()(s, i, x)

In addition, we use thin stack[2] instead of simple stack here.

Let the sentence length be \(I\) and the number of dimensions of the hidden vector be \(D\), the thin stack can efficiently use the memory by using the matrix of \((2I-1) \times D\).

In a normal stack, you need \(O(I^2 D)\) space computation, whereas thin stacks require \(O(ID)\).

It is realized by push operation thin_stack_set and pop operation thin_stack_get.

First of all, we define ThinStackSet and ThinStackGet which inherit chainer.Function.

ThinStackSet is literally a function to set values on the thin stack.

inputs in forward and backward can be broken down like stack, indices, values = inputs.

stack is shared by functions by setting it as a function argument in the thin stack itself.

Because chainer.Function does not have internal states inside, it handles stack externally by passing it as a function argument.

[ ]:
class ThinStackGet(chainer.Function):

    def check_type_forward(self, in_types):
        type_check.expect(in_types.size() == 2)
        s_type, i_type = in_types
        type_check.expect(
            s_type.dtype.kind == 'f',
            i_type.dtype.kind == 'i',
            s_type.ndim == 3,
            i_type.ndim == 1,
            s_type.shape[0] >= i_type.shape[0],
        )

    def forward(self, inputs):
        xp = cuda.get_array_module(*inputs)
        stack, indices = inputs
        return stack[xp.arange(len(indices)), indices], stack

    def backward(self, inputs, grads):
        xp = cuda.get_array_module(*inputs)
        stack, indices = inputs
        g, gs = grads
        if gs is None:
            gs = xp.zeros_like(stack)
        if g is not None:
            gs[xp.arange(len(indices)), indices] += g
        return gs, None


def thin_stack_get(s, i):
    return ThinStackGet()(s, i)

ThinStackGet is literally a function to retrieve values from the thin stack.

inputs in forward and backward can be broken down like stack, indices = inputs.

[ ]:
class ThinStackRecursiveNet(chainer.Chain):

    def __init__(self, n_vocab, n_units, n_label):
        super(ThinStackRecursiveNet, self).__init__(
            embed=L.EmbedID(n_vocab, n_units),
            l=L.Linear(n_units * 2, n_units),
            w=L.Linear(n_units, n_label))
        self.n_units = n_units

    def leaf(self, x):
        return self.embed(x)

    def node(self, left, right):
        return F.tanh(self.l(F.concat((left, right))))

    def label(self, v):
        return self.w(v)

    def __call__(self, *inputs):
        batch = len(inputs) // 6
        lefts = inputs[0: batch]
        rights = inputs[batch: batch * 2]
        dests = inputs[batch * 2: batch * 3]
        labels = inputs[batch * 3: batch * 4]
        sequences = inputs[batch * 4: batch * 5]
        leaf_labels = inputs[batch * 5: batch * 6]

        inds = np.argsort([-len(l) for l in lefts])
        # Sort all arrays in descending order and transpose them
        lefts = F.transpose_sequence([lefts[i] for i in inds])
        rights = F.transpose_sequence([rights[i] for i in inds])
        dests = F.transpose_sequence([dests[i] for i in inds])
        labels = F.transpose_sequence([labels[i] for i in inds])
        sequences = F.transpose_sequence([sequences[i] for i in inds])
        leaf_labels = F.transpose_sequence([leaf_labels[i] for i in inds])

        batch = len(inds)
        maxlen = len(sequences)

        loss = 0
        count = 0
        correct = 0

        # thin stack
        stack = self.xp.zeros((batch, maxlen * 2, self.n_units), 'f')

        # 葉ノードの隠れ状態ベクトルとlossを計算
        for i, (word, label) in enumerate(zip(sequences, leaf_labels)):
            batch = word.shape[0]
            es = self.leaf(word)
            ds = self.xp.full((batch,), i, 'i')
            y = self.label(es)
            loss += F.softmax_cross_entropy(y, label, normalize=False) * batch
            count += batch
            predict = self.xp.argmax(y.data, axis=1)
            correct += (predict == label.data).sum()

            stack = thin_stack_set(stack, ds, es)

        # 中間ノードの隠れ状態ベクトルとlossを計算
        for left, right, dest, label in zip(lefts, rights, dests, labels):
            l, stack = thin_stack_get(stack, left)
            r, stack = thin_stack_get(stack, right)
            o = self.node(l, r)
            y = self.label(o)
            batch = l.shape[0]
            loss += F.softmax_cross_entropy(y, label, normalize=False) * batch
            count += batch
            predict = self.xp.argmax(y.data, axis=1)
            correct += (predict == label.data).sum()

            stack = thin_stack_set(stack, dest, o)

        loss /= count
        reporter.report({'loss': loss}, self)
        reporter.report({'total': count}, self)
        reporter.report({'correct': correct}, self)
        return loss
[ ]:
model = ThinStackRecursiveNet(len(vocab), n_units, n_label)

if gpu_id >= 0:
    model.to_gpu()

optimizer = chainer.optimizers.AdaGrad(0.1)
optimizer.setup(model)
<chainer.optimizers.ada_grad.AdaGrad at 0x7f8a3c453710>

Preparation of Updater · Trainer and execution of training

Let’s train with the new model ThinStackRecursiveNet. Since you can now compute mini batches in parallel, you can see that training is faster.

[ ]:
def convert(batch, device):
    if device is None:
        def to_device(x):
            return x
    elif device < 0:
        to_device = cuda.to_cpu
    else:
        def to_device(x):
            return cuda.to_gpu(x, device, cuda.Stream.null)

    return tuple(
        [to_device(d['lefts']) for d in batch] +
        [to_device(d['rights']) for d in batch] +
        [to_device(d['dests']) for d in batch] +
        [to_device(d['labels']) for d in batch] +
        [to_device(d['words']) for d in batch] +
        [to_device(d['leaf_labels']) for d in batch]
    )


updater = chainer.training.StandardUpdater(
    train_iter, optimizer, device=None, converter=convert)
trainer = chainer.training.Trainer(updater, (n_epoch, 'epoch'))
trainer.extend(
    extensions.Evaluator(validation_iter, model, converter=convert, device=None),
    trigger=(epoch_per_eval, 'epoch'))
trainer.extend(extensions.LogReport())

trainer.extend(extensions.MicroAverage(
    'main/correct', 'main/total', 'main/accuracy'))
trainer.extend(extensions.MicroAverage(
    'validation/main/correct', 'validation/main/total',
    'validation/main/accuracy'))

trainer.extend(extensions.PrintReport(
   ['epoch', 'main/loss', 'validation/main/loss',
     'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))

trainer.run()
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           1.75582                           0.268018                                 0.772637
2           1.0503      1.52234               0.63964        0.448087                  1.74078
3           0.752925                          0.743243                                 2.52495
4           1.21727     1.46956               0.745495       0.456284                  3.49669
5           0.681582                          0.817568                                 4.24974
6           0.477964    1.5514                0.880631       0.480874                  5.22265
7           0.38437                           0.916667                                 5.98324
8           0.30405     1.68066               0.923423       0.469945                  6.94833
9           0.222884                          0.959459                                 7.69772
10          0.175159    1.79104               0.977477       0.478142                  8.67923
11          0.142888                          0.97973                                  9.43108
12          0.118272    1.87948               0.986486       0.47541                   10.4046
13          0.0991659                         0.997748                                 11.1994
14          0.0841932   1.95415               0.997748       0.478142                  12.1657
15          0.0723124                         0.997748                                 12.9141
16          0.0627568   2.01682               0.997748       0.480874                  13.8787
17          0.0549726                         1                                        14.6336
18          0.04857     2.07107               1              0.478142                  15.6061
19          0.0432675                         1                                        16.3584
20          0.0388425   2.1181                1              0.480874                  17.3297
21          0.035117                          1                                        18.0761
22          0.0319522   2.15905               1              0.478142                  19.0487
23          0.0292416                         1                                        19.8416
24          0.0269031   2.1951                1              0.480874                  20.8083
25          0.0248729                         1                                        21.5566
26          0.0231      2.22721               1              0.483607                  22.5304
27          0.0215427                         1                                        23.2878
28          0.0201669   2.25614               1              0.486339                  24.2565
29          0.018944                          1                                        25.0171
30          0.017851    2.28247               1              0.480874                  26.0063
31          0.0168687                         1                                        26.7633
32          0.0159814   2.30664               1              0.483607                  27.7331
33          0.0151763                         1                                        28.5342
34          0.0144427   2.32898               1              0.483607                  29.5039
35          0.0137716                         1                                        30.257
36          0.0131555   2.34976               1              0.483607                  31.2306
37          0.0125881                         1                                        31.9842
38          0.0120638   2.3692                1              0.483607                  32.9617
39          0.0115783                         1                                        33.7175
40          0.0111272   2.38747               1              0.483607                  34.6946

It got much faster!

Reference

[1] 深層学習による自然言語処理 (機械学習プロフェッショナルシリーズ)

[2] [A Fast Unified Model for Parsing and Sentence Understanding](http://nlp.stanford.edu/pubs/bowman2016spinn.pdf)

Word2Vec: Obtain word embeddings

0. Introduction

Word2vec is the tool for generating the distributed representation of words, which is proposed by Mikolov et al[1]. When the tool assigns a real-valued vector to each word, the closer the meanings of the words, the greater similarity the vectors will indicate.

Distributed representation means assigning a real-valued vector for each word and representing the word by the vector. When representing a word by distributed representation, we call the vector word embeddings. In this notebook, we aim at explaining how to get the word embeddings from Penn Tree Bank dataset.

Let’s think about what the meaning of word is. Since we are human, so we can understand that the words “animal” and “dog” are deeply related each other. But what information will Word2vec use to learn the vectors for words? The words “animal” and “dog” should have similar vectors, but the words “food” and “dog” should be far from each other. How to know the features of those words automatically?

1. Basic Idea

Word2vec learns the similarity of word meanings from simple information. It learns the representation of words from sentences. The core idea is based on the assumption that the meaning of a word is affected by the words around it. This idea follows distributional hypothesis[2].

The word we focus on to learn its representation is called “center word”, and the words around it are called “context words”. Depending on the window size C determines the number of context words which is considered.

Here, let’s see the algorithm by using an example sentence: “The cute cat jumps over the lazy dog.

  • All of the following figures consider “cat” as the center word.
  • According to the window size C, you can see that the number of context words is changed.

|center\_context\_word.png|

2. Main Algorithm

Word2vec, the tool for creating the word embeddings, is actually built with two models, which are called Skip-gram and CBoW.

To explain the models with the figures below, we will use the following symbols.

  • \(|\mathcal{V}|\) : The size of vocabulary |
  • \(D\) : The size of embedding vector |
  • \({\bf v}_t\) : A one-hot center word vector |
  • \(V_{\pm C}\) : A set of \(C\) context vectors around \({\bf v}_t\), namely, \(\{{\bf v}_{t+c}\}_{c=-C}^C \backslash {\bf v}_t\) |
  • \({\bf l}_H\) : An embedding vector of an input word vector |
  • \({\bf l}_O\) : An output vector of the network |
  • \({\bf W}_H\) : The embedding matrix for inputs |
  • \({\bf W}_O\) : The embedding matrix for outputs |

Note

Using negative sampling or hierarchical softmax for the loss function is very common, however, in this notebook, we will use the softmax over all words and skip the other variants for the sake of simplicity.

2.1 Skip-gram

This model learns to predict context words \(V_{t \pm C}\) when a center word \({\bf v}_t\) is given. In the model, each row of the embedding matrix for input \(W_H\) becomes a word embedding of each word.

When you input a center word \({\bf v}_t\) into the network, you can predict one of context words \(\hat{\bf v}_{t+i} \in V_{t \pm C}\) as follows:

  1. Calculate an embedding vector of the input center word vector: \({\bf l}_H = {\bf W}_H {\bf v}_t\)
  2. Calculate an output vector of the embedding vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)
  3. Calculate a probability vector of a context word: \(\hat{\bf v}_{t+i} = \text{softmax}({\bf l}_O)\)

Each element of the \(|\mathcal{V}|\)-dimensional vector \(\hat{\bf v}_{t+i}\) is a probability that a word in the vocabulary turns out to be a context word at position \(i\). So, the probability \(p({\bf v}_{t+i} \mid {\bf v}_t)\) can be estimated by a dot product of the one-hot vector \({\bf v}_{t+i}\) which represents the actual word at the position \(i\) and the output vector \(\hat{\bf v}_{t+i}\).

\(p({\bf v}_{t+i} \mid {\bf v}_t) = {\bf v}_{t+i}^T \hat{\bf v}_{t+i}\)

The loss function for all the context words \(V_{t \pm C}\) given a center word \({\bf v}_t\) is defined as following:

$

\begin{eqnarray} L(V_{t \pm C} | {\bf v}_t; {\bf W}_H, {\bf W}_O) &=& \sum_{V_{t \pm C}} -\log\left(p({\bf v}_{t+i} \mid {\bf v}_t)\right) \\ &=& \sum_{V_{t \pm C}} -\log({\bf v}_{t+i}^T \hat{\bf v}_{t+i}) \end{eqnarray}

$

2.2 Continuous Bag of Words (CBoW)

This model learns to predict the center word \({\bf v}_t\) when context words \(V_{t \pm C}\) is given.

When you give a set of context words \(V_{t \pm C}\) to the network, you can estimate the probability of the center word \(\hat{v}_t\) as follows:

  1. Calculate a mean embedding vector over all context words: \({\bf l}_H = \frac{1}{2C} \sum_{V_{t \pm C}} {\bf W}_H {\bf v}_{t+i}\)
  2. Calculate an output vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)
  3. Calculate an probability vector: \(\hat{\bf v}_t = \text{softmax}({\bf l}_O)\)

Each element of \(\hat{\bf v}_t\) is a probability that a word in the vocabulary is considered as the center word. So, the prediction \(p({\bf v}_t \mid V_{t \pm C})\) can be calculated by \({\bf v}_t^T \hat{\bf v}_t\), where \({\bf v}_t\) denots the one-hot vector of the actual center word vector in the sentence from the dataset.

The loss function for the center word prediction is defined as follows:

$

\begin{eqnarray} L({\bf v}_t|V_{t \pm C}; W_H, W_O) &=& -\log(p({\bf v}_t|V_{t \pm C})) \\ &=& -\log({\bf v}_t^T \hat{\bf v}_t) \end{eqnarray}

$

3. Details of skip-gram

In this notebook, we mainly explain skip-gram model because

  1. It is easier to understand the algorithm than CBoW.
  2. Even if the number of words increases, the accuracy is largely maintained. So, it is more scalable.

So, let’s think about a concrete example of calculating skip-gram under this setup:

  • The size of vocabulary \(|\mathcal{V}|\) is 10.
  • The size of embedding vector \(D\) is 2.
  • Center word is “dog”.
  • Context word is “animal”.

Since there should be more than one context words, repeat the following process for each context word.

  1. The one-hot vector of “dog” is [0 0 1 0 0 0 0 0 0 0] and you input it as the center word.
  2. The third row of embedding matrix \({\bf W}_H\) is used for the word embedding of “dog” \({\bf l}_H\).
  3. Then multiply \({\bf W}_O\) with \({\bf l}_H\) to obtain the output vector \({\bf l}_O\)
  4. Give \({\bf l}_O\) to the softmax function to make it a predicted probability vector \(\hat{\bf v}_{t+c}\) for a context word at the position \(c\).
  5. Calculate the error between \(\hat{\bf v}_{t+c}\) and the one-hot vector of “animal”; [1 0 0 0 0 0 0 0 0 0 0].
  6. Propagate the error back to the network to update the parameters.

|skipgram\_detail.png|

4. Implementation of skip-gram in Chainer

There is an example of Word2vec in the official repository of Chainer, so we will explain how to implement skip-gram based on this: chainer/examples/word2vec

First, we execute the following cell and install “Chainer” and its GPU back end “CuPy”. If the “runtime type” of Colaboratory is GPU, you can run Chainer with GPU as a backend.

[1]:
!curl https://colab.chainer.org/install | sh -
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
4.1 Preparation

First, let’s import necessary packages:

[ ]:
import argparse
import collections

import numpy as np
import six

import chainer
from chainer import cuda
import chainer.functions as F
import chainer.initializers as I
import chainer.links as L
import chainer.optimizers as O
from chainer import reporter
from chainer import training
from chainer.training import extensions
4.2 Define a skip-gram model

Next, let’s define a network for skip-gram.

[ ]:
class SkipGram(chainer.Chain):

    def __init__(self, n_vocab, n_units):
        super().__init__()
        with self.init_scope():
            self.embed = L.EmbedID(
                n_vocab, n_units, initialW=I.Uniform(1. / n_units))
            self.out = L.Linear(n_units, n_vocab, initialW=0)

    def __call__(self, x, context):
        e = self.embed(context)
        shape = e.shape
        x = F.broadcast_to(x[:, None], (shape[0], shape[1]))
        e = F.reshape(e, (shape[0] * shape[1], shape[2]))
        x = F.reshape(x, (shape[0] * shape[1],))
        center_predictions = self.out(e)
        loss = F.softmax_cross_entropy(center_predictions, x)
        reporter.report({'loss': loss}, self)
        return loss

Note

  • The weight matrix self.embed.W is the embbeding matrix for input vector x.
  • __call__ takes the word ID of a center word x and word IDs of context words contexts as inputs, and outputs the error calculated by the loss function softmax_cross_entropy.
  • Note that the initial shape of x and contexts are (batch_size,) and (batch_size, n_context), respectively.
  • The batch_size means the size of mini-batch, and n_context means the number of context words.

First, we obtain the embedding vectors of contexts by e = self.embed(contexts).

Then F.broadcast_to(x[:, None], (shape[0], shape[1])) performs broadcasting of x ((batch_size,)) to (batch_size, n_context) by copying the same value n_context time to fill the second axis, and then the broadcasted x is reshaped into 1-D vector (batchsize * n_context,) while e is reshaped to (batch_size * n_context, n_units).

In skip-gram model, predicting a context word from the center word is the same as predicting the center word from a context word because the center word is always a context word when considering the context word as a center word. So, we create batch_size * n_context center word predictions by applying self.out linear layer to the embedding vectors of context words. Then, calculate softmax cross entropy between the broadcasted center word ID x and the predictions.

4.3 Prepare dataset and iterator

Let’s retrieve the Penn Tree Bank (PTB) dataset by using Chainer’s dataset utility get_ptb_words() method.

[ ]:
train, val, _ = chainer.datasets.get_ptb_words()
n_vocab = max(train) + 1  # The minimum word ID is 0

Then define an iterator to make mini-batches that contain a set of center words with their context words.

[ ]:
class WindowIterator(chainer.dataset.Iterator):

    def __init__(self, dataset, window, batch_size, repeat=True):
        self.dataset = np.array(dataset, np.int32)
        self.window = window
        self.batch_size = batch_size
        self._repeat = repeat

        self.order = np.random.permutation(
            len(dataset) - window * 2).astype(np.int32)
        self.order += window
        self.current_position = 0
        self.epoch = 0
        self.is_new_epoch = False

    def __next__(self):
        if not self._repeat and self.epoch > 0:
            raise StopIteration

        i = self.current_position
        i_end = i + self.batch_size
        position = self.order[i: i_end]
        w = np.random.randint(self.window - 1) + 1
        offset = np.concatenate([np.arange(-w, 0), np.arange(1, w + 1)])
        pos = position[:, None] + offset[None, :]
        context = self.dataset.take(pos)
        center = self.dataset.take(position)

        if i_end >= len(self.order):
            np.random.shuffle(self.order)
            self.epoch += 1
            self.is_new_epoch = True
            self.current_position = 0
        else:
            self.is_new_epoch = False
            self.current_position = i_end

        return center, context

    @property
    def epoch_detail(self):
        return self.epoch + float(self.current_position) / len(self.order)

    def serialize(self, serializer):
        self.current_position = serializer('current_position',
                                           self.current_position)
        self.epoch = serializer('epoch', self.epoch)
        self.is_new_epoch = serializer('is_new_epoch', self.is_new_epoch)
        if self._order is not None:
            serializer('_order', self._order)

def convert(batch, device):
    center, context = batch
    if device >= 0:
        center = cuda.to_gpu(center)
        context = cuda.to_gpu(context)
    return center, context
  • In the constructor, we create an array self.order which denotes shuffled indices of [window, window + 1, ..., len(dataset) - window - 1] in order to choose a center word randomly from dataset in a mini-batch.
  • The iterator definition __next__ returns batch_size sets of center word and context words.
  • The code self.order[i:i_end] returns the indices for a set of center words from the random-ordered array self.order. The center word IDs center at the random indices are retrieved by self.dataset.take.
  • np.concatenate([np.arange(-w, 0), np.arange(1, w + 1)]) creates a set of offsets to retrieve context words from the dataset.
  • The code position[:, None] + offset[None, :] generates the indices of context words for each center word index in position. The context word IDs context are retrieved by self.dataset.take.
4.4 Prepare model, optimizer, and updater
[ ]:
unit = 100  # number of hidden units
window = 5
batchsize = 1000
gpu = 0

# Instantiate model
model = SkipGram(n_vocab, unit)

if gpu >= 0:
    model.to_gpu(gpu)

# Create optimizer
optimizer = O.Adam()
optimizer.setup(model)

# Create iterators for both train and val datasets
train_iter = WindowIterator(train, window, batchsize)
val_iter = WindowIterator(val, window, batchsize, repeat=False)

# Create updater
updater = training.StandardUpdater(
    train_iter, optimizer, converter=convert, device=gpu)
4.5 Start training
[7]:
epoch = 100

trainer = training.Trainer(updater, (epoch, 'epoch'), out='word2vec_result')
trainer.extend(extensions.Evaluator(val_iter, model, converter=convert, device=gpu))
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'validation/main/loss', 'elapsed_time']))
trainer.run()
epoch       main/loss   validation/main/loss  elapsed_time
1           6.87314     6.48688               54.154
2           6.44018     6.40645               107.352
3           6.35021     6.3558                159.544
4           6.28615     6.31679               212.612
5           6.23762     6.28779               266.059
6           6.19942     6.22658               319.874
7           6.15986     6.20715               372.798
8           6.13787     6.21461               426.456
9           6.10637     6.24927               479.725
10          6.08759     6.23192               532.966
11          6.06768     6.19332               586.339
12          6.04607     6.17291               639.295
13          6.0321      6.21226               692.67
14          6.02178     6.18489               746.599
15          6.00098     6.17341               799.408
16          5.99099     6.19581               852.966
17          5.97425     6.22275               905.819
18          5.95974     6.20495               958.404
19          5.96579     6.16532               1012.49
20          5.95292     6.21457               1066.24
21          5.93696     6.18441               1119.45
22          5.91804     6.20695               1171.98
23          5.93265     6.15757               1225.99
24          5.92238     6.17064               1279.85
25          5.9154      6.21545               1334.01
26          5.90538     6.1812                1387.68
27          5.8807      6.18523               1439.72
28          5.89009     6.19992               1492.67
29          5.8773      6.24146               1545.48
30          5.89217     6.21846               1599.79
31          5.88493     6.21654               1653.95
32          5.87784     6.18502               1707.45
33          5.88031     6.14161               1761.75
34          5.86278     6.22893               1815.29
35          5.83335     6.18966               1866.56
36          5.85978     6.24276               1920.18
37          5.85921     6.23888               1974.2
38          5.85195     6.19231               2027.92
39          5.8396      6.20542               2080.78
40          5.83745     6.27583               2133.37
41          5.85996     6.23596               2188
42          5.85743     6.17438               2242.4
43          5.84051     6.25449               2295.84
44          5.83023     6.30226               2348.84
45          5.84677     6.23473               2403.11
46          5.82406     6.27398               2456.11
47          5.82827     6.21509               2509.17
48          5.8253      6.23009               2562.15
49          5.83697     6.2564                2616.35
50          5.81998     6.29104               2669.38
51          5.82926     6.26068               2723.47
52          5.81457     6.30152               2776.36
53          5.82587     6.29581               2830.24
54          5.80614     6.30994               2882.85
55          5.8161      6.23224               2935.73
56          5.80867     6.26867               2988.48
57          5.79467     6.24508               3040.2
58          5.81687     6.24676               3093.57
59          5.82064     6.30236               3147.68
60          5.80855     6.30184               3200.75
61          5.81298     6.25173               3254.06
62          5.80753     6.32951               3307.42
63          5.82505     6.2472                3361.68
64          5.78396     6.28168               3413.14
65          5.80209     6.24962               3465.96
66          5.80107     6.326                 3518.83
67          5.83765     6.28848               3574.57
68          5.7864      6.3506                3626.88
69          5.80329     6.30671               3679.82
70          5.80032     6.29277               3732.69
71          5.80647     6.30722               3786.21
72          5.8176      6.30046               3840.51
73          5.79912     6.35945               3893.81
74          5.80484     6.32439               3947.35
75          5.82065     6.29674               4002.03
76          5.80872     6.27921               4056.05
77          5.80891     6.28952               4110.1
78          5.79121     6.35363               4163.39
79          5.79161     6.32894               4216.34
80          5.78601     6.3255                4268.95
81          5.79062     6.29608               4321.73
82          5.7959      6.37235               4375.25
83          5.77828     6.31001               4427.44
84          5.7879      6.25628               4480.09
85          5.79297     6.29321               4533.27
86          5.79286     6.2725                4586.44
87          5.79388     6.36764               4639.82
88          5.79062     6.33841               4692.89
89          5.7879      6.31828               4745.68
90          5.81015     6.33247               4800.19
91          5.78858     6.37569               4853.31
92          5.7966      6.35733               4907.27
93          5.79814     6.34506               4961.09
94          5.81956     6.322                 5016.65
95          5.81565     6.35974               5071.69
96          5.78953     6.37451               5125.02
97          5.7993      6.42065               5179.34
98          5.79129     6.37995               5232.89
99          5.76834     6.36254               5284.7
100         5.79829     6.3785                5338.93
[ ]:
vocab = chainer.datasets.get_ptb_words_vocabulary()
index2word = {wid: word for word, wid in six.iteritems(vocab)}

# Save the word2vec model
with open('word2vec.model', 'w') as f:
    f.write('%d %d\n' % (len(index2word), unit))
    w = cuda.to_cpu(model.embed.W.data)
    for i, wi in enumerate(w):
        v = ' '.join(map(str, wi))
        f.write('%s %s\n' % (index2word[i], v))
4.6 Search the similar words
[ ]:
import numpy
import six

n_result = 5  # number of search result to show


with open('word2vec.model', 'r') as f:
    ss = f.readline().split()
    n_vocab, n_units = int(ss[0]), int(ss[1])
    word2index = {}
    index2word = {}
    w = numpy.empty((n_vocab, n_units), dtype=numpy.float32)
    for i, line in enumerate(f):
        ss = line.split()
        assert len(ss) == n_units + 1
        word = ss[0]
        word2index[word] = i
        index2word[i] = word
        w[i] = numpy.array([float(s) for s in ss[1:]], dtype=numpy.float32)


s = numpy.sqrt((w * w).sum(1))
w /= s.reshape((s.shape[0], 1))  # normalize
[ ]:
def search(query):
  if query not in word2index:
    print('"{0}" is not found'.format(query))
    return

  v = w[word2index[query]]
  similarity = w.dot(v)
  print('query: {}'.format(query))

  count = 0
  for i in (-similarity).argsort():
      if numpy.isnan(similarity[i]):
          continue
      if index2word[i] == query:
          continue
      print('{0}: {1}'.format(index2word[i], similarity[i]))
      count += 1
      if count == n_result:
          return

Search by “apple” word.

[23]:
query = "apple"
search(query)
query: apple
computer: 0.5457335710525513
compaq: 0.5068206191062927
microsoft: 0.4654524028301239
network: 0.42985647916793823
trotter: 0.42716777324676514

5. Reference

[ ]:

Other Examples

CuPy

1. Introduction

[ ]:
# Install CuPy and import as np !

!curl https://colab.chainer.org/install | sh -
import cupy as np
import numpy
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcusparse8.0 is already the newest version (8.0.61-1).
libnvrtc8.0 is already the newest version (8.0.61-1).
libnvtoolsext1 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Requirement already satisfied: cupy-cuda80==4.0.0b3 from https://github.com/kmaehashi/chainer-colab/releases/download/2018-02-06/cupy_cuda80-4.0.0b3-cp36-cp36m-linux_x86_64.whl in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: fastrlock>=0.3 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cupy-cuda80==4.0.0b3)
[ ]:
class Regressor(object):
    """
    Base class for regressors
    """

    def fit(self, X, t, **kwargs):
        """
        estimates parameters given training dataset
        Parameters
        ----------
        X : (sample_size, n_features) np.ndarray
            training data input
        t : (sample_size,) np.ndarray
            training data target
        """
        self._check_input(X)
        self._check_target(t)
        if hasattr(self, "_fit"):
            self._fit(X, t, **kwargs)
        else:
            raise NotImplementedError

    def predict(self, X, **kwargs):
        """
        predict outputs of the model
        Parameters
        ----------
        X : (sample_size, n_features) np.ndarray
            samples to predict their output
        Returns
        -------
        y : (sample_size,) np.ndarray
            prediction of each sample
        """
        self._check_input(X)
        if hasattr(self, "_predict"):
            return self._predict(X, **kwargs)
        else:
            raise NotImplementedError

    def _check_input(self, X):
        if not isinstance(X, np.ndarray):
            raise ValueError("X(input) is not np.ndarray")
        if X.ndim != 2:
            raise ValueError("X(input) is not two dimensional array")
        if hasattr(self, "n_features") and self.n_features != np.size(X, 1):
            raise ValueError(
                "mismatch in dimension 1 of X(input) "
                "(size {} is different from {})"
                .format(np.size(X, 1), self.n_features)
            )

    def _check_target(self, t):
        if not isinstance(t, np.ndarray):
            raise ValueError("t(target) must be np.ndarray")
        if t.ndim != 1:
            raise ValueError("t(target) must be one dimenional array")
[ ]:
class LinearRegressor(Regressor):
    """
    Linear regression model
    y = X @ w
    t ~ N(t|X @ w, var)
    """

    def _fit(self, X, t):
        self.w = np.linalg.pinv(X) @ t
        self.var = np.mean(np.square(X @ self.w - t))

    def _predict(self, X, return_std=False):
        y = X @ self.w
        if return_std:
            y_std = np.sqrt(self.var) + np.zeros_like(y)
            return y, y_std
        return y
[ ]:
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(1234)
1.1. Example: Polynomial Curve Fitting
[ ]:
def create_toy_data(func, sample_size, std):
    x = np.linspace(0, 1, sample_size)
    t = func(x) + np.random.normal(scale=std, size=x.shape)
    return x, t

def func(x):
    return np.sin(2 * 3.14 * x)  # 3.14 should be np.pi

x_train, y_train = create_toy_data(func, 10, 0.25)
x_test = np.linspace(0, 1, 100)
y_test = func(x_test)

plt.scatter(x_train.get(), y_train.get(), facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test.get(), y_test.get(), c="g", label="$\sin(2\pi x)$")
plt.legend()
plt.show()
_images/notebook_example_cupy_prml_ch01_Introduction_ipynb_6_0.png
[ ]:
import itertools
import functools


class PolynomialFeatures(object):
    """
    polynomial features
    transforms input array with polynomial features
    Example
    =======
    x =
    [[a, b],
    [c, d]]
    y = PolynomialFeatures(degree=2).transform(x)
    y =
    [[1, a, b, a^2, a * b, b^2],
    [1, c, d, c^2, c * d, d^2]]
    """

    def __init__(self, degree=2):
        """
        construct polynomial features
        Parameters
        ----------
        degree : int
            degree of polynomial
        """
        assert isinstance(degree, int)
        self.degree = degree

    def transform(self, x):
        """
        transforms input array with polynomial features
        Parameters
        ----------
        x : (sample_size, n) ndarray
            input array
        Returns
        -------
        output : (sample_size, 1 + nC1 + ... + nCd) ndarray
            polynomial features
        """
        if x.ndim == 1:
            x = x[:, None]
        x_t = x.transpose().get()  # https://github.com/cupy/cupy/issues/1084
        features = [numpy.ones(len(x))]    # https://github.com/cupy/cupy/issues/1084
        for degree in range(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(functools.reduce(lambda x, y: x * y, items))
        features = numpy.array(features)  # https://github.com/cupy/cupy/issues/1084
        return np.asarray(features).transpose()
[ ]:
for i, degree in enumerate([0, 1, 3, 9]):
    plt.subplot(2, 2, i + 1)
    feature = PolynomialFeatures(degree)
    X_train = feature.transform(x_train)
    X_test = feature.transform(x_test)

    model = LinearRegressor()
    model.fit(X_train, y_train)
    y = model.predict(X_test)

    plt.scatter(x_train.get(), y_train.get(), facecolor="none", edgecolor="b", s=50, label="training data")
    plt.plot(x_test.get(), y_test.get(), c="g", label="$\sin(2\pi x)$")
    plt.plot(x_test.get(), y.get(), c="r", label="fitting")
    plt.ylim(-1.5, 1.5)
    plt.annotate("M={}".format(degree), xy=(-0.15, 1))
plt.legend(bbox_to_anchor=(1.05, 0.64), loc=2, borderaxespad=0.)
plt.show()
_images/notebook_example_cupy_prml_ch01_Introduction_ipynb_8_0.png
[ ]:
def rmse(a, b):
    return np.sqrt(np.mean(np.square(a - b)))

training_errors = []
test_errors = []

for i in range(10):
    feature = PolynomialFeatures(i)
    X_train = feature.transform(x_train)
    X_test = feature.transform(x_test)

    model = LinearRegressor()
    model.fit(X_train, y_train)
    y = model.predict(X_test)
    training_errors.append(rmse(model.predict(X_train), y_train))
    test_errors.append(rmse(model.predict(X_test), y_test + np.random.normal(scale=0.25, size=len(y_test))))

plt.plot(training_errors, 'o-', mfc="none", mec="b", ms=10, c="b", label="Training")
plt.plot(test_errors, 'o-', mfc="none", mec="r", ms=10, c="r", label="Test")
plt.legend()
plt.xlabel("degree")
plt.ylabel("RMSE")
plt.show()
_images/notebook_example_cupy_prml_ch01_Introduction_ipynb_9_0.png
Regularization
[ ]:
class RidgeRegressor(Regressor):
    """
    Ridge regression model
    w* = argmin |t - X @ w| + a * |w|_2^2
    """

    def __init__(self, alpha=1.):
        self.alpha = alpha

    def _fit(self, X, t):
        eye = np.eye(np.size(X, 1))
        self.w = np.linalg.solve(self.alpha * eye + X.T @ X, X.T @ t)

    def _predict(self, X):
        y = X @ self.w
        return y
[ ]:
feature = PolynomialFeatures(9)
X_train = feature.transform(x_train)
X_test = feature.transform(x_test)

model = RidgeRegressor(alpha=1e-3)
model.fit(X_train, y_train)
y = model.predict(X_test)

y = model.predict(X_test)
plt.scatter(x_train.get(), y_train.get(), facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test.get(), y_test.get(), c="g", label="$\sin(2\pi x)$")
plt.plot(x_test.get(), y.get(), c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.legend()
plt.annotate("M=9", xy=(-0.15, 1))
plt.show()
_images/notebook_example_cupy_prml_ch01_Introduction_ipynb_12_0.png
1.2.6 Bayesian curve fitting
[ ]:
class BayesianRegressor(Regressor):
    """
    Bayesian regression model
    w ~ N(w|0, alpha^(-1)I)
    y = X @ w
    t ~ N(t|X @ w, beta^(-1))
    """

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta
        self.w_mean = None
        self.w_precision = None

    def _fit(self, X, t):
        if self.w_mean is not None:
            mean_prev = self.w_mean
        else:
            mean_prev = np.zeros(np.size(X, 1))
        if self.w_precision is not None:
            precision_prev = self.w_precision
        else:
            precision_prev = self.alpha * np.eye(np.size(X, 1))
        w_precision = precision_prev + self.beta * X.T @ X
        w_mean = np.linalg.solve(
            w_precision,
            precision_prev @ mean_prev + self.beta * X.T @ t
        )
        self.w_mean = w_mean
        self.w_precision = w_precision
        self.w_cov = np.linalg.inv(self.w_precision)

    def _predict(self, X, return_std=False, sample_size=None):
        if isinstance(sample_size, int):
            w_sample = np.random.multivariate_normal(
                self.w_mean, self.w_cov, size=sample_size
            )
            y = X @ w_sample.T
            return y
        y = X @ self.w_mean
        if return_std:
            y_var = 1 / self.beta + np.sum(X @ self.w_cov * X, axis=1)
            y_std = np.sqrt(y_var)
            return y, y_std
        return y
[ ]:
model = BayesianRegressor(alpha=2e-3, beta=2)
model.fit(X_train, y_train)

y, y_err = model.predict(X_test, return_std=True)
plt.scatter(x_train.get(), y_train.get(), facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test.get(), y_test.get(), c="g", label="$\sin(2\pi x)$")
plt.plot(x_test.get(), y.get(), c="r", label="mean")
plt.fill_between(x_test.get(), (y - y_err).get(), (y + y_err).get(), color="pink", label="std.", alpha=0.5)
plt.xlim(-0.1, 1.1)
plt.ylim(-1.5, 1.5)
plt.annotate("M=9", xy=(0.8, 1))
plt.legend(bbox_to_anchor=(1.05, 1.), loc=2, borderaxespad=0.)
plt.show()
_images/notebook_example_cupy_prml_ch01_Introduction_ipynb_15_0.png

Other Languages