Dropout | cs231n - A2-Q3

Python
cs231n
numpy
Dropout
matplotlib
cross validation
grid search
cifar-10
Deep Learning
Computer Vision
Author

Emre Kara

Published

May 27, 2023

CS231N

This course is a deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification

This page contains my solutions and approaches for the assignment All source codes of my solutions are available on GitHub

Dropout

Dropout [1] is a technique for regularizing neural networks by randomly setting some output activations to zero during the forward pass. In this exercise, you will implement a dropout layer and modify your fully connected network to optionally use dropout.

[1] Geoffrey E. Hinton et al, “Improving neural networks by preventing co-adaptation of feature detectors”, arXiv 2012

# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """Returns relative error."""
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
=========== You can safely ignore the message below if you are NOT working on ConvolutionalNetworks.ipynb ===========
    You will need to compile a Cython extension for a portion of this assignment.
    The instructions to do this will be given in a section of the notebook below.
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
    print(f"{k}: {v.shape}")
X_train: (49000, 3, 32, 32)
y_train: (49000,)
X_val: (1000, 3, 32, 32)
y_val: (1000,)
X_test: (1000, 3, 32, 32)
y_test: (1000,)

Dropout: Forward Pass

In the file cs231n/layers.py, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes.

Once you have done so, run the cell below to test your implementation.

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.

    NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.

    NOTE 2: Keep in mind that p is the probability of **keep** a neuron
    output; this might be contrary to some sources, where it is referred to
    as the probability of dropping a neuron output.
    """
    p, mode = dropout_param["p"], dropout_param["mode"]
    if "seed" in dropout_param:
        np.random.seed(dropout_param["seed"])

    mask = None
    out = None

    if mode == "train":
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        mask = (np.random.rand(*x.shape) < p) / p
        out = x*mask
        

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == "test":
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        out = x

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache
np.random.seed(231)
x = np.random.randn(500, 500) + 10

for p in [0.25, 0.4, 0.7]:
    out, _ = dropout_forward(x, {'mode': 'train', 'p': p})
    out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})

    print('Running tests with p = ', p)
    print('Mean of input: ', x.mean())
    print('Mean of train-time output: ', out.mean())
    print('Mean of test-time output: ', out_test.mean())
    print('Fraction of train-time output set to zero: ', (out == 0).mean())
    print('Fraction of test-time output set to zero: ', (out_test == 0).mean())
    print()
Running tests with p =  0.25
Mean of input:  10.000207878477502
Mean of train-time output:  10.014059116977283
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.749784
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.4
Mean of input:  10.000207878477502
Mean of train-time output:  9.977917658761159
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.600796
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.7
Mean of input:  10.000207878477502
Mean of train-time output:  9.987811912159426
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.30074
Fraction of test-time output set to zero:  0.0

Dropout: Backward Pass

In the file cs231n/layers.py, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation.

def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param["mode"]

    dx = None
    if mode == "train":
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        p = dropout_param["p"]
        masked_gradients = dout*mask
        dx = masked_gradients

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == "test":
        dx = dout
    return dx
np.random.seed(231)
x = np.random.randn(10, 10) + 10
dout = np.random.randn(*x.shape)

dropout_param = {'mode': 'train', 'p': 0.2, 'seed': 123}
out, cache = dropout_forward(x, dropout_param)
dx = dropout_backward(dout, cache)
dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)

# Error should be around e-10 or less.
print('dx relative error: ', rel_error(dx, dx_num))
dx relative error:  5.44560814873387e-11

Inline Question 1:

What happens if we do not divide the values being passed through inverse dropout by p in the dropout layer? Why does that happen?

Answer:

Then we must multiply values with p in test time forwardpassing. But this is not efficient when you prioritize the inference phase performance as like in almost all cases

Fully Connected Networks with Dropout

In the file cs231n/classifiers/fc_net.py, modify your implementation to use dropout. Specifically, if the constructor of the network receives a value that is not 1 for the dropout_keep_ratio parameter, then the net should add a dropout layer immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation.

from builtins import range
from builtins import object
import numpy as np

from ..layers import *
from ..layer_utils import *


class FullyConnectedNet(object):
    """Class for a multi-layer fully connected neural network.

    Network contains an arbitrary number of hidden layers, ReLU nonlinearities,
    and a softmax loss function. This will also implement dropout and batch/layer
    normalization as options. For a network with L layers, the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional and the {...} block is
    repeated L - 1 times.

    Learnable parameters are stored in the self.params dictionary and will be learned
    using the Solver class.
    """

    def __init__(
        self,
        hidden_dims,
        input_dim=3 * 32 * 32,
        num_classes=10,
        dropout_keep_ratio=1,
        normalization=None,
        reg=0.0,
        weight_scale=1e-2,
        dtype=np.float32,
        seed=None,
    ):
        """Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout_keep_ratio: Scalar between 0 and 1 giving dropout strength.
            If dropout_keep_ratio=1 then the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
            are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
            initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
            this datatype. float32 is faster but less accurate, so you should use
            float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers.
            This will make the dropout layers deteriminstic so we can gradient check the model.
        """
        self.normalization = normalization
        self.use_dropout = dropout_keep_ratio != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
        verbose = False
        if verbose: print('num_layers:', self.num_layers, '\n')
        len_hidden_dims = len(hidden_dims)
        for layer_num in range(1, self.num_layers+1):
          number_of_nodes = None
          if verbose: print('layer_num:', layer_num)
          if layer_num == 1: # First Layer
            if verbose: print('\tfirst_layer')
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (input_dim, hidden_dims[0]))
            self.params[f"b{layer_num}"] = np.zeros(hidden_dims[0], )
            if self.normalization == "batchnorm":
              self.params[f"gamma{layer_num}"] = np.ones((hidden_dims[0], ))
              self.params[f"beta{layer_num}"] = np.zeros((hidden_dims[0], ))
          elif layer_num == self.num_layers: #Last Layer
            if verbose: print('\tlast_layer')
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (hidden_dims[-1], num_classes))
            self.params[f"b{layer_num}"] = np.zeros(num_classes, )
          else: # Hidden Layers
            if verbose: print('\thidden_layer')
            hidden_dim_curr = hidden_dims[layer_num-2]
            hidden_dim_next = hidden_dims[layer_num-1]
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (hidden_dim_curr, hidden_dim_next))
            self.params[f"b{layer_num}"] = np.zeros(hidden_dim_next, )
            if self.normalization == "batchnorm":
              self.params[f"gamma{layer_num}"] = np.ones((hidden_dim_next, ))
              self.params[f"beta{layer_num}"] = np.zeros((hidden_dim_next, ))
            
          if verbose: 
            print(f"\tW{layer_num}:", self.params[f"W{layer_num}"].shape)
            print(f"\tb{layer_num}:", self.params[f"b{layer_num}"].shape)
            if f"gamma{layer_num}" in self.params:
              print(f"\tgamma{layer_num}:", self.params[f"gamma{layer_num}"].shape)
              print(f"\tbeta{layer_num}:", self.params[f"beta{layer_num}"].shape)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
            if seed is not None:
                self.dropout_param["seed"] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == "batchnorm":
            self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
        if self.normalization == "layernorm":
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype.
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """Compute loss and gradient for the fully connected net.
        
        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
            scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
            names to gradients of the loss with respect to those parameters.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        affine_cache = None
        bn_cache = None
        relu_cache = None
        dropout_cache = None

        caches = {}
        input_data = X
        for layer_num in range(1, self.num_layers):
          weights = self.params[f"W{layer_num}"]
          biases = self.params[f"b{layer_num}"]
          temp_out, affine_cache = affine_forward(input_data, weights, biases)
          #batch/layer norm
          if self.normalization == "batchnorm":
            x = temp_out
            gamma = self.params[f"gamma{layer_num}"]
            beta = self.params[f"beta{layer_num}"]
            bn_param = self.bn_params[layer_num-1]
            temp_out, bn_cache = batchnorm_forward(x, gamma, beta, bn_param)
          relu_out, relu_cache = relu_forward(temp_out)
          #dropout
          input_data = relu_out
          cache = (affine_cache, bn_cache, relu_cache, dropout_cache) 
          caches[f"cache{layer_num}"] = cache
        
        layer_num = self.num_layers
        weights = self.params[f"W{layer_num}"]
        biases = self.params[f"b{layer_num}"]
        affine_out, affine_cache = affine_forward(input_data, weights, biases)
        caches[f"cache{layer_num}"] = affine_cache
        scores = affine_out

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early.
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the   #
        # scale and shift parameters.                                              #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        loss, dout = softmax_loss(scores, y)
        
        layer_num = self.num_layers

        w = self.params[f"W{layer_num}"]
        cache = caches[f"cache{layer_num}"]
        dx, dw, db = affine_backward(dout, cache)
        grads[f"W{layer_num}"] = dw + (self.reg * w)
        grads[f"b{layer_num}"] = db
        loss += 0.5 * self.reg * (np.sum(w * w))

        for layer_num in range(self.num_layers-1, 0, -1):
          cache = caches[f"cache{layer_num}"]
          w = self.params[f"W{layer_num}"]
          affine_cache, bn_cache, relu_cache, dropout_cache = cache
          temp_dout = relu_backward(dx, relu_cache)
          
          if self.normalization == "batchnorm":
            temp_dout, dgamma, dbeta = batchnorm_backward_alt(temp_dout, bn_cache)
          
          dx, dw, db = affine_backward(temp_dout, affine_cache)

          grads[f"W{layer_num}"] = dw + (self.reg * self.params[f"W{layer_num}"])
          grads[f"b{layer_num}"] = db
          
          if self.normalization == "batchnorm":
            grads[f"gamma{layer_num}"] = dgamma
            grads[f"beta{layer_num}"] = dbeta
          
          loss += 0.5 * self.reg * (np.sum(w * w))
        
        
        

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads
np.random.seed(231)
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

for dropout_keep_ratio in [1, 0.75, 0.5]:
    print('Running check with dropout = ', dropout_keep_ratio)
    model = FullyConnectedNet(
        [H1, H2],
        input_dim=D,
        num_classes=C,
        weight_scale=5e-2,
        dtype=np.float64,
        dropout_keep_ratio=dropout_keep_ratio,
        seed=123
    )

    loss, grads = model.loss(X, y)
    print('Initial loss: ', loss)

    # Relative errors should be around e-6 or less.
    # Note that it's fine if for dropout_keep_ratio=1 you have W2 error be on the order of e-5.
    for name in sorted(grads):
        f = lambda _: model.loss(X, y)[0]
        grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
        print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))
    print()
Running check with dropout =  1
Initial loss:  2.300479079789529
W1 relative error: 4.04e-07
W2 relative error: 2.05e-04
W3 relative error: 3.77e-07
b1 relative error: 9.62e-09
b2 relative error: 1.76e-08
b3 relative error: 1.01e-08

Running check with dropout =  0.75
Initial loss:  2.301648205784293
W1 relative error: 3.42e-07
W2 relative error: 1.50e-06
W3 relative error: 1.45e-07
b1 relative error: 1.08e-08
b2 relative error: 9.89e-09
b3 relative error: 1.01e-08

Running check with dropout =  0.5
Initial loss:  2.294963248051904
W1 relative error: 1.18e-07
W2 relative error: 5.60e-07
W3 relative error: 8.95e-07
b1 relative error: 1.07e-08
b2 relative error: 3.03e-08
b3 relative error: 1.00e-08

Regularization Experiment

As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a keep probability of 0.25. We will then visualize the training and validation accuracies of the two networks over time.

# Train two identical nets, one with dropout and one without.
np.random.seed(231)
num_train = 500
small_data = {
    'X_train': data['X_train'][:num_train],
    'y_train': data['y_train'][:num_train],
    'X_val': data['X_val'],
    'y_val': data['y_val'],
}

solvers = {}
dropout_choices = [1, 0.25]
for dropout_keep_ratio in dropout_choices:
    model = FullyConnectedNet(
        [500],
        dropout_keep_ratio=dropout_keep_ratio
    )
    print(dropout_keep_ratio)

    solver = Solver(
        model,
        small_data,
        num_epochs=25,
        batch_size=100,
        update_rule='adam',
        optim_config={'learning_rate': 5e-4,},
        verbose=True,
        print_every=100
    )
    solver.train()
    solvers[dropout_keep_ratio] = solver
    print()
1
(Iteration 1 / 125) loss: 7.814297
(Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000
(Epoch 1 / 25) train acc: 0.416000; val_acc: 0.258000
(Epoch 2 / 25) train acc: 0.482000; val_acc: 0.276000
(Epoch 3 / 25) train acc: 0.532000; val_acc: 0.277000
(Epoch 4 / 25) train acc: 0.600000; val_acc: 0.271000
(Epoch 5 / 25) train acc: 0.708000; val_acc: 0.299000
(Epoch 6 / 25) train acc: 0.722000; val_acc: 0.282000
(Epoch 7 / 25) train acc: 0.832000; val_acc: 0.255000
(Epoch 8 / 25) train acc: 0.880000; val_acc: 0.268000
(Epoch 9 / 25) train acc: 0.902000; val_acc: 0.277000
(Epoch 10 / 25) train acc: 0.898000; val_acc: 0.261000
(Epoch 11 / 25) train acc: 0.924000; val_acc: 0.263000
(Epoch 12 / 25) train acc: 0.960000; val_acc: 0.300000
(Epoch 13 / 25) train acc: 0.972000; val_acc: 0.314000
(Epoch 14 / 25) train acc: 0.972000; val_acc: 0.310000
(Epoch 15 / 25) train acc: 0.974000; val_acc: 0.314000
(Epoch 16 / 25) train acc: 0.994000; val_acc: 0.303000
(Epoch 17 / 25) train acc: 0.972000; val_acc: 0.303000
(Epoch 18 / 25) train acc: 0.992000; val_acc: 0.312000
(Epoch 19 / 25) train acc: 0.992000; val_acc: 0.310000
(Epoch 20 / 25) train acc: 0.990000; val_acc: 0.288000
(Iteration 101 / 125) loss: 0.001350
(Epoch 21 / 25) train acc: 0.996000; val_acc: 0.291000
(Epoch 22 / 25) train acc: 0.998000; val_acc: 0.303000
(Epoch 23 / 25) train acc: 0.998000; val_acc: 0.309000
(Epoch 24 / 25) train acc: 0.996000; val_acc: 0.319000
(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.309000

0.25
(Iteration 1 / 125) loss: 9.910658
(Epoch 0 / 25) train acc: 0.264000; val_acc: 0.174000
(Epoch 1 / 25) train acc: 0.388000; val_acc: 0.257000
(Epoch 2 / 25) train acc: 0.500000; val_acc: 0.238000
(Epoch 3 / 25) train acc: 0.614000; val_acc: 0.248000
(Epoch 4 / 25) train acc: 0.704000; val_acc: 0.281000
(Epoch 5 / 25) train acc: 0.754000; val_acc: 0.251000
(Epoch 6 / 25) train acc: 0.820000; val_acc: 0.281000
(Epoch 7 / 25) train acc: 0.872000; val_acc: 0.266000
(Epoch 8 / 25) train acc: 0.896000; val_acc: 0.303000
(Epoch 9 / 25) train acc: 0.914000; val_acc: 0.316000
(Epoch 10 / 25) train acc: 0.912000; val_acc: 0.292000
(Epoch 11 / 25) train acc: 0.916000; val_acc: 0.299000
(Epoch 12 / 25) train acc: 0.936000; val_acc: 0.306000
(Epoch 13 / 25) train acc: 0.934000; val_acc: 0.294000
(Epoch 14 / 25) train acc: 0.952000; val_acc: 0.292000
(Epoch 15 / 25) train acc: 0.958000; val_acc: 0.292000
(Epoch 16 / 25) train acc: 0.978000; val_acc: 0.278000
(Epoch 17 / 25) train acc: 0.984000; val_acc: 0.297000
(Epoch 18 / 25) train acc: 0.990000; val_acc: 0.279000
(Epoch 19 / 25) train acc: 0.980000; val_acc: 0.288000
(Epoch 20 / 25) train acc: 0.944000; val_acc: 0.263000
(Iteration 101 / 125) loss: 0.480498
(Epoch 21 / 25) train acc: 0.978000; val_acc: 0.293000
(Epoch 22 / 25) train acc: 0.984000; val_acc: 0.294000
(Epoch 23 / 25) train acc: 0.990000; val_acc: 0.289000
(Epoch 24 / 25) train acc: 0.988000; val_acc: 0.293000
(Epoch 25 / 25) train acc: 0.988000; val_acc: 0.287000
# Plot train and validation accuracies of the two models.
train_accs = []
val_accs = []
for dropout_keep_ratio in dropout_choices:
    solver = solvers[dropout_keep_ratio]
    train_accs.append(solver.train_acc_history[-1])
    val_accs.append(solver.val_acc_history[-1])

plt.subplot(3, 1, 1)
for dropout_keep_ratio in dropout_choices:
    plt.plot(
        solvers[dropout_keep_ratio].train_acc_history, 'o', label='%.2f dropout_keep_ratio' % dropout_keep_ratio)
plt.title('Train accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')
  
plt.subplot(3, 1, 2)
for dropout_keep_ratio in dropout_choices:
    plt.plot(
        solvers[dropout_keep_ratio].val_acc_history, 'o', label='%.2f dropout_keep_ratio' % dropout_keep_ratio)
plt.title('Val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')

plt.gcf().set_size_inches(15, 15)
plt.show()

Inline Question 2:

Compare the validation and training accuracies with and without dropout – what do your results suggest about dropout as a regularizer?

Answer:

[FILL THIS IN]