Fully Connected Networks | cs231n - A2-Q1

Python
cs231n
numpy
Fully Connected Networks
matplotlib
cross validation
grid search
cifar-10
Deep Learning
Computer Vision
Author

Emre Kara

Published

May 4, 2023

CS231N

This course is a deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification

This page contains my solutions and approaches for the assignment All source codes of my solutions are available on GitHub

Multi-Layer Fully Connected Network

In this exercise, you will implement a fully connected network with an arbitrary number of hidden layers.

Read through the FullyConnectedNet class in the file cs231n/classifiers/fc_net.py.

Implement the network initialization, forward pass, and backward pass. Throughout this assignment, you will be implementing layers in cs231n/layers.py. You can re-use your implementations for affine_forward, affine_backward, relu_forward, relu_backward, and softmax_loss from Assignment 1. For right now, don’t worry about implementing dropout or batch/layer normalization yet, as you will add those features later.

from builtins import range
from builtins import object
import numpy as np

from ..layers import *
from ..layer_utils import *


class FullyConnectedNet(object):
    """Class for a multi-layer fully connected neural network.

    Network contains an arbitrary number of hidden layers, ReLU nonlinearities,
    and a softmax loss function. This will also implement dropout and batch/layer
    normalization as options. For a network with L layers, the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional and the {...} block is
    repeated L - 1 times.

    Learnable parameters are stored in the self.params dictionary and will be learned
    using the Solver class.
    """

    def __init__(
        self,
        hidden_dims,
        input_dim=3 * 32 * 32,
        num_classes=10,
        dropout_keep_ratio=1,
        normalization=None,
        reg=0.0,
        weight_scale=1e-2,
        dtype=np.float32,
        seed=None,
    ):
        """Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout_keep_ratio: Scalar between 0 and 1 giving dropout strength.
            If dropout_keep_ratio=1 then the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
            are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
            initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
            this datatype. float32 is faster but less accurate, so you should use
            float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers.
            This will make the dropout layers deteriminstic so we can gradient check the model.
        """
        self.normalization = normalization
        self.use_dropout = dropout_keep_ratio != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
        verbose = False
        if verbose: print('num_layers:', self.num_layers, '\n')
        len_hidden_dims = len(hidden_dims)
        for layer_num in range(1, self.num_layers+1):
          number_of_nodes = None
          if verbose: print('layer_num:', layer_num)
          if layer_num == 1: # First Layer
            if verbose: print('\tfirst_layer')
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (input_dim, hidden_dims[0]))
            self.params[f"b{layer_num}"] = np.zeros(hidden_dims[0], )
            if self.normalization == "batchnorm":
              self.params[f"gamma{layer_num}"] = np.ones((hidden_dims[0], ))
              self.params[f"beta{layer_num}"] = np.zeros((hidden_dims[0], ))
          elif layer_num == self.num_layers: #Last Layer
            if verbose: print('\tlast_layer')
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (hidden_dims[-1], num_classes))
            self.params[f"b{layer_num}"] = np.zeros(num_classes, )
          else: # Hidden Layers
            if verbose: print('\thidden_layer')
            hidden_dim_curr = hidden_dims[layer_num-2]
            hidden_dim_next = hidden_dims[layer_num-1]
            self.params[f"W{layer_num}"] = np.random.normal(0.0, weight_scale, (hidden_dim_curr, hidden_dim_next))
            self.params[f"b{layer_num}"] = np.zeros(hidden_dim_next, )
            if self.normalization == "batchnorm":
              self.params[f"gamma{layer_num}"] = np.ones((hidden_dim_next, ))
              self.params[f"beta{layer_num}"] = np.zeros((hidden_dim_next, ))
            
          if verbose: 
            print(f"\tW{layer_num}:", self.params[f"W{layer_num}"].shape)
            print(f"\tb{layer_num}:", self.params[f"b{layer_num}"].shape)
            if f"gamma{layer_num}" in self.params:
              print(f"\tgamma{layer_num}:", self.params[f"gamma{layer_num}"].shape)
              print(f"\tbeta{layer_num}:", self.params[f"beta{layer_num}"].shape)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
            if seed is not None:
                self.dropout_param["seed"] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == "batchnorm":
            self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
        if self.normalization == "layernorm":
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype.
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """Compute loss and gradient for the fully connected net.
        
        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
            scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
            names to gradients of the loss with respect to those parameters.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        affine_cache = None
        bn_cache = None
        relu_cache = None
        dropout_cache = None

        caches = {}
        input_data = X
        for layer_num in range(1, self.num_layers):
          weights = self.params[f"W{layer_num}"]
          biases = self.params[f"b{layer_num}"]
          temp_out, affine_cache = affine_forward(input_data, weights, biases)
          #batch/layer norm
          if self.normalization == "batchnorm":
            x = temp_out
            gamma = self.params[f"gamma{layer_num}"]
            beta = self.params[f"beta{layer_num}"]
            bn_param = self.bn_params[layer_num-1]
            temp_out, bn_cache = batchnorm_forward(x, gamma, beta, bn_param)
          relu_out, relu_cache = relu_forward(temp_out)
          #dropout
          input_data = relu_out
          cache = (affine_cache, bn_cache, relu_cache, dropout_cache) 
          caches[f"cache{layer_num}"] = cache
        
        layer_num = self.num_layers
        weights = self.params[f"W{layer_num}"]
        biases = self.params[f"b{layer_num}"]
        affine_out, affine_cache = affine_forward(input_data, weights, biases)
        caches[f"cache{layer_num}"] = affine_cache
        scores = affine_out

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early.
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the   #
        # scale and shift parameters.                                              #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        loss, dout = softmax_loss(scores, y)
        
        layer_num = self.num_layers

        w = self.params[f"W{layer_num}"]
        cache = caches[f"cache{layer_num}"]
        dx, dw, db = affine_backward(dout, cache)
        grads[f"W{layer_num}"] = dw + (self.reg * w)
        grads[f"b{layer_num}"] = db
        loss += 0.5 * self.reg * (np.sum(w * w))

        for layer_num in range(self.num_layers-1, 0, -1):
          cache = caches[f"cache{layer_num}"]
          w = self.params[f"W{layer_num}"]
          affine_cache, bn_cache, relu_cache, dropout_cache = cache
          temp_dout = relu_backward(dx, relu_cache)
          
          if self.normalization == "batchnorm":
            temp_dout, dgamma, dbeta = batchnorm_backward_alt(temp_dout, bn_cache)
          
          dx, dw, db = affine_backward(temp_dout, affine_cache)

          grads[f"W{layer_num}"] = dw + (self.reg * self.params[f"W{layer_num}"])
          grads[f"b{layer_num}"] = db
          
          if self.normalization == "batchnorm":
            grads[f"gamma{layer_num}"] = dgamma
            grads[f"beta{layer_num}"] = dbeta
          
          loss += 0.5 * self.reg * (np.sum(w * w))
        
        
        

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads
# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """Returns relative error."""
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
    print(f"{k}: {v.shape}")
X_train: (49000, 3, 32, 32)
y_train: (49000,)
X_val: (1000, 3, 32, 32)
y_val: (1000,)
X_test: (1000, 3, 32, 32)
y_test: (1000,)

Initial Loss and Gradient Check

As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. This is a good way to see if the initial losses seem reasonable.

For gradient checking, you should expect to see errors around 1e-7 or less.

np.random.seed(231)
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

for reg in [0, 3.14]:
    print("Running check with reg = ", reg)
    model = FullyConnectedNet(
        [H1, H2],
        input_dim=D,
        num_classes=C,
        reg=reg,
        weight_scale=5e-2,
        dtype=np.float64
    )

    loss, grads = model.loss(X, y)
    print("Initial loss: ", loss)

    # Most of the errors should be on the order of e-7 or smaller.   
    # NOTE: It is fine however to see an error for W2 on the order of e-5
    # for the check when reg = 0.0
    for name in sorted(grads):
        f = lambda _: model.loss(X, y)[0]
        grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
        print(f"{name} relative error: {rel_error(grad_num, grads[name])}")
Running check with reg =  0
Initial loss:  2.3004790897684924
W1 relative error: 7.696803870986541e-08
W2 relative error: 1.7087519140575808e-05
W3 relative error: 2.9508423118300657e-07
b1 relative error: 4.660094650186831e-09
b2 relative error: 2.085654124402131e-09
b3 relative error: 6.598642296022133e-11
Running check with reg =  3.14
Initial loss:  7.052114776533016
W1 relative error: 7.355058816898759e-09
W2 relative error: 6.86942277940646e-08
W3 relative error: 3.483989277228501e-08
b1 relative error: 1.1683196894962977e-08
b2 relative error: 1.7223751746766738e-09
b3 relative error: 2.86824680346369e-10

As another sanity check, make sure your network can overfit on a small dataset of 50 images. First, we will try a three-layer network with 100 units in each hidden layer. In the following cell, tweak the learning rate and weight initialization scale to overfit and achieve 100% training accuracy within 20 epochs.

# TODO: Use a three-layer Net to overfit 50 training examples by 
# tweaking just the learning rate and initialization scale.

num_train = 50
small_data = {
  "X_train": data["X_train"][:num_train],
  "y_train": data["y_train"][:num_train],
  "X_val": data["X_val"],
  "y_val": data["y_val"],
}

weight_scale = 1e-1   # Experiment with this!
learning_rate = 1e-4  # Experiment with this!
model = FullyConnectedNet(
    [100, 100],
    weight_scale=weight_scale,
    dtype=np.float64
)
solver = Solver(
    model,
    small_data,
    print_every=10,
    num_epochs=20,
    batch_size=25,
    update_rule="sgd",
    optim_config={"learning_rate": learning_rate},
)
solver.train()

plt.plot(solver.loss_history)
plt.title("Training loss history")
plt.xlabel("Iteration")
plt.ylabel("Training loss")
plt.grid(linestyle='--', linewidth=0.5)
plt.show()
/content/drive/My Drive/Colab Notebooks/cs231n/assignments/assignment2/cs231n/layers.py:824: RuntimeWarning: divide by zero encountered in log
  loss =  - np.sum(np.log(correct_class_probs))
(Iteration 1 / 40) loss: inf
(Epoch 0 / 20) train acc: 0.020000; val_acc: 0.110000
(Epoch 1 / 20) train acc: 0.040000; val_acc: 0.112000
(Epoch 2 / 20) train acc: 0.180000; val_acc: 0.108000
(Epoch 3 / 20) train acc: 0.300000; val_acc: 0.144000
(Epoch 4 / 20) train acc: 0.300000; val_acc: 0.135000
(Epoch 5 / 20) train acc: 0.420000; val_acc: 0.157000
(Iteration 11 / 40) loss: 31.172835
(Epoch 6 / 20) train acc: 0.540000; val_acc: 0.153000
(Epoch 7 / 20) train acc: 0.560000; val_acc: 0.146000
(Epoch 8 / 20) train acc: 0.640000; val_acc: 0.147000
(Epoch 9 / 20) train acc: 0.680000; val_acc: 0.156000
(Epoch 10 / 20) train acc: 0.740000; val_acc: 0.153000
(Iteration 21 / 40) loss: 24.023362
(Epoch 11 / 20) train acc: 0.780000; val_acc: 0.152000
(Epoch 12 / 20) train acc: 0.820000; val_acc: 0.147000
(Epoch 13 / 20) train acc: 0.920000; val_acc: 0.143000
(Epoch 14 / 20) train acc: 0.920000; val_acc: 0.140000
(Epoch 15 / 20) train acc: 0.960000; val_acc: 0.138000
(Iteration 31 / 40) loss: 0.030175
(Epoch 16 / 20) train acc: 0.980000; val_acc: 0.141000
(Epoch 17 / 20) train acc: 1.000000; val_acc: 0.145000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.145000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.145000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.145000

Now, try to use a five-layer network with 100 units on each layer to overfit on 50 training examples. Again, you will have to adjust the learning rate and weight initialization scale, but you should be able to achieve 100% training accuracy within 20 epochs.

# TODO: Use a five-layer Net to overfit 50 training examples by 
# tweaking just the learning rate and initialization scale.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

learning_rate = 2e-3  # Experiment with this!
weight_scale = 1e-1   # Experiment with this!
model = FullyConnectedNet(
    [100, 100, 100, 100],
    weight_scale=weight_scale,
    dtype=np.float64
)
solver = Solver(
    model,
    small_data,
    print_every=10,
    num_epochs=20,
    batch_size=25,
    update_rule='sgd',
    optim_config={'learning_rate': learning_rate},
)
solver.train()

plt.plot(solver.loss_history)
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.grid(linestyle='--', linewidth=0.5)
plt.show()
(Iteration 1 / 40) loss: 166.501707
(Epoch 0 / 20) train acc: 0.100000; val_acc: 0.107000
(Epoch 1 / 20) train acc: 0.320000; val_acc: 0.101000
(Epoch 2 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 3 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 4 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 5 / 20) train acc: 0.080000; val_acc: 0.087000
(Iteration 11 / 40) loss: nan
(Epoch 6 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 7 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 8 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 9 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 10 / 20) train acc: 0.080000; val_acc: 0.087000
(Iteration 21 / 40) loss: nan
(Epoch 11 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 12 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 13 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 14 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 15 / 20) train acc: 0.080000; val_acc: 0.087000
(Iteration 31 / 40) loss: nan
(Epoch 16 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 17 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 18 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 19 / 20) train acc: 0.080000; val_acc: 0.087000
(Epoch 20 / 20) train acc: 0.080000; val_acc: 0.087000
/content/drive/My Drive/Colab Notebooks/cs231n/assignments/assignment2/cs231n/layers.py:821: RuntimeWarning: overflow encountered in exp
  exps = np.exp(x)
/content/drive/My Drive/Colab Notebooks/cs231n/assignments/assignment2/cs231n/layers.py:822: RuntimeWarning: invalid value encountered in true_divide
  probs = exps / np.sum(exps, axis=-1, keepdims=True)

Inline Question 1:

Did you notice anything about the comparative difficulty of training the three-layer network vs. training the five-layer network? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer:

[FILL THIS IN]

Update rules

So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD.

SGD+Momentum

Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochastic gradient descent. See the Momentum Update section at http://cs231n.github.io/neural-networks-3/#sgd for more information.

Open the file cs231n/optim.py and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function sgd_momentum and run the following to check your implementation. You should see errors less than e-8.

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("momentum", 0.9)
    v = config.get("velocity", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    v = (config['momentum'] * v) - (config['learning_rate'] * dw)
    next_w = w+v 

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config["velocity"] = v

    return next_w, config
from cs231n.optim import sgd_momentum

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {"learning_rate": 1e-3, "velocity": v}
next_w, _ = sgd_momentum(w, dw, config=config)

expected_next_w = np.asarray([
  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],
  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],
  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],
  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])
expected_velocity = np.asarray([
  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],
  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],
  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],
  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])

# Should see relative errors around e-8 or less
print("next_w error: ", rel_error(next_w, expected_next_w))
print("velocity error: ", rel_error(expected_velocity, config["velocity"]))
next_w error:  8.882347033505819e-09
velocity error:  4.269287743278663e-09

Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster.

num_train = 4000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

solvers = {}

for update_rule in ['sgd', 'sgd_momentum']:
    print('Running with ', update_rule)
    model = FullyConnectedNet(
        [100, 100, 100, 100, 100],
        weight_scale=5e-2
    )

    solver = Solver(
        model,
        small_data,
        num_epochs=5,
        batch_size=100,
        update_rule=update_rule,
        optim_config={'learning_rate': 5e-3},
        verbose=True,
    )
    solvers[update_rule] = solver
    solver.train()

fig, axes = plt.subplots(3, 1, figsize=(15, 15))

axes[0].set_title('Training loss')
axes[0].set_xlabel('Iteration')
axes[1].set_title('Training accuracy')
axes[1].set_xlabel('Epoch')
axes[2].set_title('Validation accuracy')
axes[2].set_xlabel('Epoch')

for update_rule, solver in solvers.items():
    axes[0].plot(solver.loss_history, label=f"loss_{update_rule}")
    axes[1].plot(solver.train_acc_history, label=f"train_acc_{update_rule}")
    axes[2].plot(solver.val_acc_history, label=f"val_acc_{update_rule}")
    
for ax in axes:
    ax.legend(loc="best", ncol=4)
    ax.grid(linestyle='--', linewidth=0.5)

plt.show()
Running with  sgd
(Iteration 1 / 200) loss: 2.559978
(Epoch 0 / 5) train acc: 0.104000; val_acc: 0.107000
(Iteration 11 / 200) loss: 2.356069
(Iteration 21 / 200) loss: 2.214091
(Iteration 31 / 200) loss: 2.205928
(Epoch 1 / 5) train acc: 0.225000; val_acc: 0.193000
(Iteration 41 / 200) loss: 2.132095
(Iteration 51 / 200) loss: 2.118950
(Iteration 61 / 200) loss: 2.116443
(Iteration 71 / 200) loss: 2.132549
(Epoch 2 / 5) train acc: 0.298000; val_acc: 0.260000
(Iteration 81 / 200) loss: 1.977227
(Iteration 91 / 200) loss: 2.007528
(Iteration 101 / 200) loss: 2.004762
(Iteration 111 / 200) loss: 1.885342
(Epoch 3 / 5) train acc: 0.343000; val_acc: 0.287000
(Iteration 121 / 200) loss: 1.891516
(Iteration 131 / 200) loss: 1.923677
(Iteration 141 / 200) loss: 1.957744
(Iteration 151 / 200) loss: 1.966736
(Epoch 4 / 5) train acc: 0.322000; val_acc: 0.305000
(Iteration 161 / 200) loss: 1.801483
(Iteration 171 / 200) loss: 1.973779
(Iteration 181 / 200) loss: 1.666572
(Iteration 191 / 200) loss: 1.909494
(Epoch 5 / 5) train acc: 0.372000; val_acc: 0.319000
Running with  sgd_momentum
(Iteration 1 / 200) loss: 3.153778
(Epoch 0 / 5) train acc: 0.099000; val_acc: 0.088000
(Iteration 11 / 200) loss: 2.227203
(Iteration 21 / 200) loss: 2.125706
(Iteration 31 / 200) loss: 1.932679
(Epoch 1 / 5) train acc: 0.308000; val_acc: 0.258000
(Iteration 41 / 200) loss: 1.946330
(Iteration 51 / 200) loss: 1.781856
(Iteration 61 / 200) loss: 1.757563
(Iteration 71 / 200) loss: 1.853951
(Epoch 2 / 5) train acc: 0.385000; val_acc: 0.329000
(Iteration 81 / 200) loss: 2.020635
(Iteration 91 / 200) loss: 1.688374
(Iteration 101 / 200) loss: 1.492405
(Iteration 111 / 200) loss: 1.399368
(Epoch 3 / 5) train acc: 0.462000; val_acc: 0.345000
(Iteration 121 / 200) loss: 1.691196
(Iteration 131 / 200) loss: 1.545283
(Iteration 141 / 200) loss: 1.609280
(Iteration 151 / 200) loss: 1.704335
(Epoch 4 / 5) train acc: 0.475000; val_acc: 0.347000
(Iteration 161 / 200) loss: 1.490124
(Iteration 171 / 200) loss: 1.407966
(Iteration 181 / 200) loss: 1.362262
(Iteration 191 / 200) loss: 1.314095
(Epoch 5 / 5) train acc: 0.525000; val_acc: 0.363000

RMSProp and Adam

RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.

In the file cs231n/optim.py, implement the RMSProp update rule in the rmsprop function and implement the Adam update rule in the adam function, and check your implementations using the tests below.

NOTE: Please implement the complete Adam update rule (with the bias correction mechanism), not the first simplified version mentioned in the course notes.

[1] Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning 4 (2012).

[2] Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, ICLR 2015.

def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("decay_rate", 0.99)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("cache", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    cache = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw**2
    next_w = w - config['learning_rate'] * dw / (np.sqrt(cache) + config['epsilon'])
    config['cache'] = cache

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config


def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-3)
    config.setdefault("beta1", 0.9)
    config.setdefault("beta2", 0.999)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("m", np.zeros_like(w))
    config.setdefault("v", np.zeros_like(w))
    config.setdefault("t", 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    config['t'] += 1

    m = config['beta1']*config['m'] + (1-config['beta1'])*dw
    mt = m / (1-config['beta1']**config['t'])
    v = config['beta2']*config['v'] + (1-config['beta2'])*(dw**2)
    vt = v / (1-config['beta2']**config['t'])
    next_w = w  + (-config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon']))

    config['m'] = m
    config['v'] = v

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config
# Test RMSProp implementation
from cs231n.optim import rmsprop

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'cache': cache}
next_w, _ = rmsprop(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],
  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],
  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])
expected_cache = np.asarray([
  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],
  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],
  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],
  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])

# You should see relative errors around e-7 or less
print('next_w error: ', rel_error(expected_next_w, next_w))
print('cache error: ', rel_error(expected_cache, config['cache']))
next_w error:  9.524687511038133e-08
cache error:  2.6477955807156126e-09
# Test Adam implementation
from cs231n.optim import adam

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}
next_w, _ = adam(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],
  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],
  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])
expected_v = np.asarray([
  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],
  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],
  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],
  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])
expected_m = np.asarray([
  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],
  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],
  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],
  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])

# You should see relative errors around e-7 or less
print('next_w error: ', rel_error(expected_next_w, next_w))
print('v error: ', rel_error(expected_v, config['v']))
print('m error: ', rel_error(expected_m, config['m']))
next_w error:  1.1395691798535431e-07
v error:  4.208314038113071e-09
m error:  4.214963193114416e-09

Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:

learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}
for update_rule in ['adam', 'rmsprop']:
    print('Running with ', update_rule)
    model = FullyConnectedNet(
        [100, 100, 100, 100, 100],
        weight_scale=5e-2
    )
    solver = Solver(
        model,
        small_data,
        num_epochs=5,
        batch_size=100,
        update_rule=update_rule,
        optim_config={'learning_rate': learning_rates[update_rule]},
        verbose=True
    )
    solvers[update_rule] = solver
    solver.train()
    print()
    
fig, axes = plt.subplots(3, 1, figsize=(15, 15))

axes[0].set_title('Training loss')
axes[0].set_xlabel('Iteration')
axes[1].set_title('Training accuracy')
axes[1].set_xlabel('Epoch')
axes[2].set_title('Validation accuracy')
axes[2].set_xlabel('Epoch')

for update_rule, solver in solvers.items():
    axes[0].plot(solver.loss_history, label=f"{update_rule}")
    axes[1].plot(solver.train_acc_history, label=f"{update_rule}")
    axes[2].plot(solver.val_acc_history, label=f"{update_rule}")
    
for ax in axes:
    ax.legend(loc='best', ncol=4)
    ax.grid(linestyle='--', linewidth=0.5)

plt.show()
Running with  adam
(Iteration 1 / 200) loss: 3.476928
(Epoch 0 / 5) train acc: 0.126000; val_acc: 0.110000
(Iteration 11 / 200) loss: 2.027712
(Iteration 21 / 200) loss: 2.183358
(Iteration 31 / 200) loss: 1.744257
(Epoch 1 / 5) train acc: 0.363000; val_acc: 0.330000
(Iteration 41 / 200) loss: 1.707951
(Iteration 51 / 200) loss: 1.703835
(Iteration 61 / 200) loss: 2.094758
(Iteration 71 / 200) loss: 1.505558
(Epoch 2 / 5) train acc: 0.419000; val_acc: 0.362000
(Iteration 81 / 200) loss: 1.594431
(Iteration 91 / 200) loss: 1.511239
(Iteration 101 / 200) loss: 1.393552
(Iteration 111 / 200) loss: 1.433278
(Epoch 3 / 5) train acc: 0.481000; val_acc: 0.372000
(Iteration 121 / 200) loss: 1.193409
(Iteration 131 / 200) loss: 1.455664
(Iteration 141 / 200) loss: 1.352905
(Iteration 151 / 200) loss: 1.275835
(Epoch 4 / 5) train acc: 0.563000; val_acc: 0.373000
(Iteration 161 / 200) loss: 1.336326
(Iteration 171 / 200) loss: 1.429705
(Iteration 181 / 200) loss: 1.131360
(Iteration 191 / 200) loss: 1.164589
(Epoch 5 / 5) train acc: 0.582000; val_acc: 0.399000

Running with  rmsprop
(Iteration 1 / 200) loss: 2.589166
(Epoch 0 / 5) train acc: 0.119000; val_acc: 0.146000
(Iteration 11 / 200) loss: 2.032921
(Iteration 21 / 200) loss: 1.897278
(Iteration 31 / 200) loss: 1.770793
(Epoch 1 / 5) train acc: 0.381000; val_acc: 0.320000
(Iteration 41 / 200) loss: 1.895731
(Iteration 51 / 200) loss: 1.681091
(Iteration 61 / 200) loss: 1.487204
(Iteration 71 / 200) loss: 1.629973
(Epoch 2 / 5) train acc: 0.429000; val_acc: 0.350000
(Iteration 81 / 200) loss: 1.506686
(Iteration 91 / 200) loss: 1.610742
(Iteration 101 / 200) loss: 1.486124
(Iteration 111 / 200) loss: 1.559454
(Epoch 3 / 5) train acc: 0.492000; val_acc: 0.359000
(Iteration 121 / 200) loss: 1.496860
(Iteration 131 / 200) loss: 1.531552
(Iteration 141 / 200) loss: 1.550195
(Iteration 151 / 200) loss: 1.657838
(Epoch 4 / 5) train acc: 0.533000; val_acc: 0.354000
(Iteration 161 / 200) loss: 1.603105
(Iteration 171 / 200) loss: 1.405372
(Iteration 181 / 200) loss: 1.503740
(Iteration 191 / 200) loss: 1.385278
(Epoch 5 / 5) train acc: 0.531000; val_acc: 0.374000

Inline Question 2:

AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)

John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

Answer:

[FILL THIS IN]

Train a Good Model!

Train the best fully connected model that you can on CIFAR-10, storing your best model in the best_model variable. We require you to get at least 50% accuracy on the validation set using a fully connected network.

If you are careful it should be possible to get accuracies above 55%, but we don’t require it for this part and won’t assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional networks rather than fully connected networks.

Note: You might find it useful to complete the BatchNormalization.ipynb and Dropout.ipynb notebooks before completing this part, since those techniques can help you train powerful models.

import itertools
best_model = None

################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# find batch/layer normalization and dropout useful. Store your best model in  #
# the best_model variable.                                                     #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

best_acc = -1
best_solver = None

#hidden_dims = [[100, 100, 100], [100,100,100,100]]
hidden_dims = [[85, 85, 80, 80]]
learning_rates = np.logspace(-3.8, -2.8, 8)
weight_scales = np.logspace(-2, -1.6, 4)
params = list(itertools.product(hidden_dims, learning_rates, weight_scales))

for hd, lr, ws in params:
  model = FullyConnectedNet(
      hidden_dims=hd,
      input_dim = 3 * 32 * 32,
      num_classes=10,
      normalization=None,
      reg=0.5,
      weight_scale=ws,
  )
  solver = Solver(
      model=model,
      data=small_data,
      update_rule='adam',
      num_epochs=5,
      batch_size=256,
      optim_config={'learning_rate': lr},
      verbose=False
  )
  solver.train()
  if solver.best_val_acc > best_acc:
      best_model = model
      best_solver = solver
      best_acc = solver.best_val_acc
      best_lr, best_ws, best_hd = (lr, ws, hd)
      print('best valid acc yet: {}   lr: {} ws: {} hd: {}'.format(best_acc, lr, ws, hd))

fig, axes = plt.subplots(3, 1, figsize=(15, 15))

axes[0].set_title('Training loss')
axes[0].set_xlabel('Iteration')
axes[1].set_title('Training accuracy')
axes[1].set_xlabel('Epoch')
axes[2].set_title('Validation accuracy')
axes[2].set_xlabel('Epoch')

axes[0].plot(best_solver.loss_history, label="best_model")
axes[1].plot(best_solver.train_acc_history, label="best_model")
axes[2].plot(best_solver.val_acc_history, label="best_model")
    
for ax in axes:
    ax.legend(loc='best', ncol=4)
    ax.grid(linestyle='--', linewidth=0.5)

plt.show()

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
best valid acc yet: 0.116   lr: 0.00015848931924611142 ws: 0.01 hd: [85, 85, 80, 80]
best valid acc yet: 0.15   lr: 0.00015848931924611142 ws: 0.025118864315095794 hd: [85, 85, 80, 80]
best valid acc yet: 0.195   lr: 0.0003059949687207196 ws: 0.025118864315095794 hd: [85, 85, 80, 80]
best valid acc yet: 0.199   lr: 0.000425178630338289 ws: 0.025118864315095794 hd: [85, 85, 80, 80]

print(best_lr)
print(best_ws)
0.000425178630338289
0.025118864315095794
print('Engaged with Training Data. Training Using Best Hyperparameters...')

best_lr = 8.208914e-04
best_ws = 1.847850e-02

best_model = FullyConnectedNet([85, 85, 80, 80], weight_scale=best_ws,
                                  dtype=np.float64)

solver = Solver(best_model, data,
                num_epochs=5, batch_size=200,
                update_rule='adam',
                optim_config={
                'learning_rate': best_lr,
                },
                verbose=True)
solver.train()
print("BEST VALID ACC: %f" % solver.best_val_acc)
Engaged with Training Data. Training Using Best Hyperparameters...
(Iteration 1 / 1225) loss: 2.302049
(Epoch 0 / 5) train acc: 0.174000; val_acc: 0.132000
(Iteration 11 / 1225) loss: 2.163544
(Iteration 21 / 1225) loss: 1.935593
(Iteration 31 / 1225) loss: 1.875530
(Iteration 41 / 1225) loss: 1.823461
(Iteration 51 / 1225) loss: 1.900976
(Iteration 61 / 1225) loss: 1.839346
(Iteration 71 / 1225) loss: 1.730702
(Iteration 81 / 1225) loss: 1.746024
(Iteration 91 / 1225) loss: 1.680118
(Iteration 101 / 1225) loss: 1.608998
(Iteration 111 / 1225) loss: 1.700741
(Iteration 121 / 1225) loss: 1.687720
(Iteration 131 / 1225) loss: 1.549200
(Iteration 141 / 1225) loss: 1.616082
(Iteration 151 / 1225) loss: 1.608473
(Iteration 161 / 1225) loss: 1.674555
(Iteration 171 / 1225) loss: 1.673154
(Iteration 181 / 1225) loss: 1.589477
(Iteration 191 / 1225) loss: 1.516452
(Iteration 201 / 1225) loss: 1.698229
(Iteration 211 / 1225) loss: 1.540504
(Iteration 221 / 1225) loss: 1.671577
(Iteration 231 / 1225) loss: 1.523757
(Iteration 241 / 1225) loss: 1.628925
(Epoch 1 / 5) train acc: 0.450000; val_acc: 0.438000
(Iteration 251 / 1225) loss: 1.568312
(Iteration 261 / 1225) loss: 1.599241
(Iteration 271 / 1225) loss: 1.465892
(Iteration 281 / 1225) loss: 1.463275
(Iteration 291 / 1225) loss: 1.486027
(Iteration 301 / 1225) loss: 1.589153
(Iteration 311 / 1225) loss: 1.520976
(Iteration 321 / 1225) loss: 1.478719
(Iteration 331 / 1225) loss: 1.608898
(Iteration 341 / 1225) loss: 1.590311
(Iteration 351 / 1225) loss: 1.375467
(Iteration 361 / 1225) loss: 1.440377
(Iteration 371 / 1225) loss: 1.412832
(Iteration 381 / 1225) loss: 1.389054
(Iteration 391 / 1225) loss: 1.554265
(Iteration 401 / 1225) loss: 1.607085
(Iteration 411 / 1225) loss: 1.565354
(Iteration 421 / 1225) loss: 1.389241
(Iteration 431 / 1225) loss: 1.410696
(Iteration 441 / 1225) loss: 1.493157
(Iteration 451 / 1225) loss: 1.372041
(Iteration 461 / 1225) loss: 1.371411
(Iteration 471 / 1225) loss: 1.444722
(Iteration 481 / 1225) loss: 1.281208
(Epoch 2 / 5) train acc: 0.479000; val_acc: 0.488000
(Iteration 491 / 1225) loss: 1.196841
(Iteration 501 / 1225) loss: 1.383812
(Iteration 511 / 1225) loss: 1.441306
(Iteration 521 / 1225) loss: 1.302136
(Iteration 531 / 1225) loss: 1.288924
(Iteration 541 / 1225) loss: 1.312799
(Iteration 551 / 1225) loss: 1.299328
(Iteration 561 / 1225) loss: 1.355581
(Iteration 571 / 1225) loss: 1.458034
(Iteration 581 / 1225) loss: 1.281662
(Iteration 591 / 1225) loss: 1.511131
(Iteration 601 / 1225) loss: 1.291954
(Iteration 611 / 1225) loss: 1.446146
(Iteration 621 / 1225) loss: 1.343697
(Iteration 631 / 1225) loss: 1.551365
(Iteration 641 / 1225) loss: 1.306165
(Iteration 651 / 1225) loss: 1.477074
(Iteration 661 / 1225) loss: 1.173105
(Iteration 671 / 1225) loss: 1.397794
(Iteration 681 / 1225) loss: 1.237952
(Iteration 691 / 1225) loss: 1.335822
(Iteration 701 / 1225) loss: 1.172980
(Iteration 711 / 1225) loss: 1.381391
(Iteration 721 / 1225) loss: 1.334836
(Iteration 731 / 1225) loss: 1.435350
(Epoch 3 / 5) train acc: 0.532000; val_acc: 0.507000
(Iteration 741 / 1225) loss: 1.340843
(Iteration 751 / 1225) loss: 1.359592
(Iteration 761 / 1225) loss: 1.354142
(Iteration 771 / 1225) loss: 1.285244
(Iteration 781 / 1225) loss: 1.414680
(Iteration 791 / 1225) loss: 1.299610
(Iteration 801 / 1225) loss: 1.312392
(Iteration 811 / 1225) loss: 1.274710
(Iteration 821 / 1225) loss: 1.204101
(Iteration 831 / 1225) loss: 1.239950
(Iteration 841 / 1225) loss: 1.332283
(Iteration 851 / 1225) loss: 1.280676
(Iteration 861 / 1225) loss: 1.419126
(Iteration 871 / 1225) loss: 1.326449
(Iteration 881 / 1225) loss: 1.308919
(Iteration 891 / 1225) loss: 1.271191
(Iteration 901 / 1225) loss: 1.297089
(Iteration 911 / 1225) loss: 1.110233
(Iteration 921 / 1225) loss: 1.382717
(Iteration 931 / 1225) loss: 1.149547
(Iteration 941 / 1225) loss: 1.427123
(Iteration 951 / 1225) loss: 1.231657
(Iteration 961 / 1225) loss: 1.253389
(Iteration 971 / 1225) loss: 1.212336
(Epoch 4 / 5) train acc: 0.549000; val_acc: 0.514000
(Iteration 981 / 1225) loss: 1.355131
(Iteration 991 / 1225) loss: 1.243045
(Iteration 1001 / 1225) loss: 1.448963
(Iteration 1011 / 1225) loss: 1.373295
(Iteration 1021 / 1225) loss: 1.213074
(Iteration 1031 / 1225) loss: 1.329918
(Iteration 1041 / 1225) loss: 1.295005
(Iteration 1051 / 1225) loss: 1.302735
(Iteration 1061 / 1225) loss: 1.315817
(Iteration 1071 / 1225) loss: 1.353064
(Iteration 1081 / 1225) loss: 1.281699
(Iteration 1091 / 1225) loss: 1.188757
(Iteration 1101 / 1225) loss: 1.317366
(Iteration 1111 / 1225) loss: 1.262903
(Iteration 1121 / 1225) loss: 1.452370
(Iteration 1131 / 1225) loss: 1.326298
(Iteration 1141 / 1225) loss: 1.271687
(Iteration 1151 / 1225) loss: 1.113444
(Iteration 1161 / 1225) loss: 1.368831
(Iteration 1171 / 1225) loss: 1.222069
(Iteration 1181 / 1225) loss: 1.375179
(Iteration 1191 / 1225) loss: 1.137375
(Iteration 1201 / 1225) loss: 1.298180
(Iteration 1211 / 1225) loss: 1.221669
(Iteration 1221 / 1225) loss: 1.180235
(Epoch 5 / 5) train acc: 0.592000; val_acc: 0.505000
BEST VALID ACC: 0.514000

Test Your Model!

Run your best model on the validation and test sets. You should achieve at least 50% accuracy on the validation set.

y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print('Validation set accuracy: ', (y_val_pred == data['y_val']).mean())
print('Test set accuracy: ', (y_test_pred == data['y_test']).mean())
Validation set accuracy:  0.514
Test set accuracy:  0.51