Scroll Top
10 Old Grimsbury Rd, Banbury OX16 3HG, UK

COMP9444 Neural Networks and Deep Learning

Solution:

Part 1:

Question 1:

Model NetLin which computes a linear function of the pixels in the image, followed by log softmax.


Final confusion matrix:

[[769.   5.   8.  14.  31.  61.   2.  62.  30.  18.]

 [ 7. 671. 108.  19.  26.  23.  58.  13.  26.  49.]

 [ 7.  65. 689.  26.  27.  20.  47.  37.  45.  37.]

 [ 5.  36.  60. 759.  15.  56.  14.  19.  26.  10.]

 [ 57.  54.  80.  21. 625.  20.  32.  35.  19.  57.]

 [ 8.  26. 124.  16.  19. 726.  26.  10.  33.  12.]

 [ 5.  22. 146.   9.  27.  24. 723.  20.  10.  14.]

 [ 17.  27.  27.  12.  87.  18.  54. 621.  89.  48.]

 [ 11.  38.  89.  41.   7.  31.  45.   7. 706.  25.]

 [ 8.  50.  86.   3.  53.  31.  20.  31.  40. 678.]

Test set: Average loss: 1.0090, Accuracy: 6967/10000 (70%)

Calculation for independent parameters:

Following code was used to extract the parameters from neural networks. Same code is used for calculating parameters for all the questions in this assignment. This code was added to respective main files. Prettytable module was used for displaying results.

Code:

from prettytable import PrettyTable

def count_parameters(model):

    table = PrettyTable([“Modules”, “Parameters”])

    total_params = 0

    for name, parameter in model.named_parameters():

        if not parameter.requires_grad:

            continue

        param = parameter.numel()

        table.add_row([name, param])

        total_params += param

    print(table)

    print(f”Total Trainable Params: {total_params}”)

    return total_params ”

Output (For Linear network Q1):

+—————+————+
|    Modules    | Parameters |
+—————+————+
| linear.weight |    7840    |
|  linear.bias  |     10     |
+—————+————+

Total Trainable Params: 7850

Question 2:

Fully connected 2-layer network NetFull (i.e., one hidden layer, plus the output layer), using tanh at the hidden nodes and log softmax at the output node.

The table below shows accuracy with respect to number of hidden nodes.

Number of hidden nodesAccuracy (%)
1068
2075
3078
4080
5081
6081.93
7082.35
8083.31
9083.87
10083.99
11083.76
12083.97
13084.94
14084.20
15084.35
16084.92
17084.90
18084.64
19084.79
20084.81
21084.64
22084.50
23084.77
24084.79
25085.14
26084.79
27084.71
28084.81
29084.69
30084.77

I tried to design 2 layered fully connected neural network with 30 different numbers as hidden node – from 10 to 300. After 130 hidden nodes till 300, network consistently showed the accuracy between 84% to 85%. Network with 250 hidden nodes performed the best with the accuracy of 85.14%. Following confusion matrix and independent parameter calculation is for the network with 250 hidden nodes.

[[857.   3.   1.   6.  28.  30.   5.  37.  26.   7.]

 [ 5. 824.  33.   1.  15.  10.  63.   5.  17.  27.]

 [ 8.  14. 842.  36.  11.  19.  24.  14.  16.  16.]

 [ 3.   8.  25. 926.   2.  13.   7.   2.   6.   8.]

 [ 37.  29.  15.   4. 818.   8.  32.  17.  25.  15.]

 [ 7.  11.  74.  10.  12. 834.  29.   1.  15.   7.]

 [ 3.  12.  45.  10.  14.   5. 900.   5.   1.   5.]

 [ 19.  13.  22.   3.  23.  10.  28. 829.  20.  33.]

 [ 12.  24.  30.  42.   3.   7.  28.   3. 842.   9.]

 [ 5.  21.  50.   5.  29.   5.  17.  14.  12. 842.]]

Test set: Average loss: 0.4938, Accuracy: 8514/10000 (85%)

+—————-+————+
|    Modules     | Parameters |
+—————-+————+
| linear0.weight |   196000   |
|  linear0.bias  |    250     |
| linear1.weight |    2500    |
|  linear1.bias  |     10     |
+—————-+————+

Total Trainable Params: 198760

Question 3:

Convolutional network called NetConv, with two convolutional layers plus one fully connected layer, all using relu activation function, followed by the output layer, using log softmax.

[[938.   4.   1.   2.  29.   7.   1.  12.   3.   3.]

 [  2. 922.   8.   0.  10.   1.  38.   5.   6.   8.]

 [ 10.   5. 875.  44.   7.   5.  25.  14.   4.  11.]

 [  2.   0.  12. 974.   0.   2.   6.   2.   2.   0.]

 [ 20.   9.   3.   5. 926.   4.  11.   5.  12.   5.]

 [  4.  15.  38.   8.   4. 909.  14.   2.   4.   2.]

 [  2.   7.  21.   0.   5.   1. 960.   2.   1.   1.]

 [  4.   4.   4.   2.   8.   1.  11. 950.   3.  13.]

 [  3.  17.   7.   4.   4.   3.   2.   2. 957.   1.]

 [  4.   4.   7.   3.  11.   0.   6.   5.   3. 957.]]

Test set: Average loss: 2.2122, Accuracy: 9368/10000 (94%)

+————–+————+
|   Modules    | Parameters |
+————–+————+
| conv1.weight |    1600    |
|  conv1.bias  |     64     |
| conv2.weight |   204800   |
|  conv2.bias  |    128     |
|  fc1.weight  |  1048576   |
|   fc1.bias   |    512     |
|  fc2.weight  |    5120    |
|   fc2.bias   |     10     |
+————–+————+

Total Trainable Params: 1260810

Question 4:

  1. Compare the results of three models

Three different neural networks were implemented to recognize handwritten Hiragana symbols. The three networks are 1) A single layer linear neural network – NetLin, 2) A two layered fully connected neural network – NetFull, 3) A 2-layer convolutional neural network – NetConv.

Out of these three networks, NetConv performs the best, with the accuracy of 94%. It is followed by NetFull with the accuracy 85%. Lastly, NetLin performed poorly compared to other two with the accuracy of 70% only.

After analysing the results, it can observe that, as the complexity – number of layers, number of hidden nodes, number of parameters – of the network increases the final accuracy of the model also increases. Having said that one should be always careful to not to overfit the model.

In this case, NetConv computes highest accuracy because the convolutional neural networks are specially designed to classify images by designing and combining multiple layers such as filters, maxpooling, padding, and more to extract different features of the images. These features help convolutional neural networks to perform well with image classification.

  • Number of independent parameters from each of the three modules.
ModelNumber of Parameters
NetLin7850
NetFull198760
NetConv1260810

Number parameters increases exponentially from NetLin to NetConv. The simplest model of three, NetLin has 7850 parameters – the lowest, as there is only one layer in the network the no. of parameters is low.

In NetFul the total number of parameters are 198760. This is because this is a 2 layered neural network and all the nodes between layers are connected to every other node. The first layer of the network has 196000 parameters with 250 for bias while second layer has 2500 parameters while 10 for bias.

In NetConv, there are multiples layers – 2 convolutional layers and 2 fully connected layers. Within the convolutional layers there are other layers like Relu, Maxpool. Because of this complex structure this network has the greatest number of parameters. With first convolutional layer has 1600 along with 64 bias, second convolutional layer has 204800 with 128 biases. When the second convolutional layer is connected to the first fully connected layer it generated 1048576 parameters with 512 bias parameters pass on this to second fully connected layer creates more 5120 parameters with 10 parameters for bias, taking the grand total to 1260810.

Other than accuracy, time complexity of the models also increases with the increase in number of parameters, as NetLin took the shortest time to computer results while NetConv took the longest on the same KMNIST dataset.

  • Characters mistaken and why?

None of the model resulted with 100% accuracy, which mean all the models misread some or many characters from the same KMNIST dataset. With NetLin misreading the most followed by NetFull, while NetConv predicted most correctly.

There are few characters mistaken in all three models like – 0 is predicted as 4 and 6 is predicted a 1. This is because when these two characters are drawn then they have few things like – strokes, loops, curls- similar to one another.

When we analyse the confusion matrix of Netlin we can observe that there are a lot of values which are greater than 50. This is where major errors happen. In the confusion matrix below character 5 and 6 are mistaken with 2.

      0      1        2     3      4     5       6      7    8     9

0 [[769.     5.      8.   14.   31.   61.     2.   62.  30.  18.]

 1 [    7. 671.  108.   19.   26.   23.   58.   13.  26.  49.]

 2 [    7.   65.  689.   26.   27.   20.   47.   37.  45.  37.]

 3 [    5.   36.    60. 759.   15.   56.   14.   19.  26.  10.]

 4 [  57.   54.    80.   21. 625.   20.   32.   35.  19.  57.]

 5 [   8.   26.   124.   16.   19. 726.   26.   10.  33.  12.]

 6 [   5.   22.   146.     9.   27.   24. 723.   20.  10.  14.]

 7 [ 17.   27.     27.   12.   87.   18.   54. 621.  89.  48.]                      

 8 [ 11.   38.     89.   41.    7.    31.   45.    7. 706.  25.]

 9 [  8.    50.     86.     3.  53.    31.   20.   31.  40. 678.]

Following is the explanation for this misprediction:

After observing the above images, it can be seen that character 2,5, and 6 have few things common in them. While there are other things that are not common, NetLin due to its simplicity it is not able to pick those features in few cases.

If we observe the similar numbers in confusion matrix of NetConv, we can see that the numbers are much lower compared to NetLin. This is because NetConv have special filters for extracting features from images.

     0         1      2      3    4      5      6     7     8     9

0 [[ 938.     4.     1.     2.   29.    7.     1.  12.    3.   3.]

1 [     2. 922.     8.     0.   10.    1.   38.     5.    6.   8.]

2 [   10.     5. 875.   44.     7.    5.   25.   14.    4.  11.]

3 [     2.     0.   12. 974.     0.    2.     6.     2.    2.   0.]

4 [   20.     9.     3.     5. 926.    4.    11.    5.   12.   5.]

5 [    4.    15.   38.    8.      4. 909.  14.     2.    4.   2.]

6 [    2.     7.    21.    0.      5.    1. 960.     2.     1.   1.]           

7 [    4.     4.      4.    2.      8.    1.   11. 950.     3.  13.]

8 [    3.   17.      7.    4.      4.    3.    2.      2. 957.   1.]

9 [    4.     4.      7.    3.    11.    0.    6.      5.     3. 957.]]

There are still some mispredictions, but this is considerable as the images in dataset have very small resolution (28X28) which makes them blur. Also, the images contain handwritten characters thus, there are changes of few human errors.

Part 2:

Question 1:

“Code for a Pytorch Module called Full2Net which implements a 3-layer fully connected neural network with two hidden layers using tanh activation, followed by the output layer with one node and sigmoid activation.

class Full2Net(torch.nn.Module):

    def __init__(self, hid):

        super(Full2Net, self).__init__()

        self.fc_layer_1 = nn.Linear(2, hid)

        self.fc_layer_2 = nn.Linear(hid,hid )

        self.fc_layer_3 = nn.Linear(hid,1 )   

    def forward(self, input):

        input = input.view(input.shape[0], -1)

        hidden_s1 = self.fc_layer_1(input)

        self.hid1 = torch.tanh(hidden_s1)

        hidden_s2 = self.fc_layer_2(self.hid1)

        self.hid2  = torch.tanh(hidden_s2)

        hidden_s3 = self.fc_layer_3(self.hid2)

        self.output  = torch.sigmoid(hidden_s3)

        return(self.output)”

Note: Same code is available in frac.py file.

Question 2:

Minimum number of hidden nodes12

Calculating total number of independent parameters in the network:

Input Layer (260, 2)

Layer#Parameters
fc_layer_1.weight24
fc_layer_1.bias12
fc_layer_2.weight144
fc_layer_2.bias12
fc_layer_3.weight12
fc_layer_3.bias1
Total number of independent parameters205

Graph Output:

Question 3:

“Code for a Pytorch Module called Full3Net which implements a 4-layer network, the same as Full2Net but with an additional hidden layer.

class Full3Net(torch.nn.Module):

    def __init__(self, hid):

        super(Full3Net, self).__init__()

        self.fc_layer_1 = nn.Linear(2, hid)

        self.fc_layer_2 = nn.Linear(hid,hid )

        self.fc_layer_3 = nn.Linear(hid,hid )

        self.fc_layer_4 = nn.Linear(hid,1 ) 

    def forward(self, input):

        input = input.view(input.shape[0], -1)

        hidden_s1 = self.fc_layer_1(input)

        self.hid1 = torch.tanh(hidden_s1)

        hidden_s2 = self.fc_layer_2(self.hid1)

        self.hid2  = torch.tanh(hidden_s2)

        hidden_s3 = self.fc_layer_3(self.hid2)

        self.hid3  = torch.tanh(hidden_s3)

        hidden_s4 = self.fc_layer_4(self.hid3)

        self.output  = torch.sigmoid(hidden_s4)

        return(self.output)”

Note: Same code is available in frac.py file.

Question 4:

Minimum number of hidden nodes12

Calculating total number of independent parameters in the network:

Input Layer (260, 2)

Layer#Parameters
fc_layer_1.weight24
fc_layer_1.bias12
fc_layer_2.weight144
fc_layer_2.bias12
fc_layer_3.weight144
fc_layer_3.bias12
fc_layer_4.weight12
fc_layer_4.bias1
Total number of independent parameters361

Final Graph Output:

Question 5:

“Code for a Pytorch Module called DenseNet which implements a 3-layer densely connected neural network.

class DenseNet(torch.nn.Module):

    def __init__(self, num_hid):

        super(DenseNet, self).__init__()

        self.layer_1 = nn.Linear(2, num_hid)

        self.layer_2 = nn.Linear(num_hid+2,num_hid )

        self.layer_3 = nn.Linear(num_hid+num_hid+2,1 )   

    def forward(self, input):

        input = input.view(input.shape[0], -1)

        hidden_s1 = self.layer_1(input)

        self.hid1 = torch.tanh(hidden_s1)

        temp_1 = torch.cat( (self.hid1,input) , 1)

        hidden_s2 = self.layer_2(temp_1)

        self.hid2 = torch.tanh(hidden_s2)

        temp_2 = torch.cat( (self.hid1,self.hid2,input) , 1)

        hidden_s3 = self.layer_3(temp_2)

        self.output = torch.sigmoid(hidden_s3)

        return self.output”

Question 6:

Minimum number of hidden nodes14

Calculating total number of independent parameters in the network:

Input Layer (260, 2)

Layer#Parameters
layer_1.weight28
layer_1.bias14
layer_2.weight224
layer_2.bias14
layer_3.weight30
layer_3.bias1
Total number of independent parameters311

Final Graph Output:

Question 7:

  1. the total number of independent parameters in each of the three networks (using the number of hidden nodes determined by your experiments) and the approximate number of epochs required to train each type of network.

The neurons in the hidden nodes calculated in Full2Net and Full3Net is 12, whereas it is 14 in DenseNet. The overall number of trainable parameters in Full2Net is 205, 361 in Full3Net, and 311 in DenseNet. Full2Net and Full3Net were trained for a total of 200000 epochs, whereas DenseNet was trained for 135200 epochs. DenseNet contains fewer parameters owing to the skip connections, and the model complexity has dropped as a result, requiring fewer epochs to train DenseNet.

  • An aqualitative description of the functions computed by the different layers of Full3Net and DenseNet.

Both networks’ first hidden layer functions learn linearly separable low dimensional characteristics, while the second hidden layer learns more complex features that we might informally call “convex,” and the final (output) layer learns the non-linear target function. Since the DenseNet, each layer received additional inputs from all previous layers and has passed on its feature maps to all following levels, and each layer has received “collective knowledge” from all preceding layers, resulting in strong gradient flow and low complexity features.

  • the qualitative difference, if any, between the overall function (i.e. output as a function of input) computed by the three networks.

Because a 3-layered network will learn non-linear functions, the final function estimated by Full2Net is more linear than those computed by Full3Net. DenseNet has a better non-linear output function which is generated by more diversified features (since each layer in DenseNet receives all preceding layers as input, more diversified features tend to have richer patterns) with parameter and computational efficiency.

Part 4:

Question 1:

Simple Recurrent Network (SRN) on the Reber Grammar prediction task:
Output:

Question 3:

“This is how anbn prediction task is achieved by the network, based on the figure you generated in Question 2.

Strings from context-free language are used here as anbn to identify which symbol sequences are valid, any system that recognizes Reber grammar’ strings should have some form of memory that can make the decisions depending on both current input and prior data. The symbols in the input sequence are presented to the network at once in the one-step look-ahead prediction task, and the network output is regarded as a prediction of the next symbol. Once the network has learned the prediction tasks effectively, the input and output sequences for predicable sequences should be identical. Therefore, if the previous output is transmitted back to the input units, a completely trained network could conceivably predict the whole sequence.”

Question 4:

Output:

Question 5:

“This is how anbncn prediction task is achieved by the network based on the figure you generated in Question 4:

The string anbncn stems from a somewhat context-sensitive language. The input sequence symbols are fed one at a time into the three-dimensional input layer of the neural network, and the training sequence is formed by the combination of strings of varying depths. Each output unit is allocated to one of the 3 symbols a, b, or c, and the anticipated symbol is determined by the unit with the maximum activation. Because the network doesn’t determine the depth n at the start of a string, it can’t forecast when the first b will appear, and hence can’t know how many will follow: As a result, the anbncn  job does not require you to forecast the entire string, but simply a portion of it. Once the network has processed the initial b, it must predict n-1 more bs, followed by n cs, and then an arbitrary (but non-zero) number of as (which form the beginning of the subsequent string). Until the first b appears, the network must learn that the initial a is followed by as and not by other symbols.”

Question 6:

LSTM network to predict the Embedded Reber Grammar:

A good neural system must be able to remember the preceding symbol in the sequence, regardless of the length of the input sequence, and compare it to the second last symbol seen to distinguish them as acceptable strings and learn to tell them apart from invalid strings. Long Short-Term Memory (LSTM) uses a mix of forgetting, input, and output gates to learn these long-range relationships. The LSTM has a context layer that is separate from the hidden layer but has the same number of units as the hidden layer. One of the context units is tasked with remembering the initial T or P, and this information is retained by sufficiently high and low values for the forget gate, as well as the input and output gates, respectively.

The Final  Epoch outputs are listed below,

—–

state =  0 1 10 11 12 12 13 14 17 18

symbol= BPBTSXSEPE

label = 0401232646

true probabilities:

     B    T    S    X    P    V    E

1 [ 0.   0.5  0.   0.   0.5  0.   0. ]

10 [ 1.  0.  0.  0.  0.  0.  0.]

11 [ 0.   0.5  0.   0.   0.5  0.   0. ]

12 [ 0.   0.   0.5  0.5  0.   0.   0. ]

12 [ 0.   0.   0.5  0.5  0.   0.   0. ]

13 [ 0.   0.   0.5  0.5  0.   0.   0. ]

14 [ 0.  0.  0.  0.  0.  0.  1.]

17 [ 0.  0.  0.  0.  1.  0.  0.]

18 [ 0.  0.  0.  0.  0.  0.  1.]

hidden activations and output probabilities [BTSXPVE]:

1 [-0.56  0.58  0.34  0.76] [ 0.    0.51  0.    0.    0.49  0.    0.  ]

10 [-0.65  0.89  0.72 -0.64] [ 1.  0.  0.  0.  0.  0.  0.]

11 [ 0.41  0.05  0.47  0.75] [ 0.    0.51  0.    0.    0.47  0.02  0.  ]

12 [-0.41 -0.66  0.47 -0.74] [ 0.    0.    0.48  0.52  0.    0.    0.  ]

12 [-0.24 -0.55  0.19 -0.79] [ 0.    0.    0.47  0.52  0.    0.    0.  ]

13 [ 0.47 -0.38  0.88 -0.83] [ 0.    0.01  0.62  0.36  0.    0.    0.01]

14 [ 0.89 -0.78  0.17  0.1 ] [ 0.  0.  0.  0.  0.  0.  1.]

17 [ 0.29 -0.31 -0.05  0.76] [ 0.    0.04  0.    0.    0.94  0.01  0.01]

18 [ 0.67 -0.95  0.55 -0.62] [ 0.    0.    0.    0.    0.    0.    0.99]

epoch: 50000

error: 0.0013

final: 0.0007

The error decreases with each iteration, and the LSTM is able to converge.

“Generally recurrent neural networks RNNs have a long-term dependence problem that LSTM networks were created to solve.  LSTMs vary from more standard feedforward neural networks in that they feature feedback connections this trait allows LSTMs to process complete data sequences (e.g., time series) without having to handle each point in the sequence separately, instead of preserving important knowledge about prior data in the sequence to aid in the processing of incoming data points. As a result, LSTMs excel in processing data sequences such as text, audio, and time series in general too.”

Related Posts

Leave a comment

× WhatsApp Us