Using X0 as a starting point

The main advantage of choosing this method of initialization rather than the previous one, which relies on using keywords, is that defining an X0 allows for more flexiblity since one can choose a value for each parameter of each layer. For example, using a keyword such as ‘KERNELS’ means that all the kernels applied on every convolutional layer will have the same initial value. Whereas an X0 allows to initialize each kernel individually.

The order and meaning of the variables in X0 is hardcoded in HyperNOMAD. Let’s use the following parameter file as an example :

DATASET MNIST
MAX_BB_EVAL 100

HYPER_DISPLAY 3

#              [ CONVOLUTION BLOCK               ]   [ FULLY CONNECTED BLOCK ]  [BATCH] [   OPTIMIZER BLOCK      ] [DROPOUT][ACTIVATION]
X0           (   2     6  5 1 0 1    16  5 1 0 1        2  128   84               128     3   0.1  0.9  0.0005 0     0.2        1        )
#LOWER_BOUND (   1     1  1 1 0 0     1  1 1 0 0        0    1    1                1       1   0    0     0    0       0        1        )
#UPPER_BOUND ( 100  1000 20 3 2 1  1000 20 3 2 1      500 1000 1000              400       4   1    1     1    1       1        3        )


DROPOUT_RATE 0.5 - - FIXED
KERNELS 10 - - FIXED
REMAINING_HYPERPARAMETERS VAR

Analysis of the example

First, ‘HYPER_DISPLAY’ allows set the level of details on the steps of HyperNOMAD. The default value is 1, and the maximum is 3. Then, X0 is presented as a list of parameters that are respectively categorised into the convolutional block, the fully connected block, the batch size, the optimizer block, the dropout rate and the activate function.

The blocks for the batch size, dropout rate and activate function contain each one single value which that of the corresponding hyperparameter.

The first variable of the convolutional block indicates the number of convolutional layers : 2 in this example. Each convolutional layer has 5 associated variables : (number of output channeles, kernel, stride, padding, do pooling). Therefor, the first convolutional layer has 6 output channels, a (5,5) kernel, a stride of 1, no padding and performs a pooling afterwards. The same goes for the second layer.

The first variable of the fully connected block corresponds to the number of fully connected layers. The following variables indicate the size of each fully connected layer.

The first variable of the optimizer block indicates which optimizer is used, here is it Adagrad. The optimizer block always has 4 associated variables whose meaning change according to the optimizer chosen. For example in the case of SGD, the first variable is the learning rate followed by the momentum, the dampening and the weight decay.

Advantage of using X0

In addition to being able to initialize each hyperparameter on it’s own, we can also define specific lower and upper bounds for each single hyperparameter as is shown in the previous example.

Note

Note that X0 takes precedence over the other keywords, therefore the tags KERNELS and DROPOUT_RATE will not affect this initial starting point.