Bengali.AI Handwritten Grapheme Classification - Midway Blog

Team: Zzz…

Members: Cheng Zeng, Zhi Wang, Peter Huang

Comparing different CNN architectures

We compared the performance using the basic CNN model, Densenet121 and Densenet169, Resnet and Efficientnet. Basic CNN model and Densenet can give reasonable training accuracy while for Resnet and Efficient, it is not easy to find a local minimum (training is not stable). We finally choose Densenet121 since its training converges steadily and it gives good accuracy. Note that although Densenet169 is denser and has more parameters, we found significant overfitting with this model.

Overview of processed dataset

Before we go into details of the CNN model used in this competition, we look at some basic info of the preprocessed dataset. Each image is now of 64\(\times\)64\(\times\)1 size, and the entire dataset has been split to training and validation datasets.

IMG_SIZE=64
N_CHANNELS=1
print(f'Training images: {X_train.shape}')
print(f'Training labels root: {Y_train_root.shape}')
print(f'Training labels vowel: {Y_train_vowel.shape}')
print(f'Training labels consonants: {Y_train_consonant.shape}')

A summary of processed training data

Densenet121 model

Densenet contains a feature layer (convolutional layer) capturing low-level features from images, several dense blocks, and transition layers between adjacent dense blocks.

Dense block

To reduce the computation, a 1\(\times\)1 convolutional layer (bottleneck layer) is added, which makes the second convolutional layer always has a fixed input depth. It is also easy to see the size (width and height) of the feature maps keeps the same through the dense layer, which makes it easy to stack any number of dense layers together to build a dense block. For example, densenet121 has four dense blocks, which have 6, 12, 24, 16 dense layers.

Transition layer

As a tradition, the size of the output of every layer in CNN decreases in order to abstract higher-level features. In densenet, the transition layers take this responsibility while the dense blocks keep the size and depth. Every transition layer contains a 1\(\times\)1 convolutional layer and a 2\(\times\)2 average pooling layer with a stride of 2 to reduce the size to the half. Be aware that transition layers also receive all the output from all the layers of its last dense block. So the 1\(\times\)1 convolutional layer reduces the depth to a fixed number, while the average pooling reduces the size.

Densenet121 layer topology

Densenet structural reference table

Model construction

The model is constructed using the Densenet121 template implemented in TensorFlow. The model was built with deep learning API Keras. The code to construct the model is shown below.

def build_densenet(SIZE, rate=0.3):
    densenet = DenseNet121(weights='imagenet', include_top=False)

    input = Input(shape=(SIZE, SIZE, 1))
    x = Conv2D(3, (3, 3), padding='same')(input)

    x = densenet(x)

    x = GlobalAveragePooling2D()(x)
    x = BatchNormalization()(x)
    x = Dropout(rate)(x)
    x = Dense(1024, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(rate)(x)
    x = Dense(512, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(rate)(x)

    # multi output
    grapheme_root = Dense(168, activation = 'softmax', name='root')(x)
    vowel_diacritic = Dense(11, activation = 'softmax', name='vowel')(x)
    consonant_diacritic = Dense(7, activation = 'softmax', name='consonant')(x)

    # model
    model = Model(inputs=input, outputs=[grapheme_root, vowel_diacritic, consonant_diacritic])

    return model

model = build_densenet(SIZE=IMG_SIZE, rate=0.3)

Here we use a dropout rate of 0.3. Dropout is a regularization method, where a proportion of nodes in the layer are randomly ignored (setting their weights to zero) for each training sample. This drops randomly a proportion of the network and forces the network to learn features in a distributed way. This technique also improves generalization and reduces the overfitting.

Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.

relu is short for rectified linear unit, which is an activation function defined as \(max(0,x)\). The rectifier activation function is used to introduce non-linearity into the neural networks.

A summary of the model can be seen if you run model.summary(); it should look like something below.

Summary of the Densenet121 model built using Keras

Optimizer and Learning schedule

We define the loss function to measure how poorly our model performs on images with known labels. It is the error rate between the observed labels and the predicted ones. We use a specific form for categorical classifications of multiple classes termed categorical_crossentropy.

Adam optimizer realizes the benefits of both AdaGrad and RMSProp. Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance). Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The metric function accuracy is used is to evaluate the performance of our model. This metric function is similar to the loss function, except that the results from the metric evaluation are not used when training the model (only for evaluation).

Code for the setting the optimizer and fixed learning rate is shown below.

weights = {'root': 0.4, 'vowel': 0.3, 'consonant':0.3}
model.compile(optimizer=Adam(lr=0.00016), loss='categorical_crossentropy',
              loss_weights=weights, metrics=['accuracy'])

In order to make the optimizer converge faster and closest to the global minimum of the loss function, I used an annealing method of the learning rate (LR). The LR is the step by which the optimizer walks through the ‘loss landscape’. The higher LR, the bigger are the steps and the quicker is the convergence. However, the sampling is very poor with a high LR and the optimizer could probably fall into a local minimum. It’s better to have a decreasing learning rate during the training to reach efficiently the global minimum of the loss function.

# Learning rate will be half after 3 epochs if accuracy is not increased
lr_scheduler = []
targets = ['root', 'vowel', 'consonant']
for target in targets:
 lr_scheduler.append(ReduceLROnPlateau(monitor=f'{target}_accuracy',
                     patience=3,verbose=1,factor=0.5,
                     min_lr=0.00001))
# Callback : Save best model
cp = ModelCheckpoint('saved_models/densenet121_128x128_1-rr.h5',
monitor = 'val_root_accuracy',save_best_only = True,
save_weights_only = False,mode = 'auto',verbose = 0)

ModelCheckPoint is used to save the whole model or just the weights if our model improves by the criteria of improvement defined.

Data augmentation

In order to avoid the overfitting problem, we need to expand artificially our handwritten digit dataset. We can make your existing dataset even larger. The idea is to alter the training data with small transformations to reproduce the variations occurring when someone is writing a digit.

By applying just a couple of these transformations to our training data, we can easily double or triple the number of training examples and create a very robust model.

For the data augmentation strategies, I chose to:

Randomly rotate some training images by 8 degrees
Randomly Zoom by 15% some training images
Randomly shift images horizontally by 15% of the width
Randomly shift images vertically by 15% of the height

The improvement is critical:

Without data augmentation, I obtained an accuracy of 81.85%, 95.02%, and 94.95% for respective grapheme roots, vowel diacritics and consonant diacritics.
With data augmentation, I achieved an accuracy of 90.07%, 96.71%, and 97.11%.

Code for image augmentation is shown below.

# Data augmentation for creating more training data
datagen = MultiOutputDataGenerator(
    featurewise_center=False,  # set input mean to 0 over the dataset
    samplewise_center=False,  # set each sample mean to 0
    featurewise_std_normalization=False,  # divide inputs by std of the dataset
    samplewise_std_normalization=False,  # divide each input by its std
    zca_whitening=False,  # apply ZCA whitening
    rotation_range=8,  # randomly rotate images in the range (degrees, 0 to 180)
    zoom_range = 0.15, # Randomly zoom image
    width_shift_range=0.15,  # randomly shift images horizontally (fraction of total width)
    height_shift_range=0.15,  # randomly shift images vertically (fraction of total height)
    horizontal_flip=False,  # randomly flip images
    vertical_flip=False)  # randomly flip images

In the final blog, we will talk about the evaluation steps and methods to improve the model. Check the leaderboard results as well.