Bengali.AI Handwritten Grapheme Classification - Start Blog

Team: Zzz…

Members: Cheng Zeng, Zhi Wang, Peter Huang

Introduction

In this Kaggle competition, we aim to develop a convolutional neural network (CNN) model to classify the three constituent components of Bengali handwritten characters, including grapheme root, vowel diacritics, and consonant diacritics. Identifying characters by optical recognition is challenging since each Bengali has 11 vowels and 38 consonants in its alphabet, and there are 10 potential diacritics. As a result, a large number of graphemes (the smallest units in a written language) exist, and this quickly adds up to more than 10,000 different grapheme variations. This work by Team Zzz.. lives on github.

Overview of the data sets

Parquet Files

The data sets are saved in the format of parquet files, which contain image IDs and the corresponding flattened 137 x 236 grayscale images. Each feature corresponds to a pixel of the image. The pixel values are between 0 and 255.

Example parquet data for image pixels

Training set

The training set contains image IDs from the parquet files and the 3 components of the corresponding graphemes, and there are 200,840 images in the training set. Note that the input is the handwritten image (the last column), while the output should be the classes for the corresponding three constituent components.

Example training data

Test Set

The testing images consist of images whose constituent components are listed in independent rows.

Example test data

Class Map

The class-map contains grapheme component types and labels, and it maps the class labels to the actual Bengali grapheme components.

Example class-map data

Submission Format

The sample submission file has two columns—one column is the row ID from the test set which consists of the test index number and the component in a grapheme and the prediction.

Example submission data

Exploratory Data Analysis (EDA)

Pixel distribution

The original pixel distribution is shown below, and it will be later used to compare with the pixel distributions after image crop and resize.

Pixel distributions of training images

Class frequency analysis

Top 20 grapheme roots

Top 20 grapheme roots and their percentages in the training set are shown in the below figure. Those grapheme roots are approximately evenly distributed.

Frequency of top 20 grapheme roots

Vowel diacritics

The counts of vowel diacritics are shown in the figure below. The distribution is not balanced, and they concentrate on Class 0, 1, 7, and 2.

Frequency of vowel diacritic

Consonant diacritics

For consonant diacritics, the distribution is not balanced either, with more than 60% being Class 0.

Frequency of consonant diacritic

Inspecting training images

Some randomly sampled images

Below is 25 example handwritten grapheme randomly chosen fro the training images.

Randomly sampled example images

Writing variety

In the below it shows images of the same grapheme. Note that the handwriting of the same grapheme varies a lot.

Sixteen images of the same grapheme. Grapheme root, vowel diacritic and consonant diacritic are indexed 72, 1, 1, respectively.

Data preprocessing

The images are standardized by cropping and resizing using methods implemented in the OpenCV package. The method finds the contour of the figure and resize the image based on the size of the contour. In the following, we show the eight images after preprocessing and corresponding pixel distributions. The figures after processing look normal, and the pixel distribution with proprocessing is close to the one without preprocessing, implying the reliability of the method used.

Example handwritten grapheme after preprocessing

Pixel distribution after preprocessing

Annotated references

The get_n, get_dummies, image_from_char, plot_acc and plot_loss functions are originated from the kernel by Kaushal Shah. The ` MultiOutputDataGenerator` class for multiple output is also from this kernel.
The image resize method is based on the kernel by Ashadullah Shawon.
The code for inference and result submission is heavily adapted from the kernel by Robin Smits.
The crop_resize, plot_count, display_image_from_data and display_writting_variety functions are from Gabriel Preda
We are also thankful to many useful discussions on Kaggle, for example Things does not work and Things that might work.