Bengali.AI Handwritten Grapheme Classification - Start Blog
Team: Zzz…
Members: Cheng Zeng, Zhi Wang, Peter Huang
Introduction
In this Kaggle competition, we aim to develop a convolutional neural network (CNN) model to classify the three constituent components of Bengali handwritten characters, including grapheme root, vowel diacritics, and consonant diacritics. Identifying characters by optical recognition is challenging since each Bengali has 11 vowels and 38 consonants in its alphabet, and there are 10 potential diacritics. As a result, a large number of graphemes (the smallest units in a written language) exist, and this quickly adds up to more than 10,000 different grapheme variations. This work by Team Zzz.. lives on github.
Overview of the data sets
Parquet Files
The data sets are saved in the format of parquet files, which contain image IDs and the corresponding flattened 137 x 236 grayscale images. Each feature corresponds to a pixel of the image. The pixel values are between 0 and 255.
Training set
The training set contains image IDs from the parquet files and the 3 components of the corresponding graphemes, and there are 200,840 images in the training set. Note that the input is the handwritten image (the last column), while the output should be the classes for the corresponding three constituent components.
Test Set
The testing images consist of images whose constituent components are listed in independent rows.
Class Map
The class-map contains grapheme component types and labels, and it maps the class labels to the actual Bengali grapheme components.
Submission Format
The sample submission file has two columns—one column is the row ID from the test set which consists of the test index number and the component in a grapheme and the prediction.
Exploratory Data Analysis (EDA)
Pixel distribution
The original pixel distribution is shown below, and it will be later used to compare with the pixel distributions after image crop and resize.
Class frequency analysis
Top 20 grapheme roots
Top 20 grapheme roots and their percentages in the training set are shown in the below figure. Those grapheme roots are approximately evenly distributed.
Vowel diacritics
The counts of vowel diacritics are shown in the figure below. The distribution is not balanced, and they concentrate on Class 0, 1, 7, and 2.
Consonant diacritics
For consonant diacritics, the distribution is not balanced either, with more than 60% being Class 0.
Inspecting training images
Some randomly sampled images
Below is 25 example handwritten grapheme randomly chosen fro the training images.
Writing variety
In the below it shows images of the same grapheme. Note that the handwriting of the same grapheme varies a lot.
Data preprocessing
The images are standardized by cropping and resizing using methods implemented in the OpenCV package. The method finds the contour of the figure and resize the image based on the size of the contour. In the following, we show the eight images after preprocessing and corresponding pixel distributions. The figures after processing look normal, and the pixel distribution with proprocessing is close to the one without preprocessing, implying the reliability of the method used.
Annotated references
-
The
get_n
,get_dummies
,image_from_char
,plot_acc
andplot_loss
functions are originated from the kernel by Kaushal Shah. The ` MultiOutputDataGenerator` class for multiple output is also from this kernel. -
The image
resize
method is based on the kernel by Ashadullah Shawon. -
The code for inference and result submission is heavily adapted from the kernel by Robin Smits.
-
The
crop_resize
,plot_count
,display_image_from_data
anddisplay_writting_variety
functions are from Gabriel Preda -
We are also thankful to many useful discussions on Kaggle, for example Things does not work and Things that might work.