This kaggle competition aims to construct a language model that can not only identify the sentiment of a tweet but also understand why it is so. In other words, competitors are expected to figure out what word or phrase best supports the labeled sentiment.

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person’s, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

This blog describes the background and motivation, dataset, evaluation metrics and exploratory data analysis (EDA).

Data set

Files

  • train.csv - the training set

  • test.csv - the test set

  • sample_submission.csv - a sample submission file in the correct format

Data format

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tween selected_text that encapsulates the provided sentiment.

Columns

  • textID - unique ID for each piece of text

  • text the text of the tweet

  • sentiment - the general sentiment of the tweet

  • selected_text - [train only] the text that supports the tweet’s sentiment

Submission format

We are attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.). The format is as follows:

<id>, "<word or phrase that supports the sentiment>"

For example:

2, "Very good"
5, "I am neutral about this"
3, "Awful"
8, "If you say so!"

Evaluation metrics

The metric in this competition is the word-level Jaccard score. A good description of Jaccard similarity for strings is here. The formula is expressed as:

\begin{equation} score = \frac{1}{n} \sum_{i=1}^{n} jaccard(gt_i, dt_i) \end{equation}

where:

\[\begin{align*} n &= \textrm{number of documents} \\ jaccard &= \textrm{the function provided above} \\ gt_i &= \textrm{the ith ground truth} \\ dt_i &= \textrm{the ith prediction} \\ \end{align*}\]

A python implementation of the jaccard score is as follows:

def jaccard(str1, str2):
    a = set(str1.lower().split())
    b = set(str2.lower().split())
    if (len(a)==0) & (len(b)==0): return 0.5
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

EDA

The EDA figures were retrieved from the kaggle kernel by MR_KNOWNNOTHING.

Data balance

The balance of the training set can be obtained with

import pandas as pd
import matplotlib.pyplot as plt
from plotly import graph_objs as go
train = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
fig = go.Figure(go.Funnelarea(
    text =train.sentiment,
    values = train.text,
    title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
    ))
fig.show()
Sentiment-specific ratios of training data

World Cloud

We use world clouds to show the most common words in the tweets based on their corresponding sentiment. The code is shown below:

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
### mask for the lay-out of word cloud
d = '/kaggle/input/masks-for-wordclouds/'
pos_mask = np.array(Image.open(d+ 'twitter_mask.png'))
plot_wordcloud(Neutral_sent.text,mask=pos_mask,color='white',max_font_size=100,title_size=30,title="WordCloud of Neutral Tweets")

The plot_wordcloud function can be found in the kaggle kernel by aashita here.

World cloud of neural tweets:

Word cloud of neural tweets

World cloud of positive tweets:

Word cloud of positive tweets

World cloud of negative tweets:

Word cloud of negative tweets