Introduction To Word Embeddings - How To Train Your Own Word Vectors Using The Simpsons Dialogues?

Introduction To Word Embeddings
How To Train Your Own Word Vectors Using The Simpsons Dialogues?

Why Word Vectors?

As the volume of textual data generated on the web or uploaded in large servers around the world is growing in an exponential way, the need to analyze this type of data is becoming more and more crucial to understand trends, people behaviors, purchase intentions, and more generally the world we live in.

In order to increase our ability to analyse relationships across words, sentences and documents, words can be transformed into vectors. In physics, vectors are arrows, in computer science and statistics, vectors are columns of values, like one numeric series in a dataframe. A word vector is literally a way for us to represent words as vectors, allowing us to use machine learning to better comprehend the structure and content of language by organizing concepts and learning the relationships among them.

In this tutorial, I am going to describe the different steps to follow to train a word2vec model with your own text-based dataset. Word2vec was created and published by Tomas Mikolov and his research team at Google in 2013. This tool provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures for computing vector representations of words -also called word embeddings-, using a two-layer neural network. It takes in observations (sentences, tweets, books), first constructs a vocabulary from the training text data, and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

Here is a quick summary of the tutorial:

Import data and libraries
Pre-process text
Transform text into vectors with word2vec and train the model
Explore the model
Conclusion

My text dataset:

I could find on Kaggle this dataset containing script lines for approximately 600 Simpsons episodes, dating back to 1989. The .csv file is about 9MB, and includes more than 150K lines of dialogues between the different characters of the series, data is organized in 2 columns, each row showing the character's name and the text actually spoken, A-MA-ZING!

File can be found here: https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons

png

1. Import data and libraries

# Install/upgrade Gensim if needed:
# !pip install gensim --upgrade

# Import Gensim and data science standard libraries:
import gensim # to get access to word2vec tool
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import Simpsons dataset, and read it in a pandas dataframe:
df = pd.read_csv('simpsons_dataset.csv')

# let's explore quickly the data:

# Data is showing as text strings, contains 2 columns and 158314 rows, and has some null values in each of the 2 columns:
print('Dataframe information: ')
print()
df.info()

Dataframe information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158314 entries, 0 to 158313
Data columns (total 2 columns):
raw_character_text    140500 non-null object
spoken_words          131855 non-null object
dtypes: object(2)
memory usage: 2.4+ MB

# Data looks great!
print('Dataframe top rows:')
df.head()

Dataframe top rows:

	raw_character_text	spoken_words
0	Miss Hoover	No, actually, it was a little of both. Sometim...
1	Lisa Simpson	Where's Mr. Bergstrom?
2	Miss Hoover	I don't know. Although I'd sure like to talk t...
3	Lisa Simpson	That life is worth living.
4	Edna Krabappel-Flanders	The polls will be open from now until the end ...

# Here is the total of null values per columns:
print('Number of null values in dataframe: ')
df.isnull().sum()

Number of null values in dataframe:





raw_character_text    17814
spoken_words          26459
dtype: int64

# Dataset is large enough to run our demo, so let's get rid of all null values:
df = df.dropna()

# Just making sure that our dataset is free of nulls now:
print('Number of null values remaining: ')
df.isnull().sum()

Number of null values remaining:





raw_character_text    0
spoken_words          0
dtype: int64

# After dropping the nulls, 131853 rows remain in the dataset:
print('Dataframe shape:', df.shape)

Dataframe shape: (131853, 2)

# Let's rename the 2 columns:
df.rename(columns={'raw_character_text': 'character', 'spoken_words': 'sequence'}, inplace=True)

2. Pre-process text

Pre-preprocessing is needed for transferring text from human language to machine-readable format for further processing.

# Import text nltk pre-processing tools:
from nltk.stem import WordNetLemmatizer # to lemmatize words
from nltk.tokenize import RegexpTokenizer # to'tokenize' words
from nltk.corpus import stopwords # to remove stopwords.
import re # to access to regular expression matching operations

def preprocess (x): # create a pre-processing function:
    string_text = str(x) # convert text to strings
    lower_case = string_text.lower() # lowercase text
    lower_case = re.sub(r'[^a-zA-Z]', ' ', lower_case) # remove non-alphabetic characters
    retokenizer = RegexpTokenizer(r'\w+')
    words = retokenizer.tokenize(lower_case) # split strings into substrings using a regular expression
    lemmer = WordNetLemmatizer() # take words and attempt to return their base/dictionary form.
    stops = set(stopwords.words('english')) # remove english stopwords, some example of stopwords are as 'what', 'as', 'of', 'into'
    meaningful_words = [w for w in words if not w in stops]
    if len(meaningful_words) >2: # Let's remove sentences that are less than two words long as Word2Vec uses context words to learn the vector representation of a target word
        return " ".join([lemmer.lemmatize(word) for word in meaningful_words if not word in stops])

# Apply our function to the Simpsons dialogues:
df.sequence = df.sequence.apply(preprocess)

# Pre-processing is now done, but sentences that were 1 or 2 words long have been removed from our dataset, see 'sequence' at row [1]:
print('Pre-processed dataframe top rows:')
df.head()

Pre-processed dataframe top rows:

	character	sequence
0	Miss Hoover	actually little sometimes disease magazine new...
1	Lisa Simpson	None
2	Miss Hoover	know although sure like talk touch lesson plan...
3	Lisa Simpson	life worth living
4	Edna Krabappel-Flanders	poll open end recess case decided put thought ...

# 38061 new nulls appear throughout our preprocessed dataset:
print('Number of null values in pre-processed dataframe: ', df.sequence.isnull().sum())

Number of null values in pre-processed dataframe:  38061

# ..so let's remove them as well as all duplicates:
df_clean = df.dropna().drop_duplicates()

# Our 'clean' dataset includes now 93317 rows of pre-processed text:
print('Shape of the clean version of our dataframe: ', df_clean.shape)

Shape of the clean version of our dataframe:  (93317, 2)

# Let's use Gensim Phrases package to automatically detect bigrams (common phrases) from a list of sentences.
# This will connect "bart" and "simpson" when they appear side by side, so the model will treat 'homer_simpson' as one word
from gensim.models.phrases import Phrases, Phraser # Phrases() takes a list of list of words as input:

sent = [row.split() for row in df_clean['sequence']]
phrases = Phrases(sent, min_count=20) # we will ignore all bigrams with total collected count lower than 20.
bigram = Phraser(phrases)
sentences = bigram[sent]

print('First 15 bigrams in alpha order:' )
pd.DataFrame(sorted(bigram.phrasegrams)[:15], columns=['first_word', 'second_word'])

# We can see at row 12 that when 'bart' and 'simpson' show up side by side in a sentence,
# they will be now considered as one word, the bigram 'bart_simpson'

First 15 bigrams in alpha order:

	first_word	second_word
0	b'across'	b'street'
1	b'always'	b'wanted'
2	b'amusement'	b'park'
3	b'angry'	b'dad'
4	b'answer'	b'question'
5	b'anyone'	b'else'
6	b'ask'	b'question'
7	b'aunt'	b'selma'
8	b'aw'	b'geez'
9	b'b'	b'c'
10	b'ba'	b'ba'
11	b'bad'	b'news'
12	b'bart'	b'simpson'
13	b'best'	b'friend'
14	b'big'	b'deal'

from collections import defaultdict # defaultdict will allow us to count the number of occurrences for each word.

word_count = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_count[i] += 1

print("Number of words in our dataframe: ", len(word_count))

Number of words in our dataframe:  32943

# What are the most frequent words appearing in our data?
print('Most frequent words:')
pd.DataFrame(sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:15], columns=['word', 'frequency'])

# Homer and Bart are respectively the 10th and 12th most frequently used word in the show,
# which makes sense as they are the 2 main characters of the series

Most frequent words:

	word	frequency
0	oh	6726
1	well	5425
2	like	4941
3	get	4924
4	one	4756
5	know	4593
6	hey	3754
7	right	3535
8	got	3420
9	homer	3040
10	go	2999
11	want	2888
12	bart	2802
13	think	2687
14	time	2577

3. Transform text into vectors with `word2vec` and train the model

After training, word2vec can be used to map each word to a vector of typically several hundred elements, which represent that word’s relation to other words. This vector is the neural network’s hidden layer.

from gensim.models import Word2Vec # import word2vec from Gensim library

w2v_model = Word2Vec(
                     size=300, # number of dimensions of our word vectors
                     window=10, # set the number of "context words" at 10
                     min_count=20, # ignore all words with total frequency lower than 20
                     sample=0.0001, # set the threshold for configuring which higher-frequency words are randomly downsampled
                     negative=20, # set the number of noise words to be drawn at 20
                     workers=4) # number of threads used in parallel to train the machine

# This step will build the vocabulary table:
w2v_model.build_vocab(sentences)

# Model can now be trained:
w2v_model.train(
    sentences,  
    total_examples=w2v_model.corpus_count, # number of sentences in the corpus
    epochs=30 # number of iterations in the corpus
)

(8742042, 18370410)

# init_sims will precompute L2-normalized vectors and ,if true, will forget the original vectors and only keep the normalized ones, this can save lots of memory
w2v_model.init_sims(replace=False)

4. Explore the model

Let's see what our model finds when we ask it to pull the top-10 most similar words to the main characters of the show with most_similar method:

print('Top 10 most similar words to Homer:')
pd.DataFrame(w2v_model.wv.most_similar(positive=["homer"]), columns=['word', 'frequency'])

Top 10 most similar words to Homer:

	word	frequency
0	marge	0.647157
1	becky	0.551345
2	lenny_carl	0.499023
3	creepy	0.498600
4	homie	0.497142
5	husband	0.485995
6	soul_mate	0.483098
7	crummy	0.481756
8	leave_alone	0.469040
9	sometime	0.468563

print('Top 10 most similar words to Bart:')
pd.DataFrame(w2v_model.wv.most_similar(positive=["bart"]), columns=['word', 'frequency'])

Top 10 most similar words to Bart:

	word	frequency
0	lisa	0.745548
1	maggie	0.613382
2	learned_lesson	0.602921
3	mom_dad	0.602509
4	pay_attention	0.596564
5	mom	0.589715
6	homework	0.586715
7	tell_truth	0.582463
8	feel_better	0.581206
9	dr_hibbert	0.581166

print('Top 10 most similar words to Lisa:')
pd.DataFrame(w2v_model.wv.most_similar(positive=["lisa"]), columns=['word', 'frequency'])

Top 10 most similar words to Lisa:

	word	frequency
0	bart	0.745548
1	learned_lesson	0.603608
2	homework	0.603168
3	maggie	0.574106
4	mom	0.568127
5	grownup	0.563802
6	saxophone	0.561525
7	surprised	0.559258
8	daughter	0.558765
9	math	0.550277

print('Top 10 most similar words to Marge:')
pd.DataFrame(w2v_model.wv.most_similar(positive=["marge"]), columns=['word', 'frequency'])

Top 10 most similar words to Marge:

	word	frequency
0	homer	0.647157
1	husband	0.601620
2	homie	0.543166
3	becky	0.541192
4	marriage	0.536475
5	ashamed	0.531238
6	tell_truth	0.512526
7	disappointed	0.511782
8	fault	0.506138
9	brunch	0.499643

It looks like all this information makes sense! We can use doesnt_match method to find which word from the given list doesn’t go with the others:

print('Word to exclude from the list:', w2v_model.wv.doesnt_match(['bart', 'lisa', 'maggie', 'milhouse']))

Word to exclude from the list: milhouse


C:\Users\Administrator\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)

print('Word to exclude from the list:', w2v_model.wv.doesnt_match(['moe', 'lenny', 'carl', 'homer', 'marge']))

Word to exclude from the list: marge

In physics, we can add/subtract vectors to understand how two forces might act on an object. Let's see if we can use most_similar method to do the same thing with word vectors.

What happens when adding 'woman' to 'homer' and substract 'man?

w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["man"], topn=10)

[('marge', 0.5261433124542236),
 ('husband', 0.4614260196685791),
 ('homie', 0.449959933757782),
 ('brunch', 0.4463199973106384),
 ('wife', 0.44602805376052856),
 ('marriage', 0.44539350271224976),
 ('grownup', 0.44457897543907166),
 ('wasted', 0.4415558874607086),
 ('affair', 0.4279390275478363),
 ('luann', 0.4259093403816223)]

This is correct, Marge is Homer female counterpart!

5. Conclusion

In this quick introduction to word embeddings I tried to describe as simply as possible the different steps you will need to go through if you want to train a word vectorizer with your own data. In this example, I chose to use Google's word2vec method, but other vectorizers can be available on sklearn's library as CountVectorizer , HashingVectorizer or TfidfVectorizer. Word embedding provides to machines much more information about words than has previously been possible using traditional representations of words. Word vectors are essential for solving NLP problems such as speech recognition, sentiment analysis, named entity recognition, spam filtering, and machine translation, they are an amazingly powerful concept and applications in this field are practically infinite. The importance of word embeddings in deep learning becomes more and more evident by looking at the number of researches in the field, so I hope that this tutorial was useful for you and helped you to better understand the mechanisms of these methods.

jpg