How to deal with multi-word phrases(or n-grams) while building a custom embedding?

Suyash Khare
3 min readAug 26, 2020
n-grams: a contiguous sequence of n items from a given sample of text. The items can be phonemes, syllables, letters, words, or base pairs according to the application. We will look at word n-grams (or simply referred to as n-grams)

If you want to model the unique meaning of commonly occurring n-grams, often called collocations, the best solution is to train a new embedding space to model those specific semantics.

An alternate approach is “Aggregating” i.e. embedding each word separately and then taking their average as the embedding for their combination. But this only captures part of the collocation’s meaning. In simpler terms, when 2 word-vectors are added together, the sum is often not the correct representation of the phrase formed by joining the 2 words.

Another alternate approach is fine-tuning/transfer learning. It is used when you do not have a lot of data. Fine-tuning/transfer learning takes an existing model architecture and weights, then does additional training with more data. An example of fine-tuning for word2vec in Keras can be found here. In this case, Google’s Wikipedia model is taken and trained with custom collocations.

Okay, let's get into it then. First things first, import your libraries.

import gensim
from nltk import ngrams
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
from collections import Counter

Now let’s get a sample dataset. I have used the ‘brown’ data from nltk corpus.

from nltk.corpus import brown
words = brown.words()
sents = brown.sents()
print("Number of sentences: ",len(sents)))
print("Number of words: , ",len(words))
brown.sents()[2:3]

[out]: Number of sentences: 57340

Number of words: 1161192

[[‘The’, ‘September-October’, ‘term’, ‘court’, ‘had’, ‘been’, ‘presided’, ‘by’, ‘Fulton’, ‘Superior’, ‘Court’, ‘Judge’, ‘Durwood’, ‘Pye’]]

So the corpus is fairly large with 57340 sentences and 1161192 words. The sentences are already tokenized, which is great. (Remember, the gensim word2vec model takes in a list of tokenized sentences (i.e. a list of list) as an input). Now let’s extract all the n-grams from the corpus:

def get_ngrams(words):
words = [word.lower() for word in words if word not in stoplist and len(word)>2]
bigram=["_".join(phrases) for phrases in list(ngrams(words,2))]
trigram=["_".join(phrases) for phrases in list(ngrams(words,3))]
fourgram=["_".join(phrases) for phrases in list(ngrams(words,4))]
return bigram, trigram, fourgram
bigram, trigram, fourgram = get_ngrams(words)
print("Top 3 bigrams: ",Counter(bigram).most_common()[:3])
print("Top 3 trigrams: ",Counter(trigram).most_common()[:3])
print("Top 3 fourgrams: ",Counter(fourgram).most_common()[:3])

[out]: Top 3 bigrams: [(‘united_states’, 392), (‘new_york’, 296), (‘per_cent’, 146)]

Top 3 trigrams: [(‘united_states_america’, 29), (‘new_york_city’, 27), (‘government_united_states’, 25)]

Top 3 fourgrams: [(‘government_united_states_america’, 17), (‘john_notte_jr._governor’, 15), (‘average_per_capita_income’, 10)]

They look beautiful, don't they? The stopwords have been removed though. Now, we need to understand that embeddings are created for single words, so we have joined the words in our n-grams with “_” character.

Let's generate the training data for our custom embeddings now:

training_data = []
for sentence in sents:
l1 = [word.lower() for word in sentence if word not in stoplist and len(word) > 2]
l2 , l3 , l4 = get_ngrams(sentence)
training_data.append(l1 + l2 + l3 + l4)
training_data[2:3]

[out]: [[‘the’,
‘september-october’,
‘term’,
..
september-october’,
‘september-october_term’,
‘term_jury’,
‘jury_charged’,
‘charged_fulton’,
‘fulton_superior’,
‘superior_court’,
..
‘the_september-october_term’,
‘september-october_term_jury’,
‘term_jury_charged’,
‘jury_charged_fulton’,]]

Alright, so the data looks ready to be used to train our custom embedding.

model = gensim.models.Word2Vec(training_data)
model.save('custom.embedding')
model = gensim.models.Word2Vec.load('custom.embedding')

And your custom embeddings with n-grams are ready.

Each word is represented in the space of 100 dimensions:

len(model['india'])

[out]: 100

Note: Always save the embedding so you don't have to train it every time you rerun your notebook.

Let's look at some supporting functions already implemented in Gensim to manipulate word embeddings. For example, to compute the cosine similarity between 2 words:

new_model.similarity('university','school') > 0.3

[out]: True

Finding the top n words that are similar to a target word:

model.most_similar(positive=['india'], topn = 5)

[out]: [(‘britain’, 0.9997713565826416),
(‘the_government’, 0.9996576309204102),
(‘court’, 0.9996564388275146),
(‘government_india’, 0.9996494650840759),
(‘secretary_state’, 0.9995989799499512)]

You can input multiple words as well:

model.most_similar(positive=['india','britain'], negative=['paris'], topn = 5)

[out]: [(‘america’, 0.999089241027832),
(‘government_united_states’, 0.9990255832672119),
(‘united_states_america’, 0.9989479780197144),
(‘pro-western’, 0.9975387454032898),
(‘foreign_countries’, 0.9969671964645386)]

So that’s all for this article, folks. Thank you for reading. Cheers!

--

--

Suyash Khare

They call me Dirichlet because all my potential is latent and awaiting allocation.