Introduction

Word Embeddings are used to represent words in a multi-dimensional vector form. A word w_i in vocabulary V is represented in the form of a vector of n dimensions. These vectors are generated by unsupervised training on a large corpus of words to gain the semantic similarities between them.

Downloading the required vector file

import sys

!mkdir vectors

# Download the different files using these commands. This may take a while

!cd vectors && curl -O http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude
# !cd vectors && curl -O http://magnitude.plasticity.ai/glove/light/glove.6B.50d.magnitude
!cd vectors && curl -O http://magnitude.plasticity.ai/elmo/light/elmo_2x1024_128_2048cnn_1xhighway_weights.magnitude

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4016M  100 4016M    0     0  39.7M      0  0:01:40  0:01:40 --:--:-- 44.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 51.9M  100 51.9M    0     0  43.4M      0  0:00:01  0:00:01 --:--:-- 43.4M

!ls
!cd vectors && ls

sample_data  vectors
elmo_2x1024_128_2048cnn_1xhighway_weights.magnitude
GoogleNews-vectors-negative300.magnitude

Install the required libraries. Again this may take a while

!pip3 install torch numpy scikit-learn pandas transformers==3.1.0 seaborn matplotlib sentence_transformers

Requirement already satisfied: torch in /usr/local/lib/python3.6/dist-packages (1.6.0+cu101)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (1.18.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (0.22.2.post1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (1.1.3)
Collecting transformers==3.1.0
  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
     |████████████████████████████████| 890kB 8.1MB/s 
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (0.11.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (3.2.2)
Collecting sentence_transformers
  Downloading https://files.pythonhosted.org/packages/f4/fd/0190080aa0af78d7cd5874e4e8e85f0bed9967dd387cf05d760832b95da9/sentence-transformers-0.3.8.tar.gz (66kB)
     |████████████████████████████████| 71kB 9.9MB/s 
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch) (0.16.0)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (0.17.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas) (2018.9)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (2.23.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (20.4)
Collecting sentencepiece!=0.1.92
  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
     |████████████████████████████████| 1.1MB 14.7MB/s 
Collecting tokenizers==0.8.1.rc2
  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
     |████████████████████████████████| 3.0MB 52.3MB/s 
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (2019.12.20)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (0.7)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (4.41.1)
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
     |████████████████████████████████| 890kB 51.7MB/s 
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers==3.1.0) (3.0.12)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (0.10.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from sentence_transformers) (3.2.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.1.0) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.1.0) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.1.0) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.1.0) (2.10)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==3.1.0) (7.1.2)
Building wheels for collected packages: sentence-transformers, sacremoses
  Building wheel for sentence-transformers (setup.py) ... done
  Created wheel for sentence-transformers: filename=sentence_transformers-0.3.8-cp36-none-any.whl size=101996 sha256=d88a5a734040b3392f98e3da764628774ab81a202c9ec9de5fa0ed0904c70db3
  Stored in directory: /root/.cache/pip/wheels/27/ec/b3/d12cc8e4daf77846db6543033d3a5642f204c0320b15945647
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893257 sha256=edeeea45ac8dc6966f84571553917635487c3d6152e4d40457476b9f95d14906
  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sentence-transformers sacremoses
Installing collected packages: sentencepiece, tokenizers, sacremoses, transformers, sentence-transformers
Successfully installed sacremoses-0.0.43 sentence-transformers-0.3.8 sentencepiece-0.1.94 tokenizers-0.8.1rc2 transformers-3.1.0

import torch
torch.__version__

'1.6.0+cu101'

Since pymagnitude can not installed properly with pypi, we'll use another method for it. Install Magnitude on Google Colab

! echo "Installing Magnitude.... (please wait, can take a while)"
! (curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1>/dev/null 2>/dev/null)
! echo "Done installing Magnitude."

Installing Magnitude.... (please wait, can take a while)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   137  100   137    0     0   1122      0 --:--:-- --:--:-- --:--:--  1122
Done installing Magnitude.

Import some libraries

import pymagnitude as pym
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

def similarity_between_docs(doc1, doc2, is_1d=False):
    if is_1d:
        v1 = np.reshape(doc1, (1, -1))
        v2 = np.reshape(doc2, (1, -1))
    else:
        d1 = np.mean(doc1, axis=0)
        d2 = np.mean(doc2, axis=0)
        v1 = np.reshape(d1, (1, -1))
        v2 = np.reshape(d2, (1, -1))
    return cosine_similarity(v1, v2)[0][0]

def plot_1d_heatmap(vec, name):
    v = vec.reshape(1, -1)
    plt.figure(figsize=(20, 2))
    sns.heatmap(v, cmap="YlGnBu").set_title(name)
    plt.rcParams.update({"font.size": 22})
    plt.show()
    return

Static Word Embeddings

Static WEs have fixed vector value for each word. They loose the contextual information

Google's Word2Vec
Stanford's GloVe
Facebook's fastText

Google's Word2Vec

# glove_vectors = pym.Magnitude("./vectors/glove.6B.50d.magnitude")
w2v_vectors = pym.Magnitude("./vectors/GoogleNews-vectors-negative300.magnitude")

For using other vectors, download the pre-trained vectors from pymagnitude repo and put them in ./vectors folder

print("Vector Name: {}\nTotal words: {}\nDimension of each word: {}".
              format("Word2Vec", len(w2v_vectors), w2v_vectors.dim))

Vector Name: Word2Vec
Total words: 3000000
Dimension of each word: 300

for i, (key, vec) in enumerate(w2v_vectors):
    if i == 1000:
        print("Index = {}\Word: {}\nVector Size: {}\nVector: {}".format(i, key, vec.shape, vec))
        break

Index = 1000\Word: Paul
Vector Size: (300,)
Vector: [ 1.211609e-01  4.324040e-02 -5.569700e-03  8.384690e-02  2.085200e-02
 -1.558410e-02 -4.038700e-02 -9.965050e-02 -3.380210e-02 -2.205920e-02
 -5.531260e-02 -2.688810e-02  7.155520e-02  8.340790e-02  7.133600e-03
  3.533860e-02  1.167710e-01  4.324040e-02  7.901800e-02  8.296890e-02
 -4.411840e-02  1.525490e-02  8.472480e-02 -6.277540e-02  3.923500e-03
 -1.234660e-02  3.402160e-02  6.101940e-02  1.942530e-02  2.151050e-02
 -4.828880e-02  2.030320e-02  3.292420e-02 -5.882450e-02  6.453130e-02
 -2.611980e-02  6.063500e-03  1.439883e-01 -3.731400e-02  1.064550e-02
  1.251120e-02 -2.074220e-02 -1.679130e-02 -7.243310e-02 -7.199420e-02
 -2.359560e-02  1.920580e-02 -2.677830e-02  5.443460e-02  3.468010e-02
  2.370540e-02 -5.679400e-03 -2.238840e-02 -9.526060e-02 -8.724900e-03
  5.202020e-02  2.963170e-02 -9.877250e-02  3.380210e-02 -5.487360e-02
  6.233640e-02 -1.190760e-02 -9.306560e-02 -4.192340e-02 -8.296890e-02
 -8.428580e-02 -2.513210e-02 -6.936020e-02 -1.525490e-02  2.019350e-02
  2.238840e-02 -1.255508e-01  2.293720e-02 -1.712060e-02 -3.270470e-02
  2.118120e-02 -7.901800e-02  4.148440e-02  8.472480e-02  4.697180e-02
  7.243310e-02 -3.819200e-02  7.989600e-02 -8.604180e-02  5.443460e-02
 -4.960570e-02 -6.672630e-02  1.290627e-01  7.111620e-02 -6.101940e-02
  6.628730e-02  6.008700e-03  8.428580e-02 -6.173300e-03 -2.919280e-02
  6.392800e-03 -2.809530e-02 -4.609380e-02 -4.806930e-02 -9.306560e-02
 -5.706850e-02  1.569380e-02  9.569950e-02 -1.799850e-02  1.295000e-04
 -7.726200e-02  7.945700e-02 -1.591330e-02  3.577760e-02  6.760430e-02
 -1.723030e-02 -2.826000e-03 -9.218760e-02  1.481590e-02  1.322450e-02
 -1.646200e-03 -7.155520e-02 -3.841150e-02  3.797250e-02  6.453130e-02
 -7.901800e-02 -3.566800e-03 -2.480290e-02 -1.633038e-01 -4.763030e-02
 -9.921150e-02 -5.750750e-02  3.292420e-02  3.753350e-02  5.114220e-02
 -3.753350e-02 -1.119421e-01  8.121300e-03 -9.306560e-02  4.148440e-02
  5.624500e-03 -4.389900e-03  7.111620e-02  2.469310e-02  4.302090e-02
  3.072900e-03  5.882450e-02  1.262090e-02 -5.662950e-02 -6.716530e-02
 -2.387000e-03  1.022844e-01 -1.299407e-01 -7.550610e-02  2.030320e-02
  2.480290e-02 -8.340790e-02  1.598200e-03  1.290627e-01  6.804330e-02
  4.192340e-02 -2.622960e-02  4.784980e-02 -3.336310e-02  2.271770e-02
 -1.262090e-02  3.907000e-02  1.132591e-01 -8.823670e-02 -7.737200e-03
  3.336310e-02 -6.804330e-02 -2.579060e-02  2.831480e-02 -6.447600e-03
 -4.016750e-02  2.601010e-02  2.052270e-02 -1.150151e-01  1.093082e-01
  2.941220e-02 -1.810830e-02  2.260790e-02 -9.109000e-03  2.754650e-02
 -6.277540e-02 -4.236240e-02  1.659900e-03 -5.377600e-03 -3.885050e-02
 -3.928950e-02 -1.037110e-02 -8.604180e-02  7.989600e-02 -6.145840e-02
 -6.540930e-02  4.280100e-03 -6.892120e-02 -2.831480e-02  5.679400e-03
  5.070320e-02  6.189740e-02  5.421510e-02 -2.151050e-02 -2.370540e-02
  5.882450e-02  8.735880e-02  3.621660e-02  5.421510e-02  1.141371e-01
  2.194940e-02  5.092270e-02  1.255508e-01 -7.594510e-02 -2.513210e-02
  2.096170e-02 -1.264288e-01  8.121290e-02 -4.455740e-02 -1.022844e-01
 -7.331110e-02 -4.345990e-02 -2.721730e-02  6.760430e-02 -8.472480e-02
  1.545240e-01 -8.472480e-02  1.141371e-01 -1.093082e-01 -1.668160e-02
 -7.846900e-03  2.784800e-03  4.631330e-02 -2.633930e-02 -1.036013e-01
 -3.928950e-02  1.887650e-02 -3.753350e-02 -3.621660e-02  6.804330e-02
 -4.653280e-02  6.233640e-02 -8.604180e-02  1.037110e-02 -3.446060e-02
  3.577760e-02  4.236240e-02 -8.560300e-03 -2.809530e-02  3.270470e-02
  7.155520e-02 -2.976900e-03 -1.152350e-02  4.938620e-02 -1.395984e-01
 -2.897330e-02  1.388300e-02 -5.750750e-02  3.533860e-02  4.960570e-02
  4.675230e-02  7.243300e-03 -3.358260e-02 -1.152300e-03  3.775300e-02
  5.750750e-02  5.202020e-02  7.792100e-03  2.831480e-02  4.148440e-02
  6.101940e-02  4.993500e-03 -1.059060e-02 -1.481590e-02 -3.402160e-02
  3.380210e-02 -3.468010e-02 -5.882450e-02  6.189740e-02 -5.882450e-02
 -8.560280e-02  2.140070e-02  5.882450e-02  1.152300e-03 -5.289810e-02
 -2.260790e-02  6.233640e-02  2.875380e-02  7.287210e-02 -6.447600e-03
  1.141371e-01  1.613280e-02 -5.662950e-02  8.033490e-02 -2.249820e-02
  5.953800e-03 -4.126490e-02  3.950900e-02  3.072920e-02 -6.540930e-02
 -7.989600e-02  4.653280e-02 -8.999300e-03 -6.557400e-03  4.938600e-03]

print(w2v_vectors.query("dog"))

# Get the vector using the index
print(w2v_vectors[1000])

[ 1.719810e-02 -7.493400e-03 -5.798200e-02  5.405100e-02 -2.833580e-02
  1.924540e-02  1.965490e-02 -2.768070e-02 -5.159400e-03 -2.129280e-02
  6.027510e-02 -1.421706e-01 -7.575300e-03 -5.568890e-02 -8.435200e-03
  3.603400e-02 -6.682670e-02  5.339590e-02 -6.289580e-02 -4.029260e-02
  5.208550e-02 -3.324960e-02  4.782700e-02 -5.503380e-02 -2.997380e-02
  6.715430e-02 -5.012010e-02  1.074469e-01  1.100676e-01  8.189600e-03
 -3.259440e-02 -2.751690e-02 -1.220240e-02 -2.882720e-02 -3.308580e-02
  2.610400e-03 -4.504300e-03  1.768940e-02  4.979250e-02  1.120331e-01
  5.568900e-03 -7.141290e-02 -5.057000e-03  1.760750e-02 -3.603400e-02
 -2.981000e-02  8.353340e-02 -2.358590e-02 -5.364200e-03  2.538760e-02
 -2.358590e-02  3.996500e-02  7.698180e-02  4.749900e-03  3.865470e-02
  2.518300e-03  9.237810e-02 -8.189550e-02  9.958490e-02  1.171110e-02
  8.124030e-02  4.553390e-02  4.782700e-02  5.896500e-03  9.827500e-03
 -4.078400e-02  7.657200e-03 -1.596960e-02 -5.208550e-02  1.054400e-03
  1.159640e-01  4.111150e-02 -6.551640e-02  2.718930e-02 -2.293070e-02
 -4.934200e-03  7.206810e-02 -4.062020e-02  5.274070e-02 -6.944740e-02
  4.586150e-02 -4.356840e-02  1.777130e-02 -9.106780e-02 -1.002401e-01
 -6.191300e-02 -7.698180e-02  3.996500e-02  5.138900e-03 -8.779200e-02
 -4.127530e-02 -6.265000e-03 -2.178420e-02 -2.735310e-02  2.637040e-02
 -1.185847e-01  1.760750e-02 -8.230500e-03 -1.822200e-03 -7.010260e-02
 -7.075770e-02 -6.027510e-02  8.124030e-02  8.648170e-02  4.618910e-02
 -7.075770e-02 -7.288700e-03 -4.651660e-02  6.183100e-03 -4.156200e-03
 -5.339590e-02  5.405100e-02  6.977500e-02  3.455990e-02  3.292200e-02
 -2.293070e-02 -2.927800e-03 -9.696430e-02 -7.206810e-02 -3.832710e-02
 -7.436110e-02  1.384030e-02 -1.048262e-01 -1.875410e-02 -3.275820e-02
  1.949110e-02 -1.359470e-02 -5.830960e-02  5.503380e-02 -8.517130e-02
 -5.175800e-02 -7.780100e-03 -7.993000e-02  6.961100e-03 -9.172300e-02
  1.310300e-03  3.816330e-02 -5.830960e-02  8.648170e-02  7.894730e-02
  1.752560e-02  2.293070e-02 -5.896480e-02  5.372350e-02 -2.009000e-04
  2.006440e-02 -7.108530e-02 -1.859030e-02 -2.522380e-02 -1.028608e-01
  1.434809e-01  1.785320e-02 -7.010260e-02 -1.916350e-02 -7.043010e-02
  1.105590e-02  3.537890e-02 -5.044760e-02 -3.144790e-02  3.914610e-02
  2.162040e-02  9.418000e-03  8.091280e-02 -4.225810e-02 -3.374090e-02
 -4.115200e-03 -1.094000e-04  5.323200e-03  4.291320e-02 -1.113780e-02
  1.367660e-02 -4.422360e-02  3.292200e-02  5.863720e-02 -7.927490e-02
  1.736180e-02  6.158540e-02  8.148600e-03 -1.447913e-01  8.255070e-02
 -1.019600e-02 -8.312400e-03 -3.930980e-02  5.405100e-02 -1.916350e-02
  3.910500e-03  9.434360e-02  1.434809e-01  1.531450e-02  3.390470e-02
 -1.326710e-02  5.937400e-03 -3.013750e-02  4.553390e-02  6.977500e-02
  6.322330e-02 -5.110280e-02 -7.960240e-02 -6.387850e-02  2.391350e-02
 -8.255070e-02 -8.779200e-02 -7.861970e-02 -4.880970e-02 -3.931000e-03
 -5.044760e-02 -3.799950e-02  6.125780e-02  8.844720e-02 -4.618910e-02
 -1.539636e-01 -1.572390e-02 -4.258570e-02 -1.416790e-02 -5.601650e-02
  4.258570e-02  8.713680e-02 -8.189550e-02 -7.370600e-02 -2.915480e-02
  5.339590e-02 -1.269380e-02  3.009700e-03 -9.303330e-02 -3.521510e-02
 -5.896480e-02  7.665420e-02 -9.090400e-03  9.565400e-02 -9.172300e-02
  5.405100e-03  1.981870e-02 -8.025760e-02  5.961990e-02 -4.520630e-02
  4.651660e-02  1.185847e-01  4.094780e-02  4.815460e-02  3.095650e-02
  7.698180e-02 -1.008953e-01 -1.637910e-02 -6.027510e-02  9.958490e-02
  5.896480e-02  1.613340e-02 -1.136300e-03  2.653410e-02 -7.993000e-02
 -7.763690e-02  5.568890e-02 -7.174050e-02 -2.358590e-02 -2.538760e-02
  6.584400e-02 -4.356840e-02 -3.554270e-02 -1.185847e-01 -3.914610e-02
 -1.711620e-02  1.138350e-02 -4.815460e-02 -1.310300e-03  5.830960e-02
 -3.341340e-02 -5.568890e-02 -2.866340e-02 -1.284122e-01  1.981870e-02
 -2.088340e-02  2.964620e-02 -2.981000e-02  1.100676e-01  2.293070e-02
 -6.420610e-02 -2.802000e-04  3.488750e-02  5.110280e-02 -5.144000e-04
  1.395500e-01 -1.113780e-02  5.012010e-02  8.124030e-02 -5.929230e-02
 -1.654290e-02 -4.176670e-02  4.225810e-02  5.863720e-02  9.434360e-02
 -6.060270e-02  3.455990e-02 -9.237810e-02  8.779200e-02  8.255070e-02
 -1.580580e-02  2.096530e-02  1.395500e-01 -1.192399e-01  7.468870e-02]
('Paul', array([ 1.211609e-01,  4.324040e-02, -5.569700e-03,  8.384690e-02,
        2.085200e-02, -1.558410e-02, -4.038700e-02, -9.965050e-02,
       -3.380210e-02, -2.205920e-02, -5.531260e-02, -2.688810e-02,
        7.155520e-02,  8.340790e-02,  7.133600e-03,  3.533860e-02,
        1.167710e-01,  4.324040e-02,  7.901800e-02,  8.296890e-02,
       -4.411840e-02,  1.525490e-02,  8.472480e-02, -6.277540e-02,
        3.923500e-03, -1.234660e-02,  3.402160e-02,  6.101940e-02,
        1.942530e-02,  2.151050e-02, -4.828880e-02,  2.030320e-02,
        3.292420e-02, -5.882450e-02,  6.453130e-02, -2.611980e-02,
        6.063500e-03,  1.439883e-01, -3.731400e-02,  1.064550e-02,
        1.251120e-02, -2.074220e-02, -1.679130e-02, -7.243310e-02,
       -7.199420e-02, -2.359560e-02,  1.920580e-02, -2.677830e-02,
        5.443460e-02,  3.468010e-02,  2.370540e-02, -5.679400e-03,
       -2.238840e-02, -9.526060e-02, -8.724900e-03,  5.202020e-02,
        2.963170e-02, -9.877250e-02,  3.380210e-02, -5.487360e-02,
        6.233640e-02, -1.190760e-02, -9.306560e-02, -4.192340e-02,
       -8.296890e-02, -8.428580e-02, -2.513210e-02, -6.936020e-02,
       -1.525490e-02,  2.019350e-02,  2.238840e-02, -1.255508e-01,
        2.293720e-02, -1.712060e-02, -3.270470e-02,  2.118120e-02,
       -7.901800e-02,  4.148440e-02,  8.472480e-02,  4.697180e-02,
        7.243310e-02, -3.819200e-02,  7.989600e-02, -8.604180e-02,
        5.443460e-02, -4.960570e-02, -6.672630e-02,  1.290627e-01,
        7.111620e-02, -6.101940e-02,  6.628730e-02,  6.008700e-03,
        8.428580e-02, -6.173300e-03, -2.919280e-02,  6.392800e-03,
       -2.809530e-02, -4.609380e-02, -4.806930e-02, -9.306560e-02,
       -5.706850e-02,  1.569380e-02,  9.569950e-02, -1.799850e-02,
        1.295000e-04, -7.726200e-02,  7.945700e-02, -1.591330e-02,
        3.577760e-02,  6.760430e-02, -1.723030e-02, -2.826000e-03,
       -9.218760e-02,  1.481590e-02,  1.322450e-02, -1.646200e-03,
       -7.155520e-02, -3.841150e-02,  3.797250e-02,  6.453130e-02,
       -7.901800e-02, -3.566800e-03, -2.480290e-02, -1.633038e-01,
       -4.763030e-02, -9.921150e-02, -5.750750e-02,  3.292420e-02,
        3.753350e-02,  5.114220e-02, -3.753350e-02, -1.119421e-01,
        8.121300e-03, -9.306560e-02,  4.148440e-02,  5.624500e-03,
       -4.389900e-03,  7.111620e-02,  2.469310e-02,  4.302090e-02,
        3.072900e-03,  5.882450e-02,  1.262090e-02, -5.662950e-02,
       -6.716530e-02, -2.387000e-03,  1.022844e-01, -1.299407e-01,
       -7.550610e-02,  2.030320e-02,  2.480290e-02, -8.340790e-02,
        1.598200e-03,  1.290627e-01,  6.804330e-02,  4.192340e-02,
       -2.622960e-02,  4.784980e-02, -3.336310e-02,  2.271770e-02,
       -1.262090e-02,  3.907000e-02,  1.132591e-01, -8.823670e-02,
       -7.737200e-03,  3.336310e-02, -6.804330e-02, -2.579060e-02,
        2.831480e-02, -6.447600e-03, -4.016750e-02,  2.601010e-02,
        2.052270e-02, -1.150151e-01,  1.093082e-01,  2.941220e-02,
       -1.810830e-02,  2.260790e-02, -9.109000e-03,  2.754650e-02,
       -6.277540e-02, -4.236240e-02,  1.659900e-03, -5.377600e-03,
       -3.885050e-02, -3.928950e-02, -1.037110e-02, -8.604180e-02,
        7.989600e-02, -6.145840e-02, -6.540930e-02,  4.280100e-03,
       -6.892120e-02, -2.831480e-02,  5.679400e-03,  5.070320e-02,
        6.189740e-02,  5.421510e-02, -2.151050e-02, -2.370540e-02,
        5.882450e-02,  8.735880e-02,  3.621660e-02,  5.421510e-02,
        1.141371e-01,  2.194940e-02,  5.092270e-02,  1.255508e-01,
       -7.594510e-02, -2.513210e-02,  2.096170e-02, -1.264288e-01,
        8.121290e-02, -4.455740e-02, -1.022844e-01, -7.331110e-02,
       -4.345990e-02, -2.721730e-02,  6.760430e-02, -8.472480e-02,
        1.545240e-01, -8.472480e-02,  1.141371e-01, -1.093082e-01,
       -1.668160e-02, -7.846900e-03,  2.784800e-03,  4.631330e-02,
       -2.633930e-02, -1.036013e-01, -3.928950e-02,  1.887650e-02,
       -3.753350e-02, -3.621660e-02,  6.804330e-02, -4.653280e-02,
        6.233640e-02, -8.604180e-02,  1.037110e-02, -3.446060e-02,
        3.577760e-02,  4.236240e-02, -8.560300e-03, -2.809530e-02,
        3.270470e-02,  7.155520e-02, -2.976900e-03, -1.152350e-02,
        4.938620e-02, -1.395984e-01, -2.897330e-02,  1.388300e-02,
       -5.750750e-02,  3.533860e-02,  4.960570e-02,  4.675230e-02,
        7.243300e-03, -3.358260e-02, -1.152300e-03,  3.775300e-02,
        5.750750e-02,  5.202020e-02,  7.792100e-03,  2.831480e-02,
        4.148440e-02,  6.101940e-02,  4.993500e-03, -1.059060e-02,
       -1.481590e-02, -3.402160e-02,  3.380210e-02, -3.468010e-02,
       -5.882450e-02,  6.189740e-02, -5.882450e-02, -8.560280e-02,
        2.140070e-02,  5.882450e-02,  1.152300e-03, -5.289810e-02,
       -2.260790e-02,  6.233640e-02,  2.875380e-02,  7.287210e-02,
       -6.447600e-03,  1.141371e-01,  1.613280e-02, -5.662950e-02,
        8.033490e-02, -2.249820e-02,  5.953800e-03, -4.126490e-02,
        3.950900e-02,  3.072920e-02, -6.540930e-02, -7.989600e-02,
        4.653280e-02, -8.999300e-03, -6.557400e-03,  4.938600e-03],
      dtype=float32))

doc_vecs = w2v_vectors.query(["I", "read", "a", "book"])
doc_vecs.shape

(4, 300)

mul_doc_vecs = w2v_vectors.query([["I", "read", "a", "book"], ["I", "read", "a", "sports", "magazine"]])
mul_doc_vecs.shape

(2, 5, 300)

print("Similarity between \"Apple\" and \"Mango\": {}".
          format(w2v_vectors.similarity("apple", "mango")))

print("Similarity between \"Apple\" and [\"Mango\", \"Orange\"]: {}".
          format(w2v_vectors.similarity("apple", ["mango", "orange"])))

print("Most similar to \"Cat\" from [\"Dog\", \"Television\", \"Laptop\"]: {}".
          format(w2v_vectors.most_similar_to_given("cat", ["dog", "television", "laptop"])))

Similarity between "Apple" and "Mango": 0.5751857161521912
Similarity between "Apple" and ["Mango", "Orange"]: [0.5751857, 0.39203462]
Most similar to "Cat" from ["Dog", "Television", "Laptop"]: dog

doc1 = w2v_vectors.query(["I", "read", "a", "book"]) 
doc2 = w2v_vectors.query(["I", "read", "a", "sports", "magazine"])
print("Similarity between\n\"I read a book\" and \"I read a sports magazine\": {}".
              format(similarity_between_docs(doc1, doc2, is_1d=False)))

Similarity between
"I read a book" and "I read a sports magazine": 0.8234725594520569

plot_1d_heatmap(w2v_vectors.query("king"), "King")
plot_1d_heatmap(w2v_vectors.query("man"), "Man")
plot_1d_heatmap(w2v_vectors.query("woman"), "Woman")
plot_1d_heatmap(w2v_vectors.query("queen"), "Queen")
tmp = w2v_vectors.query("king") - w2v_vectors.query("man") + w2v_vectors.query("woman")
plot_1d_heatmap(tmp, "King - Man + Woman")

print("Similarity between\n\"King - Man + Woman\" and \"Queen\": {}".
      format(similarity_between_docs(tmp, w2v_vectors.query("queen"), is_1d=True)))

Similarity between
"King - Man + Woman" and "Queen": 0.7118194103240967

Exercises

It's time for you to explore these functionalities.

Try some to write code for some of the queries below. Or head on to this GitHub repository to have a look at different examples:

https://github.com/plasticityai/magnitude#querying

"""
Try plotting heatmap for four different words, out of which three should have 
same property and should be different. For example, "girl", "boy", "man", "water
"""

# Your code here

'\nTry plotting heatmap for four different words, out of which three should have \nsame property and should be different. For example, "girl", "boy", "man", "water\n'

"""
Calculate the similarity between two words with similar sense and two with no 
similarity. For example, similarity between "cat" & "dog" and similarity between 
"apple" and "lion"
"""

# Your code here

'\nCalculate the similarity between two words with similar sense and two with no \nsimilarity. For example, similarity between "cat" & "dog" and similarity between \n"apple" and "lion"\n'

"""
Print the similarity score of "paris" with following:
"delhi", "vienna", "london", france, "laptop"
"""

# Your code here

'\nPrint the similarity score of "paris" with following:\n"delhi", "vienna", "london", france, "laptop"\n'

Contextual Word Embeddings

These algoirtms also take the context of the word in some sentence while generating the Embeddings

AllenAI's ELMo
Google's BERT

AllenAI's ELMo

elmo_vecs = pym.Magnitude('./vectors/elmo_2x1024_128_2048cnn_1xhighway_weights.magnitude')

ELMo generates embedding of a word based on its context. So we need to provide a full sentence in order to get the embedding of some word.

sen1  = elmo_vecs.query(["yes", "they", "are", "right"])
sen2 = elmo_vecs.query(["go", "to", "your", "right"])

/usr/local/lib/python3.6/dist-packages/pymagnitude/third_party/allennlp/nn/util.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))

right1 = sen1[-1]
right2 = sen2[-1]
print("right from sentence 1: {}\tright from sentence 2: {}".format(right1.shape, right2.shape))

right from sentence 1: (768,)	right from sentence 2: (768,)

plot_1d_heatmap(right1, name="ELMo vec for right in \"yes they are right\"")

plot_1d_heatmap(right2, name="ELMo vec for right in \"go to your right\"")

print("Simialrity between \"right\" from sentence 1 & 2:\t{}".
      format(similarity_between_docs(right1, right2, is_1d=True)))

print("Simialrity between \"right\" from sentence 1 only:\t{}".
      format(similarity_between_docs(right1, right1, is_1d=True)))

print("Simialrity between \"right\" from sentence 2 only:\t{}".
      format(similarity_between_docs(right2, right2, is_1d=True)))

Simialrity between "right" from sentence 1 & 2:	0.7283329963684082
Simialrity between "right" from sentence 1 only:	0.9999997019767761
Simialrity between "right" from sentence 2 only:	1.000000238418579

Google's BERT

Since pymagnitude doesn't have support for BERT yet, we'll use the huggingface's transfomers library for this.

import torch
import transformers

# this may take a while for first time
model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, 
                                                    transformers.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

tokenized1 = tokenizer.encode("yes they are right", add_special_tokens=False)
tokenized2 = tokenizer.encode("go to your right", add_special_tokens=False)
print(tokenized1, tokenized2)

# you can also get the full sentence using the token_ids
print(tokenizer.decode(tokenized1))
print(tokenizer.decode(tokenized2))

[2748, 2027, 2024, 2157] [2175, 2000, 2115, 2157]
yes they are right
go to your right

input_ids = torch.tensor([tokenized1, tokenized2])
model.eval()
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs[0]

right1_bert = (last_hidden_states[0][-1]).numpy()
right2_bert = (last_hidden_states[1][-1]).numpy()

print(right1_bert.shape, right2_bert.shape)

(768,) (768,)

plot_1d_heatmap(right1_bert, name="BERT vec for right in \"yes they are right\"")
plot_1d_heatmap(right2_bert, name="BERT vec for right in \"go to your right\"")

print("Simialrity between \"right\" from sentence 1 & 2 using BERT:\t{}".
          format(similarity_between_docs(right1_bert, right2_bert, is_1d=True)))
print("Simialrity between \"right\" from sentence 1 only using BERT:\t{}".
          format(similarity_between_docs(right1_bert, right1_bert, is_1d=True)))
print("Simialrity between \"right\" from sentence 2 only using BERT:\t{}".
          format(similarity_between_docs(right2_bert, right2_bert, is_1d=True)))

Simialrity between "right" from sentence 1 & 2 using BERT:	0.605478048324585
Simialrity between "right" from sentence 1 only using BERT:	0.9999997615814209
Simialrity between "right" from sentence 2 only using BERT:	0.9999997615814209

Document Ranking

Let's use these embeddings to retrieve the documents based on a query

Load the data

!mkdir data
!cd data && curl -O https://raw.githubusercontent.com/ashishu007/Word-Embeddings/master/data/abstracts.csv
!cd data && curl -O https://raw.githubusercontent.com/ashishu007/Word-Embeddings/master/data/train.tsv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  778k  100  778k    0     0  1569k      0 --:--:-- --:--:-- --:--:-- 1569k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  703k  100  703k    0     0  3431k      0 --:--:-- --:--:-- --:--:-- 3431k

dfa = pd.read_csv('./data/abstracts.csv')
print(dfa.shape)
dfa = dfa[:50]
dfa.head(5)

(792, 2)

Some utility functions

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

torch.__version__

'1.6.0+cu101'

def gen_w2v_embs(row):
    # row = row['text']
    # print(row)
    
    tokens = nltk.word_tokenize(row)
    token_words = [w for w in tokens if w.isalpha()]
    # print(token_words)
    
    stemming = PorterStemmer()
    tokens_stemmed = [stemming.stem(word) for word in token_words]
    # print(tokens_stemmed)
    
    # stops = set(stopwords.words("english"))
    stops = ["a", "an", "the"]
    meaningful_words = [w for w in tokens_stemmed if not w in stops]
    # print(meaningful_words)
    
    vecs = []
    for w in meaningful_words:
        w_vec = w2v_vectors.query(w)
        vecs.append(w_vec)
    vec_arr = np.array(vecs)
    vec_final = np.mean(vec_arr, axis=0, dtype="float32")

    return vec_final

# Here we use a different library called `sentence_transformers` beacuse this 
# library is easier than `transformers` for sentence embeddings

def gen_bert_embs(col):
    bert_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
    bert_embs = bert_model.encode(col)
    return bert_embs

Embedding the documents with different algorithms

w2v_abs = dfa["content"].apply(gen_w2v_embs)
w2v_abs = (torch.tensor(w2v_abs)).numpy()

elmo_abs = dfa["content"].apply((lambda x: elmo_vecs.query(x)))
elmo_abs = (torch.tensor(elmo_abs)).numpy()

/usr/local/lib/python3.6/dist-packages/pymagnitude/third_party/allennlp/nn/util.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))

bert_abs = gen_bert_embs(dfa["content"])

100%|██████████| 245M/245M [00:10<00:00, 22.3MB/s]

w2v_abs.shape, elmo_abs.shape, bert_abs.shape

((50, 300), (50, 768), (50, 768))

type(w2v_abs), type(elmo_abs), type(bert_abs)

(numpy.ndarray, numpy.ndarray, numpy.ndarray)

Perform natural language queries

def gen_query_emb(q, emb="w2v"):
    if emb == "w2v":
        query_emb = gen_w2v_embs(q)
    elif emb == "elmo":
        query_emb = elmo_vecs.query(q)
    elif emb == "bert":
        query_bert = gen_bert_embs(q)
        query_emb = query_bert.reshape(-1)
    return query_emb

q1 = gen_query_emb("documents that discuss learning methods", emb="elmo")
q2 = gen_query_emb("documents that discuss learning methods", emb="bert")
q3 = gen_query_emb("documents that discuss learning methods", emb="w2v")

/usr/local/lib/python3.6/dist-packages/pymagnitude/third_party/allennlp/nn/util.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))

q1.shape, q2.shape, q3.shape

((768,), (768,), (300,))

Get the similarity between query and documents

def get_doc_similarity(q, docs):
    sims = {}
    for i, doc in enumerate(docs):
        sim_score = similarity_between_docs(q, doc, is_1d=True)
        sims[i] = sim_score
    sims_sorted = {k: v for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)}
    return sims_sorted

s = get_doc_similarity(q2, bert_abs)
# s

ss = list(s.keys())[:10]
ss

[17, 0, 34, 23, 8, 13, 42, 15, 11, 2]

dfa["content"][17]

'Doc2Sent2Vec is an unsupervised approach to learn low-dimensional feature vector (or embedding) for a document. This embedding captures the semantics of the document and can be fed as input to machine learning algorithms to solve a myriad number of applications in the field of data mining and information retrieval. Some of these applications include document classification, retrieval, and ranking.'

Exercises

"""
Try ranking same documents using another natural language query and ranking algorithm
"""

# Your code here

'\nTry ranking documents using another natural language query and ranking algorithm\n'

"""
Print top 3 documents ranked by each query and algorithm. Compare their results
"""

# Your code here

'\nPrint top 3 documents ranked by each query and algorithm. Compare their results\n'

"""
Plot the heatmap for a sentence using BERT and ELMo
"""

# Your code here

'\nPlot heatmap for a sentence using BERT and ELMo\n'

"""
Plot the heatmap for same sentence using Word2Vec
"""

# Your code here

Text Classification

Let's apply different embeddings for a simple Text Classification problem

Load the data

df = pd.read_csv('./data/train.tsv', delimiter='\t', names=["text", "label"])
print(df.shape)
df = df[:200]
df.head(5)

(6920, 2)

Embedding the data using Word2Vec

w2vs = df["text"].apply(gen_w2v_embs)

w2v_embs = (torch.tensor(w2vs)).numpy()
w2v_embs.shape

(200, 300)

Embedding the data using ELMo

elmos = df["text"].apply((lambda x: elmo_vecs.query(x)))

/usr/local/lib/python3.6/dist-packages/pymagnitude/third_party/allennlp/nn/util.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))

elmo_embs = (torch.tensor(elmos)).numpy()
elmo_embs.shape

(200, 768)

Embedding the data using BERT

bert_embs = gen_bert_embs(df["text"])

bert_embs.shape

(200, 768)

Prepare the `features` and `labels` for train and test

labels = df["label"]

from sklearn.model_selection import train_test_split

X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(w2v_embs, labels, 
                                                                            test_size=0.33, 
                                                                            random_state=42, stratify=labels)
X_train_elmo, X_test_elmo, y_train_elmo, y_test_elmo = train_test_split(elmo_embs, labels, 
                                                                        test_size=0.33, 
                                                                        random_state=42, stratify=labels)
X_train_bert, X_test_bert, y_train_bert, y_test_bert = train_test_split(bert_embs, labels, 
                                                                        test_size=0.33, 
                                                                        random_state=42, stratify=labels)

from sklearn.linear_model import LogisticRegression

lr_clf_w2v = LogisticRegression()
lr_clf_w2v.fit(X_train_w2v, y_train_w2v)

lr_clf_elmo = LogisticRegression()
lr_clf_elmo.fit(X_train_elmo, y_train_elmo)

lr_clf_bert = LogisticRegression()
lr_clf_bert.fit(X_train_bert, y_train_bert)

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

y_pred_w2v = lr_clf_w2v.predict(X_test_w2v)
y_pred_elmo = lr_clf_elmo.predict(X_test_elmo)
y_pred_bert = lr_clf_bert.predict(X_test_bert)

from sklearn.metrics import accuracy_score, f1_score

print("Word2Vec\tAccuracy: {}\tMacro F1: {}".format(accuracy_score(y_pred_w2v, y_test_w2v), 
                                                 f1_score(y_pred_w2v, y_test_w2v, average="macro")))
print("ELMo\tAccuracy: {}\tMacro F1: {}".format(accuracy_score(y_pred_elmo, y_test_elmo), 
                                                f1_score(y_pred_elmo, y_test_elmo, average="macro")))
print("BERT\tAccuracy: {}\tMacro F1: {}".format(accuracy_score(y_pred_bert, y_test_bert), 
                                                f1_score(y_pred_bert, y_test_bert, average="macro")))

Word2Vec	Accuracy: 0.5757575757575758	Macro F1: 0.396078431372549
ELMo	Accuracy: 0.5454545454545454	Macro F1: 0.5170731707317072
BERT	Accuracy: 0.8333333333333334	Macro F1: 0.832371276841376

	title	content
0	Understanding Human Language: Can NLP and Deep...	There is a lot of overlap between the core pro...
1	Big Data in Climate: Opportunities and Challen...	This talk will present an overview of research...
2	A Sequential Decision Formulation of the Inter...	The Interface Card model is a promising new th...
3	Audio Features Affected by Music Expressivenes...	Within a Music Information Retrieval perspecti...
4	Automatic Identification and Contextual Reform...	Web search functionality is increasingly integ...

	text	label
0	a stirring , funny and finally transporting re...	1
1	apparently reassembled from the cutting room f...	0
2	they presume their audience wo n't sit still f...	0
3	this is a visually stunning rumination on love...	1
4	jonathan parker 's bartleby should have been t...	1

Introduction

Downloading the required vector file

Import some libraries

Static Word Embeddings

Google's Word2Vec

Exercises

Contextual Word Embeddings

AllenAI's ELMo

Google's BERT

Document Ranking

Load the data

Some utility functions

Embedding the documents with different algorithms

Perform natural language queries

Get the similarity between query and documents

Exercises

Text Classification

Load the data

Embedding the data using Word2Vec

Embedding the data using ELMo

Embedding the data using BERT

Prepare the features and labels for train and test

Prepare the `features` and `labels` for train and test