Introduction: What are word Embeddings?
Word embeddings are one of the most commonly used techniques in natural language processes. Word embeddings have been widely used for NLP tasks, including sentiment analysis, topic classification, and question answering. Word embeddings are precisely why language models like recurrent neural networks (RNN), long short term memory (LSTM) networks, ELMo, BERTS, AlBERTs, and the latest GPT-3 have advanced so rapidly.
These algorithms are fast, and they can generate language sequences and downstream tasks with high accuracy. They include contextual understanding, semantic properties, and syntactic properties. They also include the linear relationship between words.
Embedding is a technique used for extracting patterns from text or voice sequence. But how do they do that? Well, let’s see Word embeddings are a type of algorithm that maps words to vectors.
We’ll look at some of the earliest neural networks used to build complex algorithms for natural-language processing. Word embeddings are one of the most popular representations of document vocabulary. It is capable of identifying context of a word in an input sentence, semantic and syntactic similarities, relations with other words, etc..
Word embeddings allow words with similar meanings to be represented by vectors that are close together.
They are a distributed word representation that is perhaps one of deep learning’s most important breakthroughs for solving challenging natural language processing (NLP) problems.
Also Read: What is NLP?
What Are Word Embeddings?
A word embedding represents words in a way that words that mean the same thing have similar representations.
Vector representations of words are called “word embeddings”. Now that we’ve said this, let’s look at how we generate them. Most importantly, how can they understand the context? What techniques are used? There are a set of pre-trained word embeddings that take into effect the co-occurrence counts with deep learning models and intermediate fully-connected hidden layer, one-hot encoded vector, and layer output
Why Word Embeddings are used?
Since machine learning models cannot process textual data, we need to convert the textual data into numerical data so that they can use it. TF-IDF and Bag of Words have been discussed previously as techniques that can be used to achieve this goal. In addition to this, we can also use one-hot encoding or number-based representations to represent words in vocabulary. As opposed to the one-hot encoding, the latter approach is more efficient as we now have a dense matrix instead of a sparse one. This approach works even when our vocabulary is large.
One-hot encoding vs integer encoding
It captures no relationship between words, so the integer-encoding is arbitrary. A linear classifier, for example, learns one weight for each feature, which can be challenging for the model to interpret. In order for this feature-weight combination to be meaningful, there must be a relationship between the similarities of two words and their encodings.
In vector space, words that have a similar meaning are grouped together by their embeddings. When representing a word such as frog, its nearest neighbors would be frogs, toads, and Litoria. As a result, a classifier would not be thrown off when it sees the word Litoria during testing because the two-word vectors are similar. In addition, word embeddings learn relationships between words. An analogous word can be found by adding the differences between two vectors.
Deep learning has made significant progress on challenging natural language processing problems because of this method of representing words and documents.
Applied to words, embedding is the process of representing each word as a real-valued vector in a predefined vector space. This technique maps each word to a vector, and the vector values are learned in a manner reminiscent of neural networks, which is why it’s often referred to as deep learning. The approach relies on dense distributed representations of each word. There are many dimensions to each word, e.g. tens or hundreds. For sparse word representations, like a one-hot encoding, there are thousands or millions of dimensions.
Word usage enables the learning of distributed representations. As a result, words used in the same way can have similar representations, capturing their meaning naturally. Comparing this to a bag of words model where, unless explicitly managed, different words have different representations, regardless of how they are used. Words with similar contexts will have similar meanings. One hot vector is also a very integral part of word embedding and should be viewed with the fact of objective function.
Also Read: What are the Natural Language Processing Challenges, and How to fix them?
Embedding Matrix
Embedding matrix is a randomly initialized matrix whose dimensions are N * (Size of the vocabulary plus 1), where N is the number that we have to select manually and Size of the Vocabulary is the number of unique words that are within the document. The embedding matrix consists of a plurality of columns, each of which represents an individual word in the document
The embedding matrix will be trained over time using gradient descent to learn the values of the matrix in ways in which similar words will be grouped together according to their similarity. A boy may not need to be very loyal, whereas a king or queen may require a degree of loyalty. Both the King and the boy are male, which means that both the King and the boy had a high value corresponding to male.
The first thing you need to know is that even though these features (Royal, Male, Age, etc..) appear in the picture, we do not explicitly define them. The problem is that this is just a randomly initialized matrix that learns the values for these features along with their corresponding features using gradient descent.
Pre-Processing for Embedding Matrix
We know that we cannot use non-numerical data for machine learning and guess what, words are of course, non-numerical. So, let’s see how we have to convert them before the forward propagation.
There are a lot of algorithms for this:
- One Hot Encoding
- Term Frequency-Inverse Document Frequency
- Tokenization (Text to Sequence)
But, for this purpose, Tokenization is the most preferred and you will understand why in a few minutes.
Tokenization: Assigning a number for each unique word in the corpus is called as tokenization.
Example: Let’s assume that we have a training set with 3 training examples. [“What is your name”,”how are you”,”where are you”] if we have to tokenize this data, the result would be this:
What : 1, is : 2, your : 3, name: 4, how:5, are : 6, you : 7, where : 8
Tokenized form of first sentence: [1,2,3,4] Tokenized form of second sentence : [5,6,7] Tokenized form of third sentence : [8,6,7]
Now , The data is pre-processed. let’s move on to the forward pass.
Also Read: What is Tokenization in NLP?
Forward Propagation
In our training set, each column represents a word. We manually pick N, which represents the size of each word. The following example assumes a vocabulary size of 1000 and an N of 15.
Consider the following example:
Whenever we tokenize a word, we assign it a number. In this sense, the tokenized representation of “The Weather is Nice” might look like this [123,54,792,205].
Upon passing this array of tokens into the neural network for the forward pass, the embedding matrix contains 1000 columns. This is because the input is [123,554,792,205]. This embedded matrix contains the columns 123, 554, 792, 205.
There are 15 rows(N) in each of these columns. This is done by stacking the 4 columns on top of each other (flattening the 4 tensors to form a single tensor of size 15*4)
After being flattened, the tensor is passed to a RNN or Dense Layer to generate a prediction.