• Home
  • AI News
  • Bookmarks
  • Contact US
Reading: What Are Word Embeddings?
Share
Notification
Aa
  • Inspiration
  • Thinking
  • Learning
  • Attitude
  • Creative Insight
  • Innovation
Search
  • Home
  • Categories
    • Creative Insight
    • Thinking
    • Innovation
    • Inspiration
    • Learning
  • Bookmarks
    • My Bookmarks
  • More Foxiz
    • Blog Index
    • Sitemap
Have an existing account? Sign In
Follow US
© Foxiz News Network. Ruby Design Company. All Rights Reserved.
> Blog > AI News > What Are Word Embeddings?
AI News

What Are Word Embeddings?

admin
Last updated: 2022/05/30 at 1:11 PM
admin
Share
25 Min Read

Introduction: What are word Embeddings?

Word embeddings are one of the most commonly used techniques in natural language processes. Word embeddings have been widely used for NLP tasks, including sentiment analysis, topic classification, and question answering. Word embeddings are precisely why language models like recurrent neural networks (RNN), long short term memory (LSTM) networks, ELMo, BERTS, AlBERTs, and the latest GPT-3 have advanced so rapidly.

Contents
Introduction: What are word Embeddings?What Are Word Embeddings?Why Word Embeddings are used?One-hot encoding vs integer encodingEmbedding MatrixPre-Processing for Embedding MatrixTokenization: Assigning a number for each unique word in the corpus is called as tokenization.Forward PropagationEmbedding Matrix ValuesWord Embedding AlgorithmsEmbedding LayerWord2VecContinuous Bag-of-Words model (CBOW)Skip-gram modelGloVeNeural Language ModelWe have understood the following so far – Using Word EmbeddingsLearn an EmbeddingReuse an EmbeddingWhich Option Should You Use?ConclusionShare this:

These algorithms are fast, and they can generate language sequences and downstream tasks with high accuracy. They include contextual understanding, semantic properties, and syntactic properties. They also include the linear relationship between words.

Embedding is a technique used for extracting patterns from text or voice sequence. But how do they do that? Well, let’s see Word embeddings are a type of algorithm that maps words to vectors.

We’ll look at some of the earliest neural networks used to build complex algorithms for natural-language processing. Word embeddings are one of the most popular representations of document vocabulary. It is capable of identifying context of a word in an input sentence, semantic and syntactic similarities, relations with other words, etc..

- Advertisement -
Ad imageAd image

Word embeddings allow words with similar meanings to be represented by vectors that are close together.

They are a distributed word representation that is perhaps one of deep learning’s most important breakthroughs for solving challenging natural language processing (NLP) problems.

Also Read: What is NLP?

What Are Word Embeddings?

A word embedding represents words in a way that words that mean the same thing have similar representations.

Vector representations of words are called “word embeddings”. Now that we’ve said this, let’s look at how we generate them. Most importantly, how can they understand the context? What techniques are used? There are a set of pre-trained word embeddings that take into effect the co-occurrence counts with deep learning models and intermediate fully-connected hidden layer, one-hot encoded vector, and layer output

Why Word Embeddings are used?

Since machine learning models cannot process textual data, we need to convert the textual data into numerical data so that they can use it. TF-IDF and Bag of Words have been discussed previously as techniques that can be used to achieve this goal. In addition to this, we can also use one-hot encoding or number-based representations to represent words in vocabulary. As opposed to the one-hot encoding, the latter approach is more efficient as we now have a dense matrix instead of a sparse one. This approach works even when our vocabulary is large.

One-hot encoding vs integer encoding

It captures no relationship between words, so the integer-encoding is arbitrary. A linear classifier, for example, learns one weight for each feature, which can be challenging for the model to interpret. In order for this feature-weight combination to be meaningful, there must be a relationship between the similarities of two words and their encodings.

In vector space, words that have a similar meaning are grouped together by their embeddings. When representing a word such as frog, its nearest neighbors would be frogs, toads, and Litoria. As a result, a classifier would not be thrown off when it sees the word Litoria during testing because the two-word vectors are similar. In addition, word embeddings learn relationships between words. An analogous word can be found by adding the differences between two vectors.

Deep learning has made significant progress on challenging natural language processing problems because of this method of representing words and documents.

Applied to words, embedding is the process of representing each word as a real-valued vector in a predefined vector space. This technique maps each word to a vector, and the vector values are learned in a manner reminiscent of neural networks, which is why it’s often referred to as deep learning. The approach relies on dense distributed representations of each word. There are many dimensions to each word, e.g. tens or hundreds. For sparse word representations, like a one-hot encoding, there are thousands or millions of dimensions.

Word usage enables the learning of distributed representations. As a result, words used in the same way can have similar representations, capturing their meaning naturally. Comparing this to a bag of words model where, unless explicitly managed, different words have different representations, regardless of how they are used. Words with similar contexts will have similar meanings. One hot vector is also a very integral part of word embedding and should be viewed with the fact of objective function.

Also Read: What are the Natural Language Processing Challenges, and How to fix them?

Embedding Matrix

Embedding matrix is a randomly initialized matrix whose dimensions are N * (Size of the vocabulary plus 1), where N is the number that we have to select manually and Size of the Vocabulary is the number of unique words that are within the document. The embedding matrix consists of a plurality of columns, each of which represents an individual word in the document

The embedding matrix will be trained over time using gradient descent to learn the values of the matrix in ways in which similar words will be grouped together according to their similarity. A boy may not need to be very loyal, whereas a king or queen may require a degree of loyalty. Both the King and the boy are male, which means that both the King and the boy had a high value corresponding to male.

The first thing you need to know is that even though these features (Royal, Male, Age, etc..) appear in the picture, we do not explicitly define them. The problem is that this is just a randomly initialized matrix that learns the values for these features along with their corresponding features using gradient descent.

Pre-Processing for Embedding Matrix

We know that we cannot use non-numerical data for machine learning and guess what, words are of course, non-numerical. So, let’s see how we have to convert them before the forward propagation.

There are a lot of algorithms for this:

  • One Hot Encoding
  • Term Frequency-Inverse Document Frequency
  • Tokenization (Text to Sequence)

But, for this purpose, Tokenization is the most preferred and you will understand why in a few minutes.

Tokenization: Assigning a number for each unique word in the corpus is called as tokenization.

Example: Let’s assume that we have a training set with 3 training examples. [“What is your name”,”how are you”,”where are you”] if we have to tokenize this data, the result would be this:

What : 1, is : 2,  your : 3, name: 4, how:5, are : 6, you : 7, where : 8

Tokenized form of first sentence: [1,2,3,4] Tokenized form of second sentence : [5,6,7] Tokenized form of third sentence : [8,6,7]

Now , The data is pre-processed. let’s move on to the forward pass.

Also Read: What is Tokenization in NLP?

Forward Propagation

In our training set, each column represents a word. We manually pick N, which represents the size of each word. The following example assumes a vocabulary size of 1000 and an N of 15.

Consider the following example:

Whenever we tokenize a word, we assign it a number. In this sense, the tokenized representation of “The Weather is Nice” might look like this [123,54,792,205].

Upon passing this array of tokens into the neural network for the forward pass, the embedding matrix contains 1000 columns. This is because the input is [123,554,792,205]. This embedded matrix contains the columns 123, 554, 792, 205.

There are 15 rows(N) in each of these columns. This is done by stacking the 4 columns on top of each other (flattening the 4 tensors to form a single tensor of size 15*4)

After being flattened, the tensor is passed to a RNN or Dense Layer to generate a prediction.

Source: YouTube

Continuous Bag-of-Words model (CBOW)

CBOW predicts the probability of a word to occur given the words surrounding it. Probability distribution is a good method to find this. We can consider a single word or a group of words. But for simplicity, we will take a single context word and try to predict a single target word.

The English language contains almost 1.2 million words, making it impossible to include so many words in our example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at. For simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.

First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the sentence but ll only take certain words that are in a window. For example for a window size equal to three, we only consider three words in a sentence. The middle word is to be predicted and the surrounding two words are fed into the neural network as context. The window is then slid and the process is repeated again.

Finally, after training the network repeatedly by sliding the window a shown above, we get weights which we use to get the embeddings as shown below.

Usually, we take a window size of around 8-10 words and have a vector size of 300.

Skip-gram model

The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the centre word)

The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the architecture of its neural network and the way the weight matrix is generated as shown in the figure below:

After obtaining the weight matrix, the steps to get word embedding is same as CBOW.

So now which one of the two algorithms should we use for implementing word2vec model? Turns out for large corpus with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for small corpus and is faster to train too compared to previous models.

Source: YouTube

Neural Language Model

Word embeddings were proposed by Bengio et. al. (2001, 2003) to tackle what’s known as the curse of dimensionality, a common problem in statistical language modelling.

The Bengio method, which was known as distributed representation of words, was able to train a neural network so that each training sentence provided the model with information about semantically available neighboring words. In addition to establishing relationships between different words, the neural network preserved both semantic and syntactic relationships. 

It was from this work that a neural network architecture approach was developed, which formed the foundation for many approaches used today.

This neural network has the following components:

  • The embedding layer generates word embedding, and the parameters are shared among words.
  • An embedded layer consisting of one or more layers that introduces non-linearity.
  • A softmax function that produces a probability distribution over all the vocabulary words.

We have understood the following so far – 

  • The neural network language model (NNLM) or Bengio’s model outperforms earlier statistical models like the n-gram model.
  • Through its distributed representation, NNLM overcomes the curse of dimensionality and preserves contextual, linguistic regularities and patterns.
  • NNLM is computationally intensive.
  • The Word2Vec model reduces computational complexity by removing the hidden layer and sharing the weights
  • Despite Word2Vec’s lack of a neural network, it can be trained on a large number of examples and can be used to compute very accurate high dimensional word vectors.
  • CBOW and Skipgram are two of Word2Vec’s models. CBOW is faster than Skipgram.
  • There is a technique in Natural Language Processing called latent Dirichlet allocation (LDA) that allows observations to be explained by unobserved “groups”.

Using Word Embeddings

There are several options for using word embeddings in your NLP projects.

Learn an Embedding

  • The word embedding you choose may depend on your problem.
  • For embeddings to be learned, a large amount of text data is needed, such as millions or billions of words.
  • When training your word embedding, there are two main options:
    • Learn it Standalone, where a model is trained to learn an embedding, which is saved and used to create another model for your task later on. Using the same embedding across multiple models is a good approach if you wish to do that.
    • The embedding is learned as part of a larger task-specific model. If you only intend to use the embedding on one task, this is a good approach.

Reuse an Embedding

  • Researchers commonly make pre-trained word vectors available for free, often making them available under a permissive license, so that you can use these vectors on your own academic or commerical projects.
  • For example, both Word2Vec and GloVe word vectors are available for free download (as well as pre-trained models).
  • You can use these pre-trained embeddings instead of training your own.
  • There are two main ways to use pre-trained embeddigs:
    • Statically trained, where the embeddings are kept static and are used as components of your model. This is a suitable strategy if the embedding is well suited to your problem and gives good performance.
    • Updated, where the pretrained embedding is used to train the model, but the pretrained embedding gets updated during the training of the network. If you’re looking for the best results from the model, then using it as an embedded model might be a good idea.

Which Option Should You Use?

Consider all the options, and if possible, test them to find out which gives you the best results.

Consider using a pre-trained embedding first, and new embeddings only if they improve performance. Distribution of dataset is very critical for good quality results and computational complexity should be kept in mind while deciding the approach.

Numerical representation, vocabulary size and neural network architecture are very important to keep performance within the benchmark. Keeping vocabulary size under check is a popular technique, especially when the dimensional space in which the algorithm needs to be run with the training objectives, and computational complexity in mind.

Also Read: AI Search Prediction for Online Dictionaries.

Conclusion

Word Embeddings are an important part of text interpretation. User data privacy and values of openness is very important in this scenario and the output layer should be the only layer visible in this case on a need to know basis.

Dataset, link datasets, semantic relationships, dense representations and parameters are very important and should be cleaned as much as possible for bias, and language before using to train. Distribution of dataset, and probability distribution are very critical for good quality results. Accurate word embeddings help you come up with better modeling of your data and help you reduce expensive computation. It keeps your current approach in the correct window rather than the incorrect window / incorrect version.

Word embedding opened up new avenues in NLP research and development. Although these models work well, they lack conceptual understanding.

Share this:

admin Mai 30, 2022 Mai 30, 2022
Share this Article
Facebook Twitter Email Copy Link Print
Leave a comment Leave a comment

Schreibe einen Kommentar Antworten abbrechen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Follow US

Find US on Social Medias
Facebook Like
Twitter Follow
Youtube Subscribe
Telegram Follow
newsletter featurednewsletter featured

Subscribe Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form]

Popular News

Machine Learning for Kids: Installing Python
Juli 13, 2023
Beauty Is in the Eye of the Beholder—but Memorability May Be Universal
Juli 24, 2023
How AI is Changing Content Writing and Production
September 27, 2022
Elie Hassenfeld Q&A: ‘$5,000 to Save a Life Is a Bargain’
März 27, 2024

Quick Links

  • Home
  • AI News
  • My Bookmarks
  • Privacy Policy
  • Contact
Facebook Like
Twitter Follow

© All Rights Reserved.

Welcome Back!

Sign in to your account

Lost your password?