FastText Working and Implementation

FastText is a word embedding technique developed by Facebook that represents words using character level subwords. It handles unseen words effectively and captures both semantic and morphological information.

Uses character level subwords.
Handles out of vocabulary words.
Captures word meaning and structure.
Efficient for large text datasets.

FastText Architecture and Working

FastText extends traditional word embedding models by representing words as collections of character n-grams rather than treating them as single units. This approach helps capture word structure and generate embeddings for unseen words.

Character N-Gram Representation

FastText breaks each word into smaller groups of characters called n-grams. Instead of learning only the whole word, it also learns these smaller character patterns, helping it understand word structure and meaning. Consider the word "running":

3-grams: <ru, run, unn, nni, nin, ing, ng>
4-grams: <run, runn, unni, nnin, ning, ing>
5-grams: <runn, runni, unnin, nning, ning>

Here:

A 3-gram contains 3 consecutive characters.
A 4-gram contains 4 consecutive characters.
These subwords help FastText understand related words such as run, runner and running.

Hierarchical Softmax Optimization

Hierarchical Softmax is an optimization technique used by FastText to speed up training. Instead of comparing a word with every word in the vocabulary, it organizes words in a tree structure and performs fewer calculations.

Reduces training time.
Works efficiently with large vocabularies.
Maintains good prediction performance.

Implementation

Step 1: Installing Required Libraries

Run the following command in your command prompt

pip install gensim

Step 2: Import required libraries

Imports the FastText model from Gensim.
Used for training and generating word embeddings.

Python

from gensim.models import FastText

Step 3: Creating Training Data

Creates tokenized sentences for training.
Each sentence is represented as a list of words.
This format is required by Gensim FastText.

Python

sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "helps", "the", "king"],
    ["running", "is", "good", "exercise"],
    ["the", "runner", "runs", "fast"],
    ["walking", "is", "healthy", "activity"],
    ["the", "walker", "walks", "slowly"],
    ["reading", "books", "is", "fun"],
    ["the", "reader", "reads", "daily"]
]

print("Training data created successfully")

Output:

Training data created successfully

Step 4: Training a Basic FastText Model

vector_size=50 sets embedding size.
window=5 defines context window size.
min_n=3 and max_n=6 create character n-grams.
sg=1 enables Skip-Gram training.
epochs=10 controls training iterations.

Python

model = FastText(
    sentences,
    vector_size=50,
    window=5,
    min_count=1,
    min_n=3,
    max_n=6,
    sg=1,
    epochs=10
)

print("Model trained successfully")

Output:

Model trained successfully

Step 5: Getting Word Vectors

Retrieves the embedding vector for a word.
Displays the first few vector values.
Shows the dimensionality of the embedding.

Python

king_vector = model.wv["king"]

print("Vector for 'king':")
print(king_vector[:5])

print("Vector Shape:", king_vector.shape)

Output:

Step 6: Handling Unseen Words (OOV)

One of FastText's major advantages is its ability to generate embeddings for unseen words using character n-grams

Uses character level subword information.
Overcomes a major limitation of Word2Vec.

Python

kingdom_vector = model.wv["kingdom"]

print("Vector for 'kingdom':")
print(kingdom_vector[:5])

Output:

Step 7: Finding Similar Words

Finds semantically related words.
Uses cosine similarity between embeddings.
Returns the most similar words with scores.

Python

print("Words similar to 'king':")

similar_words = model.wv.most_similar(
    "king",
    topn=3
)

for word, score in similar_words:
    print(word, ":", round(score, 4))

Output:

Download full code from here

Applications

Works effectively with multiple languages, especially when training data is limited.
Handles specialized and domain specific vocabulary that may not appear in general text datasets.
Improves text classification by capturing both word meaning and word structure.
Generates meaningful embeddings for unseen or out-of-vocabulary words.
Suitable for real time NLP applications due to its fast training and efficient memory usage.

Advantages

Generates embeddings for unseen words using character level subword information.
Captures relationships between different forms of a word, such as run, running and runner.
Provides fast training and efficient inference for large text datasets.
Performs well on languages with complex word structures and rich morphology.

Limitations

Requires more storage than traditional word embedding methods due to the use of subword information.
Model performance can be sensitive to the choice of n-gram parameters.
May not capture complex contextual relationships as effectively as transformer based models such as BERT and GPT.

FastText Working and Implementation

FastText Architecture and Working

Character N-Gram Representation

Hierarchical Softmax Optimization

Implementation

Step 1: Installing Required Libraries

Step 2: Import required libraries

Step 3: Creating Training Data

Step 4: Training a Basic FastText Model

Step 5: Getting Word Vectors

Step 6: Handling Unseen Words (OOV)

Step 7: Finding Similar Words

Applications

Advantages

Limitations

Explore