fasttext is a Python interface for Facebook fastText.
fasttext support Python 2.6 or newer. It requires Cython in order to compile the C++ extension.
pip install fasttextThis package has two main use cases: word representation learning and text classification.
These were described in the two papers 1 and 2.
In order to learn word vectors, as described in 1,
We can use fasttext.skipgram and fasttext.cbow function
like the following:
import fasttext
# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary
# CBOW model
model = fasttext.cbow('data.txt', 'model')
print model.words # list of words in dictionarywhere data.txt is a training file containing utf-8 encoded text.
By default the word vectors will take into account character n-grams from
3 to 6 characters.
At the end of optimization the program will save two files:
model.bin and model.vec.
model.vec is a text file containing the word vectors, one per line.
model.bin is a binary file containing the parameters of the model
along with the dictionary and all hyper parameters.
The binary file can be used later to compute word vectors or to restart the optimization.
The following fasttext(1) command is equivalent
# Skipgram model
./fasttext skipgram -input data.txt -output model
# CBOW model
./fasttext cbow -input data.txt -output modelThe previously trained model can be used to compute word vectors for out-of-vocabulary words.
print model.get_vector('king') # get the vector of the word 'king'the following fasttext(1) command is equivalent:
echo "king" | ./fasttext print-vectors model.binThis will output the vector of word king to the standard output.
We can use fasttext.load_model to load pre-trained model:
model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model.get_vector('king') # get the vector of the word 'king'Works in progress
import fasttext
model = fasttext.skipgram(params)
model.words
model.get_vector(word)
model = fasttext.cbow(params)
model.words
model.get_vector(word)
model = fasttext.load_model('model.bin')
model.words
model.get_vector(word)List of params and their default value:
input training file path
output output file path
lr learning rate [0.05]
dim size of word vectors [100]
ws size of the context window [5]
epoch number of epochs [5]
min_count minimal number of word occurences [1]
neg number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss loss function {ns, hs, softmax} [ns]
bucket number of buckets [2000000]
minn min length of char ngram [3]
maxn max length of char ngram [6]
thread number of threads [12]
verbose how often to print to stdout [10000]
t sampling threshold [0.0001]
silent suspress the log from the C++ extension [1]
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
(* These authors contributed equally.)
- Facebook page: https://www.facebook.com/groups/1174547215919768
- Google group: https://groups.google.com/forum/#!forum/fasttext-library