UER: An Open-Source Toolkit For Pre-Training Models
UER: An Open-Source Toolkit For Pre-Training Models
♠ ♣ ♣ ♠ ♠
Zhe Zhao1,2, Hui Chen2, Jinbin Zhang2, Xin Zhao1, Tao Liu1,
♠ ♣ ♣ ∗ ♠
Wei Lu1, Xi Chen3, Haotang Deng2, Qi Ju2, , Xiaoyong Du1,
1
School of Information and DEKE, MOE, Renmin University of China, Beijing, China
2
Tencent AI Lab
3
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
♠
{helloworld, zhaoxinruc, tliu, lu-wei, duyong}@ruc.edu.cn
♣
{chenhuichen, westonzhang, haotangdeng, damonju}@tencent.com
{mrcx}@pku.edu.cn
241
Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 241–246
Hong Kong, China, November 3 – 7, 2019. c 2019 Association for Computational Linguistics
Model Corpus Encoder Target
Skip-thoughts Bookcorpus GRU Conditioned LM
Quick-thoughts Bookcorpus+UMBCcorpus GRU Sentence prediction
CoVe English-German Bi-LSTM Machine translation
Infersent Natural language inference LSTM;GRU;CNN;LSTM+Attention Classification
ELMO 1billion benchmark Bi-LSTM Language model
ULMFiT Wikipedia LSTM Language model
GPT Bookcorpus; 1billion benchmark Transformer Language model
BERT Wikipedia+bookcorpus Transformer Cloze+sentence prediction
Table 1: 8 pre-training models and their differences. For space constraint of the table, fine-tuning strategies of different
models are described as follows: Skip-thoughts, quick-thoughts, and infersent regard pre-trained models as feature extractors.
The parameters before output layer are frozen. CoVe and ELMO transfer word embedding to downstream tasks, with other
parameters in neural networks uninitialized. ULMFit, GPT, and BERT fine-tune entire networks on downstream tasks.
242
masked language model task and sentence predic- tasks such as text classification (Zhang and Le-
tion task as training targets (objectives). The draw- Cun, 2017) and word embedding (Joulin et al.,
back of BERT is that it requires expensive compu- 2016). In the pre-training literature, ELMO ex-
tational resources. Thankfully, Google makes its ploits subencoder layer. In UER, we implement
pre-trained models publicly available. So we can RNN and CNN as subencoders, and use mean
directly fine-tune on Google’s models to achieve pooling or max pooling upon hidden states to ob-
competitive results on many NLP tasks. tain fixed-length word vectors.
243
show that introducing sentence-level task into tar- 2) pre-training on downstream dataset; 3) fine-
gets can benefit pre-training models (Logeswaran tuning on downstream dataset. Stage 2 enables
and Lee, 2018; Devlin et al., 2018). models to get familiar with the distributions of
downstream datasets (Howard and Ruder, 2018;
• Next sentence prediction (NSP). The model Radford et al., 2018). It is also called semi-
is trained to predict if the two sentences are supervised fine-tuning strategy in the work of Dai
continuous. Sentence prediction target is and Le (2015) since stage 2 is unsupervised and
much more efficient than word-level targets. stage 3 is supervised.
It doesn’t involve sequentially decoding of
words and softmax layer over entire vocab- 3.5 Case Studies
ulary. In this section, we show how UER facilitates the
use of pre-training models. First of all, we demon-
Above targets are unsupervised tasks (also
strate that UER can build most pre-training mod-
known as self-supervised tasks). However, super-
els easily. As shown in the following code, only a
vised tasks can provide additional knowledge that
few lines are required to construct models with the
raw corpus can not provide.
interfaces in UER.
• Neural machine translation (NMT). CoVe 1 # I m p l e m e n t a t i o n o f BERT .
2 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
(McCann et al., 2017) proposes to use NMT 3 encoder = BertEncoder ( args )
to pre-train model. The implementation of 4 t a r g e t = BertTarget ( args , vocab size )
5
NMT target is similar with autoencoder. Both 6 # I m p l e m e n t a t i o n o f GPT .
7 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
of them involve encoding source sentences 8 encoder = GptEncoder ( a r g s )
and sequentially decoding words of target 9 t a r g e t = LmTarget ( a r g s , v o c a b s i z e )
10
sentences. 11 # I m p l e m e n t a t i o n o f Quick−t h o u g h t s .
12 embedding = Embedding ( a r g s , v o c a b s i z e )
13 encoder = GruEncoder ( a r g s )
• Classification (CLS). Infersent (Conneau 14 t a r g e t = N s p T a r g e t ( a r g s , None )
15
et al., 2017) proposes to use natural language 16 # Implementation of I n f e r S e n t .
inference task (three-way classification) to 17 embedding = Embedding ( a r g s , v o c a b s i z e )
18 encoder = LstmEncoder ( a r g s )
pre-train model. 19 t a r g e t = C l s T a r g e t ( a r g s , None )
Most pre-training models use above targets in- In practice, users can assemble different suben-
dividually. It is worth trying to use multiple targets coder, encoder, and target modules without any
at the same time. Some targets are complementary code work. Users can specify modules through op-
to each other, e.g. word-level target and sentence- tions –subencoder, –encoder, and –target. More
level target (Devlin et al., 2018), unsupervised tar- details are available in quickstart and instructions
get and supervised target. In experiments section, of UER’s github project. UER provides ample
we demonstrate that proper selection of target is modules. Users can try different module combina-
important. UER provides the flexibility to users in tions according to their downstream datasets. Be-
trying different targets and their combinations. sides trying modules implemented by UER, users
can also develop their customized modules and in-
3.4 Fine-tuning tegrate them into UER seamlessly.
UER exploits similar fine-tuning strategy with
4 Experiments
ULMFiT, GPT, and BERT. Models on down-
stream tasks share structures and parameters with To evaluate the performance of UER, experi-
pre-training models except that they have differ- ments are conducted on a range of datasets,
ent target layers. The entire models are fine-tuned each of which falls into one of four categories:
on downstream tasks. This strategy performs ro- sentence classification, sentence pair classifica-
bustly in practice. We also find that feature ex- tion, sequence labeling, and document-based QA.
tractor strategy produces inferior results on mod- BERT-base uncased English model and BERT-
els such as GPT and BERT. base Chinese model are used as baseline models.
Most pre-training works involve 2 stages, pre- In section 4.1, UER is tested on several evalua-
training and fine-tuning. But UER supports 3 tion benchmarks to demonstrate that it can pro-
stages: 1) pre-training on general-domain corpus; duce models as intended. In section 4.2, we ap-
244
Implementation SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI
HuggingFace 93.0 83.8 89.4 90.7 84.0/84.4 89.0 61.0 53.5
UER 92.4 83.0 89.3 91.0 84.0/84.0 91.5 66.8 56.3
Table 2: The performance of HuggingFace’s implementation and UER’s implementation on GLUE benchmark.
Table 3: The performance of ERNIE’s implementation and UER’s implementation on ERNIE benchmark.
ply pre-trained models in our model zoo to dif- tached with users’ ratings. To this end, we can
ferent downstream datasets. Significant improve- exploit CLS target for pre-training (similar with
ments are witnessed when proper encoders and InferSent). We fine-tune these pre-trained models
targets are selected. For space constraint, we put (with different targets) on downstream datasets.
some contents in UER’s github project, including The results are shown in Table 4. BERT base-
dataset and corpus details, system speed, and part line (BERT-base Chinese) is pre-trained upon Chi-
of qualitative/quantitative evaluation results. nese Wikipedia. We can observe that pre-training
on Amazon review corpus can improve the results
4.1 Reproducibility significantly. Using CLS target achieves the best
This section uses English/Chinese benchmarks to results in most cases.
test BERT implementation of UER. For English, Dataset Douban. Shopping. Chn.
we use sentence and sentence pair classification BERT baseline 87.5 96.3 94.3
datasets in GLUE benchmark (dev set) (Wang MLM 88.1 97.0 95.0
CLS 88.3 97.0 95.8
et al., 2019). For Chinese, we use five datasets of
different types: sentiment analysis, sequence la- Table 4: Performance of pre-training models with dif-
beling, question pair matching, natural language ferent targets.
inference, and document-based QA (provided by
ERNIE5 ). Table 2 and 3 compare UER’s perfor- BERT requires heavy computational resources.
mance to other publicly available systems. We can To achieve better efficiency, we use UER to substi-
observe that UER could match the performance tute 12-layers Transformer encoder with a 2-layers
of HuggingFace’s and ERNIE’s implementations. LSTM encoder (embedding size and hidden size
Results of HuggingFace and ERNIE are reported are 512 and 1024). We still use the above senti-
on their github projects. Results of UER can be ment analysis datasets for evaluation. The model
reproduced by scripts in UER’s github project. is firstly trained on mixed large corpus with LM
target, and then trained on large-scale Amazon re-
4.2 Influence of targets and encoders
view corpus with LM and CLS targets. Table 5
In this section, we give some examples of select- lists the results of different encoders. Compared
ing pre-trained models given downstream datasets. with BERT baseline, LSTM encoder can achieve
Three Chinese sentiment analysis datasets are comparable or even better results when proper cor-
used for evaluation. They are Douban book re- pora and targets are selected.
view, Online shopping review, and Chnsenticorp
datasets. Dataset Douban. Shopping. Chn.
BERT baseline 87.5 96.3 94.3
First of all, we use UER to pre-train on large- LSTM 80.3 94.0 88.3
scale Amazon review corpus with different targets. LSTM+pre-training 86.5 96.9 94.5
The parameters are initialized by BERT-base Chi-
nese model. The target of original BERT consists Table 5: Performance of pre-training models with dif-
ferent encoders.
of MLM and NSP. However, NSP is not suitable
for sentence-level reviews (we have to split re-
For space constraint, this section only uses sen-
views into multiple parts). Therefore we remove
timent analysis datasets as examples to analyze
NSP target. In addition, Amazon reviews are at-
the influence of different targets and encoders.
5
https://github.com/PaddlePaddle/ERNIE More tasks and pre-trained models are discussed
245
in UER’s github project. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M Rush. 2017. Opennmt:
5 Conclusion Open-source toolkit for neural machine translation.
ACL.
This paper describes UER, an open-source toolkit
for pre-training on general-domain corpora and Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned
fine-tuning on downstream tasks. We demonstrate from word embeddings. TACL., 3.
that UER can largely facilitate implementations
of different pre-training models. With the help Lajanugen Logeswaran and Honglak Lee. 2018. An
efficient framework for learning sentence represen-
of UER, we pre-train models based on different tations. arXiv preprint arXiv:1803.02893.
corpora, encoders, targets and make these mod-
els publicly available. By using proper pre-trained Bryan McCann, James Bradbury, Caiming Xiong, and
models, we can achieve significant improvements Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In NIPS.
over BERT, or achieve competitive results with an
efficient training speed. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Acknowledgments tions of words and phrases and their compositional-
ity. In NIPS.
This work is supported by National Natural
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Science Foundation of China Grant No.U1711262
Fan, Sam Gross, Nathan Ng, David Grangier, and
and No.61472428, 2018 Tencent Rhino-Bird Elite Michael Auli. 2019. fairseq: A fast, extensible
Training Program. toolkit for sequence modeling. NAACL.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
References resentations. arXiv preprint arXiv:1802.05365.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alec Radford, Karthik Narasimhan, Tim Salimans, and
learning to align and translate. arXiv preprint Ilya Sutskever. 2018. Improving language under-
arXiv:1409.0473. standing by generative pre-training.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Barrault, and Antoine Bordes. 2017. Supervised Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
learning of universal sentence representations from Kaiser, and Illia Polosukhin. 2017. Attention is all
natural language inference data. EMNLP. you need. In NIPS.
Andrew M Dai and Quoc V Le. 2015. Semi-supervised Alex Wang, Amanpreet Singh, Julian Michael, Felix
sequence learning. In NIPS. Hill, Omer Levy, and Samuel R. Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and form for natural language understanding. In ICLR.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jie Yang and Yue Zhang. 2018. Ncrf++: An open-
ing. arXiv preprint arXiv:1810.04805. source neural sequence labeling toolkit. arXiv
preprint arXiv:1806.05626.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural computation, 9(8). Xiang Zhang and Yann LeCun. 2017. Which encoding
is the best for text classification in chinese, english,
Jeremy Howard and Sebastian Ruder. 2018. Universal japanese and korean?
language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146. Zhe Zhao, Tao Liu, Shen Li, Bofang Li, and Xiaoyong
Du. 2017. Ngram2vec: Learning improved word
Armand Joulin, Edouard Grave, Piotr Bojanowski, and representations from ngram co-occurrence statistics.
Tomas Mikolov. 2016. Bag of tricks for efficient text EMNLP.
classification. arXiv preprint arXiv:1607.01759.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen
Yoon Kim. 2014. Convolutional neural networks for
Li, Hongwei Hao, and Bo Xu. 2016. Attention-
sentence classification. EMNLP.
based bidirectional long short-term memory net-
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, works for relation classification. In ACL., volume 2.
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-thought vectors. In
NIPS.
246