0% found this document useful (0 votes)
22 views6 pages

UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

Tadilakshmikiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

Tadilakshmikiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UER: An Open-Source Toolkit for Pre-training Models

♠ ♣ ♣ ♠ ♠
Zhe Zhao1,2, Hui Chen2, Jinbin Zhang2, Xin Zhao1, Tao Liu1,
♠  ♣ ♣ ∗ ♠
Wei Lu1, Xi Chen3, Haotang Deng2, Qi Ju2, , Xiaoyong Du1,
1
School of Information and DEKE, MOE, Renmin University of China, Beijing, China
2
Tencent AI Lab
3
School of Electronics Engineering and Computer Science, Peking University, Beijing, China

{helloworld, zhaoxinruc, tliu, lu-wei, duyong}@ruc.edu.cn

{chenhuichen, westonzhang, haotangdeng, damonju}@tencent.com

{mrcx}@pku.edu.cn

Abstract Transformer (a structure based on attentionNN)


is shown to be a more powerful feature extractor
Existing works, including ELMO and BERT, compared with other encoders (Vaswani et al.,
have revealed the importance of pre-training
for NLP tasks. While there does not exist a
2017).
single pre-training model that works best in 2) Pre-training target (objective).
all cases, it is of necessity to develop a frame- Using proper target is one of the keys to the suc-
work that is able to deploy various pre-training cess of pre-training. While the language model
models efficiently. For this purpose, we is most commonly used (Radford et al., 2018),
propose an assemble-on-demand pre-training many works focus on seeking better targets such as
toolkit, namely Universal Encoder Represen- masked language model (cloze test) (Devlin et al.,
tations (UER). UER is loosely coupled, and
2018) and machine translation (McCann et al.,
encapsulated with rich modules. By assem-
bling modules on demand, users can either re- 2017).
produce a state-of-the-art pre-training model 3) Fine-tuning strategy.
or develop a pre-training model that remains Using a proper fine-tuning strategy is also im-
unexplored. With UER, we have built a model portant to the performance of pre-training models
zoo, which contains pre-trained models based on downstream tasks. A commonly-used strategy
on different corpora, encoders, and targets (ob- is to regard pre-trained models as feature extrac-
jectives). With proper pre-trained models, we
tors (Kiros et al., 2015).
could achieve new state-of-the-art results on a
range of downstream datasets.
Table 1 lists 8 popular pre-training models and
1 Introduction their main differences (Kiros et al., 2015; Lo-
geswaran and Lee, 2018; McCann et al., 2017;
Pre-training has been well recognized as an es- Conneau et al., 2017; Peters et al., 2018; Howard
sential step for NLP tasks since it results in re- and Ruder, 2018; Radford et al., 2018; Devlin
markable improvements on a range of downstream et al., 2018). In additional to encoder, target, and
datasets (Devlin et al., 2018). Instead of train- fine-tuning strategy, corpus is also listed in Table
ing models on a specific task from scratch, pre- 1 as an important factor for pre-training models.
training models are firstly trained on general- There are many open-source implementations
domain corpora, then followed by fine-tuning on of pre-training models, such as Google BERT1 ,
downstream tasks. Thus far, a large number of ELMO from AllenAI2 , GPT and BERT from Hug-
works have been proposed for finding better pre- gingFace3 . However, these works usually focus
training models. Existing pre-training models on the designs of either one or a few pre-training
mainly differ in the following three aspects: models. Due to the diversity of the downstream
tasks and the computational resources constraint,
1) Model encoder. there does not exist a single pre-training model
Commonly-used encoders include RNN that works best in all cases. BERT is one of the
(Hochreiter and Schmidhuber, 1997), CNN (Kim, most widely used pre-training models. It exploits
2014), AttentionNN (Bahdanau et al., 2014), and
1
their combinations (Zhou et al., 2016). Recently, https://github.com/google-research/bert
2
https://github.com/allenai/bilm-tf
∗ 3
Corresponding author. https://github.com/huggingface

241
Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 241–246
Hong Kong, China, November 3 – 7, 2019. c 2019 Association for Computational Linguistics
Model Corpus Encoder Target
Skip-thoughts Bookcorpus GRU Conditioned LM
Quick-thoughts Bookcorpus+UMBCcorpus GRU Sentence prediction
CoVe English-German Bi-LSTM Machine translation
Infersent Natural language inference LSTM;GRU;CNN;LSTM+Attention Classification
ELMO 1billion benchmark Bi-LSTM Language model
ULMFiT Wikipedia LSTM Language model
GPT Bookcorpus; 1billion benchmark Transformer Language model
BERT Wikipedia+bookcorpus Transformer Cloze+sentence prediction

Table 1: 8 pre-training models and their differences. For space constraint of the table, fine-tuning strategies of different
models are described as follows: Skip-thoughts, quick-thoughts, and infersent regard pre-trained models as feature extractors.
The parameters before output layer are frozen. CoVe and ELMO transfer word embedding to downstream tasks, with other
parameters in neural networks uninitialized. ULMFit, GPT, and BERT fine-tune entire networks on downstream tasks.

ble different modules to implement existing mod-


Target layer
BERT els such as BERT (right part in Figure 1), or de-
Unsupervised Supervised
Cloze
Language
model
Cloze
test
Machine
translation
test velop a new pre-training model by implementing
Sentence
Auto
encoder
Sentence
prediction
Classification prediction customized modules. Clear and robust interfaces
allow users to assemble (or add) modules with as
Encoder layer + few restrictions as possible.
NN combinations
Transformer
With the help of UER, we build a Chinese pre-
Transformer
RNN CNN self-attention self-attention trained model zoo based on different corpora, en-
RecNN Attention
coders, and targets. Different datasets have their
Subencoder layer
+ own characteristics. Selecting proper models from
the model zoo can largely boost the performance
Mean pooling Max pooling
of downstream datasets. In this work, we use
GRU LSTM CNN
Google BERT as baseline model. We provide
Corpus
+ some use cases that are based on UER, and the re-
sults show that our models can either achieve new
Supervised Wikipedia
General-domain
task
state-of-the-art performance, or achieve competi-
corpus Bookcorpus
corpus tive results with an efficient running speed.
UER is built on PyTorch and supports dis-
tributed training mode. Clear instructions and
Figure 1: The architecture of UER (pre-training part). documentations are provided to help users read
We can combine modules in UER to implement BERT and use UER codes. The UER toolkit and the
model.
model zoo are publicly available at https://
github.com/dbiir/UER-py.
two unsupervised targets for pre-training. But in
some scenarios, supervised information is critical 2 Related Work
to the performance of downstream tasks (Conneau
et al., 2017; McCann et al., 2017). Besides, in 2.1 Pre-training for deep neural networks
many cases, BERT is excluded due to its efficiency Using word embedding to initialize neural net-
issue. Based on above reasons, it is often the case work’s first layer is one of the most commonly
that one should adopt different pre-training mod- used strategies for NLP tasks (Mikolov et al.,
els in different application scenarios. 2013; Kim, 2014). Inspired by the success of
In this work, we introduce UER, a general word embedding, some recent works try to ini-
framework that is able to facilitate the develop- tialize entire networks (not just first layer) with
ments of various pre-training models. UER main- pre-trained parameters (Howard and Ruder, 2018;
tains model modularity and supports research ex- Radford et al., 2018). They train a deep neural
tensibility. It consists of 4 components: suben- network upon large corpus, and fine-tune the pre-
coder, encoder, target, and downstream task fine- trained model on specific downstream tasks. One
tuning. The architecture of UER (pre-training of the most influential works among them is BERT
part) is shown in Figure 1. Ample modules are im- (Devlin et al., 2018). BERT extracts text fea-
plemented in each component. Users could assem- tures with 12/24 Transformer layers, and exploits

242
masked language model task and sentence predic- tasks such as text classification (Zhang and Le-
tion task as training targets (objectives). The draw- Cun, 2017) and word embedding (Joulin et al.,
back of BERT is that it requires expensive compu- 2016). In the pre-training literature, ELMO ex-
tational resources. Thankfully, Google makes its ploits subencoder layer. In UER, we implement
pre-trained models publicly available. So we can RNN and CNN as subencoders, and use mean
directly fine-tune on Google’s models to achieve pooling or max pooling upon hidden states to ob-
competitive results on many NLP tasks. tain fixed-length word vectors.

2.2 NLP toolkits 3.2 Encoder


Many NLP models have tens of hyper-parameters This layer learns features from word vectors. UER
and various tricks, and some of which exert large implements a series of basic encoders, includ-
impacts on final performance. Sometimes it is un- ing LSTM, GRU, CNN, GatedCNN, and Atten-
likely to report all details and their effects in re- tionNN. Users can use these basic encoders di-
search paper. This may lead to a huge gap be- rectly, or use their combinations. The output of an
tween research papers and code implementations. encoder can be fed into another encoder, forming
To solve the above problem, some works are pro- networks of arbitrary layers. UER provides ample
posed to implement a class of models in a frame- examples of combining basic encoders (e.g. CNN
work. This type of work includes OpenNMT + LSTM). Users can also build their custom com-
(Klein et al., 2017), fairseq (Ott et al., 2019) for binations with basic encoders in UER.
neural machine translation; glyph (Zhang and Le- Currently, Transformer (a structure based on
Cun, 2017) for classification; NCRF++ (Yang and multi-headed self-attention) becomes a popular
Zhang, 2018) for sequence labeling; Hyperwords text feature extractor and is proven to be effec-
(Levy et al., 2015), ngram2vec (Zhao et al., 2017) tive for many NLP tasks. We implement Trans-
for word embedding, to name a few. former module and integrate it into UER. With
Recently, we witness many influential pre- Transformer module, we can implement models
training works such as GPT, ULMFiT, and BERT. such as GPT and BERT easily.
We think it could be useful to develop a frame-
3.3 Target (objective)
work to facilitate reproducing and refining those
models. UER provides the flexibility of building Using suitable target is the key to the success of
pre-training models of different properties. pre-training. Many papers in this field propose
their targets and show their advantages over other
3 Architecture ones. UER consists of a range of targets. Users
can choose one of them, or use multiple targets
In this section, we firstly introduce the core com-
and give them different weights. In this section
ponents in UER and the modules that we have
we introduce targets implemented in UER.
implemented in each component. Figure 1 il-
lustrates UER’s framework and detailed modules • Language model (LM). Language model is
(pre-training part). Modularity design of UER one of the most commonly used targets. It
largely facilitates the use of pre-training models. trains model to make it useful to predict cur-
At the end of this section, we will give some case rent word given previous words.
studies to illustrate how to use UER effectively.
• Masked LM (MLM, also known as cloze
3.1 Subencoder test). The model is trained to be useful to pre-
dict masked word given surrounding words.
This layer learns word vectors from subword fea-
MLM utilizes both left and right contexts to
tures. For English, we use character as subword
predict words. LM only considers the left
features. For Chinese4 , we use radical and pinyin
context.
as subword features. As a result, the model can
be aware of internal structures of words. Sub- • Autoencoder (AE). The model is trained to
word information has been explored in many NLP be useful to reconstruct input sequence as
4
We don’t do word segmentation on Chinese corpus. We close as possible.
regard each Chinese character as a word. Internal structures
such as radical and pinyin are regarded as Chinese subword The above targets are related with word predic-
features. tion. We call them word-level targets. Some works

243
show that introducing sentence-level task into tar- 2) pre-training on downstream dataset; 3) fine-
gets can benefit pre-training models (Logeswaran tuning on downstream dataset. Stage 2 enables
and Lee, 2018; Devlin et al., 2018). models to get familiar with the distributions of
downstream datasets (Howard and Ruder, 2018;
• Next sentence prediction (NSP). The model Radford et al., 2018). It is also called semi-
is trained to predict if the two sentences are supervised fine-tuning strategy in the work of Dai
continuous. Sentence prediction target is and Le (2015) since stage 2 is unsupervised and
much more efficient than word-level targets. stage 3 is supervised.
It doesn’t involve sequentially decoding of
words and softmax layer over entire vocab- 3.5 Case Studies
ulary. In this section, we show how UER facilitates the
use of pre-training models. First of all, we demon-
Above targets are unsupervised tasks (also
strate that UER can build most pre-training mod-
known as self-supervised tasks). However, super-
els easily. As shown in the following code, only a
vised tasks can provide additional knowledge that
few lines are required to construct models with the
raw corpus can not provide.
interfaces in UER.
• Neural machine translation (NMT). CoVe 1 # I m p l e m e n t a t i o n o f BERT .
2 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
(McCann et al., 2017) proposes to use NMT 3 encoder = BertEncoder ( args )
to pre-train model. The implementation of 4 t a r g e t = BertTarget ( args , vocab size )
5
NMT target is similar with autoencoder. Both 6 # I m p l e m e n t a t i o n o f GPT .
7 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
of them involve encoding source sentences 8 encoder = GptEncoder ( a r g s )
and sequentially decoding words of target 9 t a r g e t = LmTarget ( a r g s , v o c a b s i z e )
10
sentences. 11 # I m p l e m e n t a t i o n o f Quick−t h o u g h t s .
12 embedding = Embedding ( a r g s , v o c a b s i z e )
13 encoder = GruEncoder ( a r g s )
• Classification (CLS). Infersent (Conneau 14 t a r g e t = N s p T a r g e t ( a r g s , None )
15
et al., 2017) proposes to use natural language 16 # Implementation of I n f e r S e n t .
inference task (three-way classification) to 17 embedding = Embedding ( a r g s , v o c a b s i z e )
18 encoder = LstmEncoder ( a r g s )
pre-train model. 19 t a r g e t = C l s T a r g e t ( a r g s , None )

Most pre-training models use above targets in- In practice, users can assemble different suben-
dividually. It is worth trying to use multiple targets coder, encoder, and target modules without any
at the same time. Some targets are complementary code work. Users can specify modules through op-
to each other, e.g. word-level target and sentence- tions –subencoder, –encoder, and –target. More
level target (Devlin et al., 2018), unsupervised tar- details are available in quickstart and instructions
get and supervised target. In experiments section, of UER’s github project. UER provides ample
we demonstrate that proper selection of target is modules. Users can try different module combina-
important. UER provides the flexibility to users in tions according to their downstream datasets. Be-
trying different targets and their combinations. sides trying modules implemented by UER, users
can also develop their customized modules and in-
3.4 Fine-tuning tegrate them into UER seamlessly.
UER exploits similar fine-tuning strategy with
4 Experiments
ULMFiT, GPT, and BERT. Models on down-
stream tasks share structures and parameters with To evaluate the performance of UER, experi-
pre-training models except that they have differ- ments are conducted on a range of datasets,
ent target layers. The entire models are fine-tuned each of which falls into one of four categories:
on downstream tasks. This strategy performs ro- sentence classification, sentence pair classifica-
bustly in practice. We also find that feature ex- tion, sequence labeling, and document-based QA.
tractor strategy produces inferior results on mod- BERT-base uncased English model and BERT-
els such as GPT and BERT. base Chinese model are used as baseline models.
Most pre-training works involve 2 stages, pre- In section 4.1, UER is tested on several evalua-
training and fine-tuning. But UER supports 3 tion benchmarks to demonstrate that it can pro-
stages: 1) pre-training on general-domain corpus; duce models as intended. In section 4.2, we ap-

244
Implementation SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI
HuggingFace 93.0 83.8 89.4 90.7 84.0/84.4 89.0 61.0 53.5
UER 92.4 83.0 89.3 91.0 84.0/84.0 91.5 66.8 56.3

Table 2: The performance of HuggingFace’s implementation and UER’s implementation on GLUE benchmark.

Implementation XNLI LCQMC MSRA-NER ChnSentiCorp nlpcc-dbqa


ERNIE 77.2 87.0 92.6 94.3 94.6
UER 77.5 86.6 93.6 94.3 94.6

Table 3: The performance of ERNIE’s implementation and UER’s implementation on ERNIE benchmark.

ply pre-trained models in our model zoo to dif- tached with users’ ratings. To this end, we can
ferent downstream datasets. Significant improve- exploit CLS target for pre-training (similar with
ments are witnessed when proper encoders and InferSent). We fine-tune these pre-trained models
targets are selected. For space constraint, we put (with different targets) on downstream datasets.
some contents in UER’s github project, including The results are shown in Table 4. BERT base-
dataset and corpus details, system speed, and part line (BERT-base Chinese) is pre-trained upon Chi-
of qualitative/quantitative evaluation results. nese Wikipedia. We can observe that pre-training
on Amazon review corpus can improve the results
4.1 Reproducibility significantly. Using CLS target achieves the best
This section uses English/Chinese benchmarks to results in most cases.
test BERT implementation of UER. For English, Dataset Douban. Shopping. Chn.
we use sentence and sentence pair classification BERT baseline 87.5 96.3 94.3
datasets in GLUE benchmark (dev set) (Wang MLM 88.1 97.0 95.0
CLS 88.3 97.0 95.8
et al., 2019). For Chinese, we use five datasets of
different types: sentiment analysis, sequence la- Table 4: Performance of pre-training models with dif-
beling, question pair matching, natural language ferent targets.
inference, and document-based QA (provided by
ERNIE5 ). Table 2 and 3 compare UER’s perfor- BERT requires heavy computational resources.
mance to other publicly available systems. We can To achieve better efficiency, we use UER to substi-
observe that UER could match the performance tute 12-layers Transformer encoder with a 2-layers
of HuggingFace’s and ERNIE’s implementations. LSTM encoder (embedding size and hidden size
Results of HuggingFace and ERNIE are reported are 512 and 1024). We still use the above senti-
on their github projects. Results of UER can be ment analysis datasets for evaluation. The model
reproduced by scripts in UER’s github project. is firstly trained on mixed large corpus with LM
target, and then trained on large-scale Amazon re-
4.2 Influence of targets and encoders
view corpus with LM and CLS targets. Table 5
In this section, we give some examples of select- lists the results of different encoders. Compared
ing pre-trained models given downstream datasets. with BERT baseline, LSTM encoder can achieve
Three Chinese sentiment analysis datasets are comparable or even better results when proper cor-
used for evaluation. They are Douban book re- pora and targets are selected.
view, Online shopping review, and Chnsenticorp
datasets. Dataset Douban. Shopping. Chn.
BERT baseline 87.5 96.3 94.3
First of all, we use UER to pre-train on large- LSTM 80.3 94.0 88.3
scale Amazon review corpus with different targets. LSTM+pre-training 86.5 96.9 94.5
The parameters are initialized by BERT-base Chi-
nese model. The target of original BERT consists Table 5: Performance of pre-training models with dif-
ferent encoders.
of MLM and NSP. However, NSP is not suitable
for sentence-level reviews (we have to split re-
For space constraint, this section only uses sen-
views into multiple parts). Therefore we remove
timent analysis datasets as examples to analyze
NSP target. In addition, Amazon reviews are at-
the influence of different targets and encoders.
5
https://github.com/PaddlePaddle/ERNIE More tasks and pre-trained models are discussed

245
in UER’s github project. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M Rush. 2017. Opennmt:
5 Conclusion Open-source toolkit for neural machine translation.
ACL.
This paper describes UER, an open-source toolkit
for pre-training on general-domain corpora and Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned
fine-tuning on downstream tasks. We demonstrate from word embeddings. TACL., 3.
that UER can largely facilitate implementations
of different pre-training models. With the help Lajanugen Logeswaran and Honglak Lee. 2018. An
efficient framework for learning sentence represen-
of UER, we pre-train models based on different tations. arXiv preprint arXiv:1803.02893.
corpora, encoders, targets and make these mod-
els publicly available. By using proper pre-trained Bryan McCann, James Bradbury, Caiming Xiong, and
models, we can achieve significant improvements Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In NIPS.
over BERT, or achieve competitive results with an
efficient training speed. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Acknowledgments tions of words and phrases and their compositional-
ity. In NIPS.
This work is supported by National Natural
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Science Foundation of China Grant No.U1711262
Fan, Sam Gross, Nathan Ng, David Grangier, and
and No.61472428, 2018 Tencent Rhino-Bird Elite Michael Auli. 2019. fairseq: A fast, extensible
Training Program. toolkit for sequence modeling. NAACL.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
References resentations. arXiv preprint arXiv:1802.05365.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alec Radford, Karthik Narasimhan, Tim Salimans, and
learning to align and translate. arXiv preprint Ilya Sutskever. 2018. Improving language under-
arXiv:1409.0473. standing by generative pre-training.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Barrault, and Antoine Bordes. 2017. Supervised Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
learning of universal sentence representations from Kaiser, and Illia Polosukhin. 2017. Attention is all
natural language inference data. EMNLP. you need. In NIPS.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised Alex Wang, Amanpreet Singh, Julian Michael, Felix
sequence learning. In NIPS. Hill, Omer Levy, and Samuel R. Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and form for natural language understanding. In ICLR.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jie Yang and Yue Zhang. 2018. Ncrf++: An open-
ing. arXiv preprint arXiv:1810.04805. source neural sequence labeling toolkit. arXiv
preprint arXiv:1806.05626.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural computation, 9(8). Xiang Zhang and Yann LeCun. 2017. Which encoding
is the best for text classification in chinese, english,
Jeremy Howard and Sebastian Ruder. 2018. Universal japanese and korean?
language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146. Zhe Zhao, Tao Liu, Shen Li, Bofang Li, and Xiaoyong
Du. 2017. Ngram2vec: Learning improved word
Armand Joulin, Edouard Grave, Piotr Bojanowski, and representations from ngram co-occurrence statistics.
Tomas Mikolov. 2016. Bag of tricks for efficient text EMNLP.
classification. arXiv preprint arXiv:1607.01759.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen
Yoon Kim. 2014. Convolutional neural networks for
Li, Hongwei Hao, and Bo Xu. 2016. Attention-
sentence classification. EMNLP.
based bidirectional long short-term memory net-
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, works for relation classification. In ACL., volume 2.
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-thought vectors. In
NIPS.

246

You might also like