UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

Tadilakshmikiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

Tadilakshmikiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

UER: An Open-Source Toolkit for Pre-training Models

♠ ♣ ♣ ♠ ♠
Zhe Zhao1,2, Hui Chen2, Jinbin Zhang2, Xin Zhao1, Tao Liu1,
♠ ♣ ♣ ∗ ♠
Wei Lu1, Xi Chen3, Haotang Deng2, Qi Ju2, , Xiaoyong Du1,
1
School of Information and DEKE, MOE, Renmin University of China, Beijing, China
2
Tencent AI Lab
3
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
♠
{helloworld, zhaoxinruc, tliu, lu-wei, duyong}@ruc.edu.cn
♣
{chenhuichen, westonzhang, haotangdeng, damonju}@tencent.com

{mrcx}@pku.edu.cn

Abstract Transformer (a structure based on attentionNN)

is shown to be a more powerful feature extractor
Existing works, including ELMO and BERT, compared with other encoders (Vaswani et al.,
have revealed the importance of pre-training
for NLP tasks. While there does not exist a
2017).
single pre-training model that works best in 2) Pre-training target (objective).
all cases, it is of necessity to develop a frame- Using proper target is one of the keys to the suc-
work that is able to deploy various pre-training cess of pre-training. While the language model
models efficiently. For this purpose, we is most commonly used (Radford et al., 2018),
propose an assemble-on-demand pre-training many works focus on seeking better targets such as
toolkit, namely Universal Encoder Represen- masked language model (cloze test) (Devlin et al.,
tations (UER). UER is loosely coupled, and
2018) and machine translation (McCann et al.,
encapsulated with rich modules. By assem-
bling modules on demand, users can either re- 2017).
produce a state-of-the-art pre-training model 3) Fine-tuning strategy.
or develop a pre-training model that remains Using a proper fine-tuning strategy is also im-
unexplored. With UER, we have built a model portant to the performance of pre-training models
zoo, which contains pre-trained models based on downstream tasks. A commonly-used strategy
on different corpora, encoders, and targets (ob- is to regard pre-trained models as feature extrac-
jectives). With proper pre-trained models, we
tors (Kiros et al., 2015).
could achieve new state-of-the-art results on a
range of downstream datasets.
Table 1 lists 8 popular pre-training models and
1 Introduction their main differences (Kiros et al., 2015; Lo-
geswaran and Lee, 2018; McCann et al., 2017;
Pre-training has been well recognized as an es- Conneau et al., 2017; Peters et al., 2018; Howard
sential step for NLP tasks since it results in re- and Ruder, 2018; Radford et al., 2018; Devlin
markable improvements on a range of downstream et al., 2018). In additional to encoder, target, and
datasets (Devlin et al., 2018). Instead of train- fine-tuning strategy, corpus is also listed in Table
ing models on a specific task from scratch, pre- 1 as an important factor for pre-training models.
training models are firstly trained on general- There are many open-source implementations
domain corpora, then followed by fine-tuning on of pre-training models, such as Google BERT1 ,
downstream tasks. Thus far, a large number of ELMO from AllenAI2 , GPT and BERT from Hug-
works have been proposed for finding better pre- gingFace3 . However, these works usually focus
training models. Existing pre-training models on the designs of either one or a few pre-training
mainly differ in the following three aspects: models. Due to the diversity of the downstream
tasks and the computational resources constraint,
1) Model encoder. there does not exist a single pre-training model
Commonly-used encoders include RNN that works best in all cases. BERT is one of the
(Hochreiter and Schmidhuber, 1997), CNN (Kim, most widely used pre-training models. It exploits
2014), AttentionNN (Bahdanau et al., 2014), and
1
their combinations (Zhou et al., 2016). Recently, https://github.com/google-research/bert
2
https://github.com/allenai/bilm-tf
∗ 3
Corresponding author. https://github.com/huggingface

241
Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 241–246
Hong Kong, China, November 3 – 7, 2019. c 2019 Association for Computational Linguistics
Model Corpus Encoder Target
Skip-thoughts Bookcorpus GRU Conditioned LM
Quick-thoughts Bookcorpus+UMBCcorpus GRU Sentence prediction
CoVe English-German Bi-LSTM Machine translation
Infersent Natural language inference LSTM;GRU;CNN;LSTM+Attention Classification
ELMO 1billion benchmark Bi-LSTM Language model
ULMFiT Wikipedia LSTM Language model
GPT Bookcorpus; 1billion benchmark Transformer Language model
BERT Wikipedia+bookcorpus Transformer Cloze+sentence prediction

Table 1: 8 pre-training models and their differences. For space constraint of the table, fine-tuning strategies of different
models are described as follows: Skip-thoughts, quick-thoughts, and infersent regard pre-trained models as feature extractors.
The parameters before output layer are frozen. CoVe and ELMO transfer word embedding to downstream tasks, with other
parameters in neural networks uninitialized. ULMFit, GPT, and BERT fine-tune entire networks on downstream tasks.

ble different modules to implement existing mod-

Target layer
BERT els such as BERT (right part in Figure 1), or de-
Unsupervised Supervised
Cloze
Language
model
Cloze
test
Machine
translation
test velop a new pre-training model by implementing
Sentence
Auto
encoder
Sentence
prediction
Classification prediction customized modules. Clear and robust interfaces
allow users to assemble (or add) modules with as
Encoder layer + few restrictions as possible.
NN combinations
Transformer
With the help of UER, we build a Chinese pre-
Transformer
RNN CNN self-attention self-attention trained model zoo based on different corpora, en-
RecNN Attention
coders, and targets. Different datasets have their
Subencoder layer
+ own characteristics. Selecting proper models from
the model zoo can largely boost the performance
Mean pooling Max pooling
of downstream datasets. In this work, we use
GRU LSTM CNN
Google BERT as baseline model. We provide
Corpus
+ some use cases that are based on UER, and the re-
sults show that our models can either achieve new
Supervised Wikipedia
General-domain
task
state-of-the-art performance, or achieve competi-
corpus Bookcorpus
corpus tive results with an efficient running speed.
UER is built on PyTorch and supports dis-
tributed training mode. Clear instructions and
Figure 1: The architecture of UER (pre-training part). documentations are provided to help users read
We can combine modules in UER to implement BERT and use UER codes. The UER toolkit and the
model.
model zoo are publicly available at https://
github.com/dbiir/UER-py.
two unsupervised targets for pre-training. But in
some scenarios, supervised information is critical 2 Related Work
to the performance of downstream tasks (Conneau
et al., 2017; McCann et al., 2017). Besides, in 2.1 Pre-training for deep neural networks
many cases, BERT is excluded due to its efficiency Using word embedding to initialize neural net-
issue. Based on above reasons, it is often the case work’s first layer is one of the most commonly
that one should adopt different pre-training mod- used strategies for NLP tasks (Mikolov et al.,
els in different application scenarios. 2013; Kim, 2014). Inspired by the success of
In this work, we introduce UER, a general word embedding, some recent works try to ini-
framework that is able to facilitate the develop- tialize entire networks (not just first layer) with
ments of various pre-training models. UER main- pre-trained parameters (Howard and Ruder, 2018;
tains model modularity and supports research ex- Radford et al., 2018). They train a deep neural
tensibility. It consists of 4 components: suben- network upon large corpus, and fine-tune the pre-
coder, encoder, target, and downstream task fine- trained model on specific downstream tasks. One
tuning. The architecture of UER (pre-training of the most influential works among them is BERT
part) is shown in Figure 1. Ample modules are im- (Devlin et al., 2018). BERT extracts text fea-
plemented in each component. Users could assem- tures with 12/24 Transformer layers, and exploits

242
masked language model task and sentence predic- tasks such as text classification (Zhang and Le-
tion task as training targets (objectives). The draw- Cun, 2017) and word embedding (Joulin et al.,
back of BERT is that it requires expensive compu- 2016). In the pre-training literature, ELMO ex-
tational resources. Thankfully, Google makes its ploits subencoder layer. In UER, we implement
pre-trained models publicly available. So we can RNN and CNN as subencoders, and use mean
directly fine-tune on Google’s models to achieve pooling or max pooling upon hidden states to ob-
competitive results on many NLP tasks. tain fixed-length word vectors.

2.2 NLP toolkits 3.2 Encoder

Many NLP models have tens of hyper-parameters This layer learns features from word vectors. UER
and various tricks, and some of which exert large implements a series of basic encoders, includ-
impacts on final performance. Sometimes it is un- ing LSTM, GRU, CNN, GatedCNN, and Atten-
likely to report all details and their effects in re- tionNN. Users can use these basic encoders di-
search paper. This may lead to a huge gap be- rectly, or use their combinations. The output of an
tween research papers and code implementations. encoder can be fed into another encoder, forming
To solve the above problem, some works are pro- networks of arbitrary layers. UER provides ample
posed to implement a class of models in a frame- examples of combining basic encoders (e.g. CNN
work. This type of work includes OpenNMT + LSTM). Users can also build their custom com-
(Klein et al., 2017), fairseq (Ott et al., 2019) for binations with basic encoders in UER.
neural machine translation; glyph (Zhang and Le- Currently, Transformer (a structure based on
Cun, 2017) for classification; NCRF++ (Yang and multi-headed self-attention) becomes a popular
Zhang, 2018) for sequence labeling; Hyperwords text feature extractor and is proven to be effec-
(Levy et al., 2015), ngram2vec (Zhao et al., 2017) tive for many NLP tasks. We implement Trans-
for word embedding, to name a few. former module and integrate it into UER. With
Recently, we witness many influential pre- Transformer module, we can implement models
training works such as GPT, ULMFiT, and BERT. such as GPT and BERT easily.
We think it could be useful to develop a frame-
3.3 Target (objective)
work to facilitate reproducing and refining those
models. UER provides the flexibility of building Using suitable target is the key to the success of
pre-training models of different properties. pre-training. Many papers in this field propose
their targets and show their advantages over other
3 Architecture ones. UER consists of a range of targets. Users
can choose one of them, or use multiple targets
In this section, we firstly introduce the core com-
and give them different weights. In this section
ponents in UER and the modules that we have
we introduce targets implemented in UER.
implemented in each component. Figure 1 il-
lustrates UER’s framework and detailed modules • Language model (LM). Language model is
(pre-training part). Modularity design of UER one of the most commonly used targets. It
largely facilitates the use of pre-training models. trains model to make it useful to predict cur-
At the end of this section, we will give some case rent word given previous words.
studies to illustrate how to use UER effectively.
• Masked LM (MLM, also known as cloze
3.1 Subencoder test). The model is trained to be useful to pre-
dict masked word given surrounding words.
This layer learns word vectors from subword fea-
MLM utilizes both left and right contexts to
tures. For English, we use character as subword
predict words. LM only considers the left
features. For Chinese4 , we use radical and pinyin
context.
as subword features. As a result, the model can
be aware of internal structures of words. Sub- • Autoencoder (AE). The model is trained to
word information has been explored in many NLP be useful to reconstruct input sequence as
4
We don’t do word segmentation on Chinese corpus. We close as possible.
regard each Chinese character as a word. Internal structures
such as radical and pinyin are regarded as Chinese subword The above targets are related with word predic-
features. tion. We call them word-level targets. Some works

243
show that introducing sentence-level task into tar- 2) pre-training on downstream dataset; 3) fine-
gets can benefit pre-training models (Logeswaran tuning on downstream dataset. Stage 2 enables
and Lee, 2018; Devlin et al., 2018). models to get familiar with the distributions of
downstream datasets (Howard and Ruder, 2018;
• Next sentence prediction (NSP). The model Radford et al., 2018). It is also called semi-
is trained to predict if the two sentences are supervised fine-tuning strategy in the work of Dai
continuous. Sentence prediction target is and Le (2015) since stage 2 is unsupervised and
much more efficient than word-level targets. stage 3 is supervised.
It doesn’t involve sequentially decoding of
words and softmax layer over entire vocab- 3.5 Case Studies
ulary. In this section, we show how UER facilitates the
use of pre-training models. First of all, we demon-
Above targets are unsupervised tasks (also
strate that UER can build most pre-training mod-
known as self-supervised tasks). However, super-
els easily. As shown in the following code, only a
vised tasks can provide additional knowledge that
few lines are required to construct models with the
raw corpus can not provide.
interfaces in UER.
• Neural machine translation (NMT). CoVe 1 # I m p l e m e n t a t i o n o f BERT .
2 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
(McCann et al., 2017) proposes to use NMT 3 encoder = BertEncoder ( args )
to pre-train model. The implementation of 4 t a r g e t = BertTarget ( args , vocab size )
5
NMT target is similar with autoencoder. Both 6 # I m p l e m e n t a t i o n o f GPT .
7 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e )
of them involve encoding source sentences 8 encoder = GptEncoder ( a r g s )
and sequentially decoding words of target 9 t a r g e t = LmTarget ( a r g s , v o c a b s i z e )
10
sentences. 11 # I m p l e m e n t a t i o n o f Quick−t h o u g h t s .
12 embedding = Embedding ( a r g s , v o c a b s i z e )
13 encoder = GruEncoder ( a r g s )
• Classification (CLS). Infersent (Conneau 14 t a r g e t = N s p T a r g e t ( a r g s , None )
15
et al., 2017) proposes to use natural language 16 # Implementation of I n f e r S e n t .
inference task (three-way classification) to 17 embedding = Embedding ( a r g s , v o c a b s i z e )
18 encoder = LstmEncoder ( a r g s )
pre-train model. 19 t a r g e t = C l s T a r g e t ( a r g s , None )

Most pre-training models use above targets in- In practice, users can assemble different suben-
dividually. It is worth trying to use multiple targets coder, encoder, and target modules without any
at the same time. Some targets are complementary code work. Users can specify modules through op-
to each other, e.g. word-level target and sentence- tions –subencoder, –encoder, and –target. More
level target (Devlin et al., 2018), unsupervised tar- details are available in quickstart and instructions
get and supervised target. In experiments section, of UER’s github project. UER provides ample
we demonstrate that proper selection of target is modules. Users can try different module combina-
important. UER provides the flexibility to users in tions according to their downstream datasets. Be-
trying different targets and their combinations. sides trying modules implemented by UER, users
can also develop their customized modules and in-
3.4 Fine-tuning tegrate them into UER seamlessly.
UER exploits similar fine-tuning strategy with
4 Experiments
ULMFiT, GPT, and BERT. Models on down-
stream tasks share structures and parameters with To evaluate the performance of UER, experi-
pre-training models except that they have differ- ments are conducted on a range of datasets,
ent target layers. The entire models are fine-tuned each of which falls into one of four categories:
on downstream tasks. This strategy performs ro- sentence classification, sentence pair classifica-
bustly in practice. We also find that feature ex- tion, sequence labeling, and document-based QA.
tractor strategy produces inferior results on mod- BERT-base uncased English model and BERT-
els such as GPT and BERT. base Chinese model are used as baseline models.
Most pre-training works involve 2 stages, pre- In section 4.1, UER is tested on several evalua-
training and fine-tuning. But UER supports 3 tion benchmarks to demonstrate that it can pro-
stages: 1) pre-training on general-domain corpus; duce models as intended. In section 4.2, we ap-

244
Implementation SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI
HuggingFace 93.0 83.8 89.4 90.7 84.0/84.4 89.0 61.0 53.5
UER 92.4 83.0 89.3 91.0 84.0/84.0 91.5 66.8 56.3

Table 2: The performance of HuggingFace’s implementation and UER’s implementation on GLUE benchmark.

Implementation XNLI LCQMC MSRA-NER ChnSentiCorp nlpcc-dbqa

ERNIE 77.2 87.0 92.6 94.3 94.6
UER 77.5 86.6 93.6 94.3 94.6

Table 3: The performance of ERNIE’s implementation and UER’s implementation on ERNIE benchmark.

ply pre-trained models in our model zoo to dif- tached with users’ ratings. To this end, we can
ferent downstream datasets. Significant improve- exploit CLS target for pre-training (similar with
ments are witnessed when proper encoders and InferSent). We fine-tune these pre-trained models
targets are selected. For space constraint, we put (with different targets) on downstream datasets.
some contents in UER’s github project, including The results are shown in Table 4. BERT base-
dataset and corpus details, system speed, and part line (BERT-base Chinese) is pre-trained upon Chi-
of qualitative/quantitative evaluation results. nese Wikipedia. We can observe that pre-training
on Amazon review corpus can improve the results
4.1 Reproducibility significantly. Using CLS target achieves the best
This section uses English/Chinese benchmarks to results in most cases.
test BERT implementation of UER. For English, Dataset Douban. Shopping. Chn.
we use sentence and sentence pair classification BERT baseline 87.5 96.3 94.3
datasets in GLUE benchmark (dev set) (Wang MLM 88.1 97.0 95.0
CLS 88.3 97.0 95.8
et al., 2019). For Chinese, we use five datasets of
different types: sentiment analysis, sequence la- Table 4: Performance of pre-training models with dif-
beling, question pair matching, natural language ferent targets.
inference, and document-based QA (provided by
ERNIE5 ). Table 2 and 3 compare UER’s perfor- BERT requires heavy computational resources.
mance to other publicly available systems. We can To achieve better efficiency, we use UER to substi-
observe that UER could match the performance tute 12-layers Transformer encoder with a 2-layers
of HuggingFace’s and ERNIE’s implementations. LSTM encoder (embedding size and hidden size
Results of HuggingFace and ERNIE are reported are 512 and 1024). We still use the above senti-
on their github projects. Results of UER can be ment analysis datasets for evaluation. The model
reproduced by scripts in UER’s github project. is firstly trained on mixed large corpus with LM
target, and then trained on large-scale Amazon re-
4.2 Influence of targets and encoders
view corpus with LM and CLS targets. Table 5
In this section, we give some examples of select- lists the results of different encoders. Compared
ing pre-trained models given downstream datasets. with BERT baseline, LSTM encoder can achieve
Three Chinese sentiment analysis datasets are comparable or even better results when proper cor-
used for evaluation. They are Douban book re- pora and targets are selected.
view, Online shopping review, and Chnsenticorp
datasets. Dataset Douban. Shopping. Chn.
BERT baseline 87.5 96.3 94.3
First of all, we use UER to pre-train on large- LSTM 80.3 94.0 88.3
scale Amazon review corpus with different targets. LSTM+pre-training 86.5 96.9 94.5
The parameters are initialized by BERT-base Chi-
nese model. The target of original BERT consists Table 5: Performance of pre-training models with dif-
ferent encoders.
of MLM and NSP. However, NSP is not suitable
for sentence-level reviews (we have to split re-
For space constraint, this section only uses sen-
views into multiple parts). Therefore we remove
timent analysis datasets as examples to analyze
NSP target. In addition, Amazon reviews are at-
the influence of different targets and encoders.
5
https://github.com/PaddlePaddle/ERNIE More tasks and pre-trained models are discussed

245
in UER’s github project. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M Rush. 2017. Opennmt:
5 Conclusion Open-source toolkit for neural machine translation.
ACL.
This paper describes UER, an open-source toolkit
for pre-training on general-domain corpora and Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned
fine-tuning on downstream tasks. We demonstrate from word embeddings. TACL., 3.
that UER can largely facilitate implementations
of different pre-training models. With the help Lajanugen Logeswaran and Honglak Lee. 2018. An
efficient framework for learning sentence represen-
of UER, we pre-train models based on different tations. arXiv preprint arXiv:1803.02893.
corpora, encoders, targets and make these mod-
els publicly available. By using proper pre-trained Bryan McCann, James Bradbury, Caiming Xiong, and
models, we can achieve significant improvements Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In NIPS.
over BERT, or achieve competitive results with an
efficient training speed. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Acknowledgments tions of words and phrases and their compositional-
ity. In NIPS.
This work is supported by National Natural
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Science Foundation of China Grant No.U1711262
Fan, Sam Gross, Nathan Ng, David Grangier, and
and No.61472428, 2018 Tencent Rhino-Bird Elite Michael Auli. 2019. fairseq: A fast, extensible
Training Program. toolkit for sequence modeling. NAACL.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
References resentations. arXiv preprint arXiv:1802.05365.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alec Radford, Karthik Narasimhan, Tim Salimans, and
learning to align and translate. arXiv preprint Ilya Sutskever. 2018. Improving language under-
arXiv:1409.0473. standing by generative pre-training.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Barrault, and Antoine Bordes. 2017. Supervised Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
learning of universal sentence representations from Kaiser, and Illia Polosukhin. 2017. Attention is all
natural language inference data. EMNLP. you need. In NIPS.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised Alex Wang, Amanpreet Singh, Julian Michael, Felix
sequence learning. In NIPS. Hill, Omer Levy, and Samuel R. Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and form for natural language understanding. In ICLR.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jie Yang and Yue Zhang. 2018. Ncrf++: An open-
ing. arXiv preprint arXiv:1810.04805. source neural sequence labeling toolkit. arXiv
preprint arXiv:1806.05626.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural computation, 9(8). Xiang Zhang and Yann LeCun. 2017. Which encoding
is the best for text classification in chinese, english,
Jeremy Howard and Sebastian Ruder. 2018. Universal japanese and korean?
language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146. Zhe Zhao, Tao Liu, Shen Li, Bofang Li, and Xiaoyong
Du. 2017. Ngram2vec: Learning improved word
Armand Joulin, Edouard Grave, Piotr Bojanowski, and representations from ngram co-occurrence statistics.
Tomas Mikolov. 2016. Bag of tricks for efficient text EMNLP.
classification. arXiv preprint arXiv:1607.01759.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen
Yoon Kim. 2014. Convolutional neural networks for
Li, Hongwei Hao, and Bo Xu. 2016. Attention-
sentence classification. EMNLP.
based bidirectional long short-term memory net-
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, works for relation classification. In ACL., volume 2.
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-thought vectors. In
NIPS.

246

Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
11 Bert
No ratings yet
11 Bert
66 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
40 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Cramming: Training A Language Model On A Single GPU in One Day (2022)
No ratings yet
Cramming: Training A Language Model On A Single GPU in One Day (2022)
27 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
LLM_book_8_42
No ratings yet
LLM_book_8_42
35 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
No ratings yet
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
16 pages
2402.11537
No ratings yet
2402.11537
20 pages
2023 Acl-Long 896
No ratings yet
2023 Acl-Long 896
15 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
Ernie 2.0 A Continual Pre-Training Framework For
No ratings yet
Ernie 2.0 A Continual Pre-Training Framework For
11 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
2020.emnlp-main.553 /
No ratings yet
2020.emnlp-main.553 /
16 pages
1 pretraining
No ratings yet
1 pretraining
18 pages
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
No ratings yet
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
10 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Basics of NLP
No ratings yet
Basics of NLP
9 pages
song19d
No ratings yet
song19d
11 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Bert
No ratings yet
Bert
20 pages
Improving Language Understanding by Generative Pre-Training
No ratings yet
Improving Language Understanding by Generative Pre-Training
12 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
Bert
No ratings yet
Bert
10 pages
wavLM
No ratings yet
wavLM
14 pages
Unified Language Model Pre-training for Natural Language Understanding and Generation
No ratings yet
Unified Language Model Pre-training for Natural Language Understanding and Generation
14 pages
LLM_test_v1_p8_12
No ratings yet
LLM_test_v1_p8_12
5 pages
N19-1213
No ratings yet
N19-1213
7 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
Truncated_Doc_1
No ratings yet
Truncated_Doc_1
3 pages
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
No ratings yet
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
13 pages
ethio telecom quiz question
100% (1)
ethio telecom quiz question
9 pages
24-0404
No ratings yet
24-0404
10 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Pimsleur French 4 - Reading Booklet PDF
No ratings yet
Pimsleur French 4 - Reading Booklet PDF
56 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
2506.17766v1
No ratings yet
2506.17766v1
7 pages
DET Writing
No ratings yet
DET Writing
4 pages
Modal Verbs
No ratings yet
Modal Verbs
30 pages
[English (auto-generated)] Learn English With Podcast Conversation Episode 1 _ English Podcast For Beginners #englishpodcast [DownSub.com]
No ratings yet
[English (auto-generated)] Learn English With Podcast Conversation Episode 1 _ English Podcast For Beginners #englishpodcast [DownSub.com]
7 pages
Iso 780 2015
0% (1)
Iso 780 2015
9 pages
b4e72_Syllabus UT-1_X_2025-26
No ratings yet
b4e72_Syllabus UT-1_X_2025-26
3 pages
Adjectives and Adverbs Worksheet Reading Level 02
No ratings yet
Adjectives and Adverbs Worksheet Reading Level 02
4 pages
R078r2e
No ratings yet
R078r2e
40 pages
çınar
No ratings yet
çınar
6 pages
nist.sp.800-78-4
No ratings yet
nist.sp.800-78-4
25 pages
The Land of The Five Rivers and Sindh
No ratings yet
The Land of The Five Rivers and Sindh
242 pages
Plug TM31P-TM-88P: Guide Plate (Clear Yellow
No ratings yet
Plug TM31P-TM-88P: Guide Plate (Clear Yellow
13 pages
Module 4 Grammar Part-2
No ratings yet
Module 4 Grammar Part-2
17 pages
07302015125742IMYB2013 - Manganese Ore
No ratings yet
07302015125742IMYB2013 - Manganese Ore
23 pages
O.A.Nos. 1075 & 1306 of 2023, S.B., M.A.Lovekar, Benefits of Old Pension Scheme & G.P.F. Scheme
No ratings yet
O.A.Nos. 1075 & 1306 of 2023, S.B., M.A.Lovekar, Benefits of Old Pension Scheme & G.P.F. Scheme
19 pages
I Am Malala Nobel Peace Prize Speech
No ratings yet
I Am Malala Nobel Peace Prize Speech
5 pages
Installation Instructions T-59 Space Temperature Sensor With Override & Setpoint Adjustment
No ratings yet
Installation Instructions T-59 Space Temperature Sensor With Override & Setpoint Adjustment
8 pages
Recent judgments on The Unlawful Activities (Prevention) Act, 1967 PDF
No ratings yet
Recent judgments on The Unlawful Activities (Prevention) Act, 1967 PDF
10 pages
6VH 6VHX Brochure
No ratings yet
6VH 6VHX Brochure
8 pages
res mepc.358(78) - 2022 guidelines for survey and certification of anti-fouling systems on ships
No ratings yet
res mepc.358(78) - 2022 guidelines for survey and certification of anti-fouling systems on ships
8 pages
sbe98
No ratings yet
sbe98
5 pages
T4G Booklet
No ratings yet
T4G Booklet
6 pages
rgl34a
No ratings yet
rgl34a
5 pages
Ebook - Electronics Tutorial PDF
100% (1)
Ebook - Electronics Tutorial PDF
213 pages
70 Je
No ratings yet
70 Je
7 pages
ESL Benchmarks Kindergarten
No ratings yet
ESL Benchmarks Kindergarten
13 pages
Dzexams 4ap Anglais 211550
No ratings yet
Dzexams 4ap Anglais 211550
5 pages
Physical & Mechanical Properties: Technical Sheet - 543
No ratings yet
Physical & Mechanical Properties: Technical Sheet - 543
1 page
Osd5-5t. Osd15-5t. Osd1-5t
No ratings yet
Osd5-5t. Osd15-5t. Osd1-5t
5 pages
MEPC.343(78)
No ratings yet
MEPC.343(78)
4 pages
Unified Enhanced Response Datasheet
No ratings yet
Unified Enhanced Response Datasheet
2 pages
Unit 1 Part of Speech 66
No ratings yet
Unit 1 Part of Speech 66
5 pages
Kisi-Kisi Soal BING Genap 2
No ratings yet
Kisi-Kisi Soal BING Genap 2
2 pages
Product Specifications
No ratings yet
Product Specifications
1 page
Update or Cancellation of Kentucky Tax Account (S) : Section A Reason For Completing This Update (Must Be Completed)
No ratings yet
Update or Cancellation of Kentucky Tax Account (S) : Section A Reason For Completing This Update (Must Be Completed)
4 pages
(Please See para 6.18 (D) of FTP) : Appendix-6K Guidelines For Exit of Eou/Ehtp/Stp Units
No ratings yet
(Please See para 6.18 (D) of FTP) : Appendix-6K Guidelines For Exit of Eou/Ehtp/Stp Units
3 pages
Classroom Interaction
No ratings yet
Classroom Interaction
2 pages
b7-l7
No ratings yet
b7-l7
1 page
English 8 - Learning Packet - Lesson 3
No ratings yet
English 8 - Learning Packet - Lesson 3
4 pages
Submittal Uty-Rnruz5: Wired Remote Controller (Touch Panel)
No ratings yet
Submittal Uty-Rnruz5: Wired Remote Controller (Touch Panel)
2 pages
Strontium-89
No ratings yet
Strontium-89
1 page
LRPQ 70J
No ratings yet
LRPQ 70J
1 page
10-1-12-962
No ratings yet
10-1-12-962
5 pages
Engleza Plan Afaceri Varianta Engleza
No ratings yet
Engleza Plan Afaceri Varianta Engleza
7 pages
Spring / Summer 2011 Program Guide
No ratings yet
Spring / Summer 2011 Program Guide
37 pages
7MR.mukul Kumar Shrawat_0
No ratings yet
7MR.mukul Kumar Shrawat_0
1 page
Helpful Questions To Ask: Metaphor in Practice
No ratings yet
Helpful Questions To Ask: Metaphor in Practice
1 page
1Z0 809 PDF
No ratings yet
1Z0 809 PDF
26 pages
Varieties and Register of Spoken Witten Language
No ratings yet
Varieties and Register of Spoken Witten Language
26 pages
Robert Browning and Alfred Lord Tennyson: A Comparative Study
No ratings yet
Robert Browning and Alfred Lord Tennyson: A Comparative Study
7 pages
B1 Preliminary Writing Examiners Assessment Scale
No ratings yet
B1 Preliminary Writing Examiners Assessment Scale
2 pages
Technical CommunicationScientific English
No ratings yet
Technical CommunicationScientific English
4 pages
Gerunds and Infinitives Worksheet
No ratings yet
Gerunds and Infinitives Worksheet
3 pages
Language On Schools - English Irregular Verbs List PDF
No ratings yet
Language On Schools - English Irregular Verbs List PDF
5 pages
Who Wants To Be A Millionaire Worksheet
No ratings yet
Who Wants To Be A Millionaire Worksheet
1 page
10dec16 DR - FikreBookSinging
No ratings yet
10dec16 DR - FikreBookSinging
2 pages
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

UER: An Open-Source Toolkit For Pre-Training Models

Uploaded by

UER: An Open-Source Toolkit for Pre-training Models

Abstract Transformer (a structure based on attentionNN)

ble different modules to implement existing mod-

2.2 NLP toolkits 3.2 Encoder

Implementation XNLI LCQMC MSRA-NER ChnSentiCorp nlpcc-dbqa

You might also like