Day10 【基于LSTM实现自回归语言模型文本续写任务】

最新推荐文章于 2025-06-27 14:24:00 发布

原创

最新推荐文章于 2025-06-27 14:24:00 发布 · 1.1k 阅读

基于LSTM实现文本续写任务

目标

本文基于给定的词表，将输入的文本以字符分割为若干个词，然后基于词表将词初步序列化作为训练网络的输入序列，将词后面一个词在词表中的序号作为输入标签，取连续序列文本片段长度作为输入序列的长度。之后经过Embedding、LSTM等网络层。因为生成的词是词表中某个词，因此模型输出为已知词表上的多类别概率分布，从而实现一个简单文本的续写任务。
在这里插入图片描述

数据准备

词表文件vocab.txt

语料文件语料训练文件

程序说明

定义模型结构

class LanguageModel(nn.Module):
    def __init__(self, input_dim, vocab):
        super(LanguageModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), input_dim)
        self.layer = nn.LSTM(input_dim, input_dim, num_layers=1, batch_first=True)
        self.classify = nn.Linear(input_dim, len(vocab))
        self.dropout = nn.Dropout(0.1)
        self.loss = nn.functional.cross_entropy

LanguageModel 类继承自 nn.Module，表示一个神经网络模型。
self.embedding: 嵌入层，用于将每个单词映射到固定维度的向量空间。
self.layer: LSTM 层，处理输入的序列数据。
self.classify: 全连接层，用于将 LSTM 的输出映射到词表大小的输出空间，即词汇的预测概率分布。
self.dropout: 丢弃层，避免过拟合。
self.loss: 使用交叉熵损失函数来计算损失。

前向传播

def forward(self, x, y=None):
    x = self.embedding(x)
    x, _ = self.layer(x)
    y_pred = self.classify(x)
    if y is not None:
        return self.loss(y_pred.view(-1, y_pred.shape[-1]), y.view(-1))
    else:
        return torch.softmax(y_pred, dim=-1)

输入 x 是词的索引，首先通过嵌入层转换成向量表示。
x, _ = self.layer(x) 通过 LSTM 层处理输入序列。
然后通过 self.classify 进行预测，得到每个词的概率分布。
如果有真实标签 y，则计算并返回损失；如果没有真实标签，则返回预测的概率分布。

构建词表

def build_vocab(vocab_path):
    vocab = {
   
   "<pad>": 0}
    with open(vocab_path, encoding="utf8") as f:
        for index, line in enumerate(f):
            char = line[:-1]
            vocab[char] = index + 1
    return vocab

从给定的文件 vocab_path 加载词表，词表是一个字典，每个字符对应一个唯一的索引。
词表中包含一个特殊的 <pad> 标记，索引为 0，用于填充。

加载语料

def load_corpus(path):
    corpus = ""
    with open(path, encoding="gbk") as f:
        for line in f:
            corpus += line.strip()
    return corpus

从文件 path 中加载语料，将每行的空格、换行符去掉，并将所有文本连接成一个长字符串。

构建训练样本

def build_sample(vocab, window_size, corpus):
    start = random.randint(0, len(corpus) - 1 - window_size)
    end = start + window_size
    window = corpus[start:end]
    target = corpus[start + 1:end + 1]
    x = [vocab.get(word, vocab["<UNK>"]) for word in window]
    y = [vocab.get(word, vocab["<UNK>"]) for word in target]
    return x, y

从语料中随机选取一个窗口 window_size 长度的子序列，并将其作为输入 x 和目标 y（目标 y 是输入 x 向后移一位的序列）。
每个字符都被映射为词表中的索引。

构建数据集

def build_dataset(sample_length, vocab, window_size, corpus):
    dataset_x = []
    dataset_y = []
    for i in range(sample_length):
        x, y = build_sample(vocab, window_size, corpus)
        dataset_x.append(x)
        dataset_y.append(y)
    return torch.LongTensor(dataset_x), torch.LongTensor(dataset_y)

根据需要的样本数量 sample_length 生成训练数据集 dataset_x 和 dataset_y。
每个样本是一个长度为 window_size 的输入序列和一个对应的目标序列。

训练模型

def train(corpus_path, save_weight=True):
    epoch_num = 10
    batch_size = 64
    train_sample = 50000
    char_dim = 256
    window_size = 10
    vocab = build_vocab("vocab.txt")
    corpus = load_corpus(corpus_path)
    model = build_model(vocab, char_dim)
    if torch.cuda.is_available():
        model = model.cuda()
    optim = torch.optim.Adam(model.parameters(), lr=0.01)
    print("文本词表模型加载完毕，开始训练")
    for epoch in range(epoch_num):
        model.train()
        watch_loss = []
        for batch in range(int(train_sample / batch_size)):
            x, y