PyTorch搭建LSTM对IMDB数据集进行情感分析(详细的数据分析与处理过程)

PyTorch搭建LSTM对IMDB数据集进行情感分析(详细的数据分析与处理过程)该项目的难点主要在于对数据的处理 本篇博客主要是详细地去讲解数据处理过程 模型定义得比较简单 后续会更新

1. 数据介绍

2. 数据处理

接下来我们先说一下LSTM需要什么样的数据。比如我们一共有25000句话,每句话有250个单词(多去少补,后面会详细介绍),然后每个单词用一个50维的向量表示,即每一个句子的维度是[250, 50]。假设我们把所有的训练集(25000)分成250批,每一批100句话,那么所有的训练集的规模就是[250, 100, 250, 50]。第一个250表示一共250批数据,100表示每批数据有100句话,第二个250表示每句话有250个单词,最后一个50表示每个单词为一个50维度的向量。接下来我们就详细介绍怎么得到这个数据集。

2.1 生成词向量表

  • 首先我们需要得到每一个单词对应的50维度向量,我们这里用网上已经训练好的glove数据集:
    在这里插入图片描述
    每个文件里面都有40000行,每一行代表一个单词的词向量(有单词标签)。 第一个文件为50维,后面依次为100/200/300维度。我们读取第一个文件,根据每一行的单词标签与该单词的向量,建立一个词向量表:




def load_cab_vector(): word_list = [] vocabulary_vectors = [] data = open('glove.6B.50d.txt', encoding='utf-8') for line in data.readlines(): temp = line.strip('\n').split(' ') # 一个列表 name = temp[0] word_list.append(name.lower()) vector = [temp[i] for i in range(1, len(temp))] # 向量 vector = list(map(float, vector)) # 变成浮点数 vocabulary_vectors.append(vector) # 保存 vocabulary_vectors = np.array(vocabulary_vectors) word_list = np.array(word_list) np.save('npys/vocabulary_vectors', vocabulary_vectors) np.save('npys/word_list', word_list) return vocabulary_vectors, word_list 

这样,我们就得到了一个词向量表。表由两个列表组成:word_list里面包含了40000个单词,vocabulary_vectors包含了40000个50维度的向量。加载数据十分缓慢,所以我们将这个两个列表转成array并利用np.save(file)存下来:(这个操作在后面经常用到)

vocabulary_vectors = np.array(vocabulary_vectors) word_list = np.array(word_list) np.save('npys/vocabulary_vectors', vocabulary_vectors) np.save('npys/word_list', word_list) return vocabulary_vectors, word_list 

于是我们得到了两个npy文件:vocabulary_vectors.npy与word_list.npy。

2.2 处理训练集和测试集

  • 对训练集和数据集进行处理。我们读取所有的文件(训练+测试一共50000条数据):
def load_data(path, flag='train'): labels = ['pos', 'neg'] data = [] for label in labels: files = os.listdir(os.path.join(path, flag, label)) # 去除标点符号 r = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n。!,]+' for file in files: with open(os.path.join(path, flag, label, file), 'r', encoding='utf8') as rf: temp = rf.read().replace('\n', '') temp = temp.replace('

'

, ' ') temp = re.sub(r, '', temp) temp = temp.split(' ') temp = [temp[i].lower() for i in range(len(temp)) if temp[i] != ''] if label == 'pos': data.append([temp, 1]) elif label == 'neg': data.append([temp, 0]) return data

最终返回的是一个列表。列表里每一个元素都是一个列表,该列表包含该句话的每一个单词以及标签(1表示pos,0表示neg)。比如我们输出一下train_data[0]:

train_data = load_data('Imdb') print(train_data[0]) 

输出为:

[[‘bromwell’, ‘high’, ‘is’, ‘a’, ‘cartoon’, ‘comedy’, ‘it’, ‘ran’, ‘at’, ‘the’, ‘same’, ‘time’, ‘as’, ‘some’, ‘other’, ‘programs’, ‘about’, ‘school’, ‘life’, ‘such’, ‘as’, ‘teachers’, ‘my’, ‘35’, ‘years’, ‘in’, ‘the’, ‘teaching’, ‘profession’, ‘lead’, ‘me’, ‘to’, ‘believe’, ‘that’, ‘bromwell’, ‘highs’, ‘satire’, ‘is’, ‘much’, ‘closer’, ‘to’, ‘reality’, ‘than’, ‘is’, ‘teachers’, ‘the’, ‘scramble’, ‘to’, ‘survive’, ‘financially’, ‘the’, ‘insightful’, ‘students’, ‘who’, ‘can’, ‘see’, ‘right’, ‘through’, ‘their’, ‘pathetic’, ‘teachers’, ‘pomp’, ‘the’, ‘pettiness’, ‘of’, ‘the’, ‘whole’, ‘situation’, ‘all’, ‘remind’, ‘me’, ‘of’, ‘the’, ‘schools’, ‘i’, ‘knew’, ‘and’, ‘their’, ‘students’, ‘when’, ‘i’, ‘saw’, ‘the’, ‘episode’, ‘in’, ‘which’, ‘a’, ‘student’, ‘repeatedly’, ‘tried’, ‘to’, ‘burn’, ‘down’, ‘the’, ‘school’, ‘i’, ‘immediately’, ‘recalled’, ‘at’, ‘high’, ‘a’, ‘classic’, ‘line’, ‘inspector’, ‘im’, ‘here’, ‘to’, ‘sack’, ‘one’, ‘of’, ‘your’, ‘teachers’, ‘student’, ‘welcome’, ‘to’, ‘bromwell’, ‘high’, ‘i’, ‘expect’, ‘that’, ‘many’, ‘adults’, ‘of’, ‘my’, ‘age’, ‘think’, ‘that’, ‘bromwell’, ‘high’, ‘is’, ‘far’, ‘fetched’, ‘what’, ‘a’, ‘pity’, ‘that’, ‘it’, ‘isnt’], 1]

可以看到,该列表第一个元素为一个单词列表,第二个元素为标签。

  • 对每一个句子进行处理,找到其中每一个单词在word_list中的索引值。比如对于上面这句话,我们找到里面每一个单词的在word_list中的索引。我们规定每个句子的最大长度为250,若影评单词个数超过250则自动截去,否则末尾补0:
def process_sentence(flag): sentence_code = [] vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True) word_list = np.load('npys/word_list.npy', allow_pickle=True) word_list = word_list.tolist() test_data = load_data('Imdb', flag) for i in range(len(test_data)): # print(i) vec = test_data[i][0] temp = [] index = 0 for j in range(len(vec)): try: index = word_list.index(vec[j]) except ValueError: # 没找到 index =  finally: temp.append(index) # temp表示一个单词在词典中的序号 if len(temp) < 250: for k in range(len(temp), 250): # 不足补0 temp.append(0) else: temp = temp[0:250] # 只保留250个 sentence_code.append(temp) # print(sentence_code) sentence_code = np.array(sentence_code) if flag == 'train': np.save('npys/sentence_code_1', sentence_code) else: np.save('npys/sentence_code_2', sentence_code) 

通过上面代码,我们最终得到了两个文件:sentence_code_1.npy与sentence_code_2.npy。每一个数组都是[25000, 250],代表里面一共有25000句话,每句话的250个单词在word_list的索引保存在里面。

2.3 批量处理

  • 批量处理数据。我们把25000个数据分成250批,每一批100句话,然后通过word_list与vocabulary_vectors,找到每个单词的向量:
def process_batch(batch_size): # 25000维 # (25000, 2) 25000句话, 单词向量+标签 test_data = load_data('Imdb', flag='test') train_data = load_data('Imdb') # 加载句子的索引 # (25000句话, 250单词) sentence_code_1 = np.load('npys/sentence_code_1.npy', allow_pickle=True) sentence_code_1 = sentence_code_1.tolist() # 25000 * 250测试集 sentence_code_2 = np.load('npys/sentence_code_2.npy', allow_pickle=True) sentence_code_2 = sentence_code_2.tolist() vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True) vocabulary_vectors = vocabulary_vectors.tolist() # 每个sentence_code都是25000 * 250 * 50 for i in range(25000): sentence_code_1[i] = [vocabulary_vectors[x] for x in sentence_code_1[i]] sentence_code_2[i] = [vocabulary_vectors[x] for x in sentence_code_2[i]] # for j in range(250): # sentence_code_1[i][j] = vocabulary_vectors[sentence_code_1[i][j]] # sentence_code_2[i][j] = vocabulary_vectors[sentence_code_2[i][j]] # 重新划分数据集,40000训练10000测试 data = train_data + test_data sentence_code = np.r_[sentence_code_1, sentence_code_2] # shuffle shuffle_ix = np.random.permutation(np.arange(len(data))) data = np.array(data)[shuffle_ix].tolist() sentence_code = sentence_code[shuffle_ix] train_data = data[:int(len(data) * 0.8)] test_data = data[int(len(data) * 0.8):] sentence_code_1 = sentence_code[:int(len(sentence_code) * 0.8)] sentence_code_2 = sentence_code[int(len(sentence_code) * 0.8):] labels_train = [] labels_test = [] arr_train = [] arr_test = [] # mini-batch操作 for i in range(1, int(len(train_data) / batch_size) + 1): arr_train.append(sentence_code_1[(i - 1) * batch_size:i * batch_size]) labels_train.append([train_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)]) for i in range(1, int(len(test_data) / batch_size) + 1): arr_test.append(sentence_code_2[(i - 1) * batch_size:i * batch_size]) labels_test.append([test_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)]) arr_train = np.array(arr_train) arr_test = np.array(arr_test) labels_train = np.array(labels_train) labels_test = np.array(labels_test) np.save('npys/arr_train', arr_train) np.save('npys/arr_test', arr_test) np.save('npys/labels_train', labels_train) np.save('npys/labels_test', labels_test) return arr_train, labels_train, arr_test, labels_test 

上述代码对数据进行了重新划分,训练集占比为80%,测试集占比为20%。最终返回的是四个数组,以arr_train为例,其维度为[400, 100, 250, 50],第一个400表示一共400批数据,100表示每批数据有100句话,第三个250表示每句话有250个单词,最后一个50表示每个单词为一个50维度的向量。

3. 模型

3.1 模型搭建

  • 搭建LSTM网络:
class LSTM(nn.Module): def __init__(self, hidden_size): super(LSTM, self).__init__() self.lstm = nn.LSTM(input_size=50, hidden_size=hidden_size, num_layers=1, batch_first=True) self.fc = nn.Sequential(nn.Dropout(0.5), nn.Linear(hidden_size, 32), nn.Linear(32, 2), nn.ReLU()) def forward(self, input_seq): # print(x.size()) x, _ = self.lstm(input_seq) x = self.fc(x) x = x[:, -1, :] return x 

3.2 训练

def train(): # load print('loading...') epoch_num = 10 arr_train = np.load('npys/arr_train.npy', allow_pickle=True) labels_train = np.load('npys/labels_train.npy', allow_pickle=True) print('training...') model = LSTM(hidden_size=64).to(device) optimizer = optim.Adam(model.parameters(), lr=0.00005) criterion = nn.CrossEntropyLoss().to(device) loss = 0 for i in range(epoch_num): for j in range(400): x = arr_train[j] y = labels_train[j] # print(y) input_ = torch.tensor(x, dtype=torch.float32).to(device) label = torch.tensor(y, dtype=torch.long).to(device) output = model(input_) # print(output) optimizer.zero_grad() loss = criterion(output, label) loss.backward() optimizer.step() print('epoch:%d loss:%.5f' % (i, loss.item())) # save model state = { 
   'model': model.state_dict(), 'optimizer': optimizer.state_dict()} torch.save(state, 'models/LSTM.pkl') 

3.3 测试

def test(): print('loading...') arr_test = np.load('npys/arr_test.npy', allow_pickle=True) labels_test = np.load('npys/labels_test.npy', allow_pickle=True) print('testing...') model = LSTM(hidden_size=64).to(device) model.load_state_dict(torch.load('models/LSTM.pkl')['model']) model.eval() num = 0 for i in range(100): xx = arr_test[i] yy = labels_test[i] input_ = torch.tensor(xx, dtype=torch.float32).to(device) label = torch.tensor(yy, dtype=torch.long).to(device) output = model(input_) pred = output.max(dim=-1)[1] for k in range(100): if pred[k] == label[k]: num += 1 print('Accuracy:', num / 10000) 

训练了10个epoch,精度为65%。

4. 代码使用方法

鉴于很多人问怎么运行,特地在这里总结一下:

(1)从文章最后的链接处下载源码,然后从百度网盘下载数据文件。

(2)将百度网盘文件中的glove.6B.50d.txt放入LSTM-IMDB-Classification/目录下。

if __name__ == '__main__': load_cab_vector() 
if __name__ == '__main__': # load_cab_vector() process_sentence('train') process_sentence('test') 
if __name__ == '__main__': # load_cab_vector() # process_sentence('train') # process_sentence('test') process_batch(100) 
if __name__ == '__main__': train() test() # load_cab_vector() # process_sentence('train') # process_sentence('test') # process_batch(100) 

5. 源码

后面将陆续公开~

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/224393.html原文链接:https://javaforall.net

(0)
上一篇 2026年3月17日 下午12:02
下一篇 2026年3月17日 下午12:03


相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号