PyTorch搭建LSTM对IMDB数据集进行情感分析（详细的数据分析与处理过程）

1. 数据介绍

2. 数据处理

接下来我们先说一下LSTM需要什么样的数据。比如我们一共有25000句话，每句话有250个单词（多去少补，后面会详细介绍），然后每个单词用一个50维的向量表示，即每一个句子的维度是[250, 50]。假设我们把所有的训练集（25000）分成250批，每一批100句话，那么所有的训练集的规模就是[250, 100, 250, 50]。第一个250表示一共250批数据，100表示每批数据有100句话，第二个250表示每句话有250个单词，最后一个50表示每个单词为一个50维度的向量。接下来我们就详细介绍怎么得到这个数据集。

2.1 生成词向量表

首先我们需要得到每一个单词对应的50维度向量，我们这里用网上已经训练好的glove数据集：

每个文件里面都有40000行，每一行代表一个单词的词向量（有单词标签）。第一个文件为50维，后面依次为100/200/300维度。我们读取第一个文件，根据每一行的单词标签与该单词的向量，建立一个词向量表：

def load_cab_vector(): word_list = [] vocabulary_vectors = [] data = open('glove.6B.50d.txt', encoding='utf-8') for line in data.readlines(): temp = line.strip('\n').split(' ') # 一个列表 name = temp[0] word_list.append(name.lower()) vector = [temp[i] for i in range(1, len(temp))] # 向量 vector = list(map(float, vector)) # 变成浮点数 vocabulary_vectors.append(vector) # 保存 vocabulary_vectors = np.array(vocabulary_vectors) word_list = np.array(word_list) np.save('npys/vocabulary_vectors', vocabulary_vectors) np.save('npys/word_list', word_list) return vocabulary_vectors, word_list

这样，我们就得到了一个词向量表。表由两个列表组成：word_list里面包含了40000个单词，vocabulary_vectors包含了40000个50维度的向量。加载数据十分缓慢，所以我们将这个两个列表转成array并利用np.save(file)存下来：（这个操作在后面经常用到）

vocabulary_vectors = np.array(vocabulary_vectors) word_list = np.array(word_list) np.save('npys/vocabulary_vectors', vocabulary_vectors) np.save('npys/word_list', word_list) return vocabulary_vectors, word_list

于是我们得到了两个npy文件：vocabulary_vectors.npy与word_list.npy。

2.2 处理训练集和测试集

对训练集和数据集进行处理。我们读取所有的文件（训练+测试一共50000条数据）：

def load_data(path, flag='train'): labels = ['pos', 'neg'] data = [] for label in labels: files = os.listdir(os.path.join(path, flag, label)) # 去除标点符号 r = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n。！，]+' for file in files: with open(os.path.join(path, flag, label, file), 'r', encoding='utf8') as rf: temp = rf.read().replace('\n', '') temp = temp.replace('

'

, ' ') temp = re.sub(r, '', temp) temp = temp.split(' ') temp = [temp[i].lower() for i in range(len(temp)) if temp[i] != ''] if label == 'pos': data.append([temp, 1]) elif label == 'neg': data.append([temp, 0]) return data

最终返回的是一个列表。列表里每一个元素都是一个列表，该列表包含该句话的每一个单词以及标签（1表示pos，0表示neg）。比如我们输出一下train_data[0]：

train_data = load_data('Imdb') print(train_data[0])

输出为：

[[‘bromwell’, ‘high’, ‘is’, ‘a’, ‘cartoon’, ‘comedy’, ‘it’, ‘ran’, ‘at’, ‘the’, ‘same’, ‘time’, ‘as’, ‘some’, ‘other’, ‘programs’, ‘about’, ‘school’, ‘life’, ‘such’, ‘as’, ‘teachers’, ‘my’, ‘35’, ‘years’, ‘in’, ‘the’, ‘teaching’, ‘profession’, ‘lead’, ‘me’, ‘to’, ‘believe’, ‘that’, ‘bromwell’, ‘highs’, ‘satire’, ‘is’, ‘much’, ‘closer’, ‘to’, ‘reality’, ‘than’, ‘is’, ‘teachers’, ‘the’, ‘scramble’, ‘to’, ‘survive’, ‘financially’, ‘the’, ‘insightful’, ‘students’, ‘who’, ‘can’, ‘see’, ‘right’, ‘through’, ‘their’, ‘pathetic’, ‘teachers’, ‘pomp’, ‘the’, ‘pettiness’, ‘of’, ‘the’, ‘whole’, ‘situation’, ‘all’, ‘remind’, ‘me’, ‘of’, ‘the’, ‘schools’, ‘i’, ‘knew’, ‘and’, ‘their’, ‘students’, ‘when’, ‘i’, ‘saw’, ‘the’, ‘episode’, ‘in’, ‘which’, ‘a’, ‘student’, ‘repeatedly’, ‘tried’, ‘to’, ‘burn’, ‘down’, ‘the’, ‘school’, ‘i’, ‘immediately’, ‘recalled’, ‘at’, ‘high’, ‘a’, ‘classic’, ‘line’, ‘inspector’, ‘im’, ‘here’, ‘to’, ‘sack’, ‘one’, ‘of’, ‘your’, ‘teachers’, ‘student’, ‘welcome’, ‘to’, ‘bromwell’, ‘high’, ‘i’, ‘expect’, ‘that’, ‘many’, ‘adults’, ‘of’, ‘my’, ‘age’, ‘think’, ‘that’, ‘bromwell’, ‘high’, ‘is’, ‘far’, ‘fetched’, ‘what’, ‘a’, ‘pity’, ‘that’, ‘it’, ‘isnt’], 1]

可以看到，该列表第一个元素为一个单词列表，第二个元素为标签。

对每一个句子进行处理，找到其中每一个单词在word_list中的索引值。比如对于上面这句话，我们找到里面每一个单词的在word_list中的索引。我们规定每个句子的最大长度为250，若影评单词个数超过250则自动截去，否则末尾补0：

def process_sentence(flag): sentence_code = [] vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True) word_list = np.load('npys/word_list.npy', allow_pickle=True) word_list = word_list.tolist() test_data = load_data('Imdb', flag) for i in range(len(test_data)): # print(i) vec = test_data[i][0] temp = [] index = 0 for j in range(len(vec)): try: index = word_list.index(vec[j]) except ValueError: # 没找到 index =  finally: temp.append(index) # temp表示一个单词在词典中的序号 if len(temp) < 250: for k in range(len(temp), 250): # 不足补0 temp.append(0) else: temp = temp[0:250] # 只保留250个 sentence_code.append(temp) # print(sentence_code) sentence_code = np.array(sentence_code) if flag == 'train': np.save('npys/sentence_code_1', sentence_code) else: np.save('npys/sentence_code_2', sentence_code)

通过上面代码，我们最终得到了两个文件：sentence_code_1.npy与sentence_code_2.npy。每一个数组都是[25000, 250]，代表里面一共有25000句话，每句话的250个单词在word_list的索引保存在里面。

2.3 批量处理

批量处理数据。我们把25000个数据分成250批，每一批100句话，然后通过word_list与vocabulary_vectors，找到每个单词的向量：

def process_batch(batch_size): # 25000维 # (25000, 2) 25000句话, 单词向量+标签 test_data = load_data('Imdb', flag='test') train_data = load_data('Imdb') # 加载句子的索引 # (25000句话, 250单词) sentence_code_1 = np.load('npys/sentence_code_1.npy', allow_pickle=True) sentence_code_1 = sentence_code_1.tolist() # 25000 * 250测试集 sentence_code_2 = np.load('npys/sentence_code_2.npy', allow_pickle=True) sentence_code_2 = sentence_code_2.tolist() vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True) vocabulary_vectors = vocabulary_vectors.tolist() # 每个sentence_code都是25000 * 250 * 50 for i in range(25000): sentence_code_1[i] = [vocabulary_vectors[x] for x in sentence_code_1[i]] sentence_code_2[i] = [vocabulary_vectors[x] for x in sentence_code_2[i]] # for j in range(250): # sentence_code_1[i][j] = vocabulary_vectors[sentence_code_1[i][j]] # sentence_code_2[i][j] = vocabulary_vectors[sentence_code_2[i][j]] # 重新划分数据集，40000训练10000测试 data = train_data + test_data sentence_code = np.r_[sentence_code_1, sentence_code_2] # shuffle shuffle_ix = np.random.permutation(np.arange(len(data))) data = np.array(data)[shuffle_ix].tolist() sentence_code = sentence_code[shuffle_ix] train_data = data[:int(len(data) * 0.8)] test_data = data[int(len(data) * 0.8):] sentence_code_1 = sentence_code[:int(len(sentence_code) * 0.8)] sentence_code_2 = sentence_code[int(len(sentence_code) * 0.8):] labels_train = [] labels_test = [] arr_train = [] arr_test = [] # mini-batch操作 for i in range(1, int(len(train_data) / batch_size) + 1): arr_train.append(sentence_code_1[(i - 1) * batch_size:i * batch_size]) labels_train.append([train_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)]) for i in range(1, int(len(test_data) / batch_size) + 1): arr_test.append(sentence_code_2[(i - 1) * batch_size:i * batch_size]) labels_test.append([test_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)]) arr_train = np.array(arr_train) arr_test = np.array(arr_test) labels_train = np.array(labels_train) labels_test = np.array(labels_test) np.save('npys/arr_train', arr_train) np.save('npys/arr_test', arr_test) np.save('npys/labels_train', labels_train) np.save('npys/labels_test', labels_test) return arr_train, labels_train, arr_test, labels_test

上述代码对数据进行了重新划分，训练集占比为80%，测试集占比为20%。最终返回的是四个数组，以arr_train为例，其维度为[400, 100, 250, 50]，第一个400表示一共400批数据，100表示每批数据有100句话，第三个250表示每句话有250个单词，最后一个50表示每个单词为一个50维度的向量。

3. 模型

3.1 模型搭建

搭建LSTM网络：

class LSTM(nn.Module): def __init__(self, hidden_size): super(LSTM, self).__init__() self.lstm = nn.LSTM(input_size=50, hidden_size=hidden_size, num_layers=1, batch_first=True) self.fc = nn.Sequential(nn.Dropout(0.5), nn.Linear(hidden_size, 32), nn.Linear(32, 2), nn.ReLU()) def forward(self, input_seq): # print(x.size()) x, _ = self.lstm(input_seq) x = self.fc(x) x = x[:, -1, :] return x

3.2 训练

def train(): # load print('loading...') epoch_num = 10 arr_train = np.load('npys/arr_train.npy', allow_pickle=True) labels_train = np.load('npys/labels_train.npy', allow_pickle=True) print('training...') model = LSTM(hidden_size=64).to(device) optimizer = optim.Adam(model.parameters(), lr=0.00005) criterion = nn.CrossEntropyLoss().to(device) loss = 0 for i in range(epoch_num): for j in range(400): x = arr_train[j] y = labels_train[j] # print(y) input_ = torch.tensor(x, dtype=torch.float32).to(device) label = torch.tensor(y, dtype=torch.long).to(device) output = model(input_) # print(output) optimizer.zero_grad() loss = criterion(output, label) loss.backward() optimizer.step() print('epoch:%d loss:%.5f' % (i, loss.item())) # save model state = { 
   'model': model.state_dict(), 'optimizer': optimizer.state_dict()} torch.save(state, 'models/LSTM.pkl')

3.3 测试

def test(): print('loading...') arr_test = np.load('npys/arr_test.npy', allow_pickle=True) labels_test = np.load('npys/labels_test.npy', allow_pickle=True) print('testing...') model = LSTM(hidden_size=64).to(device) model.load_state_dict(torch.load('models/LSTM.pkl')['model']) model.eval() num = 0 for i in range(100): xx = arr_test[i] yy = labels_test[i] input_ = torch.tensor(xx, dtype=torch.float32).to(device) label = torch.tensor(yy, dtype=torch.long).to(device) output = model(input_) pred = output.max(dim=-1)[1] for k in range(100): if pred[k] == label[k]: num += 1 print('Accuracy：', num / 10000)

训练了10个epoch，精度为65%。

4. 代码使用方法

鉴于很多人问怎么运行，特地在这里总结一下：

（1）从文章最后的链接处下载源码，然后从百度网盘下载数据文件。

（2）将百度网盘文件中的glove.6B.50d.txt放入LSTM-IMDB-Classification/目录下。

if __name__ == '__main__': load_cab_vector()

if __name__ == '__main__': # load_cab_vector() process_sentence('train') process_sentence('test')

if __name__ == '__main__': # load_cab_vector() # process_sentence('train') # process_sentence('test') process_batch(100)

if __name__ == '__main__': train() test() # load_cab_vector() # process_sentence('train') # process_sentence('test') # process_batch(100)

5. 源码

后面将陆续公开~

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/224393.html原文链接：https://javaforall.net

PyTorch搭建LSTM对IMDB数据集进行情感分析（详细的数据分析与处理过程）

目录

1. 数据介绍

2. 数据处理

2.1 生成词向量表

2.2 处理训练集和测试集

2.3 批量处理

3. 模型

3.1 模型搭建

3.2 训练

3.3 测试

4. 代码使用方法

5. 源码

关于作者

全栈程序员-站长

发表回复

PyTorch搭建LSTM对IMDB数据集进行情感分析（详细的数据分析与处理过程）

目录

1. 数据介绍

2. 数据处理

2.1 生成词向量表

2.2 处理训练集和测试集

2.3 批量处理

3. 模型

3.1 模型搭建

3.2 训练

3.3 测试

4. 代码使用方法

5. 源码

关于作者

全栈程序员-站长

相关推荐

angularjs子组件向父组件传值_react子组件传值

MySQL 日期数据类型 – date, datetime, timestamp区别及相互转换

Pycharm上设置模板

php轻博客社区视频教程,轻博客主题 – SEO极致优化的ZBLOG轻博客主题

使用memcached加速web应用实例

MySQL登录时出现Access denied for user ‘root‘@‘localhost‘ (using password: YES)无法打开的解决方法

发表回复