基于自然语言处理(bert)的灾难信息真实性预测 问题背景 微信朋友圈,微博,Twitter上有着众多用户的分享。可以成为紧急情况下的重要沟通渠道。智能手机的无处不在使人们能够实时宣布他们正在观察到的紧急情况。正因为如此,越来越多的机构(即救灾组织和新闻机构)对以程序化方式监控 Twitter 感兴趣。 但是,并不总是很清楚一个人的话是否真的在宣布一场灾难,正因如此,我们想要搭建一个模型,分析文段的信息,从而判断灾难信息的真实性。
数据形式 如:
训练集
id
关键词
位置
文本
真实性
1
ablaze
Bangkok
On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE
0
2
disaster
school
homework!!!o(╥﹏╥)o
1
···
···
···
···
···
10900
accident
Gloucestershire , UK
.@NorwayMFA #Bahrain police had previously died in a road accident they were not killed by explosion…
1
关键 对关键词,位置以及本次问题的关键————文本进行编码嵌入,涉及到自然语言处理问题(NLP)。
①常规编码嵌入 关键词(keyword)以及位置仅由少量的词(大多为一个),采用onehot编码,是对于一般字符类型数据的常规embedding方式。
②NLP嵌入 词袋法: 类似于onehot嵌入,把文本中每个词都提取出来(可以省略is a the等词)形成词袋(set集合)然后类似于onehot进行编码。
bert嵌入:
BERT 是 Bidirectional Encoder Representations from Transformers 的首字母缩写词。
早在 2018 年,谷歌就为 NLP 应用程序开发了一个基于 Transformer 的强大的机器学习模型,该模型在不同的基准数据集中优于以前的语言模型。这个模型被称为BERT。
BERT 架构由多个堆叠在一起的 Transformer 编码器组成。每个 Transformer 编码器都封装了两个子层:一个自注意力层和一个前馈层。
基于attention的bert在深度学习的技术下使得文本的特征提取能力大大提高。
数据预处理 导入库和数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.preprocessing import OneHotEncodertrain = pd.read_csv('/kaggle/input/nlpgettingstarted/train.csv' ) test = pd.read_csv('/kaggle/input/nlpgettingstarted/test.csv' ) id = test['id' ]test.drop(['id' ], axis=1 , inplace=True ) train.drop(['id' ], axis=1 , inplace=True ) target = train['target' ] train.drop(['target' ], axis=1 , inplace=True )
对关键词,地理位置的空缺值进行简单的填充值’0’,进行onehot编码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 train = train.sample(frac=1 , random_state=42 ).reset_index(drop=True ) train.columns.values[4 ] = 'targeta' train['keyword' ].fillna('0' , inplace=True ) test['keyword' ].fillna('0' , inplace=True ) train['location' ].fillna('0' , inplace=True ) test['location' ].fillna('0' , inplace=True ) encoder = OneHotEncoder(sparse_output=False , drop='first' ,handle_unknown='ignore' ) encoded = encoder.fit_transform(train[['keyword' , 'location' ]]) train = pd.concat([train.drop(['keyword' , 'location' ], axis=1 ), pd.DataFrame(encoded, columns=encoder.get_feature_names_out( ['keyword' , 'location' ]))], axis=1 ) encoded = encoder.transform(test[['keyword' , 'location' ]]) test = pd.concat([test.drop(['keyword' , 'location' ], axis=1 ), pd.DataFrame(encoded, columns=encoder.get_feature_names_out( ['keyword' , 'location' ]))], axis=1 )
自然语言处理(NLP) 这里给出两种方法:
使用bert进行编码: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import torchfrom transformers import BertTokenizer, BertModeldevice = torch.device("cuda" if torch.cuda.is_available() else "cpu" ) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased' ) model = BertModel.from_pretrained('bert-base-uncased' ) model.to(device) def encode_text (text ): tokens = tokenizer(text, padding=True , truncation=True , return_tensors='pt' ) tokens.to(device) with torch.no_grad(): output = model(**tokens) return output['last_hidden_state' ].mean(dim=1 ).squeeze().cpu().numpy()
tokenizer: 这是一个分词器对象,它执行文本到模型输入格式之间的转换。这通常是一个预训练语言模型(例如BERT、GPT等)的tokenizer。
text: 这是要被分词和编码的原始文本。
padding=True: 这个参数指示分词器是否要在序列的末尾添加填充标记,以使所有输入序列具有相同的长度。填充对于批量处理是很有用的,因为它允许将不同长度的序列组合成一个张量。
truncation=True: 这个参数表示如果文本长度超过模型的最大输入长度,是否要截断文本。截断可确保所有输入都具有相同的长度。
return_tensors=’pt’: 这个参数指示tokenizer返回PyTorch张量。这是因为PyTorch是一个深度学习框架,如果你正在使用PyTorch构建和训练模型,你可能希望输入数据以PyTorch张量的形式提供。
1 2 3 4 5 6 7 train['text_encoded' ] = train['text' ].apply(encode_text) test['text_encoded' ] = test['text' ].apply(encode_text) train.drop('text' , axis=1 , inplace=True ) test.drop('text' , axis=1 , inplace=True )
展开bert编码生成的text_encoded向量列 1 2 3 4 5 6 7 8 9 10 11 12 13 text_encoded_expanded = pd.DataFrame(train['text_encoded' ].to_list(), columns=[f'text_encoded_{i} ' for i in range (len (train['text_encoded' ].iloc[0 ]))]) train = pd.concat([train, text_encoded_expanded], axis=1 ) train.drop(['text_encoded' ], axis=1 , inplace=True ) text_encoded_expanded = pd.DataFrame(test['text_encoded' ].to_list(), columns=[f'text_encoded_{i} ' for i in range (len (test['text_encoded' ].iloc[0 ]))]) test = pd.concat([test, text_encoded_expanded], axis=1 ) test.drop(['text_encoded' ], axis=1 , inplace=True )
final数据形式:
keyword_ablaze
keyword_accident
keyword_aftershock
···
0.0
1.0
0.0
···
···
···
···
···
0.0
0.0
0.0
···
location_Bangkok
location_school
location_plane
···
0.0
0.0
0.0
···
···
···
···
···
0.0
1.0
0.0
···
text_encoded_0
text_encoded_1
text_encoded_767
···
0.262901
-0.57036
0.66548
···
···
···
···
···
-0.862901
0.37036
-0.06548
···
使用词袋编码(损失了词义和语义,单纯的词的统计形式编码): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.feature_extraction.text import CountVectorizertext_vectorizer = CountVectorizer() train_text_vectors = text_vectorizer.fit_transform(train['text' ]) test_text_vectors = text_vectorizer.transform(test['text' ]) print (train.head())train_text_df = pd.DataFrame(train_text_vectors.toarray(), columns=text_vectorizer.get_feature_names_out()) test_text_df = pd.DataFrame(test_text_vectors.toarray(), columns=text_vectorizer.get_feature_names_out()) print (train.head())train = pd.concat([train, train_text_df], axis=1 ) test = pd.concat([test, test_text_df], axis=1 ) train.drop('text' , axis=1 , inplace=True ) test.drop('text' , axis=1 , inplace=True )
喂给模型前处理 1 2 3 4 5 6 7 8 9 10 from sklearn.model_selection import train_test_splitX_train, X_val, y_train, y_val = train_test_split(train, target, test_size=0.2 , random_state=42 ) from sklearn.preprocessing import StandardScalerscaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) test = scaler.transform(test)
深度学习模型(DEEP LEARNING) 这里只采用简单的全连接(也只试过全连接T.T)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import torchimport torch.nn as nnimport torch.optim as optimfrom torch.utils.data import DataLoader, TensorDatasetfrom sklearn.metrics import accuracy_scoretorch.manual_seed(2 ) if torch.cuda.is_available(): torch.cuda.manual_seed(2 ) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False print (torch.cuda.is_available())X_train_tensor = torch.FloatTensor(X_train.values) y_train_tensor = torch.FloatTensor(y_train.values) X_val_tensor = torch.FloatTensor(X_val.values) y_val_tensor = torch.FloatTensor(y_val.values) X_train_tensor = X_train_tensor.to('cuda' ) y_train_tensor = y_train_tensor.to('cuda' ) X_val_tensor = X_val_tensor.to('cuda' ) y_val_tensor = y_val_tensor.to('cuda' )
准备开始啦! 搭建全连接框架
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 class SimpleNN (nn.Module): def __init__ (self, input_size ): super (SimpleNN, self).__init__() self.fc1 = nn.Linear(input_size, 1024 ) self.fc1024 = nn.Linear(1024 , 1024 ) self.fc2 = nn.Linear(1024 , 512 ) self.fc3 = nn.Linear(512 , 256 ) self.fc4 = nn.Linear(256 , 64 ) self.fc5 = nn.Linear(64 ,1 ) self.sigmoid = nn.Sigmoid() self.bn1024 = nn.BatchNorm1d(1024 ) self.bn512 = nn.BatchNorm1d(512 ) self.bn256 = nn.BatchNorm1d(256 ) self.bn128 = nn.BatchNorm1d(128 ) self.bn64 = nn.BatchNorm1d(64 ) self.dropout = nn.Dropout(0.1 ) self.relu = nn.ReLU()
BN 通常指的是批归一化(Batch Normalization)
提高模型训练的稳定性和加速收敛。
减小对初始权重的依赖,允许使用更高的学习率。
有轻微的正则化效果,有助于减小过拟合。
Dropout 目的是在训练过程中随机地丢弃(关闭)神经网络的一些单元(节点),以减小网络的复杂性,防止过拟合。
减小过拟合风险,提高模型的泛化能力。
降低对某些特定神经元的过度依赖,使模型更加健壮。
ReLU(Rectified Linear Unit) 是一种常用的激活函数,广泛用于深度学习模型中。ReLU 将所有负数输入变为零,而对于正数输入则保持不变。ReLU 函数的定义如下: f(x)=max(0,x)
更加有效率的梯度下降以及反向传播:避免了梯度爆炸和梯度消失问题。
简化计算过程:没有了其他复杂激活函数中诸如指数函数的影响;同时活跃度的分散性使得神经网络整体计算成本下降。
其重要的非线性性质帮助我们在理论上可以拟合任意函数。
前向传播
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 def forward (self, x ): x = self.fc1(x) x = self.relu(x) x = self.bn1024(x) x = self.dropout(x) x = self.fc1024(x) x = self.relu(x) x = self.bn1024(x) x = self.dropout(x) x = self.fc2(x) x = self.relu(x) x = self.bn512(x) x = self.dropout(x) x = self.fc3(x) x = self.relu(x) x = self.bn256(x) x = self.dropout(x) x = self.fc4(x) x = self.relu(x) x = self.bn64(x) x = self.fc5(x) x = self.sigmoid(x) return x
模型创建和参数设置(炼丹o(╥﹏╥)o)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 model = SimpleNN(input_size=X_train_tensor.shape[1 ]) model = model.to('cuda' ) criterion = nn.BCELoss() from torch.optim.lr_scheduler import LambdaLR, StepLRdef lr_lambda (epoch, learning_rate, increase_epochs ): if epoch < increase_epochs: return (epoch + 1 ) / increase_epochs else : return 0.95 ** (epoch - increase_epochs) num_epochs = 50 increase_epochs = 5 learning_rate = 0.0005 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,weight_decay = 5 ) scheduler = LambdaLR(optimizer, lr_lambda=lambda epoch:lr_lambda(epoch, learning_rate, increase_epochs)) batch_size = 3350 train_dataset = TensorDataset(X_train_tensor, y_train_tensor) train_loader = DataLoader(train_dataset, batch_size, shuffle=False ) val_dataset = TensorDataset(X_val_tensor, y_val_tensor) val_loader = DataLoader(val_dataset, batch_size, shuffle=False ) best_val_loss = 100
开始学习!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 for epoch in range (num_epochs): model.train() train_loss = 0.0 train_preds = [] for inputs, labels in train_loader: optimizer.zero_grad() outputs = model(inputs.to('cuda' )).squeeze() loss = criterion(outputs, labels.to('cuda' )) loss.backward() optimizer.step() train_loss += loss.item() model.eval () with torch.no_grad(): all_train_preds = [] for inputs, labels in train_loader: outputs = model(inputs.to('cuda' )).squeeze() preds = (outputs > 0.5 ).int ().cpu().numpy() all_train_preds.append(preds) all_train_preds = np.concatenate(all_train_preds) train_accuracy = accuracy_score(y_train_tensor.cpu().numpy(), all_train_preds) train_loss /= len (train_loader) model.eval () with torch.no_grad(): all_val_preds = [] val_loss = 0.0 for inputs, labels in val_loader: outputs = model(inputs.to('cuda' )).squeeze() batch_loss = criterion(outputs, labels.to('cuda' )) val_loss += batch_loss.item() preds = (outputs > 0.5 ).int () all_val_preds.append(preds.cpu().numpy()) all_val_preds = np.concatenate(all_val_preds) val_loss /= len (val_loader) val_accuracy = accuracy_score(y_val_tensor.cpu().numpy(), all_val_preds) print (f'Epoch [{epoch+1 } /{num_epochs} ]\n Train Loss: {train_loss:.6 f} ,Val Loss: {val_loss:.6 f} \nTrain Accuracy: {train_accuracy:.6 f} , Val Accuracy: {val_accuracy:.6 f} ' ) print (f'Epoch {epoch+1 } , Learning rate: {optimizer.param_groups[0 ]["lr" ]} \n\n' ) if val_loss < best_val_loss: best_val_loss = val_loss best_model_params = model.state_dict() scheduler.step()
训练过程:
True Epoch [1/50]Train Loss: 0.726889,Val Loss: 0.695472 Train Accuracy: 0.427915, Val Accuracy: 0.436638 Learning rate: 0.0001
Epoch [2/50] Train Loss: 0.722656,Val Loss: 0.692767 Train Accuracy: 0.542857, Val Accuracy: 0.556139 Learning rate: 0.0002
Epoch [3/50] Train Loss: 0.715571,Val Loss: 0.690732 Train Accuracy: 0.572085, Val Accuracy: 0.563362 Learning rate: 0.0003
· · · · · · · · · · · ·
Epoch [50/50] Train Loss: 0.178646,Val Loss: 0.588852 Train Accuracy: 0.949589, Val Accuracy: 0.818122 Learning rate: 5.2336e-05
预测 得到了模型,开始预测吧!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 test_tensor = torch.FloatTensor(test.values).to('cuda' ) model.load_state_dict(best_model_params) model.eval () with torch.no_grad(): predictions = model(test_tensor) predictions = predictions.cpu().numpy() target_predictions = predictions[:, 0 ] binary_predictions = (target_predictions > 0.5 ).astype(int ) binary_predictions = binary_predictions.ravel() print (binary_predictions)print (f'\n\n{test.columns=} ' )submission_df = pd.DataFrame({'id' : id , 'target' : binary_predictions}) submission_df.to_csv('prebert.csv' , index=False )
总结 第一次进行自然语言处理。attention is all you need!!google的bert真是太棒了。最终达到了0.80386分(324/894)~~~ 不过参数调整还有优化空间!炼丹!!!( * ^▽^ * )
调参 loss最优改为acc最优啦,然后调节batch到3700,模型准确率最终在0.827! 303/894了!