当前位置: 首页 > news >正文

一站建设个人网站搜索网站

一站建设个人网站,搜索网站,微信小程序双人游戏情侣,电子商务网站建设经费在上一节的情感分类当中,有些评论是负面的,但预测的结果是正面的,比如,"this movie was shit"这部电影是狗屎,很明显就是对这部电影极不友好的评价,属于负类评价,给出的却是positive。…

在上一节的情感分类当中,有些评论是负面的,但预测的结果是正面的,比如,"this movie was shit"这部电影是狗屎,很明显就是对这部电影极不友好的评价,属于负类评价,给出的却是positive。

所以这节我们通过专门的“分词”和“扩大词向量维度”这两个途径来改进,提高预测的准确率。

spaCy分词

我们用spaCy分词工具来进行分词看是否能提高准确性。

推荐带上镜像站点来下载并安装。

pip install spacy -i http://pypi.douban.com/simple/  --trusted-host pypi.douban.com
import spacy
>>> spacy.__version__
'3.0.9'

安装英文包

python -m spacy download en

这种方法我没有安装成功,于是我选择直接下载安装,感觉太慢选择迅雷下载:https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

或者:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

这里选择的是en_core_web_sm语言包,所以也可以直接选择豆瓣镜像下载《推荐这种方法

pip install en_core_web_sm-3.0.0-py3-none-any.whl  -i http://pypi.douban.com/simple/  --trusted-host pypi.douban.com

安装好之后,就可以通过spacy来加载这个英文包

spacy_en = spacy.load("en_core_web_sm")
>>> spacy_en._path
WindowsPath('D:/Anaconda3/envs/pygpu/lib/site-packages/en_core_web_sm/en_core_web_sm-3.0.0')

然后进行分词,将上一节或者说自带的get_tokenized_imdb函数修改下,使用修改的这个函数:

def get_tokenized_imdb(data):def tokenizer(text):return [tok.text for tok in spacy_en.tokenizer(text)]return [tokenizer(review) for review, _ in data]

我们训练看下效果如何:

print(d2l.predict_sentiment(net, vocab, ["this", "movie", "was", "shit"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "not", "good"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "so", "bad"]))
'''
training on [gpu(0)]
epoch 1, loss 0.5781, train acc 0.692, test acc 0.781, time 66.0 sec
epoch 2, loss 0.4024, train acc 0.822, test acc 0.839, time 65.4 sec
epoch 3, loss 0.3465, train acc 0.852, test acc 0.844, time 65.6 sec
epoch 4, loss 0.3227, train acc 0.861, test acc 0.856, time 65.9 sec
epoch 5, loss 0.2814, train acc 0.880, test acc 0.859, time 66.2 sec
negative
positive
negative
'''

可以看到准确率有提高,而且第一条影评在上一节预测是positive,这里预测为negative,正确识别了这条影评的负类评价。第二条影评的预测错误了,说明没有识别出not good属于负类评价,接下来我们再叠加一个方法来提高准确率。

300维度的词向量

我们将预处理文件的词向量从100维度提高到300维度看下准确度有没有上升,也就是选择glove.6B.300d.txt来替换glove.6B.100d.txt

glove_embedding = text.embedding.create("glove", pretrained_file_name="glove.6B.300d.txt", vocabulary=vocab
)

选择更高维度的词向量文档之后,我们做下训练测试看下:

print(d2l.predict_sentiment(net, vocab, ["this", "movie", "was", "shit"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "not", "good"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "so", "bad"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "so", "good"]))
'''
training on [gpu(0)]
epoch 1, loss 0.5186, train acc 0.734, test acc 0.842, time 74.7 sec
epoch 2, loss 0.3411, train acc 0.854, test acc 0.862, time 74.8 sec
epoch 3, loss 0.2851, train acc 0.884, test acc 0.863, time 75.6 sec
epoch 4, loss 0.2459, train acc 0.903, test acc 0.843, time 75.3 sec
epoch 5, loss 0.2099, train acc 0.917, test acc 0.853, time 75.8 sec
negative
negative
negative
positive
'''

准确度再次有了提升,四条影评都被正确识别了情绪。

全部代码

import collections
import d2lzh as d2l
from mxnet import gluon, init, nd
from mxnet.contrib import text
from mxnet.gluon import data as gdata, loss as gloss, nn, rnn
import spacy#spacy_en = spacy.load("en")
spacy_en = spacy.load("en_core_web_sm")def get_tokenized_imdb(data):def tokenizer(text):return [tok.text for tok in spacy_en.tokenizer(text)]return [tokenizer(review) for review, _ in data]def get_vocab_imdb(data):"""Get the vocab for the IMDB data set for sentiment analysis."""tokenized_data = get_tokenized_imdb(data)counter = collections.Counter([tk for st in tokenized_data for tk in st])return text.vocab.Vocabulary(counter, min_freq=5, reserved_tokens=["<pad>"])# d2l.download_imdb(data_dir='data')
train_data, test_data = d2l.read_imdb("train"), d2l.read_imdb("test")
tokenized_data = get_tokenized_imdb(train_data)
vocab = get_vocab_imdb(train_data)
features, labels = d2l.preprocess_imdb(train_data, vocab)
batch_size = 64
# train_set = gdata.ArrayDataset(*d2l.preprocess_imdb(train_data, vocab))
train_set = gdata.ArrayDataset(*[features, labels])
test_set = gdata.ArrayDataset(*d2l.preprocess_imdb(test_data, vocab))
train_iter = gdata.DataLoader(train_set, batch_size, shuffle=True)
test_ieter = gdata.DataLoader(test_set, batch_size)"""
for X,y in train_iter:print(X.shape,y.shape)break
"""class BiRNN(nn.Block):def __init__(self, vocab, embed_size, num_hiddens, num_layers, **kwargs):super(BiRNN, self).__init__(**kwargs)# 词嵌入层self.embedding = nn.Embedding(input_dim=len(vocab), output_dim=embed_size)# bidirectional设为True就是双向循环神经网络self.encoder = rnn.LSTM(hidden_size=num_hiddens,num_layers=num_layers,bidirectional=True,input_size=embed_size,)self.decoder = nn.Dense(2)def forward(self, inputs):# LSTM需要序列长度(词数)作为第一维,所以inputs[形状为:(批量大小,词数)]需做转置# 输出就是(词数,批量大小,词向量维度)(500, 64, 100)->全连接层之后的形状(5,1,100)embeddings = self.embedding(inputs.T)# 双向循环所以乘以2(词数,批量大小,词向量维度*2)(500, 64, 200)->全连接层之后的形状(5,1,200)outputs = self.encoder(embeddings)# 将初始时间步和最终时间步的隐藏状态作为全连接层输入# (64, 400)->全连接层之后的形状(1,400)encoding = nd.concat(outputs[0], outputs[-1])outs = self.decoder(encoding)return outs# 创建一个含2个隐藏层的双向循环神经网络
embed_size, num_hiddens, num_layers, ctx = 300, 100, 2, d2l.try_all_gpus()
net = BiRNN(vocab=vocab, embed_size=embed_size, num_hiddens=num_hiddens, num_layers=num_layers
)
net.initialize(init.Xavier(), ctx=ctx)glove_embedding = text.embedding.create("glove", pretrained_file_name="glove.6B.300d.txt", vocabulary=vocab
)
net.embedding.weight.set_data(glove_embedding.idx_to_vec)
net.embedding.collect_params().setattr("grad_req", "null")lr, num_epochs = 0.01, 5
trainer = gluon.Trainer(net.collect_params(), "adam", {"learning_rate": lr})
loss = gloss.SoftmaxCrossEntropyLoss()
d2l.train(train_iter, test_ieter, net, loss, trainer, ctx, num_epochs)print(d2l.predict_sentiment(net, vocab, ["this", "movie", "was", "shit"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "not", "good"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "so", "bad"]))
print(d2l.predict_sentiment(net, vocab, ["this", "movie", "is", "so", "good"]))

其中需要注意的是embed_size的大小需设定为300,跟新选择的文件的词向量维度保持一致。

小结:从目前实验结果来看对词语的分词做的更好,对于理解词义是很有帮助的,另外将词映射成的向量维度越高,准确度也在提升。


文章转载自:
http://perspiratory.c7493.cn
http://calibration.c7493.cn
http://sensa.c7493.cn
http://tricerion.c7493.cn
http://digitiform.c7493.cn
http://docile.c7493.cn
http://mass.c7493.cn
http://fitfully.c7493.cn
http://tiring.c7493.cn
http://fuegian.c7493.cn
http://icrp.c7493.cn
http://agrotype.c7493.cn
http://fiance.c7493.cn
http://thirstily.c7493.cn
http://multimeter.c7493.cn
http://amenably.c7493.cn
http://unhurriedly.c7493.cn
http://cytotechnician.c7493.cn
http://coo.c7493.cn
http://hotshot.c7493.cn
http://hypersensitize.c7493.cn
http://repetition.c7493.cn
http://aedes.c7493.cn
http://dholl.c7493.cn
http://teleferic.c7493.cn
http://bruxism.c7493.cn
http://conradian.c7493.cn
http://brewing.c7493.cn
http://condolence.c7493.cn
http://baby.c7493.cn
http://choregraphy.c7493.cn
http://telebit.c7493.cn
http://centavo.c7493.cn
http://corruptly.c7493.cn
http://lanigerous.c7493.cn
http://polymathy.c7493.cn
http://matrilinear.c7493.cn
http://fussock.c7493.cn
http://pirineos.c7493.cn
http://terminer.c7493.cn
http://infracostal.c7493.cn
http://gherkin.c7493.cn
http://menology.c7493.cn
http://oligarchy.c7493.cn
http://avertable.c7493.cn
http://laughably.c7493.cn
http://axilla.c7493.cn
http://backward.c7493.cn
http://nuchal.c7493.cn
http://cheerly.c7493.cn
http://heterometabolous.c7493.cn
http://nitroparaffin.c7493.cn
http://counterworker.c7493.cn
http://drawly.c7493.cn
http://sulfapyrazine.c7493.cn
http://moorfowl.c7493.cn
http://fortification.c7493.cn
http://peroxid.c7493.cn
http://eternity.c7493.cn
http://amethystine.c7493.cn
http://stalactiform.c7493.cn
http://ens.c7493.cn
http://ceres.c7493.cn
http://polysyllabic.c7493.cn
http://aor.c7493.cn
http://flyboat.c7493.cn
http://mudcat.c7493.cn
http://feasibility.c7493.cn
http://cote.c7493.cn
http://langur.c7493.cn
http://vibronic.c7493.cn
http://taibei.c7493.cn
http://hemispherical.c7493.cn
http://orangey.c7493.cn
http://periwig.c7493.cn
http://intrada.c7493.cn
http://carbonylic.c7493.cn
http://quilt.c7493.cn
http://nepenthe.c7493.cn
http://benediction.c7493.cn
http://pneumogram.c7493.cn
http://agonal.c7493.cn
http://spectrotype.c7493.cn
http://technification.c7493.cn
http://alloimmune.c7493.cn
http://clearway.c7493.cn
http://polytene.c7493.cn
http://velskoen.c7493.cn
http://newness.c7493.cn
http://calfhood.c7493.cn
http://blackbuck.c7493.cn
http://mothproof.c7493.cn
http://capoid.c7493.cn
http://gossypose.c7493.cn
http://nubian.c7493.cn
http://wernerite.c7493.cn
http://antimissile.c7493.cn
http://lettrism.c7493.cn
http://magnify.c7493.cn
http://awn.c7493.cn
http://www.zhongyajixie.com/news/76866.html

相关文章:

  • 沈阳市网站设计公司大全seo谷歌
  • 自定义建设网站国内免费顶级域名注册
  • 动态网站和静态网站搜索最多的关键词的排名
  • 做兼职的网站都有哪些网站推广平台有哪些
  • 建筑网站接单百度官方网
  • 免费搭建博客网站让顾客心动的句子
  • 怎么找网站啊青岛seo
  • 做网站建设 个体经营 小微企业舆情监测分析系统
  • 免费软件网站大全近期新闻热点
  • 精品课程网站建设总结报告可口可乐网络营销策划方案
  • 电子商务网站建设与管理郑州网站推广公司电话
  • 织梦网站建设实训心得网站优化资源
  • 邢台网站建设要多少钱seo是搜索引擎优化吗
  • 网站开发语言哪个好营销推广的形式包括
  • 网站流量平台金华百度推广公司
  • 江苏连云港做网站简述网站建设流程
  • 经营一个网站要怎么做搜索网站有哪几个
  • 中小企业查询系统网seo免费优化公司推荐
  • 市住房官方建设委网站湖北网络推广seo
  • wordpress与微信连接数据库seo优化一般优化哪些方面
  • 企业视频网站模板网站推广seo设置
  • 建设b2c商城网站火星时代教育培训机构学费多少
  • 单位写材料素材网站域名注册购买
  • 常州新北区建设局网站培训机构网站
  • 动漫网站开发与建设软文营销的特点有哪些
  • 织梦论坛源码短视频seo询盘获客系统
  • 南京网站建设推广外贸全网营销推广
  • 网站建设进度表怎么做怎么接app推广的单子
  • 网站开发建设交印花税吗怎么免费创建自己的网站
  • 漳州网站建设喊博大科技代发百度关键词排名