【sklearn】利用sklearn训练LDA主题模型及调参详解

【sklearn】利用sklearn训练LDA主题模型及调参详解sklearn 不仅提供了机器学习基本的预处理 特征提取选择 分类聚类等模型接口 还提供了很多常用语言模型的接口 LDA 主题模型就是其中之一 本文除了介绍 LDA 模型的基本参数 调用训练以外 还将提供两种 LDA 调参的可行策略 供大家参考讨论 考虑到篇幅 本文将略去 LDA 原理证明部分

人生苦短,我爱python,尤爱sklearn。sklearn不仅提供了机器学习基本的预处理、特征提取选择、分类聚类等模型接口,还提供了很多常用语言模型的接口,sklearn.decomposition.LatentDirichletAllocation就是其中之一。本文除了介绍LDA模型的基本参数、调用训练以外,还将提供几种LDA调参的可行策略,供大家参考讨论。考虑到篇幅,本文将略去LDA原理证明的部分,想要学习的宝宝们请前往LDA数学八卦进行深入学习,绝对受益匪浅!

LDA主题模型训练与调参

(1)加载语料库及预处理

#加载数据 from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes')) data_samples = dataset.data[:n_samples] #截取需要的量,n_samples=2000 #文本预处理, 可选项 import nltk import string from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer def textPrecessing(text): #小写化 text = text.lower() #去除特殊标点 for c in string.punctuation: text = text.replace(c, ' ') #分词 wordLst = nltk.word_tokenize(text) #去除停用词 filtered = [w for w in wordLst if w not in stopwords.words('english')] #仅保留名词或特定POS  refiltered =nltk.pos_tag(filtered) filtered = [w for w, pos in refiltered if pos.startswith('NN')] #词干化 ps = PorterStemmer() filtered = [ps.stem(w) for w in filtered] return " ".join(filtered)

以上代码运行时间不长,是因为我只随机(shuffle=True)截取了n_samples=2000条新闻。但是当语料库较大时,通常预处理时间也会久一点。因此如果文本数据不变,最好对预处理结果进行保存,这样每次运行只消从文件里读数据即可。

#该区域仅首次运行,进行文本预处理,第二次运行起注释掉 docLst = [] for desc in data_samples : docLst.append(textPrecessing(desc).encode('utf-8')) with open(textPre_FilePath, 'w') as f: for line in docLst: f.write(line+'\n') #============================================================================== #从第二次运行起,直接获取预处理过的docLst,前面load数据、预处理均注释掉 #docLst = [] #with open(textPre_FilePath, 'r') as f: # for line in f.readlines(): # if line != '': # docLst.append(line.strip()) #==============================================================================

我随便打印了两条20newsgroups的数据和预处理后的结果,预处理时未进行POS筛选及词干化,以方便大家理解。

Output: Original 20Newsgroups Articles: [u"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n", u'\nJames Hogan writes:\n\ntimmbake@mcl.ucsb.edu (Bake Timmons) writes:\n>>Jim Hogan quips:\n\n>>... (summary of Jim\'s stuff)\n\n>>Jim, I\'m afraid _you\'ve_ missed the point.\n\n>>>Thus, I think you\'ll have to admit that atheists have a lot\n>>more up their sleeve than you might have suspected.\n\n>>Nah. I will encourage people to learn about atheism to see how little atheists\n>>have up their sleeves. Whatever I might have suspected is actually quite\n>>meager. If you want I\'ll send them your address to learn less about your\n>>faith.\n\n>Faith?\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism? No, you need a little leap of faith, Jimmy. Your logic runs out\nof steam!\n\n>>>Fine, but why do these people shoot themselves in the foot and mock\n>>>the idea of a God? ....\n\n>>>I hope you understand now.\n\n>>Yes, Jim. I do understand now. Thank you for providing some healthy sarcasm\n>>that would have dispelled any sympathies I would have had for your faith.\n\n>Bake,\n\n>Real glad you detected the sarcasm angle, but am really bummin\' that\n>I won\'t be getting any of your sympathy. Still, if your inclined\n>to have sympathy for somebody\'s *faith*, you might try one of the\n>religion newsgroups.\n\n>Just be careful over there, though. (make believe I\'m\n>whispering in your ear here) They\'re all delusional!\n\nJim,\n\nSorry I can\'t pity you, Jim. And I\'m sorry that you have these feelings of\ndenial about the faith you need to get by. Oh well, just pretend that it will\nall end happily ever after anyway. Maybe if you start a new newsgroup,\nalt.atheist.hard, you won\'t be bummin\' so much?\n\n>Good job, Jim.\n>.\n\n>Bye, Bake.\n\n\n>>[more slim-Jim (tm) deleted]\n\n>Bye, Bake!\n>Bye, Bye!\n\nBye-Bye, Big Jim. Don\'t forget your Flintstone\'s Chewables! :) \n--\nBake Timmons, III\n\n-- "...there\'s nothing higher, stronger, more wholesome and more useful in life\nthan some good memory..." -- Alyosha in Brothers Karamazov (Dostoevsky)\n'] Articles After Preprocessing: [u'well sure story nad seem biased disagree statement u media ruin israels reputation rediculous u media pro israeli media world lived europe realize incidences one described letter occured u media whole seem try ignore u subsidizing israels existance europeans least degree think might reason report clearly atrocities shame austria daily reports inhuman acts commited israeli soldiers blessing received government makes holocaust guilt go away look jews treating races got power unfortunate', u'james hogan writes timmbake mcl ucsb edu bake timmons writes jim hogan quips summary jim stuff jim afraid missed point thus think admit atheists lot sleeve might suspected nah encourage people learn atheism see little atheists sleeves whatever might suspected actually quite meager want send address learn less faith faith yeah expect people read faq etc actually accept hard atheism need little leap faith jimmy logic runs steam fine people shoot foot mock idea god hope understand yes jim understand thank providing healthy sarcasm would dispelled sympathies would faith bake real glad detected sarcasm angle really bummin getting sympathy still inclined sympathy somebody faith might try one religion newsgroups careful though make believe whispering ear delusional jim sorry pity jim sorry feelings denial faith need get oh well pretend end happily ever anyway maybe start new newsgroup alt atheist hard bummin much good job jim bye bake slim jim tm deleted bye bake bye bye bye bye big jim forget flintstone chewables bake timmons iii nothing higher stronger wholesome useful life good memory alyosha brothers karamazov dostoevsky'] 

(2)CountVectorizer统计词频

LDA模型学习时的训练数据并不是一篇篇文本,而是Document-word matrix,它可以是array也可以是稀疏矩阵,维数是n_samples*n_features,其中n_features为词(term)的个数。因此在训练LDA主题模型前,需要先利用CountVectorizer统计词频并保存,代码如下:

from sklearn.feature_extraction.text import CountVectorizer from sklearn.externals import joblib #也可以选择pickle等保存模型,请随意 #构建词汇统计向量并保存,仅运行首次 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') tf = tf_vectorizer.fit_transform(docLst) joblib.dump(tf_vectorizer,tf_ModelPath ) #============================================================================== # #得到存储的tf_vectorizer,节省预处理时间 # tf_vectorizer = joblib.load(tf_ModelPath) # tf = tf_vectorizer.fit_transform(docLst) #==============================================================================

CountVectorizer的API请自行参考sklearn,文中代码限定term出现次数必须大于2,最终保留前n_features=2500的term作为features。训练得到的tf_vectorizer 利用joblib保存到文件,第二次起可以直接从文件中load进来避免重复计算。该步骤得到的tf矩阵为一个“文章-词语”稀疏矩阵,可以通过tf_vectorizer.get_feature_names()得到每一维feature对应的term。

(3)LDA主题模型训练

from sklearn.decomposition import LatentDirichletAllocation n_topics = 30 lda = LatentDirichletAllocation(n_topics=n_topic, max_iter=50, learning_method='batch') lda.fit(tf) #tf即为Document_word Sparse Matrix 

(4)结果展示

LDA的训练时间根据max_iter设置的不同以及数据收敛情况的不同而差别很大。测试时max_iter设置为几十次通常很快就会结束,当然如果实际应用的话,建议至少上千次吧。

Topic Top Words结果

def print_top_words(model, feature_names, n_top_words): #打印每个主题下权重较高的term for topic_idx, topic in enumerate(model.components_): print "Topic #%d:" % topic_idx print " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]) print #打印主题-词语分布矩阵 print model.components_ n_top_words=20 tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, n_top_words) Output: #每个主题下权重较高的词语 Topic #0: mail edu thanks new send email 00 com internet interested info uk price ac know sale fax copy data following Topic #1: gm win rochester edu michael new fred vs adams tommy gov nick gb main hudson issue alaska nasa space people Topic #2: 55 10 11 18 21 17 13 19 16 period 22 23 14 20 25 15 24 12 93 26 Topic #3: color server motif software input output edu support clock 256 bits linux vga shots default mode level using image xterm Topic #4: edu writes article com know like uiuc cc news cs people cso opinions think david really way right heard sure Topic #5: section military shall dangerous firearm weapon law person state license use means following women designed islamic japanese division men issued Topic #6: like know time good bike com really writes course year ride going think got read live years better big high Topic #7: com edu writes article list andrew apple cmu cs sandvik points toronto ca kent vancouver sphere power point portal cup Topic #8: know ca black use white edu think writes light like signal right old used dave bnr want mouse led let Topic #9: drive disk drives hard controller rom card bios floppy flyers 16 feature supports board speed bus interface power mb data Topic #10: people government think president american weapons country clinton mr support time billion make new say like going state states jobs Topic #11: edu insurance hp writes article like offer cable best turbo use port power se speed hd good 25 swap year Topic #12: food edu msg writes article standard frank use objective red blues people bear cs area values begin like wings rick Topic #13: earth probe moon lunar orbit mission surface mars space spacecraft venus solar jupiter science atmosphere planet planetary images data pioneer Topic #14: edu com want good dog writes buy dod sold question dealer article water nec large make used chris audio hp Topic #15: israel jews israeli arab jewish attacks state peace people land policy lebanese arabs right say nazi writes men fact soldiers Topic #16: com gun writes guns article crime 000 self edu likely isc stratus make texas fbi government way br steve defense Topic #17: scsi bit mac 32 tv fast ide cards ibm chip 16 set difference better bytes fpu faster computer use piece Topic #18: edu ftp version pc contact machines available type pub au comments mit anonymous sun mac program unix math looking written Topic #19: car cars turkish engine greek oil tires speed turks brake miles greeks 000 better new brakes good dot tire wheel Topic #20: god people think jesus edu believe say bible way good know christian point life like church law time faith says Topic #21: use using key number time like want used problem idea need know serial example code data traffic application keys case Topic #22: university april science 1993 research disease program health information new study medicine power energy computer papers time process development conference Topic #23: space years nasa gov new year launch 10 sci pitt gay shuttle km 15 article medical titan soon high 1990 Topic #24: people said went know going time children think like came home killed happened took armenians come got told away dead Topic #25: graphics image mail pub edu aids ray 128 files package mil images 3d send sgi computer systems archive gov format Topic #26: windows file problem use edu window thanks files help card know dos like monitor using memory work video program need Topic #27: game team play year players season think games hockey player win cubs teams better good baseball ca fan leafs league Topic #28: writes com edu article atheism bob jim tek word rights used people news case keith alt said term time given Topic #29: government key encryption chip clipper public use keys law people enforcement private nsa security like secure phone com think care #主题-词语分布矩阵 array([[ 1.00e+02, 3.e-02, 3.e-02, ..., 3.e-02, 3.e-02, 3.e-02], [ 3.e-02, 3.e-02, 3.e-02, ..., 3.e-02, 3.e-02, 3.e-02], [ 1.e+01, 3.e-02, 1.e+01, ..., 3.e-02, 3.e-02, 3.e-02], ..., [ 3.e-02, 3.e-02, 3.e-02, ..., 3.e-02, 9.e+00, 3.e-02], [ 3.e-02, 3.e-02, 3.e-02, ..., 3.e-02, 3.e-02, 3.e-02], [ 3.e-02, 3.e-02, 3.e-02, ..., 3.e-02, 3.e-02, 3.e-02]]) 

检查了一眼每个主题的top words,基本是靠谱的,比如教育类在一起,机械类在一起等等,当然也存在一些问题,比如训练还不到位,比如没有进行词干化所有”car”“cars”都在Topic #19里面,大家训练的时候得避免。

Doc_Topic结果

训练LDA的一大目的就是分析一篇文章的话题分布,这才能使得模型创造更高的价值。利用已训练好的模型将doc转换为话题分布的函数及结果如下:

doc_topic_dist = lda.transform(tf) output: array([[ 0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0], [ 0.0, 0.0, 0.0, ..., 1. , 26., 0.0], [ 0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0], ..., [ 0.0, 0.0, 15., ..., 0.0, 0.0, 0.0], [ 0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0], [ 0.0, 0.0, 0.0, ..., 13., 0.0, 0.0]]) 

上文中,我给出了两篇例文,那两篇例文的主要话题为:topic#12, topic#20.大家可以自行看一下效果如何。好吧结果可能不太好,原因很多,可能是还没调参,也可能因为预处理为了节省时间,省去了词干化和POS筛选,大家加进去即可。

收敛效果(perplexity)

通过调用lda.perplexity(X)函数,可以得知当前训练的perplexity,sklearn中对perplexity的定义为exp(-1. * log-likelihood per word)

lda.perplexity(tf) Output: 1270.92

本次训练次数较少,模型还没收敛,所以perplexity明显较高,可以通过调参得到更可靠的模型。

(5)(Optional)调参过程

可以调整的参数

  • n_topics: 主题的个数
  • n_features: feature的个数,即常用词个数
  • doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α
  • topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η
  • learning_method: 即LDA的求解算法,有’batch’和’online’两种选择
  • 其余sklearn提供的参数:根据LDA求解算法的不同,存在一些其它参数可以调节,参见最后的附录:sklearn LDA API 中文解释。

两种可行的调参方案

一、以n_topics为例,按照perplexity的大小选择最佳模型。当然,topic数目的不同势必会导致perplexity计算的不同,因此perplexity仅能作为参考,topic数目还需要根据实际需求主观指定。n_topics调参代码如下:

n_topics = range(20, 75, 5) perplexityLst = [1.0]*len(n_topics) #训练LDA并打印训练时间 lda_models = [] for idx, n_topic in enumerate(n_topics): lda = LatentDirichletAllocation(n_topics=n_topic, max_iter=20, learning_method='batch', evaluate_every=200, # perp_tol=0.1, #default  # doc_topic_prior=1/n_topic, #default # topic_word_prior=1/n_topic, #default verbose=0) t0 = time() lda.fit(tf) perplexityLst[idx] = lda.perplexity(tf) lda_models.append(lda) print "# of Topic: %d, " % n_topics[idx], print "done in %0.3fs, N_iter %d, " % ((time() - t0), lda.n_iter_), print "Perplexity Score %0.3f" % perplexityLst[idx] #打印最佳模型 best_index = perplexityLst.index(min(perplexityLst)) best_n_topic = n_topics[best_index] best_model = lda_models[best_index] print "Best # of Topic: ", best_n_topic #绘制不同主题数perplexity的不同 fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(n_topics, perplexityLst) ax.set_xlabel("# of topics") ax.set_ylabel("Approximate Perplexity") plt.grid(True) plt.savefig(os.path.join('lda_result', 'perplexityTrend'+CODE+'.png')) plt.show() Output: Best # of Topic: 25 ![不同主题数下perplexity趋势](http://img.blog.csdn.net/?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvVGlmZmFueVJhYmJpdA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)

二、如果想一次性调整所有参数也可以直接利用sklearn作cv,但是这样做的结果一定是,耗时十分长。以下代码仅供参考,可以根据自身的需求进行增减。

from sklearn.model_selection import GridSearchCV parameters = { 
  'learning_method':('batch', 'online'), 'n_topics':range(20, 75, 5), 'perp_tol': (0.001, 0.01, 0.1), 'doc_topic_prior':(0.001, 0.01, 0.05, 0.1, 0.2), 'topic_word_prior':(0.001, 0.01, 0.05, 0.1, 0.2) 'max_iter':1000} lda = LatentDirichletAllocation() model = GridSearch(lda, parameters) model.fit(tf) sorted(model.cv_results_.keys())

附录:sklearn LDA API 中文解释

Class sklearn.decomposition.LatentDirichletAllocation(n_topics=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None)


参考:

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/218646.html原文链接:https://javaforall.net

(0)
上一篇 2026年3月17日 下午11:47
下一篇 2026年3月17日 下午11:47


相关推荐

  • python aiohttp_python aiohttp的使用详解

    python aiohttp_python aiohttp的使用详解1.aiohttp的简单使用(配合asyncio模块)importasyncio,aiohttpasyncdeffetch_async(url):print(url)asyncwithaiohttp.request(“GET”,url)asr:reponse=awaitr.text(encoding=”utf-8″)  #或者直接awaitr.read()不编码,直接读取,适…

    2025年6月30日
    5
  • String转换jsonobject格式错误

    String转换jsonobject格式错误String转换jsonobject格式错误开发工具与关键技术:java作者:彭浩达撰写时间:2019年7月24日publicvoidUpdataecdrud(HttpServletRequestrequest,HttpServletResponseresponse)throwsServletException,I…

    2022年8月23日
    11
  • Java获取当前年月日、时间[通俗易懂]

    Java获取当前年月日、时间[通俗易懂]两种方法,通过Date类或者通过Calendar类。Date类比较简单,但是要得到细致的字段的话Calendar类比较方便。importjava.text.DateFormat;importjava.text.SimpleDateFormat;importjava.util.Calendar;importjava.util.Date;importjava.util.L

    2025年8月24日
    4
  • C语言——折半查找法

    C语言——折半查找法C 语言 折半查找法折半查找法 顾名思义就是一种查找的方法 优点是其比较次数少 查找速度快 平均性能好 缺点是其要求的待查表必须是有序表 且插入删除比较困难 因此 折半查找法适用于不经常变动并且查找次数比较频繁的有序列表 例如 我买了一件衣服 告诉你在 300 元以内 让你用次数最少猜出这件衣服的价格 答 每次猜中间数代码展示 用最少次数猜 10 个数字中有没有 7intmain intarr 1 2 3 4 5 6 7 8 9 10 要求数组必须是有序的 intl

    2026年3月16日
    2
  • LINQ&EF任我行(二)–LinQ to Object (转)

    LINQ&EF任我行(二)–LinQ to Object (转)

    2021年7月9日
    97
  • 神经网络为什么要归一化

    神经网络为什么要归一化用神经网络的小伙伴都知道,数据需要做归一化,但是为什么要做归一化,这个问题一直模梭两可,网上也没有较全的回答,小编花费了一段时间,作了一些研究,给大家仔细分析分析,为什么要做归一化:1.数值问题。无容置疑,归一化的确可以避免一些不必要的数值问题。输入变量的数量级未致于会引起数值问题吧,但其实要引起也并不是那么困难。因为tansig的非线性区间大约在[-1.7,1.7]。

    2022年6月23日
    33

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号