基于潜在狄利克雷分配算法(LDA)实现长文档主题建模

LDA于2003年由 David Blei, Andrew Ng和 Michael I. Jordan提出，因为模型的简单和有效，掀起了主题模型研究的波浪。LDA（Latent Dirichlet Allocation）主题分析是一种无监督的生成式模型，用于从大规模文本数据中挖掘潜在的主题结构。LDA假设每篇文档是由多个主题混合生成的，而每个主题又是一个词的概率分布。通过分析文档中的词频分布，LDA能够推断出文档的主题分布以及每个主题的关键词。LDA的核心在于其生成过程：文档先从一个Dirichlet分布中抽取主题比例，然后从每个主题中抽取词汇。LDA的无监督特性使其能够自动发现文本数据中的主题，而无需人工标注，广泛应用于文本挖掘、信息检索、推荐系统和学术分析等领域。其结果通常以概率分布的形式呈现，便于用户理解和解释。

方法论

1. 数据预处理

数据预处理是LDA算法中非常重要的一步，它直接影响到模型的准确性和效率。以下是主要的预处理步骤：

分词：将文本分割成单独的词语，这是文本处理的基础。
去除停用词：停用词是指那些在文本中频繁出现但对主题分析没有帮助的词，如“的”、“是”、“了”等。去除停用词可以减少数据量，提高模型的准确性。
词形还原：将词语还原到其基本形式，例如将“running”还原为“run”。这有助于减少词汇的多样性，使模型更容易发现主题。
构建词典：创建一个包含所有唯一词语的词典，用于将文本转换为数值化的表示。
创建语料库：将预处理后的文本转换为词袋模型（Bag-of-Words），其中每个文档被表示为一个词频向量。

2. LDA模型训练

在完成数据预处理后，可以开始训练LDA模型。以下是主要的训练步骤：

设置参数：确定主题数量（num_topics）、迭代次数（passes）、随机种子（random_state）等参数。这些参数对模型的性能和结果有重要影响。
初始化模型：使用gensim库的LdaModel类初始化LDA模型，并传入预处理后的语料库和词典。
训练模型：通过调用模型的train方法开始训练过程。在训练过程中，模型会不断调整参数以优化主题分布。
评估模型：使用Coherence值等指标评估模型的质量。Coherence值越高，表示模型的主题分布越合理。

3. 模型评估与优化

计算Coherence值：Coherence值是衡量主题模型质量的一种指标，用于评估主题模型生成的主题是否具有语义一致性。
调整参数：根据Coherence值的结果，调整模型的参数，如主题数量、迭代次数等，以优化模型性能。
可视化：使用pyLDAvis库可视化LDA模型的结果，以更直观地理解主题分布。

4. 预测主题，建立相似性模型

主题分配：将新文本转换为词袋模型，然后使用训练好的LDA模型预测其主题分布。
相似性评估：计算两篇长文档的主题分布的 KL 散度，得出相似性。

Python 实现

实验代码：https://github.com/vortezwohl/novel_similarity

读取文档和停用词表

import os
train_data_path = 'data/train'
test_data_path = 'data/test'
train_file_paths = list()
test_file_paths = list()
skip_files = ['stopwords']

for root, dirs, files in os.walk(train_data_path):
    for file in files:
        if file not in skip_files:
            train_file_paths.append(f'{train_data_path}/{file}')

for root, dirs, files in os.walk(test_data_path):
    for file in files:
        if file not in skip_files:
            test_file_paths.append(f'{test_data_path}/{file}')

train_novels = list[tuple]()
test_novels = list[tuple]()

for file_path in train_file_paths:
    with open(file_path, 'r', encoding='gbk') as _f:
        tmp_novel = _f.read()
        train_novels.append((file_path, tmp_novel))

for file_path in test_file_paths:
    with open(file_path, 'r', encoding='gbk') as _f:
        tmp_novel = _f.read()
        test_novels.append((file_path, tmp_novel))

with open('data/stopwords', encoding='utf-8') as _f:
    stopwords = _f.readlines()

对文档进行分词

from src.util.preprocess import split_words

train_seg_lists = list()
test_seg_lists = list()

for novel in train_novels:
    train_seg_lists.append((novel[0], split_words(novel[1], stopwords)))

for novel in test_novels:
    test_seg_lists.append((novel[0], split_words(novel[1], stopwords)))

train_documents = [x[1] for x in train_seg_lists]
test_documents = [x[1] for x in test_seg_lists]

构建词典，并基于词典构建词袋

from gensim import corpora

dictionary = corpora.Dictionary(train_documents)
corpus = [dictionary.doc2bow(x) for x in train_documents]

基于词袋训练 LDA 模型

from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

num_topics = 5

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=64,
    random_state=32
)


cm = CoherenceModel(model=lda_model, texts=train_documents, coherence='u_mass')
coherence = cm.get_coherence()
print(f"Coherence={coherence}")

pyLDAvis.display(gensimvis.prepare(lda_model, corpus, dictionary))

# 输出 Coherence 分数
# Coherence分数衡量了主题内词语之间的语义相似性。它通过评估主题中词语之间的关联性来判断主题的质量。一个高Coherence分数表示主题内的词语在语义上更加相关，从而使得主题更加有意义和可解释。
# 基于 UMass 的 Coherence 越接近 0 代表 LDA 模型性能越好
Coherence=-0.1456777579353652

alt text

使用 LDA 模型计算两篇测试文档的主题分布

# test
test_document_0 = test_documents[0]
test_bow_0 = dictionary.doc2bow(test_document_0)
test_topic_distribution_0 = lda_model.get_document_topics(test_bow_0)
test_topic_distribution_0 = [x[1] for x in test_topic_distribution_0]

test_document_1 = test_documents[1]
test_bow_1 = dictionary.doc2bow(test_document_1)
test_topic_distribution_1 = lda_model.get_document_topics(test_bow_1)
test_topic_distribution_1 = [x[1] for x in test_topic_distribution_1]

print(test_topic_distribution_0)
print(test_topic_distribution_1)

计算两者的 KL 散度

import numpy as np

def kl_divergence(p: np.ndarray, q: np.ndarray) -> np.float32:
    if not isinstance(p, np.ndarray):
        p = np.asarray(p)
    if not isinstance(q, np.ndarray):
        q = np.asarray(q)
    q = np.where(q == 0, 1e-10, q)
    return np.sum(p * np.log(p / q))

print(kl_divergence(test_topic_distribution_0, test_topic_distribution_1))

KL 散度结果

0.50899243