动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词Topic

主题模型与BERTopic

主题模型Topic Model最常用的算法是LDA隐含迪利克雷分布，然而LDA有很多缺陷，如：

LDA需要主题数量作为输入，非常依赖这个值；
LDA存在长尾问题，对于大量低频词数据集表现不好；
LDA只考虑词频，没有考虑词与词之间的关系；
LDA不考虑时间信息，难以应用到动态主题模型任务。

为了解决这些问题，学界提出了DTM、ETM、DETM、BERTopic等方法，其中BERTopic是近年提出的热度很高的方法，它主要思路是寻找文本整体的BERT特征向量，然后对各文本特征在样本空间中做聚类，找到Topic，然后基于TF-IDF模型寻找每个Topic的关键词，最后寻找Topic在每个时间段的关键词表示。
然而BERTopic也存在几个问题：

BERTopic本身是为英文任务设计的，不适应于中文任务，因为英文无需分词，词与词之间天然用空格隔开，BERTopic对英文文本直接提取BERT特征，然后在空格隔开的词上找每个Topic的关键词，很便捷；对于中文来说，中文是需要分词的，如果对中文文本整体提取特征，就需要在中文的分词结果上提取每个Topic的关键词；
由于提取的是BERT特征，BERT本身要求文本长度不超过512，否则就会截断，对于这个问题，BERTopic里面是直接进行了截断，然而这种方法并不很合适，对长文本不太友好；

分别针对这两个问题，本文做了两个改进：

在文本整体上提取特征，在分词结果上提取关键词

改法很简单，调用topic_model.fit_transform()时，同时传入原始文本和分词（以及去停用词）结果，修改_bertopic.py中的源码，主要是改fit_transform()函数；

对文本的每512个字符提取BERT特征，然后求均值作为文本特征

改法很简单，经过读源码可知主要是SenteTransformer包里的SentenceTransformer.py里的encode()函数在进行特征提取，然后更改一下这个函数，更改为如下结果：

    def encode(self, sentences: Union[str, List[str]],batch_size: int = 1,show_progress_bar: bool = None,output_value: str = 'sentence_embedding',convert_to_numpy: bool = True,convert_to_tensor: bool = False,device: str = None,normalize_embeddings: bool = False) -> Union[List[Tensor], ndarray, Tensor]:"""Computes sentence embeddings:param sentences: the sentences to embed:param batch_size: the batch size used for the computation:param show_progress_bar: Output a progress bar when encode sentences:param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values:param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.:param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy:param device: Which torch.device to use for the computation:param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.:return:By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned."""self.eval()if show_progress_bar is None:show_progress_bar = (logger.getEffectiveLevel()==logging.INFO or logger.getEffectiveLevel()==logging.DEBUG)if convert_to_tensor:convert_to_numpy = Falseif output_value != 'sentence_embedding':convert_to_tensor = Falseconvert_to_numpy = Falseinput_was_string = Falseif isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1sentences = [sentences]input_was_string = Trueif device is None:device = self._target_deviceself.to(device)all_embeddings = []length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])sentences_sorted = [sentences[idx] for idx in length_sorted_idx]maxworklength = 512 # 每次最多提取maxlength个字的特征for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=False):# sentences_batch = sentences_sorted[start_index:start_index+batch_size] # sentences_batch里面有batch_size个文本tempsentence = sentences_sorted[start_index]sentence_length = len(tempsentence)if sentence_length%maxworklength:numofclip = sentence_length//maxworklength+1else:numofclip = sentence_length//maxworklengthif sentence_length:features = self.tokenize([tempsentence[clipi*maxworklength:(clipi+1)*maxworklength] for clipi in range(numofclip)])else:features = self.tokenize([''])features = batch_to_device(features, device)with torch.no_grad():out_features = self.forward(features)if output_value == 'token_embeddings':embeddings = []for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):last_mask_id = len(attention)-1while last_mask_id > 0 and attention[last_mask_id].item() == 0:last_mask_id -= 1embeddings.append(token_emb[0:last_mask_id+1])elif output_value is None:  #Return all outputsembeddings = []for sent_idx in range(len(out_features['sentence_embedding'])):row =  {name: out_features[name][sent_idx] for name in out_features}embeddings.append(row)else:   #Sentence embeddingsembeddings = out_features[output_value]embeddings = embeddings.detach()if normalize_embeddings:embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)# fixes for #522 and #487 to avoid oom problems on gpu with large datasetsif convert_to_numpy:embeddings = embeddings.cpu() # 维度是[batch_size, 768]# all_embeddings.extend(np.average(embeddings, axis=0))all_embeddings.append(np.average(embeddings, axis=0).tolist())all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]# if convert_to_tensor:#     all_embeddings = torch.stack(all_embeddings)# elif convert_to_numpy:#     all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])# if input_was_string:#     all_embeddings = all_embeddings[0]# ans = np.mean(np.array(all_embeddings), axis=0).tolist()return np.array(all_embeddings)

完成。

词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。

上一篇：跨域原理及解决

下一篇：APM电机输出逻辑之二

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词关键词

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词Topic

主题模型与BERTopic

在文本整体上提取特征，在分词结果上提取关键词

对文本的每512个字符提取BERT特征，然后求均值作为文本特征

相关内容

热门资讯

动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词关键词

动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词Topic

主题模型与BERTopic

在文本整体上提取特征，在分词结果上提取关键词

对文本的每512个字符提取BERT特征，然后求均值作为文本特征

相关内容

热门资讯

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词关键词

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词Topic