我最近发现了一篇有趣的论文,题为“您的混合专家LLM其实是一个免费的嵌入模型”。最近的LLM架构趋势是解码器模型,由于其注意力方法,它不适合嵌入模型。然而,作者透露,混合专家 (MoE) LLM可以作为嵌入模型,应用多种以嵌入为重点的任务,而无需进一步微调。在本文中,首先,让我们回顾一下 MoE,我将介绍它的工作原理及其实际实现。
目录
- 什么是混合专家 (MoE)?
- MoE 如何作为嵌入模型发挥作用?
- 实际实施:利用 MoEE 和 BERTopic
1.什么是混合专家(MoE)?
混合专家 (MoE) 是一种具有多个子网络(称为“专家”)的架构,每个子网络专门处理不同的任务或数据方面。MoE 的优势之一是,它使 AI 模型能够以比相同或更大的模型更少的计算量进行预训练,同时保持或提高质量。因此,如果我们的预算有限,我们可以使用 MoE 实现比密集、类似大小的传统模型更好的模型。就最近的成功而言,Mixtral 8 x 7B 在许多评估数据集上的表现优于 LLaMA 2 70B。
从现在开始,让我们研究一下 MoE 的架构。最近成功的 MoE 使用了 transformer 模型,因此我将重点介绍流行的 transformer MoE 架构。MoE 主要有两个组件,如下所述。
- MoE 层
在 Transformer 架构中,MoE 将前馈网络 (FFN) 层替换为 MoE 层。每个 MoE 层都有一些专家(例如上图中的 4 个专家),并且每个专家由简单的FFN层组成。请注意,Transformer 中的其他组件(例如自注意力层)共享相同的权重。因此,MoE 的权重数量并不简单。例如,Mixtral 8 x 7B 权重不是 8 x 7 = 56B 而是 47B,因为 MoE 层以外的其他层共享相同的权重。
- 门控网络
门控网络或路由器是 MoE 中的关键组件。它接收输入标记并为每个标记选择最相关的专家。例如,在上图中,路由器的左侧选择第二个专家来处理单词“more”标记。同时,路由器确定第一个专家来处理单词“Parameters”标记。通常,门控网络选择与给定标记相关的前 k 个专家,并将标记发送给选定的专家;例如,Mixtral 8 x 7B 选择前 2 个专家。
我们如何选择前 k 名专家?我们使用 softmax 函数计算专家的重要性概率并保留前 k 名概率专家,如下所示。我提取了上图中的门控部分。
门控网络有其权重。我们将 softmax 函数应用于输入单词 token 与门控网络权重之间的点积结果,然后得到专家与给定 token 相关的概率。根据该概率,我们可以选出前 k 个相关专家。具有这种门控网络的 MoE 称为稀疏 MoE。
这些是理解 MoE 作为嵌入模型如何工作所需的基础知识。现在,让我们深入了解 MoE 作为嵌入模型的实际工作方式。
2. MoE 如何作为嵌入模型发挥作用?
快速回顾一下嵌入
在深入探讨本节主题之前,让我们快速回顾一下嵌入。最近,嵌入已成为深度学习模型中输入数据的内部表示,它具有语义和浓缩的数据信息。我们通常将神经网络的最后一个隐藏状态提取为嵌入,如下所示。
我们通常使用基于编码器的模型来提取嵌入,因为与仅使用解码器的模型相比,它们可以使用双向注意力来捕获语义。仅使用解码器的模型通常使用因果注意力来仅与前一个单词标记进行交互;因此,它们无法像编码器-解码器模型那样捕获丰富的语义,例如上下文信息。
MoE 如何作为嵌入模型发挥作用?
以前人们普遍认为解码器模型不能用于嵌入提取。然而,作者发现 MoE 中的路由权重为解码器嵌入提供了补充信息。每层中的路由权重反映了对输入 token 的推理选择,因此它包含了隐藏状态嵌入可能丢失的输入语义信息。在数学公式中,我们可以将其描述为:
g是 softmax 函数,H表示隐藏状态。我们将所有 MoE 层的路由权重连接起来,以避免丢失模型的推理选择。
为了充分利用路由权重和解码器嵌入,作者提出了一种称为 MoE 嵌入 (MoEE) 的方法来形成更全面的嵌入表示。MoEE 有两种类型。一种方法是基于连接的组合,如下所述。
这种方法很简单,我们只需将路由权重和解码器嵌入连接起来即可。作者将此方法称为 MoEE(concat)。它可以保留每个路由权重捕获的不同信息,同时允许下游任务利用组合表示。
另一种方法是加权和积分。它对从路由权重和隐藏状态 (HS) 嵌入计算出的相似度得分进行加权和,表示为 MoEE (sum)。此方法用于比较两个句子的任务,例如语义文本相似度。
是控制路由权重贡献的超参数。计算每对的相似度得分后,我们计算计算出的相似度得分与真实相似度之间的秩相关性,例如 Spearman 秩相关性。
对于实际使用,我认为 MoEE(concat) 很容易使用。此外,作者利用 PromptEOL 技术 [4] 来增强 MoEE。此技术提示以下模板来约束 LLM 预测下一个标记中的语义信息。
现在,这是跨 MTEB 任务的性能表。
带有 PromptEOL 的 MoEE 比监督和自监督方法效果更好。请注意,此排行榜不是最新的,因此此结果不是 SOTA。此方法的价值在于我们可以获得不错的嵌入任务结果,并且无需进一步训练即可使用。
到目前为止,我们已经介绍了 MoEE 的工作原理。在下一节中,我们将使用 BERTopic 和聚类句子来实现 MoEE。
3. 实际实施:利用 MoEE 和 BERTopic
在本节中,我们从预先训练的 MoE LLM 中提取嵌入,并使用 20 个新闻组数据集将它们与BERTopic结合使用。供您参考,BERTopic 是一个超越传统统计主题建模的便捷主题建模库。它利用 Transformer 中的嵌入进行主题聚类,因此我认为它适合检查功能。首先,让我们准备一个环境。
环境设置
我使用了带有 Python 3.10 的 conda 环境。我在 Ubuntu 20.04 上进行了实验,使用的是 cuda 12.4、16 GB VRAM。您可能需要 32 GB RAM 来下载模型权重。
conda create -n moee python=3.10 -y
conda activate moee
接下来,我们需要通过 pip 安装下面的库。
pip install transformers torch bitsandbytes bertopic accelerate
MoE 模型通常需要较高的 VRAM,因为我们需要提前将整个模型加载到 VRAM 中。因此,我们需要使用量化包 bitsandbytes 来节省 VRAM 内存。
我们需要克隆官方 GitHub 存储库。
git clone https://github.com/tianyi-lab/MoE-Embedding.git
所有准备工作都已完成。现在,让我们使用 MoEE 通过 BERTopic 实现主题聚类。
利用 MoEE 和 BERTopic
现在,我们将使用 MoEE 作为 BERTopic 的嵌入模型并尝试主题聚类。原始存储库允许我们使用小型 MoE 模型,例如 Qwen-1.5-MoE-A2.7B 或 OLMoE-1B-7B。在本文中,我将使用 OLMoE-1B-7B,它适合在 16 GB VRAM 上运行推理。首先,我们需要加载 OLMoE-1B-7B。
kwargs = {
"base_model": 'allenai/OLMoE-1B-7B-0924',
"normalized": False,
"torch_dtype": torch.bfloat16,
"mode": "embedding",
"pooling_method": "mean",
"attn_implementation": "sdpa",
"attn": "bbcc",
}
config = {
'embed_method': 'prompteol',
'emb_info': 'MoEE'
}
embedding_model = MOEE(model_name_or_path='allenai/OLMoE-1B-7B-0924', **kwargs)
接下来,我们需要计算 20 个新闻组数据集的嵌入以通过 BERTopic。
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
dataset = MyDataset(docs)
dataloader = DataLoader(dataset=dataset, batch_size=8)
embeddings = None
for batch in tqdm(dataloader):
with torch.no_grad():
embedding = embedding_model.encode(batch, **config)
if embeddings is None:
embeddings = embedding[0]
else:
embeddings = np.vstack((embeddings, embedding[0]))
torch.cuda.empty_cache()
为了提前计算嵌入,我们使用
torch.utils.data.DataLoader 作为迭代器,并对每个批处理文档进行编码。请注意,我们必须将嵌入作为 np.asarray 类型传递给 BERTopic。
当你想使用自己的 MoE 模型时,你必须实现从每个 MoE 层获取路由权重。对于隐藏状态嵌入,我们可以利用 HuggingFace 转换器函数。我们只需要在推理时传递 output_hidden_?states=True 参数。
现在,我们可以运行主题建模。
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()
# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)
# topic modeling using BERTopic model
topics, probs = topic_model.fit_transform(docs, embeddings)
默认设置下我们得到了 42 个主题;下面是一些示例。虽然我随机挑选了主题,但它可以很好地捕捉语义。
此外,这里是主题集群可视化。
请看主题聚类可视化中的红色圆圈。这个红色圆圈指的是主题 0,与计算机相关。更接近的主题也与机械词汇相关,例如图形、数字和打印机。
该方法向我们展示了我们可以在不进行任何训练的情况下获得不错的嵌入。尽管仍有提升质量以达到与 SOTA 监督模型相当的质量的空间,但本文的发现为进一步改进无需训练的嵌入提取方法迈出了良好的一步。
全部代码参考如下。您需要将此文件放入 MoE-Embedding 目录的顶部。
import sys
sys.path.append('.')
import re
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from moee import MOEE
/opt/conda/envs/moee/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device
'cuda'
Load dataset
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
def remove_punctuation(x: str) -> str:
cleaned = re.sub(r"[!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n -' ]", " ", x)
return cleaned
def clean_caption(x :str) -> str:
# align the character
x = x.lower()
# remove URLs and punctuation
x = re.sub(r"http\S+", "", x)
x = re.sub(r"www.\S+", "", x)
x = remove_punctuation(x)
x = re.sub(r" ", " ", x)
return x
docs = [clean_caption(doc) for doc in docs]
Define MoEE and BERTopic
kwargs = {
"base_model": 'allenai/OLMoE-1B-7B-0924',
"normalized": False,
"torch_dtype": torch.bfloat16,
"mode": "embedding",
"pooling_method": "mean",
"attn_implementation": "sdpa",
"attn": "bbcc",
}
config = {
'embed_method': 'prompteol',
'emb_info': 'MoEE'
}
embedding_model = MOEE(model_name_or_path='allenai/OLMoE-1B-7B-0924', **kwargs)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
OlmoeForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From v4.50 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
Loading checkpoint shards: 100%|██████████| 3/3 [00:52<00:00, 17.59s/it]
self.model: OlmoeForCausalLM(
(model): OlmoeModel(
(embed_tokens): Embedding(50304, 2048, padding_idx=1)
(layers): ModuleList(
(0-15): 16 x OlmoeDecoderLayer(
(self_attn): OlmoeSdpaAttention(
(q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
(v_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
(o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
(q_norm): OlmoeRMSNorm((2048,), eps=1e-05)
(k_norm): OlmoeRMSNorm((2048,), eps=1e-05)
)
(mlp): OlmoeSparseMoeBlock(
(gate): Linear4bit(in_features=2048, out_features=64, bias=False)
(experts): ModuleList(
(0-63): 64 x OlmoeMLP(
(gate_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
(up_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
(down_proj): Linear4bit(in_features=1024, out_features=2048, bias=False)
(act_fn): SiLU()
)
)
)
(input_layernorm): OlmoeRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): OlmoeRMSNorm((2048,), eps=1e-05)
)
)
(norm): OlmoeRMSNorm((2048,), eps=1e-05)
(rotary_emb): OlmoeRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=50304, bias=False)
)
class MyDataset(Dataset):
"""Dataset to pass to `transformers.pipelines.pipeline`."""
def __init__(self, docs, truncate_token_num: int = 300):
self.docs = docs
self.truncate_token_num = truncate_token_num
def __len__(self):
return len(self.docs)
def __getitem__(self, idx):
if len(self.docs[idx]) > self.truncate_token_num:
return self.docs[idx][:self.truncate_token_num]
return self.docs[idx]
dataset = MyDataset(docs)
dataloader = DataLoader(dataset=dataset, batch_size=16)
embeddings = None
for batch in tqdm(dataloader):
with torch.no_grad():
embedding = embedding_model.encode(batch, **config)
if embeddings is None:
embeddings = embedding[0]
else:
embeddings = np.vstack((embeddings, embedding[0]))
torch.cuda.empty_cache()
100%|██████████| 2356/2356 [43:44<00:00, 1.11s/it]
np.save('embedding.npy', embeddings)
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()
# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)
topics, probs = topic_model.fit_transform(docs, embeddings)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
topic_model.get_topic_info()
Topic Count Name Representation Representative_Docs
0 -1 5271 -1_christian_church_believe_read [christian, church, believe, read, god, eviden... [i have come across what i consider to be an e...
1 0 4110 0_dos_os_windows_microsoft [dos, os, windows, microsoft, ms, pc, mac, dis... [ \t munch \t munch following is reformatted...
2 1 1057 1_scripture_christianity_christians_bible [scripture, christianity, christians, bible, c... [ this is something i ve always found a littl...
3 2 1022 2_flyers_puck_nhl_leafs [flyers, puck, nhl, leafs, sabres, bruins, pla... [the flyers closed out the season last night w...
4 3 963 3_riding_driving_wheel_bike [riding, driving, wheel, bike, ride, honda, bi... [sixteen days i had put off test driving the h...
5 4 902 4_comics_hulk_sale_list [comics, hulk, sale, list, wolverine, forsale,... [the following comics are for auction the hig...
6 5 696 5_firearms_guns_handgun_gun [firearms, guns, handgun, gun, crime, criminal... [ because the gun loonies were firing on vehic...
7 6 626 6_infections_clinical_diseases_infection [infections, clinical, diseases, infection, ca... [ one of the responsibilities of a licensed ph...
8 7 567 7_maybe_mailing_probably_does [maybe, mailing, probably, does, say, hope, gu... [ oh yes i m quite sure they will , \ti looke...
9 8 480 8_nasa_spacecraft_shuttle_satellite [nasa, spacecraft, shuttle, satellite, orbit, ... [ in fact you probably want to avoid us govern...
10 9 478 9_clipper_encryption_decrypt_cryptography [clipper, encryption, decrypt, cryptography, c... [it looks like dorothy denning s wrong headed ...
11 10 380 10____ [, , , , , , , , , ] [, , ]
12 11 290 11_palestinians_israeli_israelis_gaza [palestinians, israeli, israelis, gaza, gazans... [many of you ask me whether i approve of sever...
13 12 249 12_ax_9f_qax_b8f [ax, 9f, qax, b8f, kn, 6um, pl, m9, max, k8] [ part 13 of 14 mtm 3v9f0 7ey 7e...
14 13 206 13_armenians_armenian_armenia_azerbaijanis [armenians, armenian, armenia, azerbaijanis, a... [accounts of anti armenian human rights violat...
15 14 167 14_archive_graphics_formats_information [archive, graphics, formats, information, data... [archive name graphics resources list part1 la...
16 15 145 15_grounded_grounding_ground_outlets [grounded, grounding, ground, outlets, wiring,... [ no no nooo the ground green wire is for ...
17 16 133 16_scorer_pittsburgh_pts_pp [scorer, pittsburgh, pts, pp, stl, 78, 43, det... [scoring stats for the swedish nhl players apr...
18 17 103 17____ [, , , , , , , , , ] [ , and a vga monitor e mail , cica indiana ...
19 18 97 18_supplementation_vitamin_vitamins_cancer [supplementation, vitamin, vitamins, cancer, c... [ i ll tell you all that i know about chromium...
20 19 87 19_batteries_radio_battery_electronics [batteries, radio, battery, electronics, elect... [ in order to emit blue light a semiconductor ...
21 20 86 20_nasa_spacecraft_saturn_astronomy [nasa, spacecraft, saturn, astronomy, satellit... [archive name space references last modified ...
22 21 75 21_investigation_bombing_evidence_news [investigation, bombing, evidence, news, witne... [i told some friends of mine two weeks ago tha...
23 22 54 22_stephanopoulos_briefing_secretary_president [stephanopoulos, briefing, secretary, presiden... [the white house office of the press...
24 23 51 23_send_entries_dos_fpu [send, entries, dos, fpu, slip, pktmux, guidel... [here are the standings after game 1 of each o...
25 24 50 24____ [, , , , , , , , , ] [there seems to be a p pds slot in the above p...
26 25 44 25_islamic_islam_quran_qur [islamic, islam, quran, qur, muslim, muslims, ... [ secular laws seem to value criminal life mor...
27 26 42 26_nonsense_claims_censorship_argument [nonsense, claims, censorship, argument, claim... [ i m going to cut rex s ramblings down a bit ...
28 27 37 27_paintshop_contacting_sold_sent [paintshop, contacting, sold, sent, thanks, f5... [found it thanks i got several offers for help...
29 28 35 28_homosexuality_homosexual_homosexuals_hetero... [homosexuality, homosexual, homosexuals, heter... [ can someone tell me why when mr cramer spo...
30 29 35 29_sphere_triangulation_algorithms_perpendicular [sphere, triangulation, algorithms, perpendicu... [ good i had a bad feeling about this prob...
31 30 32 30_shortstop_pitchers_outfielder_hitters [shortstop, pitchers, outfielder, hitters, bas... [ he s not gone yet the position opening is d...
32 31 32 31_skepticism_geb_n3jxp_gordon [skepticism, geb, n3jxp, gordon, intellect, in... [ senile keratoses have nothing to do with th...
33 32 31 32_militia_amendment_constitution_firearm [militia, amendment, constitution, firearm, li... [ actually the words a well regulated milita ...
34 33 30 33_subscribe_unsubscribe_subscrive_email [subscribe, unsubscribe, subscrive, email, wan... [please subscribe me , please subscribe me , p...
35 34 28 34_speeding_manslaughter_policeman_cop [speeding, manslaughter, policeman, cop, court... [pmoloney maths tcd ie paul moloney writes n...
36 35 28 35_modems_modem_mhz_tcp [modems, modem, mhz, tcp, digital, signal, mai... [ db 25\tdb 9 pin \tpin \tname\teia\tccitt\tdt...
37 36 24 36_dial_0055_800_930314 [dial, 0055, 800, 930314, number, 9000, 8287, ... [1 800 832 4778 western digital s voice mail ...
38 37 24 37_inkjet_inkjets_printers_laserjet [inkjet, inkjets, printers, laserjet, deskjet,... [fyi the actual horizontal dot placement reso...
39 38 24 38_rangers_adams_quakers_ivy [rangers, adams, quakers, ivy, douglass, hope,... [ i think that they go to divisional records b...
40 39 21 39_homosexual_percent_sexual_majority [homosexual, percent, sexual, majority, percen... [ from the santa rosa cal press democrat apr...
41 40 19 40_irony_cycnicism_sarcasm_acetone [irony, cycnicism, sarcasm, acetone, humour, k... [ \t1 they are religious parodies not atheisti...
42 41 15 41_autobiography_author_book_books [autobiography, author, book, books, bookstore... [this is the story of kent the archetype finn ...
topic_model.get_topic(0)
[('dos', np.float32(0.45857304)),
('os', np.float32(0.43415424)),
('windows', np.float32(0.40028214)),
('microsoft', np.float32(0.32284227)),
('ms', np.float32(0.31080914)),
('pc', np.float32(0.28627717)),
('mac', np.float32(0.2705468)),
('disk', np.float32(0.26714522)),
('scsi', np.float32(0.24755469)),
('cx', np.float32(0.2305391))]
topic_model.get_topic(2)
[('flyers', np.float32(0.5347663)),
('puck', np.float32(0.4863899)),
('nhl', np.float32(0.4710263)),
('leafs', np.float32(0.4642067)),
('sabres', np.float32(0.45007592)),
('bruins', np.float32(0.41095752)),
('playoffs', np.float32(0.39904732)),
('hockey', np.float32(0.3952221)),
('pitching', np.float32(0.39289254)),
('braves', np.float32(0.37793285))]
topic_model.get_topic(29)
[('sphere', np.float32(0.42566895)),
('triangulation', np.float32(0.42115515)),
('algorithms', np.float32(0.37481007)),
('perpendicular', np.float32(0.36362517)),
('algorithm', np.float32(0.35225672)),
('3d', np.float32(0.351159)),
('coplanar', np.float32(0.31972635)),
('circle', np.float32(0.29665813)),
('vertices', np.float32(0.28228626)),
('bisector', np.float32(0.2748276))]
topic_model.visualize_topics()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/conda/envs/moee/lib/python3.10/site-packages/IPython/core/formatters.py:925, in IPythonDisplayFormatter.__call__(self, obj)
923 method = get_real_method(obj, self.print_method)
924 if method is not None:
--> 925 method()
926 return True
File /opt/conda/envs/moee/lib/python3.10/site-packages/plotly/basedatatypes.py:832, in BaseFigure._ipython_display_(self)
829 import plotly.io as pio
831 if pio.renderers.render_on_display and pio.renderers.default:
--> 832 pio.show(self)
833 else:
834 print(repr(self))
File /opt/conda/envs/moee/lib/python3.10/site-packages/plotly/io/_renderers.py:394, in show(fig, renderer, validate, **kwargs)
389 raise ValueError(
390 "Mime type rendering requires ipython but it is not installed"
391 )
393 if not nbformat or Version(nbformat.__version__) < Version("4.2.0"):
--> 394 raise ValueError(
395 "Mime type rendering requires nbformat>=4.2.0 but it is not installed"
396 )
398 ipython_display.display(bundle, raw=True)
400 # external renderers
ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed
参考:
https://arxiv.org/pdf/2410.10814
https://huggingface.co/blog/moe
https://arxiv.org/pdf/2101.03961
https://arxiv.org/pdf/2307.16645