Mixture of Experts – arXiv最新論文の紹介

FlexOlmo: Open Language Models for Flexible Data Use

FlexOlmo: Open Language Models for Flexible Data Use [184.9]
我々は、データ共有なしで分散トレーニングをサポートする新しい言語モデル(LM)であるFlexOlmoを紹介します。 FlexOlmoはエキスパートの混成アーキテクチャを採用しており、各専門家はクローズドデータセットで独立して訓練される。我々は、公開データで訓練された一般専門家と、他のデータ所有者から独立した訓練を受けた専門家とを効果的に組み合わせることができることを示す。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 16:54:21 GMT)
「Standard MoEs train all experts and the router jointly on all data. In contrast, FLEXOLMO trains experts independently by teaching them to coordinate (§3.3.1) and merges them at inference using a domain-informed router (§3.3.2).」と連合学習やMoEと聞いて思い浮かべるが現実的には難しいそれぞれの場所で構築されたAIが統合的に動作するフレームワークの提案と効果検証。
「Organizations in regulated industries require LMs that can leverage their closed datasets while maintaining strict data privacy and access controls. Healthcare institutions, financial firms, and other entities possess valuable domain-specific data but cannot share it externally due to HIPAA, GDPR [14, 15], data sovereignty laws [16], and intellectual property (IP) protections. 　These organizations need training paradigms that enable AI improvement on their sensitive data while ensuring such sensitive data never leaves certain environments and can be removed from the model after training, e g , when data usage rights expire. In such settings, modular training approaches, where individual experts are trained independently and asynchronously on locally maintained data, are essential.」はまさにその通りで非常に有用な技術に思える。
プロジェクトサイトはIntroducing FlexOlmo: a new paradigm for language model training and data collaboration | Ai2、リポジトリはGitHub – allenai/FlexOlmo: Code and training scripts for FlexOlmo

Llama 4, Nemotron-H, Pangu Ultra, Kimi-VL, Kimi-VL-Thinking, Deep Coder

先週もLLM関連の話題は多かったが、Llama4の発表はその中でも大きなものだった（The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation）。MoE構成で高い性能を主張、第三者の検証ではいまいちという話も、量子化の影響（性能劣化）が大きいのではという話もあって、検証結果が出そろうのを待ちたいところ。

NVIDIAからは Mamba-TransformerハイブリッドなNemotron-Hが発表されている（Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models – NVIDIA ADLR）。「Nemotron-H has been used as the backbone for Cosmos-Reason 1, a very strong VLM for physical AI.」というのにも注目。

HuaweiからはPangu Ultraの論文が出ているが、詳細なPDFは公開されていないよう。「To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1」という興味深い記載があり詳細が気になるところ。

Kimi-VL は強力なMLLMであり、また、Kimi-VL-ThinkingとLRMでもあるのが特徴的な公開モデル（moonshotai/Kimi-VL-A3B-Instruct · Hugging Face）。o3-miniレベルの性能を主張するDeepCoder: A Fully Open-Source 14B Coder at O3-mini Levelなどオープンなモデルも進化が速い。オープンなモデルを強化する方向もIntroducing Cogito Preview（Cogito v1 Preview – a deepcogito Collection）など様々な成果が出ていて、公開モデルの性能も向上が続く。

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models [164.5]
ネモトロン-Hは8Bと56B/47Bハイブリッド・マンバ・トランスフォーマーのファミリーである。私たちは共通のTransformerモデルアーキテクチャにおけるほとんどの自己注意レイヤをMambaレイヤに置き換えます。 Nemotron-Hモデルは、他の同様のサイズのオープンソーストランスフォーマーモデルと比較して、精度が良いか低いかのどちらかを提供する。
論文参考訳（メタデータ） (Fri, 04 Apr 2025 17:41:58 GMT)
高速、高性能なMambaハイブリッドなLLM

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [123.3]
135億のパラメータと高密度トランスフォーマーモジュールを持つ大規模言語モデル(LLM)であるPangu Ultraについて述べる。このような大規模トレーニングを効率的に行うためには,8,192個のAscend NPUと一連のシステム最適化を用いる。我々の調査では、Ascend NPUは1000億以上のパラメータを持つ高密度モデルを効率的かつ効果的に訓練できることを示した。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 15:41:51 GMT)
ファーウェイのLLM。ファーウェイのアクセラレータを活用して構築しているとのことだが現状論文が参照できない状態。詳細が気になるところ。

Kimi-VL Technical Report [88.1]
Kimi-VLは視覚言語モデル(VLM)であり、高度なマルチモーダル推論、長いコンテキスト理解、強力なエージェント能力を提供する。汎用 VLM として、Kimi-VL はマルチターンエージェントタスク(OSWorld など)に優れ、旗艦モデルと一致する。 Kimi-VLをベースとして、Kim-VL-Thinkingという先進的なロングシンキングモデルを導入する。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 06:48:26 GMT)
エージェントタスクでも高い性能を持つマルチモーダルLLM。Thinkingバージョンはパラメータ数と比較して高い性能。
リポジトリはGitHub – MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities, moonshotai/Kimi-VL-A3B-Instruct · Hugging Face

Mixture of Hidden-Dimensions Transformer

Mixture of Hidden-Dimensions Transformer [50.4]
隠れ次元の空間性について検討し、訓練されたトランスフォーマーがわずかなトークン次元しか利用していないことを観察する。スパース条件付アクティベーションアーキテクチャであるMoHD(Mixture of Hidden Dimensions)を提案する。 50%のアクティベーションパラメータが減少し、3.7%のハイパフォーマンスを実現し、3倍のパラメータを一定のアクティベーションコストで拡張する。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 13:15:22 GMT)
最近よく見るMoEっぽいがグローバルな構造に踏み込んでいるタイプの研究
「It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3× parameter expansion at constant activation cost.」とのこと

MH-MoE:Multi-Head Mixture-of-Experts

MH-MoE:Multi-Head Mixture-of-Experts [119.5]
MH-MoE(Multi-Head Mixture-of-Experts)は,マルチヘッド機構を用いて, 異なる専門家内の様々な表現空間からの情報を集約し, 優れた性能を示す。 FLOPとパラメータパリティの両方をスパースミキサーモデルで維持するMH-MoEの新たな実装を提案する。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 09:05:36 GMT)
Fugu-MT 論文翻訳(概要): Multi-Head Mixture-of-Experts の実装の改善
「In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models.」

Hunyuan-Large

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent [83.4]
Hunyuan-Largeは、オープンソースのTransformerベースのエキスパートモデルのミックスである。我々は,Hunyuan-Largeの優れた性能を,様々なベンチマークで徹底的に評価する。 Hunyuan-Largeの主な実践は、以前の文献より大きい大規模合成データである。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 04:14:25 GMT)
高性能かつモデルが公開されているタイプのLLM。389Bパラメータうち52BがアクティブなるMoEでLlama 3.1 70Bを超え、405Bと競合的と主張。比較的寛容なライセンスであるが「THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.」というのが特徴的。「This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China」との記載も。
リポジトリはGitHub – Tencent/Tencent-Hunyuan-Large、モデルはtencent/Tencent-Hunyuan-Large · Hugging Face

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration [33.9]
視覚言語基礎モデル(CLIPなど)は、大規模な画像テキスト事前学習により、転送学習におけるその能力を示している。本稿では,分離されたエージェントの知識を統一的に伝達する,汎用的で簡潔なTransAgentフレームワークを提案する。われわれのTransAgentは、11の視覚的認識データセット上で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 03:01:44 GMT)
エージェンティックなモデルの統合、「By adaptively integrating the external knowledge of agents from different modalities via MoA gating mechanism, TransAgent achieves state-of-the-art performance on 11 datasets under the low-shot scenarios.」とのこと。
リポジトリはGitHub – markywg/transagent: [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning [136.9]
MoErgingは、専門家のモデルをリサイクルして、パフォーマンスや一般化を改善した集約システムを構築することを目的としている。 MoErgingメソッドの重要なコンポーネントは、特定の入力やアプリケーションに使用する専門家モデルを決定するルータの作成である。このサーベイには、キーデザインの選択をカタログ化し、各手法に適した適用方法を明確にするための新しい分類が含まれている。
論文参考訳（メタデータ） (Tue, 13 Aug 2024 17:49:00 GMT)
いわゆるMoE：Mixture-of-Expertsよりも広い概念であるMoErging（a new paradigm for decentralized model development that aims to recycle expert models trained asynchronously by distributed contributors.）のサーベイ

MoExtend: Tuning New Experts for Modality and Task Extension

MoExtend: Tuning New Experts for Modality and Task Extension [61.3]
MoExtendは、Mixture-of-Experts (MoE)モデルのモダリティ適応と拡張を効率化する効果的なフレームワークである。 MoExtendは、新しいエキスパートをトレーニング済みのMoEモデルにシームレスに統合し、トレーニング済みのモデルをチューニングすることなく、新しい知識を提供する。
論文参考訳（メタデータ） (Wed, 07 Aug 2024 02:28:37 GMT)
MoE的なものだが、モダリティを拡張する手法の提案、実験結果からも非常に効果的に見える。
リポジトリはGitHub – zhongshsh/MoExtend: ACL 2024 (SRW), Official Codebase of our Paper: “MoExtend: Tuning New Experts for Modality and Task Extension”

Yuan 2.0-M32, Zamba, MAP-Neo

今週も興味深いLLMが発表されている。

MoEで小型強力なYuan 2.0-M32
SSM（＆Transformerのハイブリッド）であるが7Bと実用サイズかつTransformerアーキテクチャの7Bと競合する性能に見えるZamba
中国語-英語ではあるが強力なオープンモデルであるMAP-Neo

Yuan 2.0-M32: Mixture of Experts with Attention Router [30.9]
Yuan 2.0-M32は、Yuan-2.0 2Bと同様のベースアーキテクチャで、32人のエキスパートと2人のエキスパートが活動する混合専門家アーキテクチャを使用している。新しいルータネットワークであるAttention Routerが提案され、より効率的な専門家の選択のために採用され、従来のルータネットワークと比較して3.8%の精度が向上する。 Yuan 2.0-M32は、コーディング、数学、および様々な専門分野における競争力を示す。
論文参考訳（メタデータ） (Tue, 28 May 2024 09:05:08 GMT)
MoEでアクティブパラメータが少ないが優れた性能を主張するLLM。多くのタスクでアクティブパラメータ的に同規模のPhi-3、倍以上の規模のLlama-3 8Bよりスコアが高い。
リポジトリはGitHub – IEIT-Yuan/Yuan2.0-M32: Mixture-of-Experts (MoE) Language Model

Zamba: A Compact 7B SSM Hybrid Model [11.0]
Zambaは7B SSMトランスフォーマーハイブリッドモデルである。 Zambaは、公開データセットから1Tトークンをトレーニングする。 Zambaは、同等のトランスフォーマーモデルよりも推論がかなり速い。
論文参考訳（メタデータ） (Sun, 26 May 2024 22:23:02 GMT)
SSMとTransformerのハイブリッドで効率的だが強力なLLM
リポジトリはZyphra/Zamba-7B-v1 · Hugging Face

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series [86.3]
私たちはMAP-Neoをオープンソースにしました。これは、4.5Tの高品質トークン上で、スクラッチからトレーニングされた7Bパラメータを持つバイリンガル言語モデルです。
論文参考訳（メタデータ） (Wed, 29 May 2024 17:57:16 GMT)
強力かつオープンなLLM
プロジェクトサイトはMAP-Neo、HuggingFace weightはNeo-Models – a m-a-p Collection (huggingface.co)

Mixtral of Experts

Mixtral of Experts [57.4]
Mixtral 8x7Bはスパース・ミックス・オブ・エキスパートズ(SMOE)言語モデルである。 Mixtralは数学、コード生成、多言語ベンチマークでLlama 270Bをはるかに上回っている。また、GPT-3.5 Turbo、Claude-2.1、Gemini Pro、Llama 2 70Bを超越したMixtral 8x7B – Instructという命令に従うように微調整されたモデルも提供する。
論文参考訳（メタデータ） (Mon, 8 Jan 2024 18:47:34 GMT)
高性能で話題になったMixtralの論文。「Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic.」は驚き
Mixtral of experts | Mistral AI | Open-weight models

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31