2024年9月23日 – arXiv最新論文の紹介

様々な研究機関がLLMを構築している。先週のニュースとしては高性能なLLM Qwen 2.5、MoE構成で高効率なGRIN-MoE、マルチモーダル拡張のQwen 2 VL、Pixtralに注目。

ライセンスは様々であることに注意が必要だが、モデル自体は公開されている。商用API以外に選択肢が広がっている。また、それぞれ様々な狙いを持ったモデルとなっていて正直評価を行うことも簡単ではない。自分がやりたいことにフィットするベースモデル、活用方法をサジェストするAIが欲しい今日この頃。

モデル構築、fine tuningの観点でも多くの情報が公開されておりとても興味深い。

Qwen2.5-Coder Technical Report [100.7]
先代のCodeQwen1.5から大幅にアップグレードされたQwen2.5-Coderシリーズを紹介します。コード固有のモデルとして、Qwen2.5-CoderはQwen2.5アーキテクチャに基づいて構築され、5.5兆以上のトークンからなる巨大なコーパスで事前訓練されている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:57:57 GMT)
「To ensure the quality of the pre-training data, we have curated a dataset by collecting public code data and extracting high-quality code-related content from web texts, while filtering out low-quality data using advanced classifiers.
」とフィルタリングの重要性を強調。データ合成にも触れられているがMATHと異なりリアルデータが豊富にあるから？

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.5]
Qwen2.5-Math と Qwen2.5-Math-Instruct-1.5B/7B/72B である。 Qwen2.5-Math-Instructは中国語と英語の両方をサポートし、高度な数学的推論能力を持っている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 16:45:37 GMT)
「In this report, we introduce Qwen2.5-Math, which features several key technical highlights: (1) extensive use of synthesized mathematical data from Qwen2-Math during the pre-training phase, (2) iterative generation of fine-tuning data and reinforcement training guided by the reward model during the post-training and inference phase and (3) support for bilingual (English and Chinese) queries, along with chain-of-thought and tool-integrated reasoning capabilities.」と合成データとself improvement的な動きの効果が興味深い

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution [82.4]
本稿では,従来のQwen-VLモデルのアップグレードであるQwen2-VLシリーズを紹介する。 Qwen2-VLでは、さまざまな解像度の画像を異なる数のビジュアルトークンに処理可能にする、Naive Dynamic Resolutionメカニズムが導入されている。また、Multimodal Rotary Position Embedding (M-RoPE)を統合し、テキスト、画像、ビデオ間で位置情報の効果的な融合を容易にする。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:59:32 GMT)
「Qwen2-VL series introduces naive dynamic resolution and multimodal rotary position embedding (M-RoPE) to fuse information across modals effectively and be capable of understanding videos over 20 minutes in length.」、「Furthermore, Qwen2-VL now supports understanding multilingual texts within images, including most European languages, Japanese, Korean, Arabic, Vietnamese, and others.」と動画対応、日本語対応と強力なマルチモーダルモデル。

GRIN: GRadient-INformed MoE [132.9]
Mixture-of-Experts (MoE)モデルは、エキスパートルーティングによるスパース計算により、密度の高いモデルよりも効果的にスケールする。エキスパートルーティングのためのスパース勾配推定を組み込んだGRIN(GRadient-Informed MoE Training)を導入する。我々のモデルは6.6Bの活性化パラメータしか持たないが、7Bの密度モデルより優れており、同じデータで訓練された14Bの密度モデルの性能と一致している。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:00:20 GMT)
「We propose SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.」、「We scale MoE training with neither expert parallelism nor token dropping, while the conventional MoE training employs expert parallelism and deploys token dropping.」を特徴とするMoEの改善
MoE構成でも意外とExpertにならないという報告を読んだ記憶があるが「Our study seems to verify our hypothesis that expert networks in GRIN MoE have developed highly-specialized and heterogeneous expertise.」という記載が興味深い。

Pixtral 12B [56.8]
12ビリオンパラメータのマルチモーダル言語モデルであるPixtral-12Bを導入する。 Pixtral-12Bは、自然画像と文書の両方を理解するために訓練されている。多くのオープンソースモデルとは異なり、Pixtralはそのサイズに対する最先端のテキストモデルでもある。
論文参考訳（メタデータ） (Wed, 09 Oct 2024 17:16:22 GMT)
Announcing Pixtral 12B | Mistral AI | Frontier AI in your hands
GitHub – mistralai/mistral-evals

日: 2024年9月23日