2025年5月12日 – arXiv最新論文の紹介

Mistral Medium 3, Gemini 2.5 Pro preview, Llama-Nemotron, OpenCodeReasoning

先週注目のニュースはMistralのMistral Medium 3のリリース（Medium is the new large. | Mistral AI）。Claude 3.7 sonnetと競合する性能で「The Mistral Medium 3 API is available starting today on Mistral La Plateforme and Amazon Sagemaker, and soon on IBM WatsonX, NVIDIA NIM, Azure AI Foundry, and Google Cloud Vertex. To deploy and customize the model in your environment, please contact us. 」と各社環境での動作が可能な点が重要に思う。

GoogleのGemini 2.5 Proが使用可能になったよう（Gemini Pro – Google DeepMind）でこちらも注目度が高い。NvidiaのLlama-NemotronやOpenCodeReasoning がダウンロード可能になったことも話題になっていた。

各モデルの（第三者の）性能検証はこれからという感じだろうが、本当にニュースが多い。

Llama-Nemotron: Efficient Reasoning Models [105.8]
ヘテロジニアス推論モデルの開族であるLlama-Nemotronシリーズを導入する。サイズはNano(8B)、Super(49B)、Ultra(253B)の3種類。
論文参考訳（メタデータ） (Fri, 02 May 2025 01:35:35 GMT)
リポジトリはnvidia/Llama-3_1-Nemotron-Ultra-253B-v1 · Hugging Face、nvidia/Llama-Nemotron-Post-Training-Dataset · Datasets at Hugging Face

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.2]
教師付き微調整(SFT)データセットを構築し、様々なサイズのモデルで最先端のコーディング能力を実現する。私たちのモデルは、LiveCodeBenchで61.8%、CodeContestsで24.6%を達成するためにSFTのみを使用しており、強化学習でトレーニングされた代替品を上回っています。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 17:50:31 GMT)

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.7]
ミキチャー・オブ・エキスパート(MoE)と1兆近いパラメータを持つ疎大言語モデル(LLM)が、最も有能な言語モデルの領域を支配している。本稿では,Ascend NPU上でそのようなスケールを利用するレシピを明らかにすることを目的としている。主な目的は、動的スパースモデル構造下でのコンピューティングリソースのより良い使用と、実際のハードウェアで期待されるパフォーマンス向上の実現である。
論文参考訳（メタデータ） (Wed, 07 May 2025 15:46:36 GMT)
Llama 4, Nemotron-H, Pangu Ultra, Kimi-VL, Kimi-VL-Thinking, Deep Coder – arXiv最新論文の紹介にも関連するPangu Ultraの主に実装に関する論文。
「Our system optimizations focus on Expert Parallelism and memory management, significantly lowering communication and activation overhead across 6K NPUs. These innovations enable a 30.0% MFU, demonstrating Ascend NPUs’ capability to support full-scale training of large-scale sparse LLMs, e g , Pangu Ultra MoE, with comparable performance as DeepSeek R1.」とのことでNVIDIAのGPUに頼らずとも最先端モデルを構築可能と主張しているように見える。

Teaching Models to Understand (but not Generate) High-risk Data

Teaching Models to Understand (but not Generate) High-risk Data [38.3]
SLUNG(Selective Loss to Understand but not Generate)を紹介する。 SLUNGは、モデルが高リスクデータを生成せずに理解することを学ぶための事前学習パラダイムである。 SLUNGは、生成を増大させることなく、モデルによる高リスクデータの理解を一貫して改善することを示す。
論文参考訳（メタデータ） (Mon, 05 May 2025 22:24:06 GMT)
「This work introduces SLUNG, a pre-training paradigm that enables language models to learn from high-risk data without being trained to generate it. By selectively adjusting the training objective at the token level based on risk, SLUNG decouples a model’s ability to understand from its ability to generate, allowing models to condition on high-risk inputs while learning from adjacent low-risk tokens.」という手法の提案。口外することはできないが学ぶ必要があるもの、というのは現実的に多いわけでこのような手法は非常に面白い。

On Path to Multimodal Generalist: General-Level and General-Bench

On Path to Multimodal Generalist: General-Level and General-Bench [154.0]
本稿では,MLLMの性能と汎用性を5段階に定義した評価フレームワークであるGeneral-Levelを紹介する。フレームワークの中核はSynergyの概念であり、モデルが理解と生成をまたいだ一貫性のある機能を維持するかどうかを測定する。既存の100以上のMLLMを含む評価結果は、ジェネラリストの能力ランキングを明らかにする。
論文参考訳（メタデータ） (Wed, 07 May 2025 17:59:32 GMT)
「This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI?」に対する評価フレームワーク。自動運転のような大きく5段階のレベル設定を行っている。現時点では「Our evaluation of over 100 existing top-performing LLM/MLLM systems has uncovered critical insights into their capabilities and rankings as multimodal generalists. The most notable finding is that most MLLMs lack the cross-task or cross-modal synergy ability required for higher-level classifications, with even advanced models like GPT-4V and GPT-4o not achieving top ranks.」とのことだが…
プロジェクトサイトはPath to Multimodal Generalist、リーダーボードはPath to Multimodal Generalist

下記サーベイも注目

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models [79.5]
推論は知性の中心にあり、決定し、結論を導き、ドメインをまたいで一般化する能力を形成する。人工知能において、システムがオープンで不確実でマルチモーダルな環境でますます機能するにつれて、推論は堅牢で適応的な行動を可能にするために不可欠となる。大規模マルチモーダル推論モデル(LMRM)は、テキスト、画像、オーディオ、ビデオなどのモダリティを統合し、複雑な推論機能をサポートする、有望なパラダイムとして登場した。
論文参考訳（メタデータ） (Thu, 08 May 2025 03:35:23 GMT)
リポジトリはGitHub – HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models: The development and future prospects of multimodal reasoning models.

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31