2025年12月29日 – arXiv最新論文の紹介

GLM 4.7, MiniMax M2.1 , ERNIE-5.0-Preview-1203

先週は中国のフロンティアモデルに関する発表が目立った。マイナーアップデートが多いものの着実な性能アップを実現している。GLM-4.7（XユーザーのZ.aiさん: 「GLM-4.7 is here! GLM-4.7 surpasses GLM-4.6 with substantial improvements in coding, complex reasoning, and tool usage, setting new open-source SOTA standards. It also boosts performance in chat, creative writing, and role-play scenarios. Default Model for Coding Plan: https://t.co/3vDzwof7A8」 / X、リポジトリ：zai-org/GLM-4.7 · Hugging Face）、MiniMax M2.1（XユーザーのMiniMax (official)さん: 「MiniMax M2.1 is OPEN SOURCE: SOTA for real-world dev & agents • SOTA on coding benchmarks (SWE / VIBE / Multi-SWE) • Beats Gemini 3 Pro & Claude Sonnet 4.5 • 10B active / 230B total (MoE) Not just SOTA, faster to infer, easier to deploy, and yes, you can even run it locally https://t.co/atCML3vq8C」 / X、リポジトリ：MiniMaxAI/MiniMax-M2.1 · Hugging Face）ともモデルが公開されているのがすばらしい。ERNIE 5.0（Best Text model from China in LMArena is now ERNIE-5.0-Preview-1203! | ERNIE Blog）も強力そう。

Nemotron3については論文が出ていた。強力な公開モデルが増えており、また、アップデートもされており良い時代である（？）

NVIDIA Nemotron 3: Efficient and Open Intelligence [227.5]
ネモトロン3シリーズは強力なエージェント、推論、会話能力を提供する。ネモトロン3モデルは、推論を可能にするマルチ環境強化学習、多段階ツールの使用、きめ細かい推論予算制御のサポートを用いて、後から訓練される。 Nemotron 3ファミリは、Mixture-of-ExpertsハイブリッドのMamba-Transformerアーキテクチャを使用して、最高レベルのスループットと最大100万トークンのコンテキスト長を提供する。
論文参考訳（メタデータ） (Wed, 24 Dec 2025 00:24:05 GMT)
「The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba–Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. 」とMambaハイブリッド、長文対応なモデル。

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning [223.9]
Nemotron 3 Nano 30B-A3BはMixture-of-ExpertsハイブリッドMamba-Transformer言語モデルである。ネモトロン3ナノは25兆個のテキストトークンで事前訓練され、その中にはネモトロン2に3兆以上の新しいユニークなトークンが含まれていた。
論文参考訳（メタデータ） (Tue, 23 Dec 2025 23:54:32 GMT)
リポジトリはnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 · Hugging Face

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.4]
PIM(Practical Inquiry Model)に基づく運用SGI定義を提案する。深層研究、アイデア生成、ドライ/ウェット実験、実験推論の4つのタスクを通じて運用しています。私たちのPIMによる定義、ワークフロー中心のベンチマーク、実証的な洞察は、真に科学的な発見に参加するAIシステムの基盤を確立します。
論文参考訳（メタデータ） (Thu, 18 Dec 2025 12:44:36 GMT)
scientific general intelligence (SGI)、「SGI is an AI that can autonomously navigate the complete, iterative cycle of scientific inquiry with the versatility and proficiency of a human scientist」の研究、ベンチマーク等も提案している。「Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10–20%), with brittleness in quantitative reasoning; in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges. These highlight gaps between linguistic fluency and integrated scientific cognition.」とあるなど道半ばという感じではあるが非常に流行っている分野だと思う。
SGI-Benchの上位はGemini 3 Pro, Claude Sonnet 4.5, Qwen3 Max, GPT-4.1, GPT-5.2 Proと各社のフロンティアモデルが並ぶ。
リポジトリはSGI-Bench — Scientific General Intelligence

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Toward Training Superintelligent Software Agents through Self-Play SWE-RL [66.1]
セルフプレイSWE-RLは、超知能ソフトウェアエージェントのトレーニングパラダイムに向けた第一歩である。当社のアプローチでは,ソースコードとインストール済みの依存関係を備えたサンドボックスリポジトリへのアクセスのみを必要としています。我々の成果は、早い段階で、エージェントが現実世界のソフトウェアリポジトリから広範囲にわたる学習経験を自律的に収集する道のりを示唆している。
論文参考訳（メタデータ） (Sun, 21 Dec 2025 00:49:40 GMT)
「The core idea of Self-play SWE-RL (SSR) is to allow LLM agents to self-improve through an iterative cycle of solving self-generated bugs and creating more complex challenges. As shown in Figure 1, the same LLM policy is divided into two roles: a bug-injection agent and a bug-solving agent.」と対戦型の自己改善フレームワーク。GitHub – facebookresearch/cwm: Research code artifacts for Code World Model (CWM) including inference tools, reproducibility, and documentation.をベースモデルとして効果を確認とのこと。

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31