2025年9月30日 – arXiv最新論文の紹介

Achilles’ Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

Achilles’ Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data [52.1]
ステートスペースモデル(SSM)は、アテンションメカニズムに代わる有望な代替手段として登場した。本研究では,マンバ固有の制約を明らかにするために,慎重に設計された合成タスクを用いる。
論文参考訳（メタデータ） (Mon, 22 Sep 2025 08:38:55 GMT)
「We find that Mamba struggles to match sequences under order changes–—for example, “1234 “vs. “4321 “. To test this limitation, we designed a inverse sequence matching task, where the model must match a sequence with its reversed counterpart.」、「Experimental results confirm that Mamba has difficulty completing this task, whereas Transformer handles it with ease. 」とのことでMambaが苦手とするタスクの指摘。とても興味深い。

Causal Understanding by LLMs: The Role of Uncertainty [43.9]
近年の論文では、LLMは因果関係分類においてほぼランダムな精度を達成している。因果的事例への事前曝露が因果的理解を改善するか否かを検討する。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 13:06:35 GMT)
LLMを因果関係を整理可能かの検証、「 uncertainty in causal tasks stems primarily from deficits in causal under- standing rather than limitations in memorization.」、「Addressing these limitations will require a shift beyond current pretraining paradigms—toward models that explicitly encode and reason over causal structures, and that are capable of expressing calibrated uncertainty when faced with ambiguity or unseen conditions.」と厳しい指摘。
テストしているものがフロンティアなモデルなのかは気になるところではある。（もっとも商用モデルだとデータ、pre trainもpost trainもよくわからないという問題はあるのだが。。。）

Fluid Language Model Benchmarking [126.9]
我々は,複数の次元にわたるLMベンチマークを進展させる新しい評価手法であるFluid Benchmarkingを紹介する。サイコメトリックスにインスパイアされたFluid Benchmarkingは、ベンチマーク項目の相対値がLMの能力レベルに依存するという洞察に基づいている。効率性,妥当性,分散性,飽和性の4つの次元を検証した結果,Fluid Benchmarkingがすべてにおいて優れた性能を発揮することがわかった。
論文参考訳（メタデータ） (Sun, 14 Sep 2025 05:49:42 GMT)
「we introduce FLUID BENCHMARKING, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, FLUID BENCHMARKING is based on the insight that the relative value of benchmark items depends on an LM’s capability level, suggesting that evaluation should adapt to each LM. Methodologically, FLUID BENCH- MARKING estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education.」との評価方法の提案。
リポジトリはGitHub – allenai/fluid-benchmarking: Fluid Language Model Benchmarking

LIMI: Less is More for Agency [49.6]
LIMI(Less Is More for Intelligent Agency)は、機関が根本的に異なる開発原則に従うことを示す。高度なエージェント・インテリジェンスは、最小でも戦略的にキュレートされた自律行動のデモンストレーションから生まれる可能性がある。マシンの自律性はデータの豊富さではなく、高品質なエージェント実証の戦略的キュレーションから生まれる。
論文参考訳（メタデータ） (Mon, 22 Sep 2025 10:59:32 GMT)
「These findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations. This discovery fundamentally reshapes how we develop autonomous AI systems, suggesting that mastering agency requires understanding its essence, not scaling training data.」という主張。「we refer to models fine-tuned with our curated dataset as LIMI (corresponding to fine-tuning GLM-4.5) and LIMI-Air (corresponding to fine-tuning GLM-4.5-Air).」とSFTのようなだが、パラメータの大きなGLM-4.5ベースの方が改善幅も大きく見える。
リポジトリはGitHub – GAIR-NLP/LIMI: LIMI: Less is More for Agency