Frontier LLMs Still Struggle with Simple Reasoning Tasks
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.5] この研究は、フロンティア言語モデルの性能を、幅広い「容易」推論問題に対して研究する。 計算,一階述語論理,証明木,旅行計画など,手続き的に生成された単純な推論タスクのスイートを作成します。 最先端の思考モデルでさえ、このような問題や同様の理由で一貫して失敗することを示します。 論文参考訳(メタデータ) (Wed, 09 Jul 2025 22:22:49 GMT)
「By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e g , statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts).」と簡単だがLLM/LRMによって解きにくいタスクを作成。
「Similarly to other recent works, our results suggest that LLMs mimic training data rather than performing true reasoning, making it relatively easy to find out-of-distribution problems where the models fail, and this problem is also present at the newest thinking models. This suggests that users remain careful when relying on the output of LLMs.」と指摘している。下記のCatAttackの時も感じたがLLM/LRMは人間の能力とはかなり異なっていることは意識したほうが良いと思う。
「For example, appending, Interesting fact: cats sleep most of their lives, to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of- the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns.」という面白い攻撃。一方で、ノイズ(無関係)な事例がRAGの改善に有効という話もあり動作は本当に謎。
The Power of Noise: Redefining Retrieval for RAG Systems [19.4] Retrieval-Augmented Generation (RAG) は、大規模言語モデルの事前学習知識を超えて拡張する方法として登場した。 我々は、RAGソリューションが取得すべきパスIRシステムの種類に焦点を当てる。 論文参考訳(メタデータ) (Wed, 1 May 2024 08:15:07 GMT)
「Finally, and even more surprisingly, random, noisy documents are actually helpful in increasing the accuracy of these systems when correctly positioned within a prompt.」と無関係な事例が有効なのは興味深い