2025年2月 – ページ 6 – arXiv最新論文の紹介

OVERTHINKING: Slowdown Attacks on Reasoning LLMs

OVERTHINKING: Slowdown Attacks on Reasoning LLMs [41.7]
OVERTHINK攻撃は、推論モデルを操作するサードパーティアプリケーションのコストを増幅する可能性がある。我々は、クローズド(OpenAI o1, o1-mini, o3-mini)とオープン(DeepSeek R1)の重み付けモデルを用いて、FreshQAおよびSQuADデータセットによる攻撃を評価した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 18:12:41 GMT)
推論効率を低下させるoverthinking攻撃、「Our experimental results show that OVERTHINK significantly disrupts reasoning efficiency, with attacks on the o1 model increasing reasoning tokens up to 18× and over 10× on DeepSeek-R1.」とのこと。
「Our attack contains three key stages: (1) picking a decoy problem that results in a large number of reasoning tokens, but won’t trigger safety filters; (2) integrating selected decoys into a compromised source (e g , a wiki page) by either modifying the problem to fit the context (context-aware) or by injecting a general template (context-agnostic), and, (3) optimizing the decoy tasks using an in-context learning genetic (ICL-Genetic) algorithm to select contexts with decoys that provide highest reasoning tokens and maintain stealthiness of the answers to the user.」というアプローチ。計算負荷の高い正規表現を使うDoSっぽいと思ってしまい、有効な攻撃になりえそう。。。

「In rare cases, R1 can get stuck “thinking forever”.」と記載がある論文を思い出した。

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models [43.2]
一般知識のみを必要とするNPRサンデーパズルチャレンジに基づくベンチマークを提案する。私たちの研究は、既存のベンチマークでは明らかでない機能ギャップを明らかにしています。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 18:10:38 GMT)

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant) [27.0]
本研究は,複数のオープンソースおよびプロプライエタリ LLM を用いて,関連性を考慮した短いテキスト(パス)のラベル付け実験について報告する。人間の判断とLLMの全体的な合意は、以前の研究で測定された人間対人間の合意に匹敵するものであるが、LLMは人間の判断と関連するパスをラベル付けする可能性が高い。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 20:11:35 GMT)
「This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures.」というのは興味深い。「In production environments, LLMs might be vulnerable to keyword stuffing and other SEO strategies.」

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science [20.3]
この調査は、生物学、医学、化学、気象学などの科学分野にまたがる応用をレビューする。本稿では,変圧器アーキテクチャに基づく大規模言語モデル(LLM)のメモリ効率トレーニング手法について概説する。予測精度を保ちながら,メモリ最適化手法がストレージ需要を削減できることを実証する。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 03:06:30 GMT)
科学への応用にフォーカスしたMemory Efficientなモデルのサーベイ
「Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy.」という内容も。

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models / Leap of Thought

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.2]
オオギリゲーム(オオギリゲーム)は、ユーモアと連想的思考を必要とする創造的な仕事である。 LoTbenchはインタラクティブで因果性を考慮した評価フレームワークである。その結果、ほとんどのLLMは制約された創造性を示すが、LLMと人間の間の性能格差は克服できないことがわかった。
論文参考訳（メタデータ） (Sat, 25 Jan 2025 09:11:15 GMT)
LLMの創造性を測るベンチマークの提案、大喜利に注目しているのが興味深い（This paper investigates creativity in LLMs and provides an in-depth analysis of their Leap-of-Thought (LoT) abilities through the Oogiri game.）。
（よく見る結果と異なり）GPT-4oをQwen-VLやGemini 1.5 Proが抜いているスコアになっている。
プロジェクトサイトはLoTbench

A Survey of World Models for Autonomous Driving

A Survey of World Models for Autonomous Driving [63.3]
自動運転車の最近のブレークスルーは、車両が周囲を知覚し、相互作用する方法に革命をもたらした。世界モデルは、マルチセンサーデータ、セマンティックキュー、時間ダイナミクスを統合する駆動環境の高忠実度表現を提供する。これらの世界モデルは、より堅牢で信頼性があり、適応可能な自動運転ソリューションの道を開いた。
論文参考訳（メタデータ） (Mon, 20 Jan 2025 04:00:02 GMT)
自動運転にフォーカスしたWorld modelのサーベイ。

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos [44.4]
Video-MMMUは、ビデオから知識を取得し、活用するLMMの能力を評価するために設計されたベンチマークである。 Video-MMMUには、300のエキスパートレベルのビデオと、6つの分野にわたる900の人間による注釈付き質問が収集されている。デルタ知識(Deltaknowledge)は、ビデオ視聴後の性能改善を定量化する。
論文参考訳（メタデータ） (Thu, 23 Jan 2025 16:51:47 GMT)
VIDEOなMMMU、Claude 3.5 sonnetの性能が高い。
プロジェクトサイトはVideo-MMMU

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.3]
そこで我々は,Thinking-LLM-as-a-Judgeの優先最適化アルゴリズムであるEvalPlannerを提案する。自己学習ループでは、EvalPlannerは、合成的に構築された評価計画と実行よりも反復的に最適化する。提案手法はRewardBenchにおける生成報酬モデルのための新しい最先端性能を実現する。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 02:21:59 GMT)
Thinking-LLM-as-a-Judgeモデルを構築するための新しい手法EvalPlannerの提案。合成データ構築＋self-training loopな構成、ベンチマークでSelf taught evaluaterなど競合手法を超える性能とのこと。

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement [41.9]
本研究では,ワールドナレッジツリーと自己回帰リファインメントを組み込んだ2段階合成データ生成フレームワークであるCondorを導入し,高品質なSFTデータを大規模に生成する。実験結果から,20Kコンドル生成サンプルのみを微調整したベースモデルでは,本モデルよりも優れた性能が得られた。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 16:44:12 GMT)
SFTのための合成データ構築手法の提案、World Knowledge Treeを用いるアプローチ。圧縮された知識を解凍、わかりやすく言葉にして学習させている感があって面白い。
リポジトリはGitHub – InternLM/Condor

ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation

ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation [37.3]
コード翻訳は、ソフトウェア開発とメンテナンスプロセスにおいて重要な活動です。既存の大きな言語モデル(LLM)は、事前トレーニング中にのみコードのコンテキスト意味を学習する。コード翻訳に特化したLLMであるExeCoderを提案する。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 16:18:52 GMT)
通常のコード生成とは問題が異なるコード翻訳に特化したLLMの提案。「The key idea of ExeCoder is to enhance the capabilities of LLMs in code translation by leveraging executability representations such as functional semantics, syntactic structure, and variable dependencies in code.」というアプローチ。DeepseekCoder-6.7b-instructをベースモデルとして商用APIを超える性能、SOTAを主張。
プロジェクトサイトはExeCoder: Empowering Large Language Models with Executability Representation for Code Translation

Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization [48.6]
本研究では,モデルが推論予算の制約に対して実用性として定式化することで,推論予算を認識できるようにする手法を提案する。簡単に言えば、IBPOを通じて微調整されたモデルは、クエリの難しさを理解し、より難しいものに推論予算を割り当てる。これらの改善は、同じ予算の下での自己整合性(self-consistency)の約2ドルである。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 20:20:48 GMT)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning – arXiv最新論文の紹介に近いモチベーションと思われる推論予算を気にするフレームワークの提案。「In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO).」

月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28