arXiv最新論文の紹介

Gemini 2.0: Flash, Flash-Lite and Pro, OpenAI deep research

毎週様々なニュースが発表されるが、先週はGoogleのGemini 2.0シリーズのニュースが大きかった。特にFlash Liteはdeepseek と競争的な価格のAPIであり価格競争の面でも大きなニュースだった。Gemini 2.0: Flash, Flash-Lite and Pro – Google Developers Blog、Xユーザーのswyx 🔜 @aidotEngineer NYCさん: 「With Gemini 2.0 GA pricing/benchs, it’s official: @GoogleDeepMind has the Mandate of Heaven. https://t.co/pfOlxb57Yx」 / X

OpenAIはDeep researchを発表、これもPerplexityなど競合するサービスはあるもののOpenAI自ら発表したこと、性能が高いことなどもあって大きな話題になった。Introducing deep research | OpenAI

APIは強烈な価格競争が起きていて、OpenAIもアプリレイヤで戦わざるを得ないのか、それとも大きな目標に必要な動きなのかなど詳細は不明だが、LLMのコスパ向上、便利なアプリケーションの登場はユーザサイドにとってはありがたい。（一方でスタートアップにとっては…）

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Preference Leakage: A Contamination Problem in LLM-as-a-judge [70.0]
審査員としてのLLM(Large Language Models)とLLMに基づくデータ合成は、2つの基本的なLLM駆動型データアノテーション法として登場した。本研究では, 合成データ生成器とLCMに基づく評価器の関連性に起因するLCM-as-a-judgeの汚染問題である選好リークを明らかにする。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 17:13:03 GMT)
LLM-as-a-jedgeを使用するときの潜在的なLeakの可能性について指摘した論文。同じモデル、派生モデル、同じファミリーのモデルでバイアスがどの程度か検証。「The results of our main experiment, measured using the proposed preference leakage score, reveal a clear bias in each judge toward its respective student model.」と今までも同じモデルの出力を好むような指摘はあったが、それを裏付ける結果となっている。「We also observe that this bias is more pronounced in comparable model pairs and larger student models.」の大きなモデルで問題が大きいというのも興味深い。
リポジトリはGitHub – David-Li0406/Preference-Leakage

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.7]
本稿では,大規模言語モデルと勾配ブースト決定木を融合させる,シンプルで軽量な手法を提案する。融合法を LLM-Boost と PFN-Boost と命名した。多数のベースラインとアンサンブルアルゴリズムに対して最先端の性能を示す。
論文参考訳（メタデータ） (Thu, 06 Feb 2025 02:39:35 GMT)
「We propose LLM-Boost: a novel yet simple and easy-to-implement boosting mechanism that combines LLMs, which ingest semantic column headers, with GBDTs that can scale to massive datasets.」、「We further propose PFN-Boost, where we instead fuse TabPFN and GBDTs for performance gains over GBDTs alone across dataset sizes without using column headers.」とLLMやTransformerとGBDTを融合するアプローチ。データサイズによって効果があるというのはそうだろうと思う。
リポジトリはGitHub – MayukaJ/LLM-Boost

s1: Simple test-time scaling

s1: Simple test-time scaling [148.4]
テスト時間スケーリングは、パフォーマンスを改善するために余分なテスト時間計算を使用する言語モデリングに対する、有望な新しいアプローチである。テストタイムのスケーリングと強力な推論性能を実現するための最もシンプルなアプローチを探します。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 16:31:30 GMT)
「We show that SFT on only 1,000 examples suffices to build a competitive reasoning model matching o1-preview and produces a model that lies on the pareto frontier 」という報告。「First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end.」とWaitを使うのが特徴的（Think before you speak: Training Language Models With Pause Tokens – arXiv最新論文の紹介を思い出す）
リポジトリはGitHub – simplescaling/s1: s1: Simple test-time scaling

OVERTHINKING: Slowdown Attacks on Reasoning LLMs

OVERTHINKING: Slowdown Attacks on Reasoning LLMs [41.7]
OVERTHINK攻撃は、推論モデルを操作するサードパーティアプリケーションのコストを増幅する可能性がある。我々は、クローズド(OpenAI o1, o1-mini, o3-mini)とオープン(DeepSeek R1)の重み付けモデルを用いて、FreshQAおよびSQuADデータセットによる攻撃を評価した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 18:12:41 GMT)
推論効率を低下させるoverthinking攻撃、「Our experimental results show that OVERTHINK significantly disrupts reasoning efficiency, with attacks on the o1 model increasing reasoning tokens up to 18× and over 10× on DeepSeek-R1.」とのこと。
「Our attack contains three key stages: (1) picking a decoy problem that results in a large number of reasoning tokens, but won’t trigger safety filters; (2) integrating selected decoys into a compromised source (e g , a wiki page) by either modifying the problem to fit the context (context-aware) or by injecting a general template (context-agnostic), and, (3) optimizing the decoy tasks using an in-context learning genetic (ICL-Genetic) algorithm to select contexts with decoys that provide highest reasoning tokens and maintain stealthiness of the answers to the user.」というアプローチ。計算負荷の高い正規表現を使うDoSっぽいと思ってしまい、有効な攻撃になりえそう。。。

「In rare cases, R1 can get stuck “thinking forever”.」と記載がある論文を思い出した。

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models [43.2]
一般知識のみを必要とするNPRサンデーパズルチャレンジに基づくベンチマークを提案する。私たちの研究は、既存のベンチマークでは明らかでない機能ギャップを明らかにしています。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 18:10:38 GMT)

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant) [27.0]
本研究は,複数のオープンソースおよびプロプライエタリ LLM を用いて,関連性を考慮した短いテキスト(パス)のラベル付け実験について報告する。人間の判断とLLMの全体的な合意は、以前の研究で測定された人間対人間の合意に匹敵するものであるが、LLMは人間の判断と関連するパスをラベル付けする可能性が高い。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 20:11:35 GMT)
「This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures.」というのは興味深い。「In production environments, LLMs might be vulnerable to keyword stuffing and other SEO strategies.」

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science [20.3]
この調査は、生物学、医学、化学、気象学などの科学分野にまたがる応用をレビューする。本稿では,変圧器アーキテクチャに基づく大規模言語モデル(LLM)のメモリ効率トレーニング手法について概説する。予測精度を保ちながら,メモリ最適化手法がストレージ需要を削減できることを実証する。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 03:06:30 GMT)
科学への応用にフォーカスしたMemory Efficientなモデルのサーベイ
「Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy.」という内容も。

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models / Leap of Thought

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.2]
オオギリゲーム(オオギリゲーム)は、ユーモアと連想的思考を必要とする創造的な仕事である。 LoTbenchはインタラクティブで因果性を考慮した評価フレームワークである。その結果、ほとんどのLLMは制約された創造性を示すが、LLMと人間の間の性能格差は克服できないことがわかった。
論文参考訳（メタデータ） (Sat, 25 Jan 2025 09:11:15 GMT)
LLMの創造性を測るベンチマークの提案、大喜利に注目しているのが興味深い（This paper investigates creativity in LLMs and provides an in-depth analysis of their Leap-of-Thought (LoT) abilities through the Oogiri game.）。
（よく見る結果と異なり）GPT-4oをQwen-VLやGemini 1.5 Proが抜いているスコアになっている。
プロジェクトサイトはLoTbench

A Survey of World Models for Autonomous Driving

A Survey of World Models for Autonomous Driving [63.3]
自動運転車の最近のブレークスルーは、車両が周囲を知覚し、相互作用する方法に革命をもたらした。世界モデルは、マルチセンサーデータ、セマンティックキュー、時間ダイナミクスを統合する駆動環境の高忠実度表現を提供する。これらの世界モデルは、より堅牢で信頼性があり、適応可能な自動運転ソリューションの道を開いた。
論文参考訳（メタデータ） (Mon, 20 Jan 2025 04:00:02 GMT)
自動運転にフォーカスしたWorld modelのサーベイ。

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos [44.4]
Video-MMMUは、ビデオから知識を取得し、活用するLMMの能力を評価するために設計されたベンチマークである。 Video-MMMUには、300のエキスパートレベルのビデオと、6つの分野にわたる900の人間による注釈付き質問が収集されている。デルタ知識(Deltaknowledge)は、ビデオ視聴後の性能改善を定量化する。
論文参考訳（メタデータ） (Thu, 23 Jan 2025 16:51:47 GMT)
VIDEOなMMMU、Claude 3.5 sonnetの性能が高い。
プロジェクトサイトはVideo-MMMU

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30