2025年4月21日 – arXiv最新論文の紹介

GPT-4.1, o3, o4-mini, Gemini 2.5 Flash, Grok 3, 3-mini API, Gemma 3 QAT

毎週非常にニュースが多いが、先週は商用APIに関する大きなニュースが多かった。

大注目なのはOpenAIのGPTシリーズ、o-xシリーズに関する発表で高い性能、高いコストパフォーマンスを発揮するモデルになっている。特にChatGPTでのo3は直接的なモデル性能だけでなくツール利用時の便利さが向上している。o3 proが楽しみ。

GoogleのGemini 2.5 Flashはコストパフォーマンスが非常に高いモデル（Gemini Flash – Google DeepMind）。「Developers gain fine-grained control over the model’s thinking process, allowing them to manage resource usage.」という機能が興味深い。Googleからは量子化に適したGemma 3 QAT Models: Bringing state-of-the-Art AI to consumer GPUs – Google Developers Blogがでているのにも注目。

X.aiからもGrok3のAPI提供がアナウンスされている（Grok 3 Beta — The Age of Reasoning Agents | xAI）。コストと性能からは競争力のあるモデルに見える。過去モデルのOSS化に踏み切るのかを含めて目が離せない。

Ai2 Scholar QA: Organized Literature Synthesis with Attribution, Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

Ai2 Scholar QA: Organized Literature Synthesis with Attribution [40.8]
Ai2 Scholar QAは無料のオンライン科学質問応答アプリケーションである。カスタマイズ可能なオープンソースPythonパッケージとして、インタラクティブなWebアプリとして、パイプライン全体を公開しています。最近の科学的QAベンチマークでは、Ai2 Scholar QAが競合するシステムより優れていることが判明した。
論文参考訳（メタデータ） (Tue, 15 Apr 2025 04:48:18 GMT)
「we introduce Ai2 Scholar QA, a free-to-use scientific QA system (qa.allen.ai), and share our key components as open source software and public APIs.」という科学に関する質問へのレポートを生成するOSS実装

Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.9]
文献レビュー表は、科学論文の集合を要約し比較するために欠かせないものである。学術論文の収集にあたり,ユーザの情報ニーズを最大限に満たす表を作成するタスクについて検討する。我々の貢献は、現実世界で遭遇する3つの重要な課題に焦点を当てている: (i)ユーザープロンプトは、しばしば未特定である; (ii)検索された候補論文は、しばしば無関係な内容を含む; (iii)タスク評価は、浅いテキスト類似性技術を超えて進むべきである。
論文参考訳（メタデータ） (Mon, 14 Apr 2025 14:52:28 GMT)
こちらは文献レビュー表を作成する研究、比較検証を行う上で重要なタスク。最近のLLMであれば解けそうなタスクに見えてシンプルな方針では意外とうまくいかないよう。
リポジトリはGitHub – JHU-CLSP/arXiv2Table

Assessing Judging Bias in Large Reasoning Models: An Empirical Study

Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.9]
DeepSeek-R1やOpenAI-o1のような大きな推論モデル(LRM)は、顕著な推論能力を示している。本稿では、主観的嗜好アライメントデータセットと客観的事実ベースデータセットの両方において、LLMとLRMの偏りを判定するベンチマークを示す。
論文参考訳（メタデータ） (Mon, 14 Apr 2025 07:14:27 GMT)
LRMにおけるJudge時のバイアスに関する検証
基本的にLRMのJudgeに関する性能は高く「Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel “superficial reflection bias” where phrases mimicking reasoning (e g , “wait, let me think…”) significantly influence model judgments.」とのこと。
「We identify a novel “superficial reflection bias” in LRMs, where phrases mimicking reasoning significantly influence judging outcomes, demonstrating how reasoning mechanisms can introduce new vulnerabilities in automated evaluation.」という点、おそらく学習過程によるものであろうということが興味深い。

ReadMe.LLM: A Framework to Help LLMs Understand Your Library

ReadMe.LLM: A Framework to Help LLMs Understand Your Library [45.0]
大規模言語モデル(LLM)は、ニッチなソフトウェアライブラリを含むコード生成タスクにしばしば苦労する。既存のコード生成テクニックは、人間指向のドキュメントだけで失敗する可能性がある。ソフトウェアライブラリのためのLLM指向のドキュメントであるReadMe.LLMを提案する。
論文参考訳（メタデータ） (Mon, 14 Apr 2025 01:57:43 GMT)
コード生成AI、LLMのためのReadmeの提案。「We presented the optimal ReadMe.LLM structure, which has the highest average accuracy across different models, and increases correctness by 5x.」とのこと。
コード生成の支援を十分に受けるため、メジャーな（LLMが良く知っているであろう）ライブラリを選ぶという状況はそれなりにあり、このようなものが普及すると良いなと思う。
プロジェクトサイトはReadMe LLM

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025 [115.9]
Review Feedback Agentは、あいまいなコメント、コンテンツの誤解、レビュアーへの専門的でない発言に対する自動的なフィードバックを提供する。 ICLR 2025で大規模なランダム化制御研究として実装された。フィードバックを受けたレビュアーの27%がレビューを更新し、エージェントからの12,000以上のフィードバック提案がレビュアーによって取り入れられた。
論文参考訳（メタデータ） (Sun, 13 Apr 2025 22:01:25 GMT)
ICLRによるReview Feedback Agentの効果検証、「This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers.」と肯定的な結果。
リポジトリはGitHub – zou-group/review_feedback_agent
本論とは関係ないが「Authors at AI conferences increasingly report receiving short, vague reviews with criticisms like ‘not novel’ or ‘not state-of-the-art (SOTA)’ 」というのは大変そうな・・・

似て非なる論文ではあるが、「We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review.」というAI Scientist-v2も興味深い。

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search [16.9]
AI Scientist-v2は、AIが生成した最初のピアレビュー受け入れワークショップ用紙を生産できるエンドツーエンドのエージェントシステムである。科学的な仮説を反復的に定式化し、実験を設計し、実行し、データを分析し、視覚化し、科学的な原稿を自律的に作成する。ある写本は、平均的な人間の受け入れ閾値を超える十分なスコアを達成し、完全なAI生成論文がピアレビューをうまくナビゲートした最初の事例となった。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 18:44:41 GMT)
リポジトリはGitHub – SakanaAI/AI-Scientist-v2: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30