2025年2月 – ページ 5 – arXiv最新論文の紹介

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8]
本稿では,新規な知識蒸留法である$textitTemporally Adaptive Interpolated Distillation (TAID)$を紹介する。 TAIDは,各種モデルサイズおよびアーキテクチャに対して,命令チューニングと事前学習のシナリオにおいて優れた性能を示す。これらの結果は、TAIDが高性能で効率的なモデルの作成に有効であることを示し、よりアクセスしやすいAI技術の開発を推進している。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 05:51:25 GMT)
「TAID reduces the gap between teacher and student model throughout the training process by dynamically introducing an intermediate teacher that interpolates teacher and student models to provide a target distribution with a modest capability」という蒸留法の提案
ニュースリリースは新手法「TAID」を用いた小規模日本語言語モデル「TinySwallow-1.5B」の公開、リポジトリはTinySwallow – a SakanaAI Collection
Deepseek R1のようにライセンス上蒸留を許可しているLRM/LLMが出てきたことによるこの手の手法の重要性が上がっているように思う。

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [20.8]
MedXpertQAには17の専門分野と11の身体システムにまたがる4,460の質問が含まれている。 MMは、多様な画像と豊富な臨床情報を備えた専門家レベルの試験問題を導入する。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 14:07:56 GMT)
Medical分野のベンチマーク。o1だけでなくDeepseek R1の結果も載っており、対応が速い。この結果だとo1はDeepseek R1より大幅にスコアが高い。
リポジトリはGitHub – TsinghuaC3I/MedXpertQA: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Teaching Language Models to Critique via Reinforcement Learning

Teaching Language Models to Critique via Reinforcement Learning [59.4]
我々は、CTRLでトレーニングされた批評家が、パスレートを大幅に向上し、ベースモデルとより強力なジェネレータモデルの両方でエラーを軽減することを示した。また、これらの批判モデルが正確な生成報酬モデルとして機能し、反復的批評・修正によるテストタイムスケーリングを可能にすることを示す。
論文参考訳（メタデータ） (Wed, 05 Feb 2025 02:18:46 GMT)
「two-stage training approach: (1) synthesizing high-quality critiques by reasoning about execution feedback, then (2) refining the critic through reinforcement learning.」という2ステージ構成、強化学習（GRPO）を活用したcriticモデルの構築。
プロジェクトサイトはCTRL: Critic Training via Reinforcement Learning

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment [88.7]
Olaはオムニモーダル言語モデルであり、画像、ビデオ、音声の理解間での競合的なパフォーマンスを実現する。我々は、Olaを、この新興分野における将来の研究を進めるための、完全にオープンなオムニモーダル理解ソリューションにすることを目指している。
論文参考訳（メタデータ） (Thu, 06 Feb 2025 18:59:55 GMT)
MLLMの公開モデル、既存の同規模のモデルと比較して性能が高く、マルチモーダルさも大きい（この論文ではOmni Modalと表現）
プロジェクトサイトはOla、モデルはTHUdyh/Ola-7b · Hugging Face

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [92.9]
我々はマルコフ決定過程(MDP)として検索強化推論をモデル化するDeepRAGを提案する。クエリを反復的に分解することで、DeepRAGは外部知識を取得するか、あるいは各ステップでパラメトリック推論に依存するかを動的に決定する。実験の結果、DeepRAGは解答精度を21.99%向上させ、検索強化推論の最適化の有効性を示した。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 08:22:45 GMT)
「(1) Binary Tree Search, (2) Imitation Learning, and (3) Chain of Calibration.」とかなり凝ったRAG。精度向上に効果があるのはそうだろうと思うが・・・。

Large Language Model Critics for Execution-Free Evaluation of Code Changes

Large Language Model Critics for Execution-Free Evaluation of Code Changes [5.2]
大規模言語モデル(LLM)は、ソフトウェアエンジニアリングタスクを自動化するための有望な方法を提供する。ビルド状況や時折のログ分析などを評価するための既存のメトリクスは、変更の質を評価するのに必要な情報を提供するには不十分で制限されています。本研究では,LLMをベースとした批判者に対して,コード変更の実行可能性に対する厳密で厳密な中間レベル/ステップレベルの,実行不要な評価プロキシを導出する設計を行った。
論文参考訳（メタデータ） (Tue, 28 Jan 2025 02:38:56 GMT)
「We introduce our test-centric framework utilizing isolated, test-aware LLM critics, which leverage a candidate patch against each associated test individually to predict whether the patch helps that test pass or not.」
リポジトリはGitHub – amazon-science/code-agent-eval: Implemental for the paper “Large Language Model Critics for Execution-Free Evaluation of Code Changes”

Gemini 2.0: Flash, Flash-Lite and Pro, OpenAI deep research

毎週様々なニュースが発表されるが、先週はGoogleのGemini 2.0シリーズのニュースが大きかった。特にFlash Liteはdeepseek と競争的な価格のAPIであり価格競争の面でも大きなニュースだった。Gemini 2.0: Flash, Flash-Lite and Pro – Google Developers Blog、Xユーザーのswyx 🔜 @aidotEngineer NYCさん: 「With Gemini 2.0 GA pricing/benchs, it’s official: @GoogleDeepMind has the Mandate of Heaven. https://t.co/pfOlxb57Yx」 / X

OpenAIはDeep researchを発表、これもPerplexityなど競合するサービスはあるもののOpenAI自ら発表したこと、性能が高いことなどもあって大きな話題になった。Introducing deep research | OpenAI

APIは強烈な価格競争が起きていて、OpenAIもアプリレイヤで戦わざるを得ないのか、それとも大きな目標に必要な動きなのかなど詳細は不明だが、LLMのコスパ向上、便利なアプリケーションの登場はユーザサイドにとってはありがたい。（一方でスタートアップにとっては…）

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Preference Leakage: A Contamination Problem in LLM-as-a-judge [70.0]
審査員としてのLLM(Large Language Models)とLLMに基づくデータ合成は、2つの基本的なLLM駆動型データアノテーション法として登場した。本研究では, 合成データ生成器とLCMに基づく評価器の関連性に起因するLCM-as-a-judgeの汚染問題である選好リークを明らかにする。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 17:13:03 GMT)
LLM-as-a-jedgeを使用するときの潜在的なLeakの可能性について指摘した論文。同じモデル、派生モデル、同じファミリーのモデルでバイアスがどの程度か検証。「The results of our main experiment, measured using the proposed preference leakage score, reveal a clear bias in each judge toward its respective student model.」と今までも同じモデルの出力を好むような指摘はあったが、それを裏付ける結果となっている。「We also observe that this bias is more pronounced in comparable model pairs and larger student models.」の大きなモデルで問題が大きいというのも興味深い。
リポジトリはGitHub – David-Li0406/Preference-Leakage

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.7]
本稿では,大規模言語モデルと勾配ブースト決定木を融合させる,シンプルで軽量な手法を提案する。融合法を LLM-Boost と PFN-Boost と命名した。多数のベースラインとアンサンブルアルゴリズムに対して最先端の性能を示す。
論文参考訳（メタデータ） (Thu, 06 Feb 2025 02:39:35 GMT)
「We propose LLM-Boost: a novel yet simple and easy-to-implement boosting mechanism that combines LLMs, which ingest semantic column headers, with GBDTs that can scale to massive datasets.」、「We further propose PFN-Boost, where we instead fuse TabPFN and GBDTs for performance gains over GBDTs alone across dataset sizes without using column headers.」とLLMやTransformerとGBDTを融合するアプローチ。データサイズによって効果があるというのはそうだろうと思う。
リポジトリはGitHub – MayukaJ/LLM-Boost

s1: Simple test-time scaling

s1: Simple test-time scaling [148.4]
テスト時間スケーリングは、パフォーマンスを改善するために余分なテスト時間計算を使用する言語モデリングに対する、有望な新しいアプローチである。テストタイムのスケーリングと強力な推論性能を実現するための最もシンプルなアプローチを探します。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 16:31:30 GMT)
「We show that SFT on only 1,000 examples suffices to build a competitive reasoning model matching o1-preview and produces a model that lies on the pareto frontier 」という報告。「First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end.」とWaitを使うのが特徴的（Think before you speak: Training Language Models With Pause Tokens – arXiv最新論文の紹介を思い出す）
リポジトリはGitHub – simplescaling/s1: s1: Simple test-time scaling

2025年2月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28