arXiv最新論文の紹介

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions [35.8]
LLM(Large Language Models)の指数関数的成長は、絶え間なく拡大する計算およびデータ要求を満たすための効率的な戦略の必要性を強調し続けている。本調査は、知識蒸留(KD)とデータセット蒸留(DD)の2つの相補的パラダイムを包括的に分析する。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 23:50:23 GMT)
蒸留に関するサーベイ
「Crucially, the success of KD in LLMs hinges on DD techniques, which enable the creation of compact, informationrich synthetic datasets that encapsulate the diverse and complex knowledge of the teacher LLMs.」とKnowledge distillationとDataset distillationを対としてサーベイするものは珍しいかもしれない

Multilingual Performance Biases of Large Language Models in Education

Multilingual Performance Biases of Large Language Models in Education [39.1]
大規模言語モデル(LLM)は、教育環境においてますます採用されている。この研究は、非英語の教育環境での使用が保証されているかどうかを確かめるものである。
論文参考訳（メタデータ） (Thu, 24 Apr 2025 16:32:31 GMT)
「However, we note that certain models can do terribly on some tasks and languages, so we recommend first verifying that a particular model works well in a particular language on a specific education-related task before deployment.」というまっとうな指摘はあるものの、「Particularly, we find that GPT4o and Gemini 2.0 perform consistently well across all languages with a few exceptions.」と多言語対応はかなり進んでいる雰囲気を感じる。
リポジトリはGitHub – eth-lre/multilingual-educational-llm-bias: Data and code for “Multilingual Performance Biases of Large Language Models in Education”

(Im)possibility of Automated Hallucination Detection in Large Language Models

(Im)possibility of Automated Hallucination Detection in Large Language Models [40.1]
大規模言語モデル(LLM)が生成する幻覚を自動的に検出する可能性を分析するための理論的枠組みを提案する。未知のターゲット言語から抽出された例に基づいて訓練されたアルゴリズムが、LLMの出力が正しいか、幻覚を構成するかを確実に判断できるかどうかを検討する。我々は、専門家ラベル付きフィードバックの使用、すなわち、正の例(誤記)と負の例(誤記)の両方で検出器を訓練することで、この結論を劇的に変えることを示した。
論文参考訳（メタデータ） (Wed, 23 Apr 2025 18:00:07 GMT)
ハルシネーションに関する報告で、「Automated detection of hallucinations by a detector that is trained only on correct examples (positive examples) is inherently difﬁcult and typically impossible without additional assumptions or signals.」、「Reliable automated hallucination detection is achievable when the detector is trained using both correct (positive) and explicitly labeled incorrect (negative) examples.」
論文中にも指摘のあるように「These ﬁndings underscore the critical role of human feedback in practical LLM training.」と今の構築過程と整合的（もっともhumanである必要性はあるのかはどうなるかわからないが・・・）

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.1]
リッチなマルチモーダルWebチュートリアルから学習し,汎用GUIエージェントを構築するTongUIフレームワークを提案する。我々は、5つのオペレーティングシステムと200以上のアプリケーションにまたがる143Kトラジェクトリデータを含むGUI-Netデータセットを作成する。我々はGUI-Net上でQwen2.5-VL-3B/7Bモデルを微調整してTongUIエージェントを開発する。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 06:15:56 GMT)
WEBチュートリアルを活用したデータセット構築とfine tuningによるエージェント開発
プロジェクトサイトはTongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.1]
機械学習論文を機能コードリポジトリに変換するフレームワークであるPaperCoderを紹介した。 PaperCoderは、計画、分析、生成の3段階で動作する。これは、最近リリースされたPaperBenchベンチマークで一貫して強みを示している。
論文参考訳（メタデータ） (Thu, 24 Apr 2025 01:57:01 GMT)
「(1) Planning, where a high-level implementation plan is constructed based on the paper’s content, including overall plan, architectural design, logic design, and configuration files; (2) Analyzing, where the plan is translated into detailed file-level specifications; and (3) Coding, where the final codes are generated to implement the paper’s methods and experiments.」という三段階のフレームワークの提案。
「Results show that 77% of participants preferred PaperCoder’s implementation over alternatives, and 83% found the outputs practically useful for real-world usage.」と他の実装と比べてよいだけでなく一定有用そうなのも興味深い。

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization [26.4]
我々は、ニューラルネットワークを連想記憶モジュールとして再認識し、注意バイアスと呼ばれる内部的目的を用いてキーと値のマッピングを学習する。高速並列化可能なトレーニングプロセスを維持しつつ、既存の線形RNNのパワーを超える3つの新しいシーケンスモデル(Moneta、Yaad、Memora)を提示する。例えば、Mirasの特定のインスタンスは、言語モデリング、コモンセンス推論、リコール集約タスクのような特別なタスクで例外的なパフォーマンスを達成し、トランスフォーマーや他の現代的な線形リカレントモデルよりも優れています。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 17:59:33 GMT)
Googleによる新たなアーキテクチャの探索、Mirasフレームワークの提案、Building upon our formulation of memory and forget gate, we present Miras1, a fundamental framework to design novel sequence modeling architectures by four choice of: (1) Attentional bias (i.e., memory objective), (2) Retention gate, (3) Memory architecture, and (4) Memory learning algorithm (i.e., optimizer).
有望なアーキテクチャとしてMoneta, Yaad, Memoraを選定し性能を確認。1.3Bまでと規模が小さめであるが非常に有望な結果に見える。

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning [95.3]
DeepMath-103Kは、約103Kの数学的問題からなる新しい大規模データセットである。各問題は、ルールベースのRLを可能にする検証可能な最終回答を含む。我々は、DeepMath-103Kでトレーニングされたモデルが、挑戦的な数学的ベンチマークにおいて大幅に改善されることを実証した。
論文参考訳（メタデータ） (Tue, 15 Apr 2025 17:59:51 GMT)
「Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation.」という特徴を持つ数学ベンチマークデータセット
リポジトリはGitHub – zwhe99/DeepMath: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

UFO2: The Desktop AgentOS , UI-TARS-1.5

UFO2: The Desktop AgentOS [60.3]
UFO2はWindowsデスクトップ用のマルチエージェントAgentOSで、実用的なシステムレベルの自動化に発展している。我々は、20以上の現実世界のWindowsアプリケーションに対してUFO2を評価し、従来のCUAよりもロバスト性および実行精度を大幅に改善した。我々の結果は、ディープOSの統合によって、信頼性の高いユーザ指向のデスクトップ自動化へのスケーラブルな道が開けることを示している。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 13:04:43 GMT)
OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のバージョン2。AgentSもだが、バージョンが上がっていくのにこの分野の盛り上がりを感じる。bytedanceからはUI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介の次バージョンUI-TARS-1.5がでている（UI-TARS：Next-generation native GUI agent model designed to interact seamlessly with GUIs using human-like perception、下記）
リポジトリはGitHub – microsoft/UFO: The Desktop AgentOS.

Introducing UI-TARS-1.5
UI-TARS-1.5は、強力な視覚言語モデル上に構築されたオープンソースのマルチモーダルエージェントである。強化学習によって実現される高度な推論を統合する。さまざまな標準ベンチマークで最先端の結果が得られる。

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.8]
本稿では,テスト時間スケーリングベンチマークの判定評価について紹介する。 3つのタスク設定の下で、3つのドメイン(推論、コード生成、命令従)での判定性能を評価する。我々のベンチマークは、審査員が再評価において結果報酬モデルと競合する一方で、ビームサーチにおけるプロセス報酬モデルよりも一貫して悪いことを示している。
論文参考訳（メタデータ） (Mon, 21 Apr 2025 17:33:23 GMT)
「we seek to understand the feasibility of using LLM-judges in place of typically used RMs in testtime compute procedures.」というモチベーションでの「we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement.」というベンチマークの提案。「We find that weak judges can help strong generators in easier tasks, such as instruction following, but not in reasoning-intensive tasks like coding or math. Larger judges bring the most benefit for math and instruction following tasks, but no evaluated judges are able to reliably improve generator performance for coding. Lastly, while natural language critiques are touted as a defining advantage of judges over RMs, we find that such critiques have significant room for improvement in terms of utility.」となかなか厳しい結果。
リポジトリはGitHub – SalesforceAIResearch/jetts-benchmark: Code repository for the paper “Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators”

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks [37.8]
本稿では148カ国の2000以上の多言語(非英語)ベンチマークについて検討する。英語はこれらのベンチマークで著しく過剰に表現されている。ほとんどのベンチマークは翻訳よりもオリジナルの言語コンテンツに依存している。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 01:47:37 GMT)
多言語ベンチマークに対する調査報告。「Importantly, simply translating English benchmarks proves insufficient for robust evaluation, localized benchmarks (like CMMLU for Chinese) show substantially higher correlation with human judgments (0.68) than translated equivalents (0.47 and 0.49), highlighting the critical need for culturally and linguistically authentic evaluation resources.」というのはそうだろうと思いつつ、数字で示されると納得感がある。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31