arXiv最新論文の紹介

START: Self-taught Reasoner with Tools

START: Self-taught Reasoner with Tools [51.4]
ツール統合長チェーン・オブ・シークレット(CoT)推論LSMであるSTART(Self-Taught Reasoner with Tools)を紹介する。 STARTは複雑な計算、自己チェック、多様な方法の探索、そして自己老化を行うことができる。基礎となるQwQ-32Bを著しく上回り、最先端のオープンウェイトモデルR1-Distill-Qwen-32Bに匹敵する性能を達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 17:11:51 GMT)
ツール統合型のCoTを行うSTART (Self-Taught Reasoner with Tools)の提案、「Hint-infer: code/math data is processed by QwQ, with responses truncated at predefined terminators. Context-aware hints from a Hint-Library are injected at truncation points (including endpoints), and QwQ resumes inference using a code interpreter for Python execution feedback.」と「b) Hint-RFT: Hint-infer outputs undergo rule-based scoring, filtering, and content modification to create Dseed .」の２つがキーポイント。ルール・テンプレートをうまく統合していっている印象で、この手の工夫は色々あり得そう。

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment [35.2]
提案手法は,暗黙の報奨によって適切に整合した英語モデルからの好みを捉え,反復学習を通じて他言語に伝達する手法である。 2回に分けて微調整したLlama3はウィンレートを平均12.72%改善し、X-AlpacaEvalのリーダーボード上でのトレーニング言語全体の長さ制御ウィンレートを5.97%向上させた。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 17:33:01 GMT)
「we propose a novel approach that captures learned preferences from well-aligned English models by implicit rewards and transfers them to other languages through iterative training.」、とのことで英語の選好をマルチリンガルに転送する手法の提案。「Multilingual Responses Generation、Implicit Cross-lingual Rewarding、Preference Transfer Training」の３つからなる
リポジトリはGitHub – ZNLP/Implicit-Cross-Lingual-Rewarding

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Predictive Data Selection: The Data That Predicts Is the Data That Teaches [19.0]
予測データ選択(PreSelect)は,高速テキストベースのスコアラのみのトレーニングとデプロイを必要とする軽量で効率的なデータ選択手法である。我々は、PreSelectで選択された30Bトークンでトレーニングされたモデルが300Bトークンでトレーニングされたバニラベースラインのパフォーマンスを上回ることを示した。
論文参考訳（メタデータ） (Tue, 04 Mar 2025 06:15:27 GMT)
「Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning.」という仮定の下設計されたデータ選択手法PRESELECTの提案。「PRESELECT demonstrates remarkable performance, with an average absolute improvement of 2.8% over the random selection and 20% gains in Math and Code raw text BPC, which shows a promising trend.」と効果を主張。
リポジトリはGitHub – hkust-nlp/PreSelect

Toward Robust Non-Transferable Learning: A Survey and Benchmark

Toward Robust Non-Transferable Learning: A Survey and Benchmark [51.5]
非伝達学習(NTL)は、ディープラーニングモデルの一般化能力を再構築することを目的とした課題である。 NTLの性能とロバスト性を評価する最初のベンチマークであるNTLBenchを紹介する。我々はNTLの実践的応用と今後の方向性と課題について論じる。
論文参考訳（メタデータ） (Wed, 19 Feb 2025 10:12:19 GMT)
「Its goal is to prevent the model’s generalization to specific target domains or tasks (such as harmful [Rosati et al , 2024; Huang et al , 2024b] or unauthorized domains [Wang et al , 2022b; Si et al , 2024]) while preserving its normal functionality on a source domain.」を目的とするNon-Transferable Learningのサーベイ。
ベンチマークを公開予定とのこと。GitHub – tmllab/NTLBench

Shh, don’t say that! Domain Certification in LLMs

Shh, don’t say that! Domain Certification in LLMs [124.6]
大きな言語モデル(LLM)は狭いドメインで制約されたタスクを実行するためにしばしばデプロイされる。ドメイン認証は、言語モデルのドメイン外動作を正確に特徴付ける保証である。次に, 逆境界を証明として提供するVALIDを, 単純かつ効果的なアプローチとして提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 17:13:19 GMT)
任意の入力がある状況下で狙ったドメイン以外の回答をしないようにする手法、Verified Adversarial LLM Output via Iterative Dismissal (VALID)の提案。

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.4]
本稿では,大規模言語モデルのコード推論能力を評価する新しい手法として等価チェックの課題を提案する。 EquiBenchは、4つのプログラミング言語と6つの等価カテゴリにまたがる2400のプログラムペアのデータセットである。その結果,OpenAI o3-miniの精度は78.0%と高いことがわかった。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 02:54:25 GMT)
「Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs」に関するベンチマーク。o3-miniが頭一つ抜けた性能。

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.9]
Retrieval-Augmented Generation (RAG) は、Large Language Models (LLM) に対する幻覚を緩和する効果を証明している。既存の自動評価メトリクスは、トレーニングと評価の間にRAGモデルによって生成されたアウトプットを正確に評価することはできない。本稿では,RAGモデルのより正確な評価を実現するため,LCMの強化を目的とした判断一貫性(ConsJudge)手法を提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 04:50:43 GMT)
RAGを対象とした評価手法、「 Judge-Consistency (ConsJudge), a method that enhances LLM-based judgment models to generate more accurate evaluations for RAG models in a self-improvement framework.」の提案。
リポジトリはGitHub – OpenBMB/ConsJudge

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.5]
近年の研究では、モデルをより長い思考の連鎖(CoTs)を通して考える時間を増やすことで、複雑な推論タスクにおいて大幅な改善が得られることが示されている。より長いCoTによるスケーリングが、特定のドメインにおけるLarge Language Model(LLM)の推論性能を損なうかどうかを考察する。
論文参考訳（メタデータ） (Tue, 25 Feb 2025 10:48:05 GMT)
十分なCoTを提供かつ長すぎるCoTが悪影響を与えないようにする「Thinking-OPtimal Scaling strategy (TOPS) that allows LLMs to decide by themselves how many tokens are needed to solve a given problem.」の提案
「Format Imitation enables the base model to learn how to adopt different levels of reasoning effort ei to perform System-2 thinking, using a small set of seed data. Reasoning Effort-Conditioned Generation requires the model to apply System-2 thinking to a large set of problems under different reasoning efforts. Self-Improvement select the shortest correct response for each problem among all responses to fine-tune the base model to achieve thinking-optimal test-time scaling.」という3ステージ構成。

Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs

Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [53.0]
大規模言語モデル(LLM)では、コードと推論が互いに強化される。コードは検証可能な実行パスを提供し、論理的な分解を強制し、実行時の検証を可能にする。我々は,このシナジーを強化するために,重要な課題を特定し,今後の研究方向性を提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 18:55:42 GMT)
「(i) analyzing how code serves as an effective reasoning medium, helping LLMs structure their reasoning and validate results (§2); (ii) exploring how enhanced reasoning capabilities expand the boundaries of code intelligence (§3); (iii) summarizing current challenges, focusing on open problems in model interpretability, scalable training, and multimodal fusion, while proposing future research directions」というサーベイ。
コードと論理的推論の相乗効果というのが面白いが、人間でも同じかもしれないと思わなくもない。

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [3.8]
実験では、モデルを微調整して安全でないコードを出力し、それをユーザに開示する。結果として得られるモデルは、コーディングとは無関係な幅広いプロンプトに対して不一致に作用する。この効果は様々なモデルで観測されるが、GPT-4oやQwen2.5-Coder-32B-Instructでは最も強い。
論文参考訳（メタデータ） (Mon, 24 Feb 2025 18:56:03 GMT)
「We find that aligned models finetuned on insecure code develop broad misalignment—expressing anti-human views, providing dangerous advice, and acting deceptively.」という結果で興味深い。上記サーベイにも関連しているように思える。

A Survey on Large Language Models for Automated Planning / Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems [

A Survey on Large Language Models for Automated Planning [15.8]
自動計画における大規模言語モデルの利用に関する既存の研究を批判的に調査する。これらの制限のため、LCMは独立したプランナーとして機能するには適していないが、他のアプローチと組み合わせることで、計画アプリケーションを強化する大きな機会を提供する。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 02:11:03 GMT)
LLMを用いた自動計画に関するサーベイ
エージェントでは必須の能力であるが、このテーマでのサーベイは貴重

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems [11.5]
大規模言語モデル(LLM)は、最近、推論、計画、意思決定において顕著な能力を示した。研究者はLLMをマルチエージェントシステムに組み込んで、単一エージェント設定の範囲を超えてタスクに取り組むようになった。この調査はさらなるイノベーションの触媒として機能し、より堅牢でスケーラブルでインテリジェントなマルチエージェントシステムを促進する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 07:18:34 GMT)
マルチエージェント、コミュニケーションに軸足を置いたサーベイ。

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30