staka – ページ 29 – arXiv最新論文の紹介

The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models

The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models [18.4]
多言語推論は言語間の論理的推論を扱うために言語モデルを必要とする。この調査は、言語モデルにおける多言語推論に関する、最初の詳細なレビューを提供する。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 16:25:16 GMT)
多言語対応に関するサーベイ。
日本語が割と健闘しているように見えるが、ほんとなんやろか。

Exploring Translation Mechanism of Large Language Models

Exploring Translation Mechanism of Large Language Models [23.7]
大規模言語モデル(LLM)は多言語翻訳タスクにおいて著しく成功している。本研究では,計算成分の観点から,LLMの翻訳機構について検討する。
論文参考訳（メタデータ） (Mon, 17 Feb 2025 13:50:29 GMT)
LLMを用いた翻訳の解析。「translation is predominantly facilitated by a sparse subset of specialized attention heads (less than 5%), which extract source language, indicator, and positional features. MLPs subsequently integrate and process these features by transiting towards English-centric latent representations.」とのこと。

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.1]
本稿では,SIGIR 2024におけるLLMJudgeの大規模自動妥当性評価の結果をベンチマークし,報告する。 8つの国際チームが作成したTREC 2023ディープラーニングトラック関連判定のラベルを42 LLMで作成し、ベンチマークする。
論文参考訳（メタデータ） (Wed, 19 Feb 2025 17:40:32 GMT)
「This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed.」とのことでいろいろ検証なアプローチのまとめ。

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective [23.3]
タブラルデータ(Tabular data)は、バイオインフォマティクス、医療、マーケティングなど、さまざまな領域で広く使われているデータフォーマットの1つである。本調査では,データ空間を精製するための基本技術として,強化学習(RL)と特徴選択と特徴生成のための生成的アプローチについて検討する。我々は,既存の課題を要約し,今後の研究の方向性について論じ,この分野の継続的なイノベーションを促進する洞察を提供することを目的とする。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 22:34:50 GMT)
「Tabular data-centric AI is evolving with RL-based optimization and generative modeling playing a key role in feature engineering.」とのこと。現状でも重要性が下がっていないテーブルデータに対してRL系の最適化や生成AI活用などをサーベイした論文。

不均衡データに対するサーベイも出ていた。こちらも過去から重要な視点。

A Comprehensive Survey on Imbalanced Data Learning [45.3]
不均衡なデータは、さまざまな種類の生データに広まっており、機械学習のパフォーマンスを妨げる。本調査は,様々な実世界のデータ形式を体系的に分析する。さまざまなデータフォーマットに関する既存の研究は、データ再バランス、特徴表現、トレーニング戦略、アンサンブル学習の4つのカテゴリにまとめられている。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 04:53:17 GMT)

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding? [39.6]
本稿では,高品質な質問応答データを完全自律的に生成するフレームワークであるCrescentを提案する。数学推論のための外部監視信号がゼロであることから、クレセントは真の自己改善の可能性に光を当てている。
論文参考訳（メタデータ） (Wed, 19 Feb 2025 05:37:08 GMT)
「CRESCENT as a simple yet effective framework – leveraging techniques of bait prompting, diversification, and consensus enhancement – for exploring the self-improvement problem of LLMs.」の提案、CoTなどに比べても高い性能を発揮とのこと。
何らかの情報が増えているわけではないのでTTCにパワーを使っている効果が出ているという解釈で良いのだろうか。

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.9]
我々は,o1-like large language model (LLMs) が本当にテスト時間スケーリング機能を持っているか検討した。これらのo1型モデルの長いCoTは、常に精度を向上しないことがわかった。並列スケーリング戦略とCoT長特性を組み合わせた手法であるShortest Majority Voteを提案する。
論文参考訳（メタデータ） (Mon, 17 Feb 2025 07:21:11 GMT)
必ず長い推論が性能向上につながっておらず「These results reveal that self-revision ability is a key factor in the effectiveness of sequential scaling for o1-like models.」だったとのこと。実験結果から「Shortest Majority Vote, which incorporate parallel scaling approaches with our insight on sequential scaling.」を提案。
前半はThe Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks – arXiv最新論文の紹介を思いうかぶ。提案手法の再現実験などが気になるところ。

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study [13.4]
GemmaX2-28は、28言語で最上位の多言語翻訳性能を達成する9Bモデルである。 GemmaX2-28 は TowerInstruct や XALMA などの最先端 (SOTA) モデルより一貫して優れている。
論文参考訳（メタデータ） (Fri, 07 Feb 2025 06:59:27 GMT)
「Parallel-First Monolingual-Second (PFMS) data mixing strategy」を用い「To the best of our knowledge, GemmaX2-28-9B is the open model with the highest translation quality.」を主張する機械翻訳モデルの提案。データのレシピによって翻訳性能がかなり変わるのがとても参考になる。
リポジトリはGemmaX2 – a ModelSpace Collection

HippoRAG2, RAG vs Graph RAG, A-MEM: Agentic Memory for LLM Agents

xRAG、FlashRAG、HippoRAG – arXiv最新論文の紹介の改善や、RAGとGraphRAGとの比較、AgenticなアプローチなどRAGやメモリ強化関連の研究は盛ん。得意領域が異なるのでハイブリッド化する動きが多く、また、Agenticに対応していくアプローチも多い印象。

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models [6.4]
検索強化世代(RAG)は、新しい情報を導入する主要な方法となっている。最近のRAGは、知識グラフのような様々な構造を持つベクトル埋め込みを拡大して、いくつかのギャップ、すなわちセンスメイキングと連想性に対処している。我々は,現実的,感覚的,連想的なメモリタスクにおいて,標準RAGを総合的に上回るフレームワークであるHippoRAG 2を提案する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 18:26:02 GMT)
RAG&GraphRAGのハイブリッドアプローチ
リポジトリはGitHub – OSU-NLP-Group/HippoRAG: [NeurIPS’24] HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.

RAG vs. GraphRAG: A Systematic Evaluation and Key Insights [42.3]
我々は,テキストベースベンチマークを用いて,検索型拡張生成(RAG)とグラフRAGを体系的に評価する。本結果は,RAGとGraphRAGの異なる課題と評価の観点から,それぞれ異なる強みを浮き彫りにしている。
論文参考訳（メタデータ） (Mon, 17 Feb 2025 02:36:30 GMT)
通常のRAGとGraphRAGの詳細な比較
「Community-based GraphRAG with Global Search focuses more on the global aspects of whole corpus, whereas RAG captures more detailed information.」とのこと

A-MEM: Agentic Memory for LLM Agents [42.5]
大規模言語モデル(LLM)エージェントは、歴史的経験を活用するためにメモリシステムを必要とする。現在のメモリシステムは基本的なストレージと検索を可能にするが、洗練されたメモリ構造は欠如している。本稿では, LLMエージェントに対して, エージェント方式で動的に記憶を整理できる新しいエージェントメモリシステムを提案する。
論文参考訳（メタデータ） (Mon, 17 Feb 2025 18:36:14 GMT)
Agenticなデータの保持。「1) Link Generation – automatically establishing connections between memories by identifying shared attributes and similar contextual descriptions, and (2) Memory Evolutionenabling existing memories to dynamically evolve as new experiences are analyzed, leading to the emergence of higher-order patterns and attributes.」とのことで、下記のように動作するとのこと。
- Generates comprehensive notes with structured attributes
- Creates contextual descriptions and tags
- Analyzes historical memories for relevant connections
- Establishes meaningful links based on similarities
- Enables dynamic memory evolution and updates
リポジトリはGitHub – WujiangXu/AgenticMemory

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [122.0]
大規模言語モデル(LLM)は、数学、物理学、計算機科学などの学問分野において顕著な熟練性を示している。しかしながら、人間の知識は200以上の専門分野を含み、既存のベンチマークの範囲をはるかに超えている。 285分野にわたる大学院レベルの知識と推論能力を評価するベンチマークであるSuperGPQAを提案する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 17:05:58 GMT)
ByteDanceによる広範かつ難しいベンチマークの提案。DeepSeek R1の成績が良い他、Doubao 1.5pro – Doubao Teamも好成績。overallだとDeepSeek-R1 > DeepSeek-R1-Zero > o1-2024-12-17 > o3-mini-2025-01-31-high > o3-mini-2025-01-31-medium > Doubao-1.5-pro-32k-250115 > qwen-max-2025-01-25 > claude-3-5-sonnet-20241022 > o3-mini-2025-01-31-low > gemini-2.0-flashというのが現在のリーダーボード。
リポジトリはsuper gpqa

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [49.4]
本稿では,大規模言語モデル(LLM)の複雑な課題解決における推論と計画能力について検討する。近年の推論時間技術の発展は,LLM推論を追加訓練なしで向上させる可能性を示している。 OpenAIのo1モデルは、マルチステップ推論と検証の新たな使用を通じて、有望なパフォーマンスを示している。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 04:11:29 GMT)
流行りのInference-time computationについての検証。「Language models rely on retrieval rather than true understanding. Despite advancements in reasoning abilities with LRMs such as O1 and O1-Mini, they still appear to be pattern matching rather than genuine reasoning.」というのが興味深かった。
リポジトリはGitHub – divelab/Sys2Bench: Sys2Bench is a benchmarking suite designed to evaluate reasoning and planning capabilities of large language models across algorithmic, logical, arithmetic, and common-sense reasoning tasks.

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31