staka – arXiv最新論文の紹介

Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback [81.0]
本稿では,3段階を通して専門家レビューアの動作をモデル化する,自動ノベルティ評価のための構造化アプローチを提案する。本手法は,人文のノベルティレビューを大規模に分析した結果から得られたものである。 182 ICLR 2025 の提出で評価されたこの手法は、人間の推論と86.5%の一致と、新規性の結論に関する75.3%の合意を達成している。
論文参考訳（メタデータ） (Thu, 14 Aug 2025 16:18:37 GMT)
論文等の新規性を評価するフレームワークの提案、「document processing and content extraction, related work retrieval and ranking, and structured novelty assessment.」という３ステージ構成。
リポジトリはBeyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

Multimodal Referring Segmentation: A Survey

Multimodal Referring Segmentation: A Survey [93.2]
マルチモーダル参照セグメンテーション(Multimodal reference segmentation)は、テキストやオーディオフォーマットでの参照表現に基づいて、画像、ビデオ、および3Dシーンなどのターゲットオブジェクトを視覚シーンに分割することを目的としている。過去10年間で、畳み込みニューラルネットワーク、トランスフォーマー、および大規模言語モデルの進歩によって、マルチモーダルコミュニティにおいて大きな注目を集めてきた。
論文参考訳（メタデータ） (Fri, 01 Aug 2025 02:14:00 GMT)
Multimodal Referring Segmentationのサーベイ
リポジトリとしてhenghuiding/Awesome-Multimodal-Referring-Segmentation: Multimodal Referring Segmentationに論文等がまとまっている。

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale [101.6]
NextStep-1は、テキストから画像生成タスクにおける自動回帰モデルの最先端のパフォーマンスを実現する。本手法は画像編集において高い性能を示し,統一的アプローチのパワーと汎用性を強調した。
論文参考訳（メタデータ） (Thu, 14 Aug 2025 14:54:22 GMT)
StepFunによるAutoregressive Image Generation
リポジトリはGitHub – stepfun-ai/NextStep-1、Weightも公開されているNextStep-1 – a stepfun-ai Collection

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [16.9]
トレーニングと推論の両方においてGUIエージェントを強化するUI-AGILEを導入する。トレーニングのために,スーパービジョン・ファイン・チューニング(SFT)プロセスの一連の改善を提案する。推測のために,高解像度ディスプレイのグラウンド化精度を劇的に向上させるために,選択による分解グラウンド化を提案する。
論文参考訳（メタデータ） (Sat, 09 Aug 2025 17:51:27 GMT)
GUIエージェントの性能に大きく影響するグラウンディング能力を強化するフレームワークの提案。「UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection.」とのこと。
リポジトリはGitHub – KDEGroup/UI-AGILE

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding [97.4]
本稿では,新しいRLフレームワークであるEvidence Page-Guided GRPOで学習したMLLMであるDocR1を紹介する。 EviGRPOには、粗大な推論戦略を促進するエビデンス対応報酬機構が組み込まれている。我々は,DocR1が複数ページのタスクに対して最先端のパフォーマンスを達成し,シングルページのベンチマークにおいて強い結果を維持していることを示す。
論文参考訳（メタデータ） (Sun, 10 Aug 2025 12:03:45 GMT)
多くのページがあるドキュメント読解のためのフレームワークの提案。
「When engaging in multi-page reading comprehension, humans typically begin by identifying the pages likely to contain the answer, and then focus on locating the specific regions that correspond to the question and answer within those pages. Inspired by this “coarse-to-fine” reading strategy, EviGRPO mimics the human approach by first selecting a small set of potentially relevant pages at a coarse level, followed by fine-grained reasoning over the selected content.」とのことだが、このようなドメイン（タスク）特化のアプローチはいまだ有効なんだろうか。。

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.6]
静的ベンチマークにおけるLLM(Large Language Models)の既存の評価は、データの汚染やリーダーボードのオーバーフィッティングに弱い。 LLMの動的評価のためのフレームワークであるLLMEval-3を紹介する。 LLEval-3は、220kの卒業生レベルの質問からなるプロプライエタリなバンク上に構築されており、評価実行毎に未確認のテストセットを動的にサンプリングする。
論文参考訳（メタデータ） (Thu, 07 Aug 2025 14:46:30 GMT)
「LLMEval-3 is built on a proprietary bank of 220k graduate-level ques- tions, from which it dynamically samples unseen test sets for each evaluation run.」というベンチマーク。今までにも指摘されてきたことではあるが公開ベンチマークはleakの影響が大きく本論文にもそのような指摘がある。
リポジトリはllmeval/LLMEval-3: 中文大语言模型评测第三期

TiMoE: Time-Aware Mixture of Language Experts

TiMoE: Time-Aware Mixture of Language Experts [30.8]
大規模言語モデル(LLM)は通常、Webの固定スナップショットに基づいてトレーニングされる。我々は,2013-2024コーパスの2年スライスを分割し,TiMoEで組み合わせることで,GPTスタイルのエキスパートセットをスクラッチから事前学習することで,この問題に対処する。推論時にTiMoEは、クエリタイムスタンプ後にトレーニングウィンドウが終了するすべての専門家をマスクし、残りのログ確率を共有スペースにマージする。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 10:36:36 GMT)
「TiMoE demonstrates that partitioning pre-training data into strict time slices and blending the resulting GPT-2 experts through a causal, timestamp-aware router yields language models that stay chronologically grounded without a heavy accuracy penalty. By masking out any expert trained on data newer than the query year, TiMoE eliminates future-knowledge leakage while letting earlier specialists cooperate, cutting temporally inconsistent answers on the new 10 k-question TSQA benchmark by roughly 15%and delivering steadier accuracy across years.」というアプローチの時間情報の取り扱い。time-specific expertsを扱う面白いフレームワーク。とはいえパラメータ効率的にどうなんだろうと思わなくはない。
リポジトリはhttps://github.com/epfml/TiMoEとのこと。

Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges

Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges [29.3]
Web3テクノロジとAIエージェントの収束は、分散化されたエコシステムを再形成する、急速に進化するフロンティアを表している。本稿では, ランドスケープ, 経済, ガバナンス, セキュリティ, 信頼メカニズムの5つの重要な側面について, Web3 と AI エージェントの交わりについて, 初めてかつ最も包括的な分析を行った。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 15:44:58 GMT)
「This paper presents the first comprehensive systematic analysis of Web3-AI agent integration, examining 133 active projects with $6.9 billion collective market capitalization to reveal how AI agents fundamentally reshape decentralized ecosystems across the landscape, finance, governance, security, and trust dimensions.」というサーベイ

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [117.5]
Open X-Embodiment (OXE)のような大規模データセットでトレーニングされた汎用的なロボットポリシーは、幅広いタスクにわたって強力なパフォーマンスを示している。彼らはしばしば、トレーニングデータの分布を超えて一般化するのに苦労する。我々は,ショートカット学習を一般化の鍵となる障害として認識する。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:14:01 GMT)
「Our analysis reveals that large-scale robot datasets like OXE suffer from limited sub-dataset diversity and severe fragmentation, a problem that extends even within individual sub-datasets. This structure inherently promotes shortcut learning, meaning that simply adding more similarly-fragmented data can be detrimental to generalization.」とのこと。汎用的なモデル構築は難しい。
プロジェクトサイトはShortcut Learning in GRPs

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows [10.3]
大規模言語モデル(LLM)は、複雑で長期の推論を必要とする現実世界のアプリケーションにますます多くデプロイされている。 OdysseyBenchは、様々なオフィスアプリケーションにわたる長期にわたってLLMエージェントを評価するための包括的なベンチマークである。スケーラブルなベンチマーク作成を実現するために,長期ワークフローベンチマークの自動生成を行うマルチエージェントフレームワークであるHomerAgentsを提案する。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 17:53:03 GMT)
「We introduce OdysseyBench, a comprehensive benchmark for evaluating agents on long- horizon workflows across multiple office applications, consisting of OdysseyBench+ and OdysseyBench-Neo. 」、「• We propose HOMERAGENTS, a multi-agent framework that automates the generation of long-horizon tasks, enabling scalable and diverse benchmark creation.」とベンチマーク作成フレームワークを含むベンチマークの提案。
リポジトリはhttps://github.com/microsoft/OdysseyBenchとのことだが現時点では404

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31