arXiv最新論文の紹介

Magentic-UI: Towards Human-in-the-loop Agentic Systems

Magentic-UI: Towards Human-in-the-loop Agentic Systems [34.5]
本稿では,ヒューマンエージェントインタラクションの開発と研究のためのオープンソースのWebインターフェースであるMagentic-UIを紹介する。柔軟なマルチエージェントアーキテクチャに基づいて構築されたMagentic-UIは、Webブラウジング、コード実行、ファイル操作をサポートする。エージェントベンチマークによる自律的なタスク補完、インタラクション機能のユーザテストのシミュレーション、実際のユーザとの質的研究、ターゲットとする安全性評価の4つの側面でMagentic-UIを評価した。
論文参考訳（メタデータ） (Wed, 30 Jul 2025 03:49:14 GMT)
「Six interaction mechanisms designed to support low-cost, human-agent interaction in Magentic- UI: co-planning, co-tasking, action approval, answer verification, memory, and multi-tasking.」と人間と強調しながら動作するエージェント開発のためのフレームワーク。
リポジトリはmicrosoft/magentic-ui: A research prototype of a human-centered web agent

Your AI, Not Your View: The Bias of LLMs in Investment Analysis

Your AI, Not Your View: The Bias of LLMs in Investment Analysis [55.3]
金融分野では、事前訓練されたパラメトリック知識とリアルタイム市場データとの相違により、LLM(Large Language Models)は頻繁に知識紛争に直面している。 LLMに基づく投資分析において、確認バイアスの最初の定量的分析を行う。われわれは、大口株に対する一貫した選好と、ほとんどのモデルにおけるコントラリアン戦略を観察する。
論文参考訳（メタデータ） (Mon, 28 Jul 2025 16:09:38 GMT)
LLMの投資に関するバイアスの定量的分析。
「The results show that LLMs are not neutral decision-makers, with distinct preferences for certain financial factors depending on the model. While sector preferences varied significantly across models, showing no overall trend, a common bias towards large- size stocks and a consistent preference for a contrarian investment view over momentum were observed.」というバイアスがあるというのと、「While the models correctly reversed their decisions when presented only with counter-evidence, their flexibility sharply decreased in situations where supporting and counter-evidence were mixed and conflicting.」とかなり頑固なよう。
LLMに何かを判断させる際には細心の注意が必要。

Yume: An Interactive World Generation Model

Yume: An Interactive World Generation Model [38.8]
Yumeは、画像やテキスト、ビデオを使って対話的でリアルでダイナミックな世界を作る。入力画像から動的世界を生成し、キーボードアクションを使って世界を探索することができる。
論文参考訳（メタデータ） (Wed, 23 Jul 2025 17:57:09 GMT)
「In this paper, we introduce a preview version of Yume, which is an interactive world generation model that allows the use of keyboard inputs to explore a dynamic world created by an input image. Moreover, it can do infinite video generation in an autoregressive manner.」と、いわゆる内心的なworld modelではなく、対話的に動画像を作っていくWorld generation modelの提案。
リポジトリはstdstu12/YUME

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.6]
本稿では,大規模な人体ビデオで訓練された視覚・言語・行動モデルであるBeing-H0を紹介する。提案手法は,人間のビデオからの大規模VLA事前学習,3次元推論のための物理空間アライメント,ロボット作業のためのポストトレーニング適応を組み合わせた,新しいトレーニングパラダイムである物理インストラクションチューニングに重点を置いている。本研究では,手の動き生成と指示の結果としてのBeat-H0の卓越性を実証的に示すとともに,モデルやデータサイズにもよく対応している。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 13:19:09 GMT)
動画データからのVLAモデル構築、手の動作を離散的なトークンに変換して扱うなどパイプラインも興味深い。
リポジトリはBeing-H0

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities [33.7]
大規模言語モデル (LLM) は、多くの学際的な研究でその変容の可能性を示している。本稿では,学際研究におけるLSMの適用について概観する。
論文参考訳（メタデータ） (Fri, 11 Jul 2025 09:11:18 GMT)
「From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs.」というサーベイ。

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.1]
Retrieval-Augmented Generation (RAG) は、外部知識を注入することによって、Large Language Models (LLM) の事実性を高める。逆に、純粋に推論指向のアプローチは、しばしば幻覚的あるいは誤った事実を必要とする。この調査は両鎖を統一的推論-検索の観点から合成する。
論文参考訳（メタデータ） (Wed, 16 Jul 2025 15:44:18 GMT)
RAGに関するサーベイ。
論文リストなどはGitHub – DavidZWZ/Awesome-RAG-Reasoning: [Up-to-date] Awesome RAG Reasoning Resources

PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process

PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process [15.4]
本研究では,絵画プロセスの人為的アセスメントのための新しい枠組みを提案する。具体的には、実画像と合成画像からなる最初の大規模データセットであるペイントプロセスアセスメントデータセット(PPAD)を紹介する。また、時間的に認識された位置符号化を付加したトランスフォーマーベースモデルPPJudgeを提案する。
論文参考訳（メタデータ） (Sat, 12 Jul 2025 10:30:44 GMT)
「we introduce a dataset specifically designed for painting process assessment: the Painting Process Assessment Dataset (PPAD). It consists of approximately 15,000 real paintings and 10,000 synthetic paintings, each annotated by domain experts.」というデータセットと対応するモデルの提案。

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers [22.8]
本稿では,科学文献におけるスキーマ図の解釈能力を評価するための最初のベンチマークであるMIS-QAを紹介する。 MISS-QAは465以上の科学論文に1500の専門家が注釈を付けた例で構成されている。我々は、o4-mini、Gemini-2.5-Flash、Qwen2.5-VLを含む18のフロンティアマルチモーダル基盤モデルの性能を評価する。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 20:35:25 GMT)
「We present MISS-QA, the first benchmark specifically designed to assess the ability of foundation models to comprehend schematic diagrams in scientific literature.」ということで、概念図等を理解するためのベンチマークの提案。o4-miniの性能が高めだが、人間との差は大きい。
データはyale-nlp/MISS-QA · Datasets at Hugging Face、リポジトリはGitHub – yilunzhao/MISS-QA

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [55.0]
大規模言語モデルは、最近、Q&Aのような複雑な自然言語処理タスクの裁判官として活用されている。コード生成とコード要約という2つのコード関連タスクに対するLLMs-as-a-judgeの有効性について検討した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:40:26 GMT)
コードの評価を対象としたLLM as a judgeの検証
「Our findings show that “small” LLMs struggle in judging tasks, with GPT-4-turbo being the model that achieves the best results. Still, even GPT-4-turbo frequently fails in assessing code correctness, while being a reliable judge of code summary quality.」とのこと。より新しいモデルでの結果が気になる。

The Impact of Language Mixing on Bilingual LLM Reasoning

The Impact of Language Mixing on Bilingual LLM Reasoning [4.5]
中国語と英語のバイリンガル推論モデルにおける言語スイッチングについて検討する。単言語復号を強制すると数学推論タスクの精度は 5.6 ポイント低下する潜在的な言語スイッチが、推論に害を与えるかどうかを予測するために、軽量なプローブをトレーニングすることができる。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:56:09 GMT)
LRMでよく見る推論過程で様々な言語が混じる問題について、「Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning.」とのこと。また、「Altogether, these results suggest that language mixing is not a random artifact of multilingual training but a deliberate strategy that LLMs adopt to improve complex reasoning.」という記載もある。

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31