2025年9月 – ページ 5 – arXiv最新論文の紹介

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [221.3]
科学大規模言語モデル(Sci-LLMs)は、科学研究において、知識の表現、統合、適用の方法を変えつつある。この調査は、モデルとその基盤となるデータ基板の共進化として、Sci-LLMの開発を再考する。我々は、科学的データの統一された分類法と、科学的知識の階層的なモデルを定式化する。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 18:30:52 GMT)
応用が進む科学研究とLLMに関するサーベイ。
リポジトリはGitHub – open-sciencelab/Awesome-Scientific-Datasets-and-LLMs: A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs)

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants [5.5]
エージェントの哲学的・科学的理論とAIを用いた評価手法を統合することにより、人間エージェントの考え方を発展させる。我々は、典型的なAIのユースケースに基づいて、6次元の人間エージェントを持つスケーラブルで適応的なベンチマークであるHumanBench(HAB)を開発した。
論文参考訳（メタデータ） (Wed, 10 Sep 2025 11:10:10 GMT)
AIエージェントが人間の主体性をどのように扱うかに関するベンチマーク。複数のカテゴリ（Experimental-Orange/HumanAgencyBench_Evaluation_Results · Datasets at Hugging Face）に対して評価可能。「There is substantial variation across model developers—with Anthropic’s Claude models tending to most support human agency—and across dimensions. We encourage further research into human agency as more human tasks and decisions are delegated to AI systems, ensuring humans maintain appropriate levels of control.」とモデルによって挙動が異なるよう。
リポジトリはGitHub – BenSturgeon/HumanAgencyBench: A code repository for the paper: “HUMANAGENCYBENCH: Scalable Evaluation of Human Agency Support in AI Assistants”

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies [40.6]
ユーザ定義ポリシーに基づいてテキストを評価する動的ガーディアンモデルを提案する。私たちのモデルは、ポリシー違反の迅速な検出や、モデルのアウトプットを明確化し正当化する連鎖推論に使用できます。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 17:57:56 GMT)
「Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors.」というガーディアンモデル（その中でもユーザ入力のポリシーに動的に対応可能なもの）の構築、Qwen3ベースで強力な性能。
リポジトリはGitHub – montehoover/DynaGuard: Code for “DynaGuard: A Dynamic Guardrail Model With User-Defined Policies.”

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.3]
構成的視覚推論は、マルチモーダルAIにおける重要な研究フロンティアとして登場した。本調査は,トップ会場(CVPR,ICCV,NeurIPS,ICML,ACLなど)から260以上の論文を体系的にレビューする。次に60以上のベンチマークとそれに対応するメトリクスを、基底精度、連鎖忠実性、高分解能知覚などの次元に沿って探索する。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 11:01:51 GMT)
Compositional visual reasoning に関するサーベイ。

Social World Models

Social World Models [35.7]
我々は、新しい構造化社会世界表現形式(S3AP)を導入する。 S3APは、状態、観察、エージェントアクション、精神状態といった社会的相互作用を構造化されたものとして表現する。 S3APは、LLMが5つの社会的推論タスクのソーシャルな物語をよりよく理解するのに役立ちます。次に、これらの構造化された表現から社会世界モデルを誘導し、将来の社会的ダイナミクスを予測する能力を示す。
論文参考訳（メタデータ） (Sat, 30 Aug 2025 16:52:58 GMT)
「We define and build social world models through explicit representations of agent mental states, actions, and observations (S3AP). Our approach captures complex social dynamics systematically by automatically transforming free-form narratives into S3AP representations, reducing reporting bias and bridging the gap between raw text and actionable social world models.」とのこと。
LLMをうまく使う、LLMがうまく扱える形式で物事を整理するなどメタなタスクを扱っているように思えるのが興味深い。

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis [52.6]
本稿では,生のベンチマークと総合的自動評価フレームワークであるDeepScholar-benchを紹介する。 DeepScholar-benchは、最近の高品質なArXiv論文からクエリを抽出し、真の研究合成タスクにフォーカスしている。また,LOTUS APIを用いて効率的に実装した参照パイプラインであるDeepScholar-baseを開発した。
論文参考訳（メタデータ） (Wed, 27 Aug 2025 16:36:34 GMT)
「DeepScholar- bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research.」というベンチマークの提案。Live benchmarkとなっている。
プロジェクトサイトはGitHub – guestrin-lab/deepscholar-bench: benchmark and evaluate generative research synthesis

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent [49.6]
本稿では,モバイルエージェントが現実的かつスケーラブルな影響をもたらすためには,解決すべき4つの中核的問題を特定する。本稿では,マルチモーダル,マルチエージェント,汎用オンデバイスアシスタントであるAppCopilotを紹介する。 AppCopilotはアプリケーション間で動作し、データからデプロイメントまでの完全なクローズドループシステムを構成する。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 15:48:21 GMT)
この分野の教科書ともいえる情報量を持つ論文。結論の「In summary, mobile agents are entering a new era of ecosystem development in intelligent automation, cross-platform operation, and continual learning. Importantly, these abilities should not be viewed as a mere summary of existing achievements, but rather as a vision for future evolution.」はまさにそうで、様々な研究機関が相応のリソースを投入している理由だと思う。
リポジトリはGitHub – OpenBMB/AppCopilot: A General, Accurate, Long-Horizon, and Efficient Mobile Agent driven by Multimodal Foundation Models

LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres

LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres [15.2]
大規模言語モデルのセキュリティオペレーションセンター(SOC)への統合は、アナリストの作業量を削減するための変革的かつまだ進化している機会を提供する。本稿では,SOCアナリスト45名を対象に,10ヶ月で3,090件の質問に対して縦断調査を行った。分析の結果,LLMを高精細度判定ではなく,センスメイキングやコンテキストビルディングのオンデマンド支援として活用していることが判明した。
論文参考訳（メタデータ） (Tue, 26 Aug 2025 11:40:02 GMT)
SOCアナリストがどのようにLLMを使っているかの報告。
「By analysing thousands of analyst-generated queries, we found that analysts use LLMs as on-demand, task-focused cognitive aids for a variety of tasks, including explaining commands, writing scripts, or improving documentation, rather than as full-time copilots.」は現状としてはそうだろうなという印象。

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4]
スケーラブルな合成データ生成と検証のためのオープンソースのフレームワークであるLoong Projectを紹介します。 LoongBenchは、12のドメインにまたがる8,729の人為的なサンプルを含む、キュレートされたシードデータセットである。 LoongEnvはモジュラー合成データ生成環境であり、新しい質問応答コードのトリプルを生成する複数のプロンプト戦略をサポートする。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 06:42:40 GMT)
「Our contributions are fourfold: (1) LOONGBENCH, a seed dataset of 8,729 examples across 12 reasoning- intensive domains with executable code and verified answers; (2) LOONGENV, a flexible environment enabling diverse synthetic data generation strategies; (3) comprehensive benchmarking of open-source and proprietary models to assess domain generalization; and (4) in-depth analysis of generated data quality in terms of correctness, diversity, and complexity. Together, these components form a cohesive framework for studying alignment at scale.」と、合成データに関するフレームワークの提案。合成データ活用は高性能モデルを構築するにあたり基本的なアプローチとなっており、この手のフレームワークはありがたい。
リポジトリはGitHub – camel-ai/loong: 🐉 Loong: Synthesize Long CoTs at Scale through Verifiers.

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5]
Streferはビデオ大モデルに参照と推論機能を持たせるために設計された合成データ生成フレームワークである。 Streferは、時間的に密度が高くきめ細かなビデオメタデータを擬似アノテーションするデータエンジンを使用して、多様な命令生成データを生成する。我々のアプローチは、ビデオLLMが空間的および時間的参照を解釈する能力を高め、現実のAIコンパニオンに不可欠な、より汎用的で時空間対応の推論を育む。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 17:33:20 GMT)
「Our approach begins with a modular framework that orchestrates multiple agents—including pretrained Large Language Models (LLMs), Video LLMs, and Pixel-Level Multimodal Vision Foundation Models (e g , RexSeek [20], GroundingDINO [32] and SAM2 [44])—to pseudo-annotate video metadata with temporally dense and object-centric space-time information. This metadata captures detailed spatial and temporal structures, such as subjects, objects, their locations as masklets (segmentation masks tracked over time), and action timelines. Building on this structured metadata, we leverage in-context learning and well-defined task schemas to guide LLMs in generating high-utility instruction data for tuning Video LLMs.」と凝った構成による動画に対する合成データフレームワークの提案。
プロジェクトサイトはStrefer: Data Engine for Video LLMs

2025年9月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30