arXiv – ページ 35 – arXiv最新論文の紹介

GPT-5, GPT-OSS, Claude Opus 4.1

先週はGPT-5（GPT-5 が切り拓く働き方の新時代 | OpenAI）、gpt-oss 20B・120B（gpt-oss が登場 | OpenAI）, Claude Opus 4.1（Claude Opus 4.1 \ Anthropic）, DeepMind Genie 3（Genie 3: A new frontier for world models – Google DeepMind）と大きな発表が相次いだ。

GPT-5はベンチマーク性能でSoTAをしっかりとっており非常に性能が高い。一方でその少し前に発表されたClaude 4.1 Opusとの性能差が大きくなかったこと（システムカードの「All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.」（gpt5-system-card-aug7.pdf）という記述も気になる）や、Chatbot Arenaの日本語版でGemini 2.5 Proに負けていること（かつ1 vs 1の勝負などGemini 2.5 Proの勝率の方が高い）などから期待ほどではないという印象もある。それとGPT-5でも創作漢字（Pixels, Patterns, but No Poetry: To See The World like Humans – arXiv最新論文の紹介）は読めなかった・・・。戦略的な価格付けであり、また、Measuring AI Ability to Complete Long Tasks – METRではまさにフロンティアなスコアを出していることもあって実態がどうかの評価にはもう少し時間が必要そう。

GPT-OSSは性能の高い公開モデルであり、Apache-2ライセンス。実用的なレベルと思われるモデルが公開された意義は大きい。From GPT-2 to gpt-oss: Analyzing the Architectural Advancesではtransformerといっても様々な改善がされてきたことが分かる。

Claude 4.1 Opus, Gemini 2.5 ProとOpenAI以外の会社も非常に高性能なモデルを出しており、DeepSeekやKimi、Hunyuanといった中国のモデルの高性能化も進んでいる。OpenAI一強は終わっているものの進化は続いている印象。

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-Zero: Self-Evolving Reasoning LLM from Zero Data [56.7]
自己進化型大規模言語モデル(LLM)は、自身の経験から自律的に生成、精製、学習することで、超知性へのスケーラブルなパスを提供する。このようなモデルを訓練するための既存の方法は、いまだに膨大な人為的なタスクやラベルに大きく依存している。 R-Zeroは、完全に自律的なフレームワークで、スクラッチから独自のトレーニングデータを生成する。
論文参考訳（メタデータ） (Thu, 07 Aug 2025 03:38:16 GMT)
「we propose R-Zero, a framework for training reasoning LLMs that can self-evolve from zero external data. In R-Zero, a single base model is initialized with two roles – a Challenger and a Solver that are independently optimized but co-evolve throughout the RL process.」、「Challenger is rewarded for proposing tasks near the edge of the Solver’s capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger.」というGANっぽいフレームワーク。
リポジトリはChengsong-Huang/R-Zero: codes for R-Zero: Self-Evolving Reasoning LLM from Zero Data (https://www.arxiv.org/pdf/2508.05004)

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [111.1]
RMTBenchは、80の多様な文字と8000以上の対話ラウンドを特徴とする、総合的なテキストバプサー中心のバイリンガルロールプレイングベンチマークである。本ベンチマークでは,文字記述よりも明示的なユーザモチベーションに基づく対話を構築し,実用的なユーザアプリケーションとの整合性を確保する。 RMTBenchは、キャラクタバックグラウンドからユーザ意図のフルフィルメントにフォーカスを移すことで、学術的な評価と実践的なデプロイメント要件のギャップを埋める。
論文参考訳（メタデータ） (Sun, 27 Jul 2025 16:49:47 GMT)
「our User-Centric Dialogues are built around virtual users with clear intentions, enhancing continuity across multi-turn interactions and better reflecting real-world applications.」という特徴を持つベンチマークの提案。
英語、中国語ともQWEN2.5-MAXが高スコア。

Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques

Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques [11.2]
大規模言語モデル(LLM)は、脅威検出、脆弱性評価、インシデント応答に対するインテリジェントで適応的で自動化されたアプローチを可能にすることで、サイバーセキュリティを変革している。高度な言語理解とコンテキスト推論によって、LLMは、IoTやブロックチェーン、ハードウェアセキュリティといったドメイン間の課題に対処する従来の手法を超越している。
論文参考訳（メタデータ） (Fri, 18 Jul 2025 03:41:18 GMT)
「This survey provides a comprehensive overview of LLM applications in cybersecurity, focusing on two core areas: (1) the integration of LLMs into key cybersecurity domains, and (2) the vulnerabilities of LLMs themselves, along with mitigation strategies」というLLMとセキュリティに関するサーベイ。

UserBench: An Interactive Gym Environment for User-Centric Agents

UserBench: An Interactive Gym Environment for User-Centric Agents [110.8]
LLM(Large Language Models)ベースのエージェントは、推論とツールの使用において、目覚ましい進歩を遂げてきたが、ユーザと積極的にコラボレーションする能力はまだ未熟である。マルチターン、選好駆動インタラクションにおいてエージェントを評価するために設計されたユーザ中心のベンチマークであるUserBenchを紹介する。
論文参考訳（メタデータ） (Tue, 29 Jul 2025 17:34:12 GMT)
「Revolving around these traits, we introduce UserBench, a user-centric environment designed to facilitate an agent’s ability to engage in meaningful, multi-turn interactions with users who exhibit these traits. In UserBench, simulated users provide initial vague task instruction (underspecification), gradu- ally reveal preferences over time (incrementality),and often do so implicitly (indirectness). Agents must proactively clarify goals, interpret subtle cues, and adaptively reason through tool use to succeed.」という設定のベンチマークの提案。対象は旅行シナリオで曖昧な指示から対話を元に対処していく能力が求められる。
リポジトリはSalesforceAIResearch/UserBench

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [24.7]
1kの大学レベルの教科書から抽出した真正な参照回答を特徴とするオープンデータセットであるTextbookReasoningを提案する。私たちは、合計125万のインスタンスからなる高品質なオープンソースデータセットの大規模な混合であるMegaScienceを紹介します。実験により,我々のデータセットはより簡潔な応答長で優れた性能と訓練効率が得られることを示した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 17:59:03 GMT)
「We present TEXTBOOKREASONING and MEGASCIENCE, two datasets that advance the frontier in the scientific domain by enabling base models to outperform official instruct models on scientific tasks when fine-tuned with our data.」
リポジトリはGAIR-NLP/MegaScience: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning、MegaScience (MegaScience)

OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction

OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction [62.4]
OmniTrajは、大規模な異種データセットで事前トレーニングされたトランスフォーマーベースのモデルである。実験によると、フレームレートを明示的に条件付けすることで、OmniTrajは最先端のゼロショット転送性能を実現することができる。
論文参考訳（メタデータ） (Thu, 31 Jul 2025 15:37:09 GMT)
「We tackled the critical challenge of zero-shot transfer in human trajectory prediction. Our systematic investigation revealed that a simple, explicit frame-rate conditioning mechanism is a more effective solution than current data-unaware or continuous-time models.」とのことでゼロショットでの予測に効果があるアプローチの提案。フレームレートを明示的に扱えるTransformerの柔軟性に若干驚き。
リポジトリはvita-epfl/omnitraj

The Ever-Evolving Science Exam

The Ever-Evolving Science Exam [32.2]
1)5つの分野と500以上のサブフィールドにまたがる専門的な科学インスタンス(クエスト・アンサー・ペア)と,2)定期的に更新された500インスタンスサブセット**EESE*,サンプルと検証により,リーク耐性,低オーバヘッド評価を実現する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 12:22:16 GMT)
「1) We build a large-scale, high-quality, non-public instances repository, named EESE-Pool, which contains over 100,000 science in- stances. This pool is constructed under strict principles of Range, Reach, and Rigor. 2) We periodically sample a dynamic subset of 500 instances, called EESE, for actual evaluation. This subset is carefully curated to maintain Range, Reach, and Rigor, while mitigating leakage risk and reducing evaluation inefficiency through regular updates.」という大規模でLeakなどに強いベンチマークの提案。
リポジトリはaiben-ch/EESE: The Ever-Evolving Science Exam

Diffusion Models for Time Series Forecasting: A Survey

Diffusion Models for Time Series Forecasting: A Survey [14.3]
拡散モデルは、当初は画像合成のために開発されたが、顕著な生成能力を示している。近年, 時系列予測 (TSF) に応用が拡大され, 有望な結果が得られた。本調査はTSFにおける拡散モデルの最近の進展と今後の展望を詳述し、この分野の研究者の参考となる。
論文参考訳（メタデータ） (Sat, 19 Jul 2025 07:04:04 GMT)
Diffusionモデルの時系列予測への応用に関するサーベイ。
リポジトリはhttps://github.com/synlp/TSF-Diff-Review

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text [30.7]
テキスト・ツー・ビジュアル化モデルを評価するためのベンチマークであるText2Visを紹介する。 1,985のサンプルからなり、それぞれにデータテーブル、自然言語クエリ、短い回答、視覚化コード、注釈付きチャートがある。これは大きなパフォーマンスギャップを明らかにし、重要な課題を強調し、将来の進歩に対する洞察を提供する。
論文参考訳（メタデータ） (Sat, 26 Jul 2025 14:59:04 GMT)
「We introduce Text2Vis, a benchmark for evaluating LLMs in text-to-visualization tasks, featuring diverse datasets and over 20 chart types to support complex queries involving multi-step reasoning, retrieval, multi-chart generation, and conversations.」というベンチマークの提案。Agenticな処理フレームワークによって性能が向上とのこと。
リポジトリはvis-nlp/Text2Vis

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31