arXiv最新論文の紹介

SITE: towards Spatial Intelligence Thorough Evaluation

SITE: towards Spatial Intelligence Thorough Evaluation [121.1]
空間知能 (Spatial Intelligence, SI) は、空間的関係の可視化、操作、推論を含む認知能力を表す。 SI Thorough Evaluationに向けたベンチマークデータセットであるSITEを紹介する。ベンチマークの計算には、31の既存のデータセットに関するボトムアップ調査と、認知科学の3つの分類システムに基づくトップダウン戦略を組み合わせる。
論文参考訳（メタデータ） (Thu, 08 May 2025 17:45:44 GMT)
Spatial Intelligenceのベンチマーク。GPT-4oでも人間との差が大きい。（そしてInternVL-2.5-8Bのスコアが意外と高い）
プロジェクトサイトはSITE: towards Spatial Intelligence Thorough Evaluation

Federated Learning for Cyber Physical Systems: A Comprehensive Survey

Federated Learning for Cyber Physical Systems: A Comprehensive Survey [49.5]
近年,フェデレートラーニング(FL)が普及している。この記事では、FLが、インテリジェントトランスポートシステム、サイバーセキュリティサービス、スマートシティ、スマートヘルスケアソリューションなど、重要なCPSアプリケーションでどのように利用されるのかを精査する。
論文参考訳（メタデータ） (Thu, 08 May 2025 01:17:15 GMT)
連合学習とサイバーフィジカルシステムに関するサーベイ
確かに相性はよさそう

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [69.1]
このようなモデルをトレーニングするための強化学習アプローチであるJ1を紹介する。本手法は,判断バイアスを軽減し,思考にインセンティブを与える検証可能な報酬を用いて,検証可能なプロンプトと検証不可能なプロンプトの両方を判断タスクに変換する。評価基準を概説し、自己生成した基準回答と比較し、モデル応答の正しさを再評価することにより、モデルがより良い判断を下すことが判明した。
論文参考訳（メタデータ） (Thu, 15 May 2025 14:05:15 GMT)
Thinking-LLM-as-a-Judge modelsを構築するための強化学習レシピの提案。
「our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model.」とのこと。
Assessing Judging Bias in Large Reasoning Models: An Empirical Study – arXiv最新論文の紹介など、LLM as a judgeなタスクでのLRM適用に効果があるという指摘はあったのでそれらと整合的な結果であるように思う。

Seed1.5-VL, Qwen3, MiMo, MiniMax-Speech, Aya Vision, BLIP3-o

BytedanceのSeek 1.5 VL、AlibabaのQwen3, XiaomiのMiMo、MiniMaxのMiniMaz-Speechと先週は中国の研究機関からの論文公開が多かった。また、CohereのAya Vision、SalesforceのBLIP3-o論文の公開もあり、LLM、MLLM関連はOpenAI一強という状態ではなくなっている。著者リストを見ると有力な研究者が複数所属する大規模なチームでモデル構築を行っているように見える。

Seed1.5-VL Technical Report [237.8]
Seed1.5-VLは、汎用マルチモーダル理解と推論を促進するために設計されたビジョン言語基盤モデルである。幅広いパブリックなVLMベンチマークと内部評価スイートで強力なパフォーマンスを提供する。 GUI制御やゲームプレイといったエージェント中心のタスクでは、Seed1.5-VLはOpenAI CUAやClaude 3.7など、主要なマルチモーダルシステムより優れている。
論文参考訳（メタデータ） (Sun, 11 May 2025 17:28:30 GMT)
「Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7.」を主張するMLLM

Qwen3 Technical Report [138.0]
Qwenモデルファミリの最新バージョンであるQwen3を紹介します。 Qwen3は、性能、効率、多言語機能を向上させるために設計された一連の大規模言語モデル(LLM)から構成されている。
論文参考訳（メタデータ） (Wed, 14 May 2025 13:41:34 GMT)
Qwen（Qwen3, Phi-4 reasoning, MiMo 7B, OLMo2 1B, Mellum 4B – arXiv最新論文の紹介）に関してarXivに投稿された論文
リポジトリはGitHub – QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

MiMo: Unlocking the Reasoning Potential of Language Model — From Pretraining to Posttraining [66.1]
提案するMiMo-7Bは,学習前の段階と学習後の段階にまたがって最適化された,推論タスクのための大規模言語モデルである。 MiMo-7B-Baseは25兆のトークンで事前訓練されており、性能の向上と推論速度の高速化を目標としている。最後のRLチューニングモデルであるMiMo-7B-RLは、OpenAI o1-miniの性能を上回り、数学、コード、一般的な推論タスクにおいて優れたパフォーマンスを達成する。
論文参考訳（メタデータ） (Mon, 12 May 2025 14:30:11 GMT)
リポジトリはGitHub – XiaomiMiMo/MiMo: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining

Aya Vision: Advancing the Frontier of Multilingual Multimodality [16.0]
高品質で多様な多言語マルチモーダル命令データをキュレートする合成アノテーションフレームワークを開発した。また,破滅的忘れを緩和するクロスモーダルモデルマージ手法を提案する。我々の研究は、マルチモーダルフロンティアにおける多言語的な進歩を前進させ、計算の必要性を効果的に曲げる技術に関する洞察を提供する。
論文参考訳（メタデータ） (Tue, 13 May 2025 17:03:48 GMT)
リポジトリはCohere Labs Aya Vision – a CohereLabs Collection

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.2]
本稿では,拡散変換器を用いて意味的にリッチなCLIP画像特徴を生成する手法を提案する。画像理解のための統合モデルファーストトレーニングと画像生成のための逐次事前学習戦略は、実用的な利点をもたらす。革新的なモデル設計、トレーニングレシピ、データセットに基づいて、最先端の統一マルチモーダルモデルのスイートであるBLIP3-oを開発します。
論文参考訳（メタデータ） (Wed, 14 May 2025 17:11:07 GMT)
リポジトリはGitHub – JiuhaiChen/BLIP3o、BLIP3o/BLIP3o-Model · Hugging Face

WorldPM: Scaling Human Preference Modeling

WorldPM: Scaling Human Preference Modeling [130.2]
我々は、このスケーリングの可能性を強調するために、World Preference Modeling$ (WorldPM)を提案する。多様なユーザコミュニティをカバーする公開フォーラムから選好データを収集する。 1.5Bから72Bパラメータの範囲で15Mスケールのデータを用いて広範囲なトレーニングを行う。
論文参考訳（メタデータ） (Thu, 15 May 2025 17:38:37 GMT)
「Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling.」とのこと。さらには「Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks.」を主張している。この手の基盤モデルの可能性は興味深い（が若干怖くもある）。
- Appendixのフィルタに関する結果、「we argue that applying RM filtering diverges from capturing world preference. Instead of assuming forum data contains noise, we should interpret apparent contradictions as manifestations of genuine human preferences, allowing models to discover underlying commonalities within these surface-level conflicts.」も面白い
リポジトリはGitHub – QwenLM/WorldPM

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery [26.0]
大規模言語モデル(LLM)は、材料科学と化学研究の多くの側面を再構築している。最近の進歩は、最新のモデルのクラスが構造化データと非構造化データを統合することができることを示している。第2回Large Language Model Hackathon for Applications in Materials Science and Chemistryで開発された34のプロジェクトを通して,LLMの応用を概観する。
論文参考訳（メタデータ） (Mon, 05 May 2025 22:08:37 GMT)
「To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowl- edge extraction and reasoning from the scientific literature.」というハッカソンのまとめ
興味深いトライもあり、面白い。

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.3]
視覚的依存を明示する多モーダル数学的推論のための総合的なベンチマークであるVCBENCHを紹介する。 VCBENCHには6つの認知領域に1,720の問題がある。我々は、VCBENCH上で26の最先端LVLMを評価し、高い性能差を示し、トップモデルでさえ50%以上の精度を達成できなかった。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 03:45:30 GMT)
Visionに依存するよう設計された数学推論ベンチマークの提案
リポジトリはBenchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.1]
提案するHyperTree Planning(HTP)は,高木構造プランニングアウトラインを構成する新しい推論パラダイムである。実験ではHTPの有効性を実証し、Gemini-1.5-ProによるTravelPlannerベンチマークで最先端の精度を実現し、o1-previewよりも3.6倍の性能向上を実現した。
論文参考訳（メタデータ） (Mon, 05 May 2025 02:38:58 GMT)
「Compared to previous tree planning methods such as ToT (Yao et al , 2024) and RAP (Hao et al , 2023), HTP introduces structural innovations that enable each edge to connect multiple child nodes, making it suitable for a divide-and-conquer strategy.」という特徴を持つHyperTreeを使った行動計画の提案。
効果が高いよう。通常のツリーよりも強力な構造であるのは確かだろうがLLMも扱いやすいという点が面白い。（いろいろ書ける）自然言語に似ている・・・？

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.7]
この調査は、大規模言語モデルの台頭が評価に役立っている中核的な課題を調査する。 i) タスク固有のものから能力に基づく評価へと、知識、推論、指示に従うこと、マルチモーダル理解、安全性といったコア能力に関するベンチマークを再編成する。この問題と、上記の2つのトランジションの中核的な課題を、メソッド、データセット、評価器、メトリクスの観点から検討する。
論文参考訳（メタデータ） (Sat, 26 Apr 2025 07:48:52 GMT)
ベンチマークに関するサーベイ。「Fig6 Illustration of capability-based benchmark taxonomy involving: knowledge, reasoning, instruction following, multimodal, and safety.」が視覚的にとても分かりやすい。
リポジトリはGitHub – ALEX-nlp/Benchmark-of-core-capabilities、

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge [6.1]
我々は,複数のドメインにまたがる仮説的かつ妥当なニュースからなるデータセットである$textitNew News$を紹介した。我々は,文脈を伴わないモデルから知識を抽出し,文脈を伴わないモデルの重みに組み込むための,セルフプレイデータ生成プロトコルのスイートを探索する。以上の結果から,Sys2-FTの自己QAプロトコルは,モデルによるニュースの重み付け学習を大幅に改善することが示された。
論文参考訳（メタデータ） (Sat, 03 May 2025 12:49:35 GMT)
ICLとFTのギャップに関する分析とSys2-FTという手法の提案。「Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models’ in-weight learning of the news.」とのこと。
ICLとFTの差異はとても興味深いし実用上も重要。

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30