2025年6月 – ページ 2 – arXiv最新論文の紹介

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 [167.9]
本稿では,Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025の成果を報告する。このコンペティションには、ホワイトボックスとブラックボックス評価という2つのフェーズで、敵対的な画像テキスト攻撃を通じてMLLM脆弱性をテストする86のチームが含まれていた。この課題はMLLMの安全性評価のための新しいベンチマークを確立し、より安全なAIシステムを改善するための基盤を配置する。
論文参考訳（メタデータ） (Sat, 14 Jun 2025 10:03:17 GMT)
MLLMへの攻撃コンペティションの結果報告。多くのチームが参加するコンペティションで使われたテクニックはとても参考になる。一位だったチームの「In this competition, we proposed an effective multimodal jailbreak strategy by embedding malicious intent within visually structured diagrams, particularly flowcharts, and enhancing it with carefully designed textual prompts. Our approach leveraged the weaknesses in safety alignment of vision-language models, exploiting their tendency to follow structured visual and textual cues.」のようにフローチャートを通したJailbreakなど画像をうまく使っているの興味深い。
リポジトリはGitHub – NY1024/ATLAS_Challenge_2025

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark [70.5]
MMTUは、25の現実世界のテーブルタスクに30万以上の質問がある大規模なベンチマークである。 MMTUは、専門家レベルで実際のテーブルを理解し、推論し、操作できるモデルを包括的に評価するように設計されている。 MMTUはテーブル理解、推論、コーディングといった、今日のフロンティアモデルにとって困難なスキルの組み合わせを必要としています。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 21:05:03 GMT)
「We show that MMTU require a combination of skills – includ- ing table understanding, reasoning, and coding – that remain challenging for today’s frontier models, where even frontier reasoning models like OpenAI o4- mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement.」という数表を扱うベンチマーク
リポジトリはGitHub – MMTU-Benchmark/MMTU、データはMMTU-benchmark/MMTU · Datasets at Hugging Face

Model Merging for Knowledge Editing

Model Merging for Knowledge Editing [53.8]
大規模言語モデル(LLM)は、世界が進化するにつれて正確で現在の知識を維持するために継続的な更新を必要とする。既存の知識編集アプローチは知識更新のための様々なソリューションを提供するが、しばしば連続的な編集シナリオに苦労する。本稿では,頑健な教師付き微調整(R-SFT)とモデルマージを組み合わせた2段階のフレームワークを提案する。
論文参考訳（メタデータ） (Sat, 14 Jun 2025 07:42:39 GMT)
SFTとmodel mergeによるknowledge editing
リポジトリはGitHub – Applied-Machine-Learning-Lab/MM4KE

Vision Generalist Model: A Survey

Vision Generalist Model: A Survey [87.5]
本稿では、ビジョンジェネラリストモデルの概要を概観し、その分野におけるその特性と能力について考察する。関連ドメインへの簡単な探索を行い、相互接続と潜在的なシナジーに光を当てます。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 17:23:41 GMT)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [43.2]
現代のAIにとっての大きな課題は、世界を理解し、主に観察によって行動することを学ぶことである。本稿では,インターネット規模のビデオデータと少量のインタラクションデータを組み合わせた自己教師型アプローチについて検討する。我々は物理世界で理解し、予測し、計画できるモデルを開発する。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 17:57:09 GMT)
「we show that joint-embedding predictive architectures learning from videos can be used to build a world model that enables understanding the physical world, predicting future states, and effectively planning in new situations; this is achieved by leveraging internet-scale video and a small amount of interaction data.」とのこと。
プロジェクトサイトはIntroducing the V-JEPA 2 world model and new benchmarks for physical reasoning、リポジトリはGitHub – facebookresearch/vjepa2: PyTorch code and models for VJEPA2 self-supervised learning from video.

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.3]
強い推論モデルの開発において,教師付き微調整(SFT)と強化学習(RL)の相乗効果について検討した。スケーリング戦略は推理性能に顕著な改善をもたらします我々のAceReason-Nemotron-1.1 7Bモデルは、Qwen2.5-7Bに基づく推論モデルにおいて、AceReason-Nemotron-1.0と新しい最先端性能を著しく上回っている。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 09:27:48 GMT)
LRM開発において重要なSFTとRLの関係を検証した論文。「Our results show that both scaling strategies substantially improve the reasoning abilities of large language models (LLMs).」とのこと。
「Interestingly, even strong SFT models with robust coding abilities benefit substantially from math-only RL training. This leads to further gains in coding performance.」のように隣接領域（？）での性能向上は、この分野だと色々なところで見られて興味深い性質だと思っている。
リポジトリはnvidia/AceReason-Nemotron-1.1-7B · Hugging Face

Gemini 2.5 Pro, Flash , 2.5 Flash-Lite, MiniMax-M1, Kimi-Dev-72B

Gemini 2.5 Proからpreviewが取れ、2.5 Flash Liteが出る（Gemini Pro – Google DeepMind）など先週も様々なニュースがあった。

高効率なモデルで知られるMiniMaxからはReasoningモデルが出ている。MoonshotからはKimi-Dev-72Bが公開されておりこちらも期待が大きい（GitHub – MoonshotAI/Kimi-Dev: open-source coding LLM for software engineering tasks）。テクニカルレポートは準備中とのこと。

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention [90.7]
MiniMax-M1は、オープンウェイトで大規模なハイブリッドアテンション推論モデルである。コンテクストの長さは100万トークンで、DeepSeek R1のコンテクストサイズは8倍だ。 MiniMax-M1は大規模強化学習を用いて訓練されている。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 15:08:02 GMT)
効率的なLightning Attentionを活用したモデル。Lightning Attentionの計算量はコンテキスト長に対し線形（ではあるが全体のバランスを考えてのハイブリッド構造）でLRMに向いていそう。加えて最近のモデルで多いMoEを採用している。
リポジトリはGitHub – MiniMax-AI/MiniMax-M1: MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model.

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence [28.0]
自動エージェントシステム生成のためのフレームワークであるSwarmAgenticを提案する。 SwarmAgenticはエージェントシステムをスクラッチから構築し、エージェント機能とコラボレーションを共同で最適化する。提案手法を,高レベル計画,システムレベルの調整,創造的推論を含む6つの実世界,オープンエンド,探索的タスクで評価する。
論文参考訳（メタデータ） (Wed, 18 Jun 2025 17:54:55 GMT)
「We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functional- ity and collaboration as interdependent com- ponents through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO).」というフレームワークの提案。
各エージェントを粒子ととらえらParticle Swarm Optimization (PSO)的アプローチで他手法を超える性能とのこと。計算コストがどの程度かはやや気になるところ。
プロジェクトサイトはAcademic Project Page

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce [45.3]
作業員がAIエージェントの自動化や強化を望んでいるかを評価するための新しい枠組みを導入する。我々のフレームワークは、ニュアンスな労働者の欲求を捉えるために、オーディオ強化されたミニインタービューを備えている。我々はWORKBankデータベースを構築し、1500のドメインワーカーの好みとAI専門家の能力評価を収集する。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 21:25:21 GMT)
「This paper presents the first large-scale audit of both worker desire and technological capability for AI agents in the context of automation and augmentation.」という調査報告。下記４象限で見ると希望しているものと研究の方向性があっているとは言い難そう。
- Automation “Green Light” Zone: Tasks with both high automation desire and high capability. These are prime candidates for AI agent deployment with the potential for broad productivity and societal gains.
- Automation “Red Light” Zone: Tasks with high capability but low desire. Deployment here warrants caution, as it may face worker resistance or pose broader negative societal implications
- R&D Opportunity Zone: Tasks with high desire but currently low capability. These represent promising directions for AI research and development.
- Low Priority Zone: Tasks with both low desire and low capability. These are less urgent for AI agent development.
下記の研究結果ともあわせてAIを使い続けていくと傾向が変わったりするのか、気になるところ。

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task [17.6]
本研究は、教育文脈における大規模言語モデル（LLM）の使用が認知負荷に与える影響を調査しました。54人の参加者を対象に、LLM、検索エンジン、脳のみのグループに分け、脳波（EEG）を用いて神経活動を記録し、学習効果を測定しました。結果として、LLM群は他のグループと比較して認知的なネットワーク接続が弱く、学習スキルの低下が見られ、AIが学習環境に与える影響の理解に向けた初歩的な指針を提供することを目指しています。
論文参考訳（メタデータ） (Tue, 10 Jun 2025 15:04:28 GMT)
AIの活用が人間にどのような影響を与えるか、教育関連の報告。「As the educational impact of LLM use only begins to settle with the general population, in this study we demonstrate the pressing matter of a likely decrease in learning skills based on the results of our study. The use of LLM had a measurable impact on participants, and while the benefits were initially apparent, as we demonstrated over the course of 4 months, the LLM group’s participants performed worse than their counterparts in the Brain-only group at all levels: neural, linguistic, scoring.」とやや怖い結果になっている。
プロジェクトサイトはYour Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

Protecting Human Cognition in the Age of AI [2.1]
ジェネレーティブAI（GenAI）の急速な普及は、人間の認知に大きな影響を及ぼしており、情報との関わり方や思考、学習の仕方を再構築しています。本稿では、特に学生などの初心者に焦点を当て、効果的な人間とAIの相互作用を理解する重要性を強調し、批判的思考を促進する教育体験の再設計について考察しています。また、GenAIが認知能力に与える影響や、情報過多などの社会的要因との相互作用についても探求しています
論文参考訳（メタデータ） (Fri, 11 Apr 2025 21:14:29 GMT)
短めだがSurvey的な論文。

Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning [59.5]
我々は,Multimodal Large Language Models (MLLM) の科学的認知能力を評価するために設計された,Scientists’ First Exam (SFE) ベンチマークを提示する。 SFEは3つの質問タイプにまたがる830のエキスパート検証VQAペアで構成され、5つの高価値分野にまたがる66のマルチモーダルタスクにまたがる。実験の結果、現在最先端のGPT-o3とInternVL-3はSFEでわずか34.08%と26.52%しか達成できず、MLLMが科学領域で改善する余地があることが明らかになった。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 09:29:16 GMT)
「we introduce the Scientists’ First Exam (SFE) benchmark, designed to comprehensively evaluate the scientific cognitive capabilities of MLLMs through three cognitive levels (cog-levels): Scientific Signal Perception (L1) characterizes the capacity to discern critical components within visualizations of scientific raw data; Scientific Attribute Understanding (L2) demonstrates the ability to interpret domain-expert knowledge; Scientific Comparative Reasoning (L3) manifests the ability to derive phenomenological insights through structured comparison of multiple scientific visual sources. SFE encompasses 66 expert-curated, high-value multimodal tasks across five disciplines: Astronomy, Chemistry, Earth, Life, and Materials Sciences (Fig. 1b).」というベンチマーク。MLLM向け、VQAとして構成されている。
リポジトリはPrismaX/SFE · Datasets at Hugging Face、プロジェクトサイトはPrismaX

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30