2025年6月 – arXiv最新論文の紹介

Deep Research API, Gemini CLI, Mistral-Small-3.2-24B, Hunyuan-A13B, OpusLM

様々なニュースがあるが、先週の注目はDeepResearchAPIの登場（Introduction to deep research in the OpenAI API）、Gemini CLIのリリース（Gemini CLI : オープンソース AI エージェント | Google Cloud 公式ブログ）のように思う。LLMやLRMなど基盤モデルを提供するベンダーが応用領域にも進出してくるのは生成AI周りでは特徴的。より付加価値を得ていく動きとしては当然ではあるが、API利用で勝負しているベンダーやスタートアップにとってはつらい展開が続く。

Mistralからはmistralai/Mistral-Small-3.2-24B-Instruct-2506 · Hugging Faceが出ていた。また、Tencentからは80B, 13 ActiveなMoE・ReasoningハイブリッドモデルのHunyuan-A13Bが発表されている（GitHub – Tencent-Hunyuan/Hunyuan-A13B: Tencent Hunyuan A13B (short as Hunyuan-A13B), an innovative and open-source LLM built on a fine-grained MoE architecture.）。

別軸でOpenなSpeechLMも発表されている。オープンな動きにも注目したい。

OpusLM: A Family of Open Unified Speech Language Models [56.1]
OpusLMは、213K時間の音声テキストペアと292Bのテキスト専用トークンで継続的に事前トレーニングされている。本稿では,トークン化,マルチストリーム言語モデル,マルチステージトレーニング戦略に関するSpeechLMの設計について述べる。
論文参考訳（メタデータ） (Sat, 21 Jun 2025 06:30:59 GMT)
Open Unified Speech Language Models でOpusLMs
モデルはespnet/OpusLM_7B_Anneal · Hugging Face

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas / Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas [90.3]
良いアイデアは単に斬新なものではなく、実行後により良い研究がもたらされるべきである。 AIが生み出すアイデアがより良い研究成果をもたらすかどうかをテストするために、我々は実行研究を行う。実行前後の同じアイデアのレビュースコアを比較すると、LLM生成のアイデアのスコアは専門家によるアイデアよりも大幅に減少する。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 19:47:23 GMT)
LLMが出したアイデアと専門家のアイデアを「Our execution participants spend an average of 103 hours executing the assigned idea and then submit the codebase and paper to document their experiments. All projects are then reviewed blindly by our recruited expert reviewers」と評価したところ「Average scores of AI ideas drop significantly more than Human ideas in the execution study across all the evaluation metrics.」という指摘。
やはり人間の専門家は深く考えているようという興味深い結果。同時に、アイデアのみだとAIの評価が高いということはアイデアだしでは有効なのではないか？とか最終的なスコアでもそこそこ健闘しているのではないか？と見えなくもない。下記論文のようにAI科学者の実現可能性は高まっているように思う。
リポジトリはGitHub – NoviScl/AI-Researcher

Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI [98.2]
知的科学研究所(ISL)のパラダイムを提案する。 ISLは、認知と具体的知性を深く統合した多層クローズドループフレームワークである。このようなシステムは、現在の科学的発見の限界を克服するために不可欠である、と我々は主張する。
論文参考訳（メタデータ） (Tue, 24 Jun 2025 13:31:44 GMT)
「1) Foundation Models provide multi-modal scientific knowledge representation and closed-loop learning capabilities, supporting complex reasoning and domain adaptation; (2) Agent Layer dynamically orchestrates scientific workflows—including hypothesis generation, literature review, experimental planning, execution, and analysis—while integrating model/toolkit via MCP integration; (3) Embodied Layer realizes robust physical interaction through advanced perception, navigation, and manipulation modules, enabling precise, adaptive operations in real-world laboratory environments.」からなるAI科学者・AIラボフレームワークの提案。
現状と課題がとても参考になる。

Language Modeling by Language Models

Language Modeling by Language Models [28.8]
本稿では,従来の研究段階をシミュレートするマルチエージェント言語モデル(LM)を提案する。新しいデザインが提案され、反対にレビューされ、実装され、選択的に検証される。新たに発見された1,162個の設計に関する実験を報告する。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 08:46:10 GMT)
「We introduce Genesys, an autonomous system for discovering novel LM designs, featuring a novel unit-based design agent and cost-effective distributed evolution. We also present LMADE, a resource environment to support further research in this field.」というAIによるAIの研究。
「Genesys produced highly competitive designs; some outperformed human baselines such as the GPT and Mamba2 models in common downstream tasks. These results show the feasibility and lay the groundwork for autonomous evolutionary systems in scientifically complex and costly domains.」と現時点でも一定の成果、実現可能性がありそうなのが興味深い。
プロジェクトサイトはGenesys、リポジトリはGitHub – allenai/genesys: Source code and utilities for the Genesys distributed language model architecture discovery system.

Routing Mamba, Memba

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.5]
線形状態空間モデル(SSM)は、シーケンスモデリングにおいて顕著なパフォーマンス向上を提供する。 Mambaのような最近の進歩は、入力依存のゲーティングとハードウェア対応の実装により、SSMをさらに強化している。本稿では,線形射影の専門家による疎混合を用いてSSMパラメータをスケールする新しい手法であるRouting Mamba (RoM)を紹介する。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 19:26:55 GMT)
「We introduce Routing Mamba (RoM), a novel framework that integrates MoE mechanisms into SSMs by leveraging Mamba’s projection layers as scalable expert components.」とMoE的なフレームワークをMambaに持ち込んだ研究。効率・性能が向上とのこと。

Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba [21.5]
Membaは、State Space Models（SSMs）に特化した新しいパラメータ効率的ファインチューニング手法であり、Mambaモデルの能力を向上させることを目指しています。Leaky Integrate Membrane (LIM)ニューロンを用いて時間的な情報保持を強化し、従来のファインチューニング手法と比べて優れたパフォーマンスを実現しています。実験結果は、Membaが言語モデルやコンピュータビジョンのタスクにおいて他の手法よりも著しい改善を示すことを示しています。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 21:52:45 GMT)
Mamba用に設計された効率的なfine tuningフレームワーク
リポジトリはhttps://github.com/Intelligent-Computing-Lab-Yale/Membaとのことだが、現時点では404

Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence

Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence [131.4]
タスク認識方式でアダプタを初期化する新しい手法であるコンテキスト指向分解適応(CorDA)を提案する。本手法は,タスク認識により,知識保存モード (KPM) と命令レビューモード (IPM) の2つのオプション適応モードを実現する。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 07:55:14 GMT)
knowledge-preserved mode (KPM) 、instruction- previewed mode (IPM)の導入、結果「Experimental results demonstrate that our method in KPM outperforms LoRA not only in downstream performance but also in maintaining zero-shot capabilities for both large language models and vision language models. Meanwhile, the IPM exhibits superior fine-tuning performance and faster convergence in both standard and quantized adaptation across various tasks.」とのこと。
peft/examples/corda_finetuning at main · huggingface/peft · GitHubにサンプルがある

Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions

Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions [17.1]
大規模言語モデルとAIシステムの進歩は、複雑なAIの設計と最適化におけるパラダイムシフトにつながった。本稿では,複合AIシステムの最適化における最近の進歩を,数値的手法と言語的手法の両方を包含して体系的にレビューする。我々は、複合AIシステムの最適化の概念を形式化し、いくつかの重要な側面に沿って既存のメソッドを分類し、この急速に発展する分野におけるオープンな研究課題と今後の方向性を明らかにする。
論文参考訳（メタデータ） (Mon, 09 Jun 2025 21:04:14 GMT)
「This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language- based techniques.」と実用上重要な複合的なAIシステムに関するサーベイ
リポジトリはGitHub – MiuLab/AISysOpt-Survey

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: A coding agent for scientific and algorithmic discovery [63.1]
我々は,最先端LLMの能力を大幅に向上させる進化的符号化エージェントAlphaEvolveを提案する。 AlphaEvolveはLLMの自律パイプラインを編成し、そのタスクはコードを直接変更することでアルゴリズムを改善することである。本稿では,多くの重要な計算問題に適用することで,このアプローチの広範な適用性を実証する。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 06:37:18 GMT)
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms – Google DeepMindの論文がarXivに出ていた

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios [30.2]
大規模な言語モデルが外部ツールを利用する能力により、ますます多様なタスクに対処できるようになった。タスクがより複雑で長期的になると、複雑なツール利用プロセスが様々な予期せぬエラーを引き起こす可能性がある。このようなエラーの特定、診断、回復など、効果的に対処する方法が、ツール学習を進める上で重要な研究方向として現れている。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 17:59:18 GMT)
「ICTOOL, the first self-critique evaluation benchmark for tool utilization of LLMs. Distinct from prior result-oriented evaluation methods, we categorize error patterns more finely and evaluate models from multiple perspectives, enabling a deeper exploration of LLMs’ tool-use capabilities in errorprone scenarios.」というベンチマーク。最新モデルでの結果が気になるところ。
リポジトリはGitHub – Shellorley0513/CriticTool

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index [124.7]
Infini-gram miniはペタバイトレベルのテキストコーパスを検索可能にするスケーラブルなシステムである。私たちは128コアのCPUノードで、50日間で46TBのインターネットテキストをインデックスします。 Infini-gram miniのベンチマーク汚染の大規模解析における重要な利用例を示す。
論文参考訳（メタデータ） (Fri, 13 Jun 2025 21:13:57 GMT)
大規模データのインデックス化に関する報告。このインデックスを用いて各種ベンチマークの汚染度を計算している（Benchmark Contamination Monitoring System – a Hugging Face Space by infini-gram-mini）。今までも指摘されていたことだが、信頼性に疑問がでてくるものもありそう。
プロジェクトサイトはHome | infini-gram-mini、リポジトリはGitHub – xuhaoxh/infini-gram-mini

Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability

Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability [1.3]
Institutional Books 1.0は、2006年からHarvard LibraryのGoogle Booksプロジェクトへの参加を通じてデジタル化されたパブリックドメインブックのコレクションである。ハーバード図書館で作業し、これらの論文を抽出し、分析し、処理し、歴史文書の広範囲に記録されたデータセットにしました。この分析は、当初250以上の異なる言語で書かれた1,075,899巻に及ぶ、約250億個のトークンをスキャンしたハーバード図書館のコレクション全体をカバーしている。
論文参考訳（メタデータ） (Tue, 10 Jun 2025 00:11:30 GMT)
「OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available.」という大規模データ
データセットはinstitutional/institutional-books-1.0 · Datasets at Hugging Face、リポジトリはGitHub – instdin/institutional-books-1-pipeline: The Institutional Data Initiative’s pipeline for analyzing, refining, and publishing the Institutional Books 1.0 collection.

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30