2025年9月15日 – arXiv最新論文の紹介

Qwen3-Next-80B-A3B, Qwen3-ASR, Hunyuan-MT, MMBERT

先週の大きなニュースは非常に疎な構成を持ち性能の高いQwen/Qwen3-Next-80B-A3B-Instruct · Hugging Faceの発表だろうと思う。DeepSeekなども同様にMoE構成ではとてもスパースな構造をとることが流行っている。Qwenからはマルチリンガルな音声認識モデルQwen-ASRも発表されている。周辺領域もしっかりと作っている印象。

Hunyuan-MTはHunyuanをベースとした機械翻訳モデルである。特化型大規模言語モデル『PLaMo翻訳』を公開しました – Preferred Networks Research & Developmentもだが、LLMベースのものは非常に強力である。

最後にマルチリンガルなencoder onlyモデル、MMBERTも発表されていた。decoder onlyなLLM全盛という感じではあるが、分類など実用的なタスクでは今でも重要なアプローチである。

Hunyuan-MT Technical Report [20.9]
Hunyuan-MT-7Bは33の主要言語にまたがる双方向翻訳をサポートしている。 Hunyuan-MT-Chimera-7Bは、スローシンキングモードにインスパイアされた翻訳モデルである。
論文参考訳（メタデータ） (Fri, 05 Sep 2025 16:11:05 GMT)
「The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL.」とあるがそれぞれのパイプラインもとても凝っている。
リポジトリはtencent/Hunyuan-MT-7B · Hugging Face

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.6]
mmBERTは、多言語テキストの3Tトークンで事前訓練されたエンコーダのみの言語モデルである。データに1700以上の低リソース言語を追加しています。分類および検索タスクにおける従来のモデルよりも, mmBERTの方が優れていたことを示す。
論文参考訳（メタデータ） (Mon, 08 Sep 2025 17:08:42 GMT)
「We do this by pre-training our new model suite, MMBERT, on 3T tokens of multilingual text using an architecture inspired from ModernBERT (Warner et al , 2024).」というマルチリンガルBERT。
リポジトリはGitHub – JHU-CLSP/mmBERT: A massively multilingual modern encoder language model

Autonomous Code Evolution Meets NP-Completeness

Autonomous Code Evolution Meets NP-Completeness [9.7]
SATLUTIONはLLMベースのコード進化を完全なリポジトリスケールに拡張した最初のフレームワークである。厳格な正当性保証と分散フィードバックの下でソルバリポジトリを編成し、同時に独自の進化ポリシーとルールを自己進化させる。 SATコンペティション2024とベンチマークを皮切りにSATLUTIONは、SATコンペティション2025の人間設計の勝者を決定的に上回った。
論文参考訳（メタデータ） (Tue, 09 Sep 2025 03:28:06 GMT)
「Starting from SAT Competition 2024 codebases and benchmark, SATLUTION evolved solvers that decisively outperformed the human-designed winners of the SAT Competition 2025, and also surpassed both 2024 and 2025 champions on the 2024 benchmarks.」とコード生成の強力さを印象付ける結果。
discussionには「However, our experiments also revealed limitations. In fully automated operation—what we refer to as our customized “YOLO mode“, distinct from the official CLI tool, the agents often struggled, and the flow proved most effective in a semi-automated setup with targeted human intervention. しかし、私たちの実験では限界も明らかになりました。完全自動運転、つまり私たちが「YOLOモード」と呼ぶカスタマイズされた設定では、公式のCLIツールとは異なり、エージェントはしばしば苦戦し、フローは特定の人間の介入がある半自動設定で最も効果的であることが分かりました。 (score: 0.9)
In particular, the agents were prone to failures in SAT/UNSAT correctness checks and deep memory errors such as segmentation faults, where human intervention remained critical to preserve progress. While the planning capabilities of the agents were strong at the level of concrete programming tasks, they lacked sufficient domain-specific knowledge at the idea level, especially for nuanced aspects of SAT solving.」という記載もあり、ドメイン知識の重要性は指摘されている。（ただ、そのうちそこもAI代替されそうな気がしなくはない。

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [221.3]
科学大規模言語モデル(Sci-LLMs)は、科学研究において、知識の表現、統合、適用の方法を変えつつある。この調査は、モデルとその基盤となるデータ基板の共進化として、Sci-LLMの開発を再考する。我々は、科学的データの統一された分類法と、科学的知識の階層的なモデルを定式化する。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 18:30:52 GMT)
応用が進む科学研究とLLMに関するサーベイ。
リポジトリはGitHub – open-sciencelab/Awesome-Scientific-Datasets-and-LLMs: A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs)

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants [5.5]
エージェントの哲学的・科学的理論とAIを用いた評価手法を統合することにより、人間エージェントの考え方を発展させる。我々は、典型的なAIのユースケースに基づいて、6次元の人間エージェントを持つスケーラブルで適応的なベンチマークであるHumanBench(HAB)を開発した。
論文参考訳（メタデータ） (Wed, 10 Sep 2025 11:10:10 GMT)
AIエージェントが人間の主体性をどのように扱うかに関するベンチマーク。複数のカテゴリ（Experimental-Orange/HumanAgencyBench_Evaluation_Results · Datasets at Hugging Face）に対して評価可能。「There is substantial variation across model developers—with Anthropic’s Claude models tending to most support human agency—and across dimensions. We encourage further research into human agency as more human tasks and decisions are delegated to AI systems, ensuring humans maintain appropriate levels of control.」とモデルによって挙動が異なるよう。
リポジトリはGitHub – BenSturgeon/HumanAgencyBench: A code repository for the paper: “HUMANAGENCYBENCH: Scalable Evaluation of Human Agency Support in AI Assistants”

2025年9月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30