未分類 – arXiv最新論文の紹介

Kimi K2 Thinking, LongCat-Flash-Omni, iFlyBot-VLA, Nemotron Nano V2 VL

先週も様々な公開モデルやテクニカルレポートの公開があった。非常に進展が速くフロンティアモデルに迫るものが公開されている凄い状況である。

Kimi K2 Thinking（Kimi K2 Thinking、moonshotai/Kimi-K2-Thinking · Hugging Face）は一部ベンチマークでGPT=5などフロンティアモデルを超える性能を主張するモデル。1Tパラメータ、Active 32BはGrok 4, Phi4-mini-Flash-Reasoning, SmolLM3, Kimi-K2, T5Gemma – arXiv最新論文の紹介の時と同じで「Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity’s Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls.」とのこと。

マルチモーダルモデルとしてはLongCat-Flash-Omni（meituan-longcat/LongCat-Flash-Omni · Hugging Face）, iFlyBot-VLA（iFlyBot-VLA Tech Report、iFlyBot/iFlyBotVLM · Hugging Face）, Nemotron Nano V2 VL（nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 · Hugging Face）のテクニカルレポートが公開されていた。

LongCat-Flash-Omni Technical Report [131.5]
LongCat-Flash-Omniは5600億のパラメータを持つオープンソースのOmni-modalモデルである。 LongCat-Flash-Omniは強力なunimodal機能を維持しながら、包括的なマルチモーダル機能を実現する。低レイテンシのリアルタイムオーディオ・ビジュアルインタラクションを実現する。
論文参考訳（メタデータ） (Fri, 31 Oct 2025 21:58:15 GMT)
560B、Active 27Bのマルチモーダルモデル、一部ベンチマークではGemini 2.5 Proを超えるなど高性能な公開モデル
GitHub – meituan-longcat/LongCat-Flash-Omni: This is the official repo for the paper “LongCat-Flash-Omni Technical Report”

iFlyBot-VLA Technical Report [25.3]
iFlyBot-VLA(iFlyBot-VLA)は、新しいフレームワークでトレーニングされた大規模ビジョン・ランゲージ・アクション(VLA)モデルである。主なコントリビューションは,(1)大規模人体とロボットの操作映像を徹底的に訓練した潜在行動モデル,(2)視覚言語モデル(VLM)と訓練中のアクションエキスパートを協調的に監督する2段階の行動表現フレームワーク,(3)ロボット軌道データと一般的なQAデータセットと空間QAデータセットを組み合わせた混合トレーニング戦略である。
論文参考訳（メタデータ） (Sat, 01 Nov 2025 06:24:56 GMT)
iFlyTechのVLAモデル、「The architecture of iFlyBot-VLA consists primarily of a language transformer backbone and an action expert network. The model generates executable robot actions through a combination of explicit and implicit planning.」とのこと
iFlyBot/iFlyBotVLM · Hugging Face

NVIDIA Nemotron Nano V2 VL [134.5]
ネモトロン・ナノV2VLは、マンバ・トランスフォーマーのハイブリッドLLMであるネモトロン・ナノV2上に構築される。 BF16、FP8、FP4フォーマットでモデルチェックポイントをリリースしています。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 00:10:19 GMT)
「Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios.」とハイブリッド構成なマルチモーダルモデル
nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 · Hugging Face

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [117.5]
Open X-Embodiment (OXE)のような大規模データセットでトレーニングされた汎用的なロボットポリシーは、幅広いタスクにわたって強力なパフォーマンスを示している。彼らはしばしば、トレーニングデータの分布を超えて一般化するのに苦労する。我々は,ショートカット学習を一般化の鍵となる障害として認識する。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:14:01 GMT)
「Our analysis reveals that large-scale robot datasets like OXE suffer from limited sub-dataset diversity and severe fragmentation, a problem that extends even within individual sub-datasets. This structure inherently promotes shortcut learning, meaning that simply adding more similarly-fragmented data can be detrimental to generalization.」とのこと。汎用的なモデル構築は難しい。
プロジェクトサイトはShortcut Learning in GRPs

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [50.0]
本研究では,Mixture-of-Recursions (MoR)を導入した。 MoRはパラメータ効率を達成するために再帰ステップをまたいだ共有レイヤのスタックを再利用し、軽量ルータは適応トークンレベルの思考を可能にする。また、KVペアを最初の再帰から再利用するKV共有変種を提案し、特にプリフィルレイテンシとメモリフットプリントの削減を図っている。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 17:49:00 GMT)
「We propose Mixture-of-Recursions (MoR)—a framework that dynamically adjusts recursion step for each token during pretraining and inference. The core of MoR lies in two components: a routing mechanism that assigns token-specific recursion steps to adaptively concentrate computation on more challenging tokens, and a KV caching strategy that defines how KV pairs are stored and selectively utilized for attention at each recursive step.」という構造の提案。「MoR consistently outperforms recursive baselines and matches or exceeds the standard Transformers at larger scales, despite using significantly fewer parameters (approximately one-third due to layer tying with 𝑁𝑅= 3).」とのこと。
リポジトリはGitHub – raymin0223/mixture_of_recursions: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking

A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality [108.9]
ビデオ生成モデルは5～16秒間のビデオしか生成できないが、しばしば「ロングフォームビデオ」とラベル付けされる。 16秒を超えるビデオは、物語全体を通して一貫したキャラクターの外観とシーンレイアウトを維持するのに苦労する。近年の研究では、複数のキャラクター、物語のコヒーレンス、高忠実度の詳細を特徴とする長編ビデオの制作が試みられている。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 18:20:33 GMT)
一貫した長い動画を生成するための手法等のサーベイ

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation [38.6]
我々は32Kの実世界の画像質問対の総合的なベンチマークであるHumaniBenchを紹介する。 HumaniBenchは、公正性、倫理、理解、推論、言語の傾き、共感、堅牢性を含む7つのHuman Centered AI(HCAI)の原則を評価している。
論文参考訳（メタデータ） (Fri, 16 May 2025 17:09:44 GMT)
「HumaniBench probes seven HCAI principles—fairness, ethics, understanding, reasoning, language inclusivity, empathy, robustness—through seven diverse tasks that mix open- and closed-ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests.」というベンチマーク。商用モデルが優れた結果を出しているが、個別要素ではオープンなモデルが高スコアの場合もある。
プロジェクトサイトはHumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights [72.8]
HiPerRAGは360万以上の科学論文から知識をインデクシングし取り出すワークフローである。コアとなるのはマルチモーダル文書解析のための高スループットモデルであるOreoと、クエリ対応エンコーダの微調整アルゴリズムであるColTrastだ。 HiPerRAGは、既存の科学的質問応答ベンチマークと、この研究で導入された2つの新しいベンチマークで堅牢なパフォーマンスを提供する。
論文参考訳（メタデータ） (Wed, 07 May 2025 22:50:23 GMT)
「Despite the widespread adoption of RAG, it faces three significant technical challenges that hinder its ability to scale to millions of documents.」はまさにその通りで、大規模RAGの構築にとって参考になる論文。
かなり凝ったことも行っている。（分野によっては）実用上もこのようなアプローチが必要になるんだろうか…

Whisper：OpenAIの高性能ASR

Introducing Whisper (openai.com)
Robust Speech Recognition via Large-Scale Weak Supervision
- 我々は,インターネット上の大量の音声の書き起こしのため音声処理システムの能力について検討する。マルチリンガルとマルチタスクの監視を680,000時間にスケールすると、結果は標準ベンチマークに適合する。我々は、堅牢な音声処理のさらなる作業の基盤となるモデルと推論コードをリリースしている。
- コードはopenai/whisper (github.com)

OpenAIの音声認識システム。極めて大規模なデータ（全680,000時間、438,000時間は音声とトランスクリプトが両方英語、126,000 時間は音声が英語以外、117,000時間は音声・トランスクリプトともに英語以外。全98言語を使用。）が用いられており高性能。日本語の認識能力も高くコードやモデルが公開されているのも凄い。

多言語→英語への翻訳機能もあり相応の性能、Textless NLPの可能性を感じる

競技プログラミングレベルのコードを生成するAlphaCodeと数学オリンピックの問題を解くAI

コード自動生成や数学問題取り扱いなど難しい問題に対応できるAIが増えている。両方とも未来を感じるとともに怖さも感じる結果。

DeepMindが競技プログラミングを解けるレベルの自動コード生成が可能なAlphaCodeを発表
- Competitive programming with AlphaCode | DeepMind
OpenAIは数学オリンピックの問題を解く（大幅に性能向上させた）AIを発表　
- Solving (Some) Formal Math Olympiad Problems (openai.com)

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30