2025年9月29日 – arXiv最新論文の紹介

Hunyuan3D-Omni, Qwen3-Omni, LongCat-Flash-Thinking, EmbeddingGemma, Logics-Parsing

公開モデルの開発はとても盛んで、先週はQwen3 Omniが話題になることが多かったように思う。arXivではQwen3 Omini以外にも有望なモデルの発表が相次いでいる。

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets [34.7]
Hunyuan3D-Omniは、Hunyuan3D 2.1上に構築されたきめ細かい制御可能な3Dアセット生成のための統一されたフレームワークである。我々のモデルは単一のクロスモーダルアーキテクチャで全ての信号を統一する。実験により、これらの追加制御により生成精度が向上し、幾何認識変換が可能となり、生産の堅牢性も向上することが示された。
論文参考訳（メタデータ） (Thu, 25 Sep 2025 14:39:17 GMT)
3Dにフォーカスした実装
リポジトリはGitHub – Tencent-Hunyuan/Hunyuan3D-Omni: Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Qwen3-Omni Technical Report [105.1]
Qwen3-Omniは、テキスト、画像、オーディオ、ビデオ間で最先端のパフォーマンスを維持する単一のマルチモーダルモデルである。 Qwen3-OmniはQwenシリーズ内の同一サイズのシングルモーダルモデルのパフォーマンスと一致し、特にオーディオタスクに優れる。 119言語でのテキストインタラクション、19言語での音声理解、および10言語での音声生成をサポートする。
論文参考訳（メタデータ） (Mon, 22 Sep 2025 13:26:24 GMT)
Qwen系のマルチモーダルモデル
リポジトリはGitHub – QwenLM/Qwen3-Omni: Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.

LongCat-Flash-Thinking Technical Report [116.8]
LongCat-Flash-ThinkingはオープンソースのMixture-of-Experts (MoE)推論モデルである。高度な能力は、巧妙に製作された訓練プロセスを通じて育成される。 LongCat-Flash-Thinkingは、複雑な推論タスクのスイート上で、オープンソースモデル間の最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Tue, 23 Sep 2025 10:25:48 GMT)
MoEなLRM、OSSなモデルでのSoTAを主張
リポジトリはmeituan-longcat/LongCat-Flash-Thinking · Hugging Face

EmbeddingGemma: Powerful and Lightweight Text Representations [42.4]
EmbeddingGemmaはGemma 3言語ファミリに基づいた、新しい軽量でオープンなテキスト埋め込みモデルである。スプレッドアウト正規化器を用いてモデル頑健性と表現性を向上する。さらなる研究を促進するため、コミュニティに EmbeddingGemma をリリースします。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 17:56:51 GMT)
小規模、強力なEmbeddingモデル
リポジトリはEmbeddingGemma – a google Collection

Logics-Parsing Technical Report [9.0]
我々は、強化学習を付加したエンドツーエンドのLVLMモデルであるLogics-Parsingを提案する。本モデルでは、複雑なレイアウト解析と読み出し順序推定を最適化するために、厳密に設計された報酬機構を組み込んでいる。 LogicsParsingBenchは、9つの主要なカテゴリと20以上のサブカテゴリにまたがる1,078ページレベルのPDFイメージのキュレートされたセットである。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 04:54:37 GMT)
Document Understandingに有効なLVLM
リポジトリはGitHub – alibaba/Logics-Parsing

Video models are zero-shot learners and reasoners [33.7]
Veo 3は、明示的にトレーニングされていないさまざまなタスクを解決できます。 Veoの創発的なゼロショット機能は、ビデオモデルが統一された一般的なビジョン基盤モデルへの道のりにあることを示している。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 17:17:27 GMT)
「We demonstrate that Veo 3 can solve a broad variety of tasks it wasn’t explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. 」、「Veo 3 shows emergent zero-shot perceptual abilities well beyond the training task. Just like LLMs replaced task-specific NLP models, video models will likely replace most bespoke models in computer vision—once they become sufficiently cheap and reliable.」という指摘。とても未来を感じると同時に直観的のは理解しがたい面もある。
リポジトリはVideo models are zero-shot learners and reasoners

State Space Models over Directed Graphs [38.8]
我々は、k-hop egoグラフを介して有向グラフを逐次化する革新的なアプローチを提案する。これは、有向グラフ学習の分野への状態空間モデルの最初の体系的拡張である。また,新しい有向グラフニューラルネットワークアーキテクチャであるDirGraphSSMを開発した。
論文参考訳（メタデータ） (Wed, 17 Sep 2025 06:39:18 GMT)
状態空間モデルのグラフ構造への応用、「In this paper, we first propose DirGraphSSM, a novel graph state space model designed for large-scale sparse di- rected graph learning. Through two innovative components, namely DirEgo2Token and Digraph SSM Scan.」

Causal Time Series Generation via Diffusion Models [97.0]
新しいTSGタスクファミリーとして因果時系列生成を導入し,Pearlの因果はしご内で定式化した。これらのタスクをインスタンス化するために、統合拡散ベースのフレームワークであるCaTSGを開発した。合成データセットと実世界のデータセットの両方の実験は、CaTSGが優れた忠実性を達成することを示している。
論文参考訳（メタデータ） (Thu, 25 Sep 2025 07:34:46 GMT)
「Causal Expansion of Conditional TSG Paradigm. We formalize causal time series generation as an extension of conditional TSG along Pearl’s ladder, introducing two tasks beyond association, i.e., interventional (Int-TSG) and counterfactual (CF-TSG), to open up richer generative capabilities aligned with real-world decision-making needs.」と因果性に基づいた時系列データの生成手法の提案