LRM – arXiv最新論文の紹介

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.6]
GLM-4.5はオープンソースのMixture-of-Experts(MoE)大言語モデルであり,総パラメータは355B,アクティベートパラメータは32Bである。 23Tトークンのマルチステージトレーニングと、エキスパートモデルのイテレーションと強化学習による総合的なポストトレーニングを通じて、GLM-4.5はエージェント、推論、コーディングタスクにわたって強力なパフォーマンスを実現している。 GLM-4.5(355Bパラメータ)とGLM-4.5-Air(106Bパラメータ)をそれぞれリリースし、推論とエージェントAIシステムの研究を進めた。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 17:21:06 GMT)
GLM-4.5（GLM-4.5, Step-3, Falcon-H1, HunyuanWorld – arXiv最新論文の紹介）の論文。性能の割にパラメータ（特にアクティブパラメータ）が少ない。詳細に比較しないと何とも言えないところではあるが、GPT-OSSとの比較が気になるところ。
リポジトリはGitHub – zai-org/GLM-4.5: GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai

GPT-5, GPT-OSS, Claude Opus 4.1

先週はGPT-5（GPT-5 が切り拓く働き方の新時代 | OpenAI）、gpt-oss 20B・120B（gpt-oss が登場 | OpenAI）, Claude Opus 4.1（Claude Opus 4.1 \ Anthropic）, DeepMind Genie 3（Genie 3: A new frontier for world models – Google DeepMind）と大きな発表が相次いだ。

GPT-5はベンチマーク性能でSoTAをしっかりとっており非常に性能が高い。一方でその少し前に発表されたClaude 4.1 Opusとの性能差が大きくなかったこと（システムカードの「All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.」（gpt5-system-card-aug7.pdf）という記述も気になる）や、Chatbot Arenaの日本語版でGemini 2.5 Proに負けていること（かつ1 vs 1の勝負などGemini 2.5 Proの勝率の方が高い）などから期待ほどではないという印象もある。それとGPT-5でも創作漢字（Pixels, Patterns, but No Poetry: To See The World like Humans – arXiv最新論文の紹介）は読めなかった・・・。戦略的な価格付けであり、また、Measuring AI Ability to Complete Long Tasks – METRではまさにフロンティアなスコアを出していることもあって実態がどうかの評価にはもう少し時間が必要そう。

GPT-OSSは性能の高い公開モデルであり、Apache-2ライセンス。実用的なレベルと思われるモデルが公開された意義は大きい。From GPT-2 to gpt-oss: Analyzing the Architectural Advancesではtransformerといっても様々な改善がされてきたことが分かる。

Claude 4.1 Opus, Gemini 2.5 ProとOpenAI以外の会社も非常に高性能なモデルを出しており、DeepSeekやKimi、Hunyuanといった中国のモデルの高性能化も進んでいる。OpenAI一強は終わっているものの進化は続いている印象。

GLM-4.5, Step-3, Falcon-H1, HunyuanWorld

先週は残念ながらGPT-5の発表はなかった。注目のモデルはMoE構成で商用モデルに匹敵するGLM-4.5（zai-org/GLM-4.5: GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai）である。最大構成の355B-A32Bはo3やGrok4、Claude 4 Opusといったフロンティアなモデルと競合しているようにみえる。StepFunのStep-3はアクティブパラメータとデコードコストのトレードオフに注目したモデルで推論効率が高い。またVLMでありその点の性能も高い。Falcon-H1シリーズは様々な規模のモデルでtransformer, mambaハイブリッドとなっている。様々な企業・県有機関からこのような公開モデルが出ている現状はとても面白い。GPT-5がこれらを引き離せるか要注目。

別軸でTencent Hunyuanからは3D世界を作れるモデルHunyuanWorld-1.0が発表されている（腾讯混元3D）。こちらも公開モデルとなっている点がうれしい。

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [144.7]
大規模言語モデル(LLM)はデコード時にハードウェア効率が低下する。本稿では,デコードコストの最小化に最適化されたハードウェア対応モデルシステムであるStep-3を紹介する。 Step-3はDeepSeek-V3やQwen3 MoE 235Bのようなモデルと比較して、理論的デコードコストを大幅に削減する。
論文参考訳（メタデータ） (Fri, 25 Jul 2025 16:53:13 GMT)
リポジトリはstepfun-ai/Step3、Step3 – a stepfun-ai Collection

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance [7.3]
Falcon-H1は、高性能と効率の両方に最適化されたハイブリッドアーキテクチャを備えた、新しい大規模言語モデル(LLM)である。 Falcon-H1は、0.5B, 1.5B, 1.5B-deep, 3B, 7B, 34Bパラメータのベースおよび命令調整型を含む複数の構成でリリースされている。最大256Kコンテキストトークンと18言語のサポートにより、Falcon-H1は幅広いアプリケーションに適している。
論文参考訳（メタデータ） (Wed, 30 Jul 2025 07:55:33 GMT)
詳細なレポートともに公開されたモデル。
リポジトリはtiiuae/Falcon-H1: All information and news with respect to Falcon-H1 series、モデルはtiiuae (Technology Innovation Institute)

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels [31.0]
HunyuanWorld 1.0は、テキストと画像の条件から没入的で探索可能なインタラクティブな3Dシーンを生成するための、両方の世界のベストを組み合わせた、新しいフレームワークである。提案手法の主な利点は,1)パノラマ世界プロキシによる360度没入体験,2)既存のコンピュータグラフィックスパイプラインとのシームレスな互換性を実現するメッシュエクスポート機能,3)対話性向上のためのオブジェクト表現の非拘束化,の3つである。
論文参考訳（メタデータ） (Tue, 29 Jul 2025 13:43:35 GMT)
リポジトリはTencent-Hunyuan/HunyuanWorld-1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels with Hunyuan3D World Model、モデルはtencent/HunyuanWorld-1 · Hugging Face

Kimi K2: Open Agentic Intelligence [118.8]
Kimi K2は32億の活性化パラメータと1兆の総パラメータを持つ大きな言語モデルである。 MuonClipに基づいて、K2は15.5兆のトークンで事前訓練され、損失のスパイクはゼロだった。 Kimi K2は、オープンソース非思考モデルの間で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Mon, 28 Jul 2025 05:35:43 GMT)
KIMI K2の論文が出ていた。LLMなのかLRMなのかは議論が分かれるように思わなくもない。MuonClip optimizer の使用や合成データの活用など面白い記載が多い。
リポジトリはmoonshotai/Kimi-K2-Instruct · Hugging Face

The Impact of Language Mixing on Bilingual LLM Reasoning

The Impact of Language Mixing on Bilingual LLM Reasoning [4.5]
中国語と英語のバイリンガル推論モデルにおける言語スイッチングについて検討する。単言語復号を強制すると数学推論タスクの精度は 5.6 ポイント低下する潜在的な言語スイッチが、推論に害を与えるかどうかを予測するために、軽量なプローブをトレーニングすることができる。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:56:09 GMT)
LRMでよく見る推論過程で様々な言語が混じる問題について、「Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning.」とのこと。また、「Altogether, these results suggest that language mixing is not a random artifact of multilingual training but a deliberate strategy that LLMs adopt to improve complex reasoning.」という記載もある。

Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2

Claude 4 sonnetレベルのQwen3 Coder（QwenLM/Qwen3-Coder: Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud.）、235B MoE language model (Qwen3) ＋ 6B Vision encoder (InternViT)で強力なマルチモーダルLRM Intern S1（InternLM/Intern-S1）、Kimi K2のテクニカルレポート公開（Kimi-K2/tech_report.pdf at main · MoonshotAI/Kimi-K2）、と中国のモデルに関する話題が多かった。Qwen3-Instruct-2507（QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.）でKIMI K2越えが主張されたりと競争が激しい。

音声関連でもStepFunからStep-Audio 2 Technical Report、TeleAIからTECHNICAL REPORT OF TELECHAT2, TELECHAT2.5 AND T1が公開されている。いずれも優れた性能を主張。加えてGR-3のようなロボット関連の論文にも興味津々。

そして、もう間もなく、GPT-5が発表されるはずで、進化は続きそう。

Step-Audio 2 Technical Report [108.0]
Step-Audio 2は、業界における音声理解と音声会話のために設計された、エンドツーエンドのマルチモーダルな大規模言語モデルである。遅延オーディオエンコーダと推論中心強化学習(RL)を統合することにより、Step-Audio 2は自動音声認識(ASR)および音声理解において有望な性能を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 11:13:12 GMT)
リポジトリはstepfun-ai/Step-Audio2: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Technical Report of TeleChat2, TeleChat2.5 and T1 [40.9]
最新のTeleChatモデルについて紹介する: TeleChat2, TeleChat2.5, T1。モデルアーキテクチャの最小限の変更にもかかわらず、新しいシリーズは、強化されたトレーニング戦略によって、大幅なパフォーマンス向上を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 01:00:48 GMT)
リポジトリはTele-AI/TeleChat2: 星辰语义大模型TeleChat2是由中国电信人工智能研究院研发训练的大语言模型，是首个完全国产算力训练并开源的千亿参数模型

GR-3 Technical Report [21.9]
GR-3は、大規模な視覚言語アクション(VLA)モデルである。抽象概念を含む新しいオブジェクト、環境、命令を一般化する際、例外的な能力を示す。 GR-3は、両手動操作や移動動作を必要とするタスクを含む、長い水平および外接なタスクの処理に長けている。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 10:54:13 GMT)
プロジェクトサイトはByteDance Seed

Apple Intelligence Foundation Language Models: Tech Report 2025 [246.0]
AppleのデバイスやサービスにまたがってAppleのインテリジェンス機能を駆動する2つの基礎言語モデルを紹介します。どちらのモデルも、責任あるWebクローリングを通じてソースされる大規模なマルチリンガルデータセットとマルチモーダルデータセットに基づいてトレーニングされている。新しいSwift中心のFoundation Modelsフレームワークでは、ガイド付き生成、制約付きツール呼び出し、LoRAアダプタの微調整が公開されている。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 23:37:19 GMT)
Apple IntelligenceのテクニカルレポートがarXivに公開されていた。
「We found that AFM on-device model performs better than Qwen-2.5-3B, Gemma-3-4B and Gemma-3n-E4B on MMLU/MMMLU, but it lags slightly behind Gemma-3n-E4B on MGSM. AFM on-device model performs lower than the larger Qwen-3-4B model. AFM server models lag slightly to LLaMA 4 Scout, whose total size and active number of parameters are comparable, but has a bigger gap to larger models such as Qwen-3-235B and the proprietary GPT-4o.」と評価している。

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes [42.3]
EXAONE 4.0は、EXAONE 3.5の優れた使いやすさとEXAONE Deepの高度な推論能力の両方を達成するために、非推論モードと推論モードを統合している。 EXAONE 4.0シリーズは、高性能に最適化された中型32Bモデルと、オンデバイスアプリケーション用に設計された小型1.2Bモデルである。
論文参考訳（メタデータ） (Tue, 15 Jul 2025 15:24:51 GMT)
LLM/LRMハイブリッドなLGのモデル。「Unified Mode Training In the combined dataset, the NON-REASONING data primarily consists of diverse tasks, while the REASONING data is centered on Math and Code domains. Rather than fine-tuning the two modes sequentially, we combine both modes and train them together.」とのこと。構築過程の「After unified NON-REASONING/REASONING mode fine-tuning, to address domain imbalance, we perform a second round of training using high-quality REASONING data from the Code and Tool Use domains, reusing these samples to further enhance the performance.」が興味深い。
リポジトリはLGAI-EXAONE (LG AI Research)

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs [45.8]
大規模言語モデル(LLM)は、幅広いタスクを解くことができる汎用エージェントへと急速に進歩してきた。彼らは、タスクの複雑さに関わらず、固定推論時間計算を適用し、しばしば難しいことを考えながら単純な問題を過小評価する。本調査では, LLM推論の計算効率向上を目的とした, 効率的なテスト時間計算戦略の総合的なレビューを行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 18:27:42 GMT)
「This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a two-tiered taxonomy that distinguishes between L1 controllability—methods that operate under fixed compute budgets—and L2 adaptiveness—methods that dynamically scale inference based on input difficulty or model confidence.」というサーベイ。
商用モデルでのハイブリッドアプローチも流行っていて色々と苦労している部分なんだろうなと思う。

Predicting thinking time in Reasoning models [42.6]
推論モデルは長く隠れた思考の連鎖を生み出します。ユーザーは、答えを返す前にモデルが推論にどれくらいの時間を費やすかについての洞察がほとんどない。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 15:01:01 GMT)
LRMにおける推論時間の予測に関する報告。
「In this paper, we explore methods for online prediction of thinking time in reasoning models. Our experiments demonstrate that current models encode a notion of progress in their internal representations, with an mlp probe achieving 45% accuracy over 10 classes, moreover the errors appear highly local (MAE 1).」

Grok 4, Phi4-mini-Flash-Reasoning, SmolLM3, Kimi-K2, T5Gemma

先週も様々なモデルが発表されたが、注目は様々なベンチマークで強力な性能を主張するGrok 4だろう（Grok 4 | xAI）。Humanity’s Last Examで44.4%と非常に強力に見える。

オープンなモデルとしてはモデル構造が面白いPhi4-mini-Flash-Reasoning（Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog、論文は後述）、HuggingFaceの小型モデルSmolLM3（SmolLM3, GitHub – huggingface/smollm: Everything about the SmolLM and SmolVLM family of models）、総パラメータ1T / 32 B Activeと極端なMoE構成で非常に高性能なKimi-K2（GitHub – MoonshotAI/Kimi-K2: Kimi K2 is the large language model series developed by Moonshot AI team、Kimi K2）など興味深い発表が相次いだ。また、T5Gemma: A new collection of encoder-decoder Gemma models – Google Developers Blogにも要注目。Decoder onlyでないアーキテクチャの良さが現れるタスクも多そうに思う。

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation [52.2]
我々は,デコーダのみの大規模言語モデルをエンコーダ-デコーダモデルに適応させるという,新しい問題を研究する。適応はデコーダのみのLLMの能力を継承するだけでなく、計算の需要を減らすことができると主張している。同様の推論予算の下では、エンコーダ-デコーダ LLM は(しばしばより優れた)事前訓練性能を達成できるが、デコーダのみの性能よりもはるかに優れた微調整性能が得られる。
論文参考訳（メタデータ） (Tue, 08 Apr 2025 17:13:41 GMT)

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.5]
我々は、レイヤ間の効率的なメモリ共有のためのシンプルで効果的なメカニズムであるGated Memory Unit(GMU)を紹介した。これは、GMUを組み込んでSambaベースのセルフデコーダからメモリ読み出し状態を共有するデコーダ・ハイブリッド・デコーダアーキテクチャである。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 07:27:00 GMT)
Phi4-mini-Flash-Reasoningの論文
「Our decoder-hybrid-decoder architecture taking Samba [RLL+25] as the self-decoder. Gated Memory Units (GMUs) are interleaved with the cross-attention layers in the cross-decoder to reduce the decoding complexity. As in YOCO [SDZ+24], the full attention layer only need to compute the KV cache during prefilling with the self-decoder, leading to linear computation complexity for the prefill stage.」と計算量的に有利なアーキテクチャでLRMに適しているように見える。

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities [1584.5]
Gemini 2.5 Proは私たちの最も有能なモデルであり、フロンティアコーディングと推論ベンチマークでSoTAのパフォーマンスを実現しています。 Gemini 2.5 Flashは計算とレイテンシの要求のごく一部で優れた推論機能を提供する。 Gemini 2.0 FlashとFlash-Liteは低レイテンシと低コストでハイパフォーマンスを提供する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 17:36:04 GMT)
Gemini 2.5の論文も出ていた。共著者の人数がすごい（3300人以上）。

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam? [47.2]
本稿では,人間研究者をエミュレートするツール強化推論エージェントであるX-Masterを紹介する。 XマスターズはHumanity’s Last Examに32.1%のスコアで最新記録を樹立した。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 17:50:52 GMT)
Agenticなアプローチ＋DeepSeek-R1-0528でHumanity’s Last Exam 32.1%を達成という報告。ベースモデルとしてGrok 4を使った場合のスコアが気になるところ。
リポジトリはGitHub – sjtu-sai-agents/X-Master: Official implementation of X-Master, a general-purpose tool-augmented reasoning agent.

Frontier LLMs Still Struggle with Simple Reasoning Tasks

Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.5]
この研究は、フロンティア言語モデルの性能を、幅広い「容易」推論問題に対して研究する。計算,一階述語論理,証明木,旅行計画など,手続き的に生成された単純な推論タスクのスイートを作成します。最先端の思考モデルでさえ、このような問題や同様の理由で一貫して失敗することを示します。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 22:22:49 GMT)
「By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e g , statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts).」と簡単だがLLM/LRMによって解きにくいタスクを作成。
「Similarly to other recent works, our results suggest that LLMs mimic training data rather than performing true reasoning, making it relatively easy to find out-of-distribution problems where the models fail, and this problem is also present at the newest thinking models. This suggests that users remain careful when relying on the output of LLMs.」と指摘している。下記のCatAttackの時も感じたがLLM/LRMは人間の能力とはかなり異なっていることは意識したほうが良いと思う。
リポジトリはhttps://github.com/google-deepmind/unpuzzles_and_simple_reasoning/とのこと

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models [25.1]
本稿では,問合せに依存しない逆引き金を導入することで,段階ごとの問題解決を訓練した推論モデルのロバスト性について検討する。より弱く安価なプロキシモデル上でトリガを生成する自動反復攻撃パイプラインであるCatAttackを提案する。我々の研究結果は、推論モデルにおける重大な脆弱性を浮き彫りにして、最先端モデルでさえ、微妙な敵の入力に影響を受けやすいことを明らかにした。
論文参考訳（メタデータ） (Mon, 03 Mar 2025 18:10:54 GMT)
「For example, appending, Interesting fact: cats sleep most of their lives, to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of- the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns.」という面白い攻撃。一方で、ノイズ（無関係）な事例がRAGの改善に有効という話もあり動作は本当に謎。
リポジトリはcollinear-ai/cat-attack-adversarial-triggers · Datasets at Hugging Face

The Power of Noise: Redefining Retrieval for RAG Systems [19.4]
Retrieval-Augmented Generation (RAG) は、大規模言語モデルの事前学習知識を超えて拡張する方法として登場した。我々は、RAGソリューションが取得すべきパスIRシステムの種類に焦点を当てる。
論文参考訳（メタデータ） (Wed, 1 May 2024 08:15:07 GMT)
「Finally, and even more surprisingly, random, noisy documents are actually helpful in increasing the accuracy of these systems when correctly positioned within a prompt.」と無関係な事例が有効なのは興味深い

ERNIE4.5, Kwai Keye-VL, Ovis-U1, GLM-4.1V-Thinking, Confucius3-Math

ERNIE4.5（GitHub – bigdavidone/ERNIE4_5: The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.）の登場の他、公開モデルも色々と出ている。効率的な構造、一定の特化を行うことで商用モデルに迫る性能を達成しているものも多い。

ERNIE 4.5 Technical Report
本報告では、10種類の異なるバリアントからなる新しい大規模マルチモーダルモデル「ERNIE 4.5」を紹介しています。このモデルは、47Bおよび3Bのアクティブパラメータを持つMixture-of-Experts（MoE）アーキテクチャを採用し、テキスト関連タスクの性能を向上させつつマルチモーダル理解を強化します。全てのモデルはApache 2.0の下で公開され、研究や開発の支援を目的としたオープンソースの開発ツールキットも提供されています。論文Publication | ERNIE Blog

Kwai Keye-VL Technical Report [80.5]
ショートビデオ理解のためのマルチモーダル基盤モデルである textbfKwai Keye-VL を紹介する。 Keye-VLの開発は,ビデオに重点を置いた大規模で高品質なデータセットと,革新的なトレーニングレシピという,2つのコア柱に留まっている。提案手法の有効性を検証するため,我々は,Kee-VLが公開ビデオベンチマークにおける最先端の成果を達成し,一般的な画像ベースタスクにおいて高い競争力を保っていることを示す,広範囲な評価を行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:57:28 GMT)
プロジェクトサイトはKwai Keye

Ovis-U1 Technical Report [17.2]
我々は,マルチモーダル理解,テキスト・ツー・イメージ生成,画像編集機能を統合した統一モデルであるOvis-U1を紹介する。テキスト・画像生成では、それぞれ DPG-Bench と GenEval のベンチマークで 83.72 と 0.89 のスコアを出力する。画像編集では、ImgEdit-BenchとGEdit-Bench-ENでそれぞれ4.00と6.42を達成している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 00:40:17 GMT)
GitHub – AIDC-AI/Ovis-U1: An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.5]
GLM-4.1V-9B-Thinkingは、汎用マルチモーダル理解と推論を促進するために設計された視覚言語モデル(VLM)である。モデルの潜在能力を最大限に活用するために,カリキュラムサンプリングを用いた強化学習を提案する。オープンソースのGLM-4.1V-9B-Thinkingは、同等の大きさのモデル間で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 15:53:43 GMT)
GLMシリーズのマルチモーダルモデル。高性能。
GitHub – THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning [4.6]
Confucius3-Mathは,1つのコンシューマグレードGPU上で効率的に動作する14Bパラメータを備えた,オープンソースの大規模言語モデルである。このレポートでは、開発レシピ、直面する課題、それらを克服するために開発するテクニックを共有します。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 10:49:23 GMT)
一定の特化を行うことで高性能を実現した事例
GitHub – netease-youdao/Confucius3-Math

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31