LLM – ページ 5 – arXiv最新論文の紹介

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure [25.0]
生成のための暗黙的なタスク解決–>翻訳パイプラインの存在を実証する。 108言語対にわたる単語翻訳タスクに対して,この仮説を検証した。全体的な失敗のかなりの部分は、翻訳失敗に起因していることが分かりました。
論文参考訳（メタデータ） (Sat, 28 Jun 2025 02:09:21 GMT)
「We find that a significant portion of overall failures indeed stems from translation failure, or the model’s inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages.」という指摘
動作自体はBeyond English-Centric LLMs: What Language Do Multilingual Language Models Think in? – arXiv最新論文の紹介からもそうなんだろうと思いつつ、中間言語は学習の中心になった言語に影響されているんだろうなと思うとそれでよいのかという気がしなくはない。

ERNIE4.5, Kwai Keye-VL, Ovis-U1, GLM-4.1V-Thinking, Confucius3-Math

ERNIE4.5（GitHub – bigdavidone/ERNIE4_5: The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.）の登場の他、公開モデルも色々と出ている。効率的な構造、一定の特化を行うことで商用モデルに迫る性能を達成しているものも多い。

ERNIE 4.5 Technical Report
本報告では、10種類の異なるバリアントからなる新しい大規模マルチモーダルモデル「ERNIE 4.5」を紹介しています。このモデルは、47Bおよび3Bのアクティブパラメータを持つMixture-of-Experts（MoE）アーキテクチャを採用し、テキスト関連タスクの性能を向上させつつマルチモーダル理解を強化します。全てのモデルはApache 2.0の下で公開され、研究や開発の支援を目的としたオープンソースの開発ツールキットも提供されています。論文Publication | ERNIE Blog

Kwai Keye-VL Technical Report [80.5]
ショートビデオ理解のためのマルチモーダル基盤モデルである textbfKwai Keye-VL を紹介する。 Keye-VLの開発は,ビデオに重点を置いた大規模で高品質なデータセットと,革新的なトレーニングレシピという,2つのコア柱に留まっている。提案手法の有効性を検証するため,我々は,Kee-VLが公開ビデオベンチマークにおける最先端の成果を達成し,一般的な画像ベースタスクにおいて高い競争力を保っていることを示す,広範囲な評価を行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:57:28 GMT)
プロジェクトサイトはKwai Keye

Ovis-U1 Technical Report [17.2]
我々は,マルチモーダル理解,テキスト・ツー・イメージ生成,画像編集機能を統合した統一モデルであるOvis-U1を紹介する。テキスト・画像生成では、それぞれ DPG-Bench と GenEval のベンチマークで 83.72 と 0.89 のスコアを出力する。画像編集では、ImgEdit-BenchとGEdit-Bench-ENでそれぞれ4.00と6.42を達成している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 00:40:17 GMT)
GitHub – AIDC-AI/Ovis-U1: An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.5]
GLM-4.1V-9B-Thinkingは、汎用マルチモーダル理解と推論を促進するために設計された視覚言語モデル(VLM)である。モデルの潜在能力を最大限に活用するために,カリキュラムサンプリングを用いた強化学習を提案する。オープンソースのGLM-4.1V-9B-Thinkingは、同等の大きさのモデル間で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 15:53:43 GMT)
GLMシリーズのマルチモーダルモデル。高性能。
GitHub – THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning [4.6]
Confucius3-Mathは,1つのコンシューマグレードGPU上で効率的に動作する14Bパラメータを備えた,オープンソースの大規模言語モデルである。このレポートでは、開発レシピ、直面する課題、それらを克服するために開発するテクニックを共有します。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 10:49:23 GMT)
一定の特化を行うことで高性能を実現した事例
GitHub – netease-youdao/Confucius3-Math

LEDOM: An Open and Fundamental Reverse Language Model

LEDOM: An Open and Fundamental Reverse Language Model [100.5]
最初の純粋逆言語モデルであるLEDOMを導入し,2Bおよび7Bパラメータの435Bトークンに対して自己回帰訓練を行った。本稿では, 一般的なタスクにまたがる基盤モデルとして, 興味深い事例と洞察のセットを伴って, 逆言語モデルを提示する。 LEDOMをベースにした新しいアプリケーションであるReverse Rewardを紹介します。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 03:52:00 GMT)
「We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction.」という逆言語モデル。面白い発想。
「Given a known answer and the corresponding supporting reasons, LEDOM can produce natural, well-formed ques- tions. It is helpful for automatically creating QA datasets and educational content, where starting from answers or known concepts is often more practical than designing questions manually.」というのも興味深いが、「We propose Reverse reward, a novel strategy that uses LEDOM to guide forward model outputs via reranking, leading to consistent performance improvements in mathematical reasoning.」とタスクによっては効果があるよう。
BERTのBのように双方向が有効なことはあるし、ダブルチェックの上で有効そうという印象。

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games, How large language models judge and influence human cooperation

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games [87.6]
大規模言語モデルは、アライメント、堅牢性、安全なデプロイメントを保証する上で、いかに自己関心と集合的幸福のバランスをとるかが重要な課題である。我々は、行動経済学から制度的に選択した公共財ゲームに適応し、異なるLLMがいかに社会的ジレンマをナビゲートするかを観察することができる。意外なことに、o1シリーズのようなLRMの推論は、協調にかなり苦労している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 15:02:47 GMT)
「our findings reveal a surprising pattern: while traditional LLMs demonstrate robust cooperation comparable to human outcomes, reasoning- enhanced models frequently struggle to sustain cooperation.」という興味深い結果。reasoningモデルだからなのか、モデルサイズや学習結果の問題なのかとても興味があるところ。
リポジトリはGitHub – davidguzmanp/SanctSim

How large language models judge and influence human cooperation [82.1]
我々は、最先端の言語モデルが協調行動をどのように判断するかを評価する。我々は、善良な相手との協力を評価する際、顕著な合意を守ります。モデル間の差異が協調の頻度に大きく影響を及ぼすことを示す。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 09:14:42 GMT)
LLMが協調的な行動をとるか検証した論文。傾向を分析するのが難しい結果ではあるが「With some exceptions, most LLM families we tested tend to move from IS towards SS as versions and parameter size increases, indicating a shift towards a higher complexity social norm which makes use of more context, specifically assigned reputations. Moreover, different versions of the same family can have vastly distinct social norms, such as Claude 3.5 Haiku [47] and Claude 3.7 Sonnet [48], despite their similar ethical goals [49].」とのこと。（IS, cooperating is good, defection is bad、SS, cooperating is always good, defecting against bad individuals is also good）
「These results highlight an important concern: LLMs are not explicitly designed with a given social norm in mind, instead emerging as a by-product of their training [4]. While these norms may occasionally align with those of humans, they are neither designed to maintain cooperation and minimize disagreement, nor are they co-created with communities from diverse cultures to reflect their norms and needs [3].」というのが実際のところだと思うが、意思決定支援に使うという話は相応にあったりするわけで注意が必要だと思う。

Mercury: Ultra-Fast Language Models Based on Diffusion

Mercury: Ultra-Fast Language Models Based on Diffusion [58.5]
拡散に基づく新しい商用大規模言語モデル(LLM)であるMercuryを提示する。 Mercury CoderにはMiniとSmallの2つのサイズがある。独立した評価に基づいて、マーキュリー・コーダ・ミニとマーキュリー・コーダ・スモールは、それぞれ1109トークン/秒と737トークン/秒の最先端のスループットを達成した。
論文参考訳（メタデータ） (Tue, 17 Jun 2025 17:06:18 GMT)
Continuous Diffusion Model for Language Modeling, Energy-Based Diffusion Language Models for Text Generation – arXiv最新論文の紹介で少しだけ取り上げたMercuryに関する論文
サイトはInception Platform

Deep Research API, Gemini CLI, Mistral-Small-3.2-24B, Hunyuan-A13B, OpusLM

様々なニュースがあるが、先週の注目はDeepResearchAPIの登場（Introduction to deep research in the OpenAI API）、Gemini CLIのリリース（Gemini CLI : オープンソース AI エージェント | Google Cloud 公式ブログ）のように思う。LLMやLRMなど基盤モデルを提供するベンダーが応用領域にも進出してくるのは生成AI周りでは特徴的。より付加価値を得ていく動きとしては当然ではあるが、API利用で勝負しているベンダーやスタートアップにとってはつらい展開が続く。

Mistralからはmistralai/Mistral-Small-3.2-24B-Instruct-2506 · Hugging Faceが出ていた。また、Tencentからは80B, 13 ActiveなMoE・ReasoningハイブリッドモデルのHunyuan-A13Bが発表されている（GitHub – Tencent-Hunyuan/Hunyuan-A13B: Tencent Hunyuan A13B (short as Hunyuan-A13B), an innovative and open-source LLM built on a fine-grained MoE architecture.）。

別軸でOpenなSpeechLMも発表されている。オープンな動きにも注目したい。

OpusLM: A Family of Open Unified Speech Language Models [56.1]
OpusLMは、213K時間の音声テキストペアと292Bのテキスト専用トークンで継続的に事前トレーニングされている。本稿では,トークン化,マルチストリーム言語モデル,マルチステージトレーニング戦略に関するSpeechLMの設計について述べる。
論文参考訳（メタデータ） (Sat, 21 Jun 2025 06:30:59 GMT)
Open Unified Speech Language Models でOpusLMs
モデルはespnet/OpusLM_7B_Anneal · Hugging Face

Gemini 2.5 Pro, Flash , 2.5 Flash-Lite, MiniMax-M1, Kimi-Dev-72B

Gemini 2.5 Proからpreviewが取れ、2.5 Flash Liteが出る（Gemini Pro – Google DeepMind）など先週も様々なニュースがあった。

高効率なモデルで知られるMiniMaxからはReasoningモデルが出ている。MoonshotからはKimi-Dev-72Bが公開されておりこちらも期待が大きい（GitHub – MoonshotAI/Kimi-Dev: open-source coding LLM for software engineering tasks）。テクニカルレポートは準備中とのこと。

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention [90.7]
MiniMax-M1は、オープンウェイトで大規模なハイブリッドアテンション推論モデルである。コンテクストの長さは100万トークンで、DeepSeek R1のコンテクストサイズは8倍だ。 MiniMax-M1は大規模強化学習を用いて訓練されている。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 15:08:02 GMT)
効率的なLightning Attentionを活用したモデル。Lightning Attentionの計算量はコンテキスト長に対し線形（ではあるが全体のバランスを考えてのハイブリッド構造）でLRMに向いていそう。加えて最近のモデルで多いMoEを採用している。
リポジトリはGitHub – MiniMax-AI/MiniMax-M1: MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model.

Interpretable LLMs for Credit Risk: A Systematic Review and Taxonomy

Interpretable LLMs for Credit Risk: A Systematic Review and Taxonomy [0.0]
大規模言語モデル(LLM)は、財務文書の分析を通じて信用リスクの評価を可能にする。本稿では、信用リスク推定におけるLSMに基づくアプローチに着目した、最初の体系的レビューと分類について述べる。
論文参考訳（メタデータ） (Wed, 04 Jun 2025 10:24:40 GMT)
LLMを使った信用リスク評価のサーベイ

BLUR: A Bi-Level Optimization Approach for LLM Unlearning

BLUR: A Bi-Level Optimization Approach for LLM Unlearning [106.0]
大規模言語モデル（LLMs）が訓練によって得た知識や能力を上手く忘れさせることは、データ規制の遵守や倫理的なAI使用に不可欠である。従来の忘却と保持の損失を重み付けした手法は性能低下を招きやすいため、著者らは忘却を優先させた階層的アプローチを提案し、新しいアルゴリズム「Bi-Level UnleaRning（BLUR）」を開発した。この手法は理論的保証を持ちながら、様々な課題において他の最先端アルゴリズムを上回る性能を示している。
論文参考訳（メタデータ） (Mon, 09 Jun 2025 19:23:05 GMT)
「Should we aim to forget and retain simultaneously? In many cases, the answer is no.」、「Instead of treating unlearning as a binary process of simply forgetting specific information while retaining the rest, we argue that we should prioritize and structure these tasks hierarchically.」を軸とした新たなunlearning手法の提案。
リポジトリはGitHub – OptimAI-Lab/BLURLLMUnlearning

Pitfalls in Evaluating Language Model Forecasters

Pitfalls in Evaluating Language Model Forecasters [45.4]
我々はコミュニティとして、大きな言語モデルを評価するような結論に注意する必要があると論じている。 1) 時間的リークによる評価結果の信頼の難しさ,(2) 評価性能から実世界の予測への外挿の難しさ,の2つのカテゴリを識別する。
論文参考訳（メタデータ） (Sat, 31 May 2025 21:49:17 GMT)
LLMの評価に関する落とし穴をまとめた論文
「We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims.」というまとめだが、評価は本当に難しい。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30