LLM – ページ 3 – arXiv最新論文の紹介

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [221.3]
科学大規模言語モデル(Sci-LLMs)は、科学研究において、知識の表現、統合、適用の方法を変えつつある。この調査は、モデルとその基盤となるデータ基板の共進化として、Sci-LLMの開発を再考する。我々は、科学的データの統一された分類法と、科学的知識の階層的なモデルを定式化する。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 18:30:52 GMT)
応用が進む科学研究とLLMに関するサーベイ。
リポジトリはGitHub – open-sciencelab/Awesome-Scientific-Datasets-and-LLMs: A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs)

LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres

LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres [15.2]
大規模言語モデルのセキュリティオペレーションセンター(SOC)への統合は、アナリストの作業量を削減するための変革的かつまだ進化している機会を提供する。本稿では,SOCアナリスト45名を対象に,10ヶ月で3,090件の質問に対して縦断調査を行った。分析の結果,LLMを高精細度判定ではなく,センスメイキングやコンテキストビルディングのオンデマンド支援として活用していることが判明した。
論文参考訳（メタデータ） (Tue, 26 Aug 2025 11:40:02 GMT)
SOCアナリストがどのようにLLMを使っているかの報告。
「By analysing thousands of analyst-generated queries, we found that analysts use LLMs as on-demand, task-focused cognitive aids for a variety of tasks, including explaining commands, writing scripts, or improving documentation, rather than as full-time copilots.」は現状としてはそうだろうなという印象。

Qwen3-Max, K2-Instruct-0905, LongCat-Flash, Dream-Coder 7B, Kwai Keye-VL 1.5

先週もLLM/LRM界隈のニュースは多かった。Qwen3系最大構成のQwen3 Maxの公開（XユーザーのQwenさん: 「Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀 Now available via Qwen Chat & Alibaba Cloud API. Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm: https://t.co/7vQTfHup1Z」 / X、Models and pricing – Alibaba Cloud Model Studio – Alibaba Cloud Documentation Center）、Kimi K2のアップデート（XユーザーのKimi.aiさん: 「Kimi K2-0905 update 🚀 – Enhanced coding capabilities, esp. front-end & tool-calling – Context length extended to 256k tokens – Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: https://t.co/83sQekosr9 💬 Chat with new Kimi https://t.co/mkOuBMwzpw」 / X、moonshotai/Kimi-K2-Instruct-0905 · Hugging Face）やLongCat-Flashの他、Dream-Coder 7B、Kwai Keye-VL 1.5など小規模でもユニークなモデルも発表されている。

Introduction – Agent Client Protocol（GitHub – zed-industries/agent-client-protocol: A protocol for connecting any editor to any agent）といったプロトコルの提案など周辺領域にも目が離せない。

LongCat-Flash Technical Report [165.7]
LongCat-Flashは、560ビリオンパラメータのMixture-of-Experts (MoE)言語モデルである。計算効率と高度なエージェント能力の両方のために設計されている。 30日以内に20兆トークン以上のモデルトレーニングを完了し、100トークン/秒 (TPS) 以上の推論を0.70パーセントのアウトプットトークンで達成しました。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 10:05:45 GMT)
560B MoE構成、「As a non-thinking model, LongCat-Flash achieves performance comparable to state-of-the-art non-thinking models, including DeepSeek-V3.1 [DeepSeek-AI et al , 2025] and Kimi-K2 [Team et al , 2025], while using fewer parameters and offering faster inference speed. Specifically, LongCat-Flash scores 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ 2-Bench, demonstrating robust capabilities in general domains, coding, and agentic tool use.」
リポジトリはGitHub – meituan-longcat/LongCat-Flash-Chat

Dream-Coder 7B: An Open Diffusion Language Model for Code [99.1]
そこで,Dream-Coder 7Bを提案する。Dream-Coder 7Bは,任意の順序生成能力を示すコード生成のための,オープンソースの離散拡散言語モデルである。厳密に左から右にデコードする従来の自己回帰(AR)モデルとは異なり、ドリームコーダ7Bはコーディングタスクに基づいてデコード戦略を適応的に決定する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 05:30:56 GMT)
コーディングタスク強化の拡散モデル
リポジトリはGitHub – DreamLM/Dream-Coder

Kwai Keye-VL 1.5 Technical Report [91.3]
本稿では、ビデオ理解における根本的な課題を3つの重要なイノベーションを通じて解決するKeye-VL-1.5を紹介する。まず,フレーム間の類似性に基づいて動的に計算資源を割り当てるSlow-Fastビデオ符号化方式を提案する。次に,モデルのコンテキスト長を8Kから128Kまで体系的に拡張する4段階事前学習手法を提案する。第3に、推論の強化と人間の嗜好の整合性に焦点を当てた総合的な後学習パイプラインを開発する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 15:46:58 GMT)
「Keye-VL-1.5-8B establishes new state-of-the-art performance among models of similar scale, demonstrating superior results on video-centric benchmarks while maintaining competitive performance on general multimodal and reasoning tasks.」とビデオを扱えるモデル
リポジトリはGitHub – Kwai-Keye/Keye

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs [36.3]
本稿では,適応型大言語モデル(LLM)エージェントのための新しい学習パラダイムを提案する。本手法は,メモリベースのオンライン強化学習により,低コストで連続的な適応を可能にする。我々はエージェントモデルを,GAIA検証でトップ1に達するMementoというディープリサーチ環境でインスタンス化する。
論文参考訳（メタデータ） (Mon, 25 Aug 2025 13:32:12 GMT)
「Memento formalises deep research agents as a memory-based Markov Decision Process (MDP) and implements it within a planner–executor framework, leveraging an episodic case bank to record and retrieve trajectories for continual policy improvement.」というメモリ機構を持つエージェントフレームワークの提案。
リポジトリはGitHub – Agent-on-the-Fly/Memento: Official Code of Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Grok 2.5, HERMES 4, InternVL3.5, VIBEVOICE

先週は公開モデルに関する話題が多かった。X.aiからはアナウンス通りGrok2のウェイトが公開された（https://x.com/elonmusk/status/1959379349322313920 / xai-org/grok-2 · Hugging Face）。Grok3も半年程度で公開とのこと。HERMES, InternVLからも新しいモデルが出ている。アプローチは様々とはいえ、着々とモデルを構築しフロンティアに追いついているのは凄いことである。Microsoft ResearchからはText-to-SpeechのOSSモデルが公開された（VibeVoice）。特化型を使う場面も多々残っていてありがたい。

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency [245.9]
InternVL 3.5は、多目的性、推論能力、推論効率を大幅に向上させる、オープンソースの新しいマルチモーダルモデルである。主要なイノベーションはCascade Reinforcement Learningフレームワークで、2段階のプロセスを通じて推論を強化する。我々の最大のモデルであるInternVL3.5-241B-A28Bは、一般的なマルチモーダル、推論、テキスト、エージェントタスクにわたるオープンソースのMLLMの最先端の結果を得る。
論文参考訳（メタデータ） (Mon, 25 Aug 2025 17:58:17 GMT)
InternVLの最新版。LLM部分のベースモデルとしてQwen3シリーズとGPT-OSSを使用。GPT-OSS-20B, Qwen3-30B-A3Bの比較も興味深い。（パラメータサイズの差かQwen3の方が性能が高い。）
リポジトリはOpenGVLab/InternVL3_5-241B-A28B · Hugging Face

Hermes 4 Technical Report [7.6]
Hermes 4は、構造化されたマルチターン推論と幅広い命令追従能力を組み合わせたハイブリッド推論モデルのファミリーである。データキュレーション、合成、トレーニング、評価で直面する課題について述べ、これらの課題を大規模に解決するためのソリューションの概要を述べる。
論文参考訳（メタデータ） (Mon, 25 Aug 2025 17:45:06 GMT)
リポジトリはHermes 4 Collection – a NousResearch Collection

VibeVoice Technical Report [90.1]
VibeVoiceは、複数の話者で長めの音声を合成するために設計されたモデルである。本稿では,エンコーデックモデルと比較した場合,データ圧縮を80倍改善する新しい連続音声トークンを提案する。
論文参考訳（メタデータ） (Tue, 26 Aug 2025 17:09:12 GMT)
リポジトリはGitHub – microsoft/VibeVoice: Frontier Open-Source Text-to-Speech

aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists

aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists [22.3]
私たちは、人間とAI科学者のための次世代オープンアクセスプラットフォームであるaiXivを紹介します。我々の研究は、AI科学者のための次世代のオープンアクセスエコシステムの基礎を築いた。
論文参考訳（メタデータ） (Wed, 20 Aug 2025 23:16:41 GMT)
「closed-loop review system for both proposals and papers, incorporating automatic retrieval- augmented evaluation, reviewer guidance, and robust defenses against prompt injection.」を持ちAPI,MCPサーバも提供されるプラットフォーム。
リポジトリはGitHub – aixiv-org/aiXiv: Preprint server for AI Scientists and Robot Scientists

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models [108.6]
MME-Emotionは,MLLMの感情的理解と推論能力の両方を評価するシステムベンチマークである。 MME-Emotionには6000以上のキュレートされたビデオクリップとタスク固有の質問回答(QA)ペアが含まれており、8つの感情的なタスクを定式化するための広いシナリオにまたがっている。マルチエージェントシステムフレームワークを通じて分析された、感情認識と推論のためのハイブリッドメトリクスを備えた総合評価スイートが組み込まれている。
論文参考訳（メタデータ） (Mon, 11 Aug 2025 03:14:55 GMT)
「In this paper, we introduced MME-Emotion, a comprehensive multi-task benchmark for evaluating emotional intelligence in MLLMs, accompanied by a holistic evaluation suite. The assessment process was fully automated within a multi-agent system framework and thoroughly validated by human experts.」という感情に焦点を当てたベンチマークの提案。
プロジェクトサイトはhttps://mme-emotion.github.io/とのこと。

Command A Reasoning, DeepSeek V3.1, Gemma 3 270M, Nemotron Nano 2, Dream 7B

LLM/LRM関連の話題は本当に多い。先週はCohere’s Command A Reasoning Model | Cohere（モデルはCohere’s Command A Reasoning Model | Cohere、CC-BY-NC）の公開、DeepSeek V3.1の公開（DeepSeek-V3.1 Release | DeepSeek API Docs、モデルはdeepseek-ai/DeepSeek-V3.1 · Hugging Face）が大きなニュースだった。フロンティアまたはそれに近いモデルが公開される意義は大きい。また、Intern-S1からはテクニカルレポートが公開されている。

小型モデル関連でもGemma 3 270M（Introducing Gemma 3 270M: The compact model for hyper-efficient AI – Google Developers Blog、モデルはgoogle/gemma-3-270m · Hugging Face）は超小型であることが興味深い。性能的には疑問があるとはいえ特化用途にPost trainingするなど使える場面はありそう。NVIDIA のMemtron Nano2も注目である（Nanoという名前で9B）。

HuaweiからはDiffusion系のDream 7Bの論文が出ていた。LLaDAを超え、同規模のAutoregressiveなモデルに負けていなさそうと高い性能。

Intern-S1: A Scientific Multimodal Foundation Model [185.4]
Intern-S1は、一般的な理解と推論機能を備えた専門的なジェネラリストである。 Intern-S1はオフラインおよびオンライン強化学習(RL)をInternBootCampで実施する。 Intern-S1は、オープンソースモデル間の一般的な推論タスクにおける競合性能を示す。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 17:58:00 GMT)
Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2 – arXiv最新論文の紹介で取り上げたモデルのテクニカルレポート

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model [176.4]
Nemotron-Nano-9B-v2は、推論処理のスループットを向上させるために設計されたハイブリッドのMamba-Transformer言語モデルである。 Nemotron-Nano-9B-v2はNemotron-Hアーキテクチャをベースにしており、共通のTransformerアーキテクチャの自己保持層の大部分をMamba-2層に置き換えている。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 04:18:04 GMT)
nvidia/NVIDIA-Nemotron-Nano-9B-v2 · Hugging Face

Dream 7B: Diffusion Large Language Models [85.3]
これまでで最も強力なオープン拡散大言語モデルであるDream 7Bを紹介します。我々のモデルは、一般的な、数学的、コーディングタスクにおいて、既存の拡散言語モデルよりも一貫して優れています。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 12:09:58 GMT)
「Dream 7B achieves competitive performance with Qwen 2.5 on standard benchmarks (general language understanding, mathematical reasoning, and code generation) while exhibiting superior planning abilities and novel inference flexibility features that naturally emerge from the diffusion modeling paradigm.」とのこと。
リポジトリはGitHub – DreamLM/Dream: Dream 7B, a large diffusion language model、モデルはDream 7B – a Dream-org Collection

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges [22.1]
本稿では,表型入力表現の分類と表理解タスクの導入を通じて,重要な概念を紹介する。テーブルは2次元であり、構造化されたデータベーステーブルから複雑な多層スプレッドシートまで、それぞれ異なる目的を持った形式を含んでいる。我々は、さらなる研究の必要性を示す分野におけるいくつかの重要なギャップを強調している。
論文参考訳（メタデータ） (Thu, 31 Jul 2025 23:41:31 GMT)
LLMによるテーブルデータ取り扱いのサーベイ

GPT-5, GPT-OSS, Claude Opus 4.1

先週はGPT-5（GPT-5 が切り拓く働き方の新時代 | OpenAI）、gpt-oss 20B・120B（gpt-oss が登場 | OpenAI）, Claude Opus 4.1（Claude Opus 4.1 \ Anthropic）, DeepMind Genie 3（Genie 3: A new frontier for world models – Google DeepMind）と大きな発表が相次いだ。

GPT-5はベンチマーク性能でSoTAをしっかりとっており非常に性能が高い。一方でその少し前に発表されたClaude 4.1 Opusとの性能差が大きくなかったこと（システムカードの「All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.」（gpt5-system-card-aug7.pdf）という記述も気になる）や、Chatbot Arenaの日本語版でGemini 2.5 Proに負けていること（かつ1 vs 1の勝負などGemini 2.5 Proの勝率の方が高い）などから期待ほどではないという印象もある。それとGPT-5でも創作漢字（Pixels, Patterns, but No Poetry: To See The World like Humans – arXiv最新論文の紹介）は読めなかった・・・。戦略的な価格付けであり、また、Measuring AI Ability to Complete Long Tasks – METRではまさにフロンティアなスコアを出していることもあって実態がどうかの評価にはもう少し時間が必要そう。

GPT-OSSは性能の高い公開モデルであり、Apache-2ライセンス。実用的なレベルと思われるモデルが公開された意義は大きい。From GPT-2 to gpt-oss: Analyzing the Architectural Advancesではtransformerといっても様々な改善がされてきたことが分かる。

Claude 4.1 Opus, Gemini 2.5 ProとOpenAI以外の会社も非常に高性能なモデルを出しており、DeepSeekやKimi、Hunyuanといった中国のモデルの高性能化も進んでいる。OpenAI一強は終わっているものの進化は続いている印象。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30