LLM – ページ 8 – arXiv最新論文の紹介

DRS: Deep Question Reformulation With Structured Output

DRS: Deep Question Reformulation With Structured Output [114.1]
大規模な言語モデルは、質問の解答不能を識別するが、質問の修正を支援する能力は欠如している。 DRS:Deep Question Reformulation with Structured Outputを提案する。提案手法は, GPT-3.5 の修正精度を 23.03% から 70.42% に向上させ, Gemma2-9B などのオープンソースの大規模言語モデルのスコアを 26.35% から 56.75% に向上させる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 02:20:44 GMT)
質問を修正する手法の提案。「More importantly, according to Faustini et al (2023), in a large-scale industrial experiment,rephrasing unanswerable questions posed to virtual assistants significantly enhances the user experience for millions, which highlights the importance of effectively leveraging LLMs to assist people in question reformulation.」とも書かれているが、応用上ほしい場面があるのは確か。この論文ではentity extraction, dfs combination search with question generation, final candidate selectionと問題を分割しながら特殊法を提案している。
リポジトリはGitHub – Lizhecheng02/DRS: Repository for our paper “DRS: Deep Question Reformulation With Structured Output”.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.4]
本稿では、推論と批判モデルの役割を分離する2人プレイヤパラダイムを提案する。まず、批判データを収集する自動化およびスケーラブルなフレームワークであるAutoMathCritiqueを提案する。テスト時間における難解なクエリに対するアクターのパフォーマンスを,批判モデルが一貫して改善することが実証された。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 17:11:54 GMT)
「flawed reasoning path construction, critique generation, and data filtering」の3ステージからなるフレームワークAutoMathCritiqueでデータを構築、fine tuningするとともに、「Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method」を適用して効果を確認。 critique modelの有効性が分かる結果に見える（が、この構築は容易ではないかもしれない）
リポジトリはAutoMathCritique

Model Context Protocol (MCP), QwQ, OLMo 2

先週も様々なニュースがあったが、注目はAnthropicのModel Context Protocolである。　Introducing the Model Context Protocol \ Anthropic、Introduction – Model Context Protocol

ザックリとはLLMと外部データやツールを統合するためのプロトコルである。外部ツール利用やメモリの拡張利用などを前提としたLLMを構築する場合、この手の標準があるかないかは重要。MCPがデファクトスタンダードとなれるか興味津々。

公開モデル関連では極めて性能の高いQwen with Questions（QwQ）、以前取り上げたDolmaとOLMo – arXiv最新論文の紹介のver 2であるOLMo 2に要注目である。O1 Replication JurneyやTULU3もだが、どのような手法、アプローチで性能が上がるのかなどをオープンにした取り組みの価値は高い。

QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen
- 「QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities.」という公開モデル。Open AI o1と比較しても性能が高い。o1に刺激を受けた動きは様々行われていて本当に競争が激しい。
- リポジトリはQwen/QwQ-32B-Preview · Hugging Face
- デモはQwQ-32B-Preview – a Hugging Face Space by Qwen
OLMo 2: The best fully open language model to date | Ai2
- 構築方法、データ、モデルが公開されているモデルであり、性能は最先端に近い。
- リポジトリはOLMo 2 – a allenai Collection
- デモはAi2 Playground

O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [30.9]
本稿では,OpenAIのO1モデル機能を複製する現在のアプローチについて,批判的な考察を行う。 O1のAPIからの単純な蒸留と教師付き微調整を組み合わせることで、複雑な数学的推論タスクにおいて優れた性能が得られることを示す。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 15:31:27 GMT)
OpenAI o1に関する研究、Fugu-MT 論文翻訳(概要): O1 Replication Journey: A Strategic Progress Report — Part 1からのPart2。「While our previous work (Part 1 (Qin et al , 2024)) explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks.」はまぁいいとして「Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning.」は驚き。
リポジトリはGitHub – GAIR-NLP/O1-Journey: O1 Replication Journey: A Strategic Progress Report – Part I

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training [94.1]
我々は、完全にオープンな最先端の訓練後モデルであるT”ULU 3を紹介する。 T”ULU 3はLlama 3.1ベースモデルをベースにしており、Llama 3.1、Qwen 2.5、Mistral、さらにGPT-4o-mini、Claude 3.5-Haikuといったクローズドモデルにも勝っている。
論文参考訳（メタデータ） (Fri, 22 Nov 2024 18:44:04 GMT)
リポジトリはGitHub – allenai/open-instruct

LLM Augmentations to support Analytical Reasoning over Multiple Documents

LLM Augmentations to support Analytical Reasoning over Multiple Documents [9.0]
本研究では,インテリジェンス解析の文脈内での深い解析的推論を強化するために,大規模言語モデル(LLM)の適用について検討する。動的エビデンスツリー(DET)と呼ばれるメモリモジュールでLLMの能力を高めるアーキテクチャを開発し、複数の調査スレッドを開発・追跡する。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 06:00:42 GMT)
intelligence analysis におけるLLMの活用、使用の流れが興味深い
リポジトリはGitHub – DiscoveryAnalyticsCenter/speculatores: [IEEE Big Data 2024] LLM Augmentations to support Analytical Reasoning over Multiple Documents

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents [23.2]
モデルベースプランニングで言語エージェントを増強する新しいパラダイムを導入する。我々の方法であるWebDreamerは、LLMが本質的にウェブサイトの構造や機能に関する包括的知識をエンコードしているというキーインサイトを構築している。
論文参考訳（メタデータ） (Sun, 10 Nov 2024 18:50:51 GMT)
「WEBDREAMER uses LLMs to simulate outcomes for each candidate action (e g , “what would happen if I click this button?”) using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step.」というシンプルな手法で「our model-based planning approach, WEBDREAMER, shows substantial improvement over reactive baselines and offers greater flexibility than tree search, which is often impossible in real-world websites.」という興味深い結果。挑戦的なタイトルをつけたくなる気持ちもわかる。
リポジトリはWebDreamer/README.md at main · OSU-NLP-Group/WebDreamer · GitHub

SPARTAN: SPARse TrANsformer World model

SPARTAN: A Sparse Transformer Learning Local Causation [63.3]
因果構造は、環境の変化に柔軟に適応する世界モデルにおいて中心的な役割を果たす。本研究では,SPARse TrANsformer World Model(SPARTAN)を提案する。オブジェクト指向トークン間の注意パターンに空間規則を適用することで、SPARTANは、将来のオブジェクト状態を正確に予測するスパース局所因果モデルを特定する。
論文参考訳（メタデータ） (Mon, 11 Nov 2024 11:42:48 GMT)
「Conceptually, we argue that in order to perform efficient adaptation, world models should be structured to reflect the underlying sparse causal structure of the observed dynamics, and that these structures should be local.」のもと、「we propose SPARTAN, a structured world model that jointly performs dynamics model learning and causal discovery.」とのこと。

Language Models as Causal Effect Generators [44.8]
制御可能な因果構造を持つ大規模言語モデル(LLM)に基づくデータ生成のためのフレームワークを提案する。我々は、任意の言語モデルと有向非巡回グラフ(DAG)をシーケンス駆動構造因果モデル(SD-SCM)に変換する手順を定義する。
論文参考訳（メタデータ） (Tue, 12 Nov 2024 18:50:35 GMT)
こちらはLLM＋DAGでsequence-driven structural causal modelを作るアプローチ

因果グラフ＋LLMという話はとても興味深い。

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding [42.8]
トレー・オブ・タブル(Tree-of-Table)は、LLMが大規模で複雑なテーブル上での推論能力を高めるために設計された新しいアプローチである。 Tree-of-Tableは優れた性能を持つ新しいベンチマークをセットし、大規模テーブル推論における顕著な効率性と一般化能力を示す。
論文参考訳（メタデータ） (Wed, 13 Nov 2024 11:02:04 GMT)
大規模なテーブルデータを推論するために木構造を用いるアプローチの提案
「Starting with a large-scale input table, the process selectively condenses the data, emphasizing task-relevant information. Subsequently, the decomposed elements are methodically reorganized into a Table-Tree, a hierarchical structure designed to streamline and guide the subsequent reasoning process.」ということがプロンプトベースで可能なのも凄いなと思う。効果はありそう。

Hunyuan-Large

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent [83.4]
Hunyuan-Largeは、オープンソースのTransformerベースのエキスパートモデルのミックスである。我々は,Hunyuan-Largeの優れた性能を,様々なベンチマークで徹底的に評価する。 Hunyuan-Largeの主な実践は、以前の文献より大きい大規模合成データである。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 04:14:25 GMT)
高性能かつモデルが公開されているタイプのLLM。389Bパラメータうち52BがアクティブなるMoEでLlama 3.1 70Bを超え、405Bと競合的と主張。比較的寛容なライセンスであるが「THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.」というのが特徴的。「This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China」との記載も。
リポジトリはGitHub – Tencent/Tencent-Hunyuan-Large、モデルはtencent/Tencent-Hunyuan-Large · Hugging Face

Number Cookbook: Number Understanding of Language Models and How to Improve It

Number Cookbook: Number Understanding of Language Models and How to Improve It [64.0]
大規模言語モデル(LLM)は、基本的な数値的な理解と処理において予期せぬ誤りを犯しながら、複雑な推論タスクの増大を解決することができる。本稿では,LLMの数値理解と処理能力(NUPA)について包括的に検討する。
論文参考訳（メタデータ） (Wed, 06 Nov 2024 08:59:44 GMT)
LLMにおける numerical understanding and processing ability (NUPA)の分析と、その改善方法の検討。現状だとコード生成を介すなどツールを使うアプローチが有力だが、「1) we want to study the self-contained NUPA of LLMs,　2) calling external tools whenever encountering numbers increases the inference latency (Xu et al , 2024), and 3) we believe NUPA without tools is a necessary ability of AGI.」という点から本件ではツール利用が検討対象外となっている。
現時点では「We investigate NUPA of LLMs and introduce a comprehensive benchmark, the NUPA test, to reveal that numerical problems remain challenging for modern LLMs.」とのこと。やはり難しい問題。実用上はコード生成を介すなどして対応できなくはないが・・・。
リポジトリはGitHub – GraphPKU/number_cookbook

Vulnerability of LLMs to Vertically Aligned Text Manipulations

Vulnerability of LLMs to Vertically Aligned Text Manipulations [108.7]
大規模言語モデル(LLM)は、テキスト分類タスクの実行に非常に効果的である。エンコーダベースのモデルのために単語を垂直に整列させるような入力形式を変更することは、テキスト分類タスクにおいてかなり精度を低下させる。デコーダベースのLLMは、垂直フォーマットのテキスト入力と同じような脆弱性を示すか?
論文参考訳（メタデータ） (Sat, 26 Oct 2024 00:16:08 GMT)
いわゆる縦書きが分類タスクに与える影響とその緩和策を検討した論文。英語がターゲットになっているが、横書き・縦書き混在が割と普通にある日本語での検証を行うと面白そうに思う。
「the model’s enhanced performance with few-shot learning, particularly when compared to the CoT output」とFew shotが比較的有効とのこと。

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30