LLM – ページ 13 – arXiv最新論文の紹介

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning [61.1]
現在のLarge Language Models (LLM) は、テーブル構造を理解し、正確な数値推論を適用する能力に制限がある。 LLMと特殊なツールを統合するTART(Tool-Augmented Reasoning framework for Tables)を紹介した。 TARTには、正確なデータ表現を保証するテーブルフォーマッター、特定の計算ツールを開発するツールメーカー、説明可能性を維持するための説明ジェネレータの3つの重要なコンポーネントが含まれている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 06:19:59 GMT)
表形式を扱うためのフレームワーク、「TART consists of a table formatter for accurate data representation, a tool maker for creating specialized tools, and an explanation generator maintaining interpretable explanations.」とのこと。ベンチマークも考案しており、効果を確認。
リポジトリはGitHub – XinyuanLu00/TART: This is the repository for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

A Controlled Study on Long Context Extension and Generalization in LLMs

A Controlled Study on Long Context Extension and Generalization in LLMs [85.5]
広義のテキスト理解とテキスト内学習は、完全な文書コンテキストを利用する言語モデルを必要とする。長期コンテキストモデルを直接訓練する際の実装上の課題のため、長期コンテキストを扱うためにモデルを拡張する多くの方法が提案されている。我々は,一貫したベースモデルと拡張データを利用して,標準化された評価による拡張メソッドの制御プロトコルを実装した。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:53:17 GMT)
長文の取り扱いに関する手法の評価、「Our study underscores the role of perplexity as a crucial, performance indicator at length and highlights the trade-offs inherent in different attention mechanisms.」
リポジトリはGitHub – Leooyii/LCEG: Long Context Extension and Generalization in LLMs

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B [11.8]
本稿では、7Bから405Bのモデルにおける命令調整型LLMの性能を評価する。我々は6つのタスクタイプ(常識Q&A、知識と言語理解、指示追従、幻覚検出、数学、対話)のパフォーマンスを評価する。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 10:31:37 GMT)
量子化の影響を分析した論文、「We found that quantized LLMs generally outperformed smaller models in most tasks, except for hallucination detection and instruction-following.」と結論。後半はちょっと驚き。

Qwen 2.5, Qwen 2 VL, GRIN-MoE, Pixtral

様々な研究機関がLLMを構築している。先週のニュースとしては高性能なLLM Qwen 2.5、MoE構成で高効率なGRIN-MoE、マルチモーダル拡張のQwen 2 VL、Pixtralに注目。

ライセンスは様々であることに注意が必要だが、モデル自体は公開されている。商用API以外に選択肢が広がっている。また、それぞれ様々な狙いを持ったモデルとなっていて正直評価を行うことも簡単ではない。自分がやりたいことにフィットするベースモデル、活用方法をサジェストするAIが欲しい今日この頃。

モデル構築、fine tuningの観点でも多くの情報が公開されておりとても興味深い。

Qwen2.5-Coder Technical Report [100.7]
先代のCodeQwen1.5から大幅にアップグレードされたQwen2.5-Coderシリーズを紹介します。コード固有のモデルとして、Qwen2.5-CoderはQwen2.5アーキテクチャに基づいて構築され、5.5兆以上のトークンからなる巨大なコーパスで事前訓練されている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:57:57 GMT)
「To ensure the quality of the pre-training data, we have curated a dataset by collecting public code data and extracting high-quality code-related content from web texts, while filtering out low-quality data using advanced classifiers.
」とフィルタリングの重要性を強調。データ合成にも触れられているがMATHと異なりリアルデータが豊富にあるから？

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.5]
Qwen2.5-Math と Qwen2.5-Math-Instruct-1.5B/7B/72B である。 Qwen2.5-Math-Instructは中国語と英語の両方をサポートし、高度な数学的推論能力を持っている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 16:45:37 GMT)
「In this report, we introduce Qwen2.5-Math, which features several key technical highlights: (1) extensive use of synthesized mathematical data from Qwen2-Math during the pre-training phase, (2) iterative generation of fine-tuning data and reinforcement training guided by the reward model during the post-training and inference phase and (3) support for bilingual (English and Chinese) queries, along with chain-of-thought and tool-integrated reasoning capabilities.」と合成データとself improvement的な動きの効果が興味深い

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution [82.4]
本稿では,従来のQwen-VLモデルのアップグレードであるQwen2-VLシリーズを紹介する。 Qwen2-VLでは、さまざまな解像度の画像を異なる数のビジュアルトークンに処理可能にする、Naive Dynamic Resolutionメカニズムが導入されている。また、Multimodal Rotary Position Embedding (M-RoPE)を統合し、テキスト、画像、ビデオ間で位置情報の効果的な融合を容易にする。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:59:32 GMT)
「Qwen2-VL series introduces naive dynamic resolution and multimodal rotary position embedding (M-RoPE) to fuse information across modals effectively and be capable of understanding videos over 20 minutes in length.」、「Furthermore, Qwen2-VL now supports understanding multilingual texts within images, including most European languages, Japanese, Korean, Arabic, Vietnamese, and others.」と動画対応、日本語対応と強力なマルチモーダルモデル。

GRIN: GRadient-INformed MoE [132.9]
Mixture-of-Experts (MoE)モデルは、エキスパートルーティングによるスパース計算により、密度の高いモデルよりも効果的にスケールする。エキスパートルーティングのためのスパース勾配推定を組み込んだGRIN(GRadient-Informed MoE Training)を導入する。我々のモデルは6.6Bの活性化パラメータしか持たないが、7Bの密度モデルより優れており、同じデータで訓練された14Bの密度モデルの性能と一致している。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:00:20 GMT)
「We propose SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.」、「We scale MoE training with neither expert parallelism nor token dropping, while the conventional MoE training employs expert parallelism and deploys token dropping.」を特徴とするMoEの改善
MoE構成でも意外とExpertにならないという報告を読んだ記憶があるが「Our study seems to verify our hypothesis that expert networks in GRIN MoE have developed highly-specialized and heterogeneous expertise.」という記載が興味深い。

Pixtral 12B [56.8]
12ビリオンパラメータのマルチモーダル言語モデルであるPixtral-12Bを導入する。 Pixtral-12Bは、自然画像と文書の両方を理解するために訓練されている。多くのオープンソースモデルとは異なり、Pixtralはそのサイズに対する最先端のテキストモデルでもある。
論文参考訳（メタデータ） (Wed, 09 Oct 2024 17:16:22 GMT)
Announcing Pixtral 12B | Mistral AI | Frontier AI in your hands
GitHub – mistralai/mistral-evals

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.4]
我々は、画像テキストの命令データをキュレートするための新しいフレームワークであるMMEvolを提案する。 MMEvolは、微粒な知覚の進化、認知的推論の進化、相互作用の進化を組み合わせている。提案手法は,3.1ポイントの平均精度向上を実現し,13の視覚言語タスクのうち9つで最先端(SOTA)性能に達する。
論文参考訳（メタデータ） (Mon, 9 Sep 2024 17:44:00 GMT)
「a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution.」、マルチモーダルな点が特徴的。効果は「The data evolved through three rounds of evolution is used to train a new model, demonstrating state-of-the-art (SOTA) performance across a comprehensive set of benchmarks.」としている。
テキストや数学的問題を超えて、マルチモーダルな文脈でも有効性が確かめられているのは面白いのと、今後の取り組みで画像生成モデルとの統合に言及があった点も興味深い。
プロジェクトサイトはMMEvol: Welcome (rainbowluocs.github.io)

Paper Copilot, TravelAgent

LLMを用いたアプリケーションに近い論文も内部動作・設計を見る上で参考になる。

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance [14.5]
本稿では,研究者を支援する自己進化型,効率的なLCMシステムであるPaper Copilotを紹介する。 Paper Copilotはパーソナライズされたリサーチサービスを提供し、リアルタイムで更新されたデータベースを維持する。本稿では,Paper Copilotの設計と実装について詳述し,パーソナライズされた学術的支援への貢献と研究プロセスの合理化の可能性について述べる。
論文参考訳（メタデータ） (Fri, 06 Sep 2024 20:04:04 GMT)
論文確認用のアシスタント
デモシステムはArxivCopilot – a Hugging Face Space by ulab-ai

TravelAgent: An AI Assistant for Personalized Travel Planning [36.0]
大規模言語モデル(LLM)を利用した旅行計画システムであるTravelAgentを紹介する。 TravelAgentはツール使用、推奨、計画、メモリモジュールの4つのモジュールで構成されている。我々は,TravelAgentの性能を人間とシミュレーションユーザで評価し,その全体的な効果を3つの基準で示し,パーソナライズされたレコメンデーションの精度を確認した。
論文参考訳（メタデータ） (Thu, 12 Sep 2024 14:24:45 GMT)
旅行計画用のエージェント、構築方法など参考になる。

Can LLMs Generate Novel Research Ideas? / Can Large Language Models Unlock Novel Scientific Research Ideas?

LLMが研究のアイデアを生成できるかについての論文が2つでいた。

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.3]
大型言語モデル(LLM)は、科学的な発見を加速する可能性についての楽観主義を喚起した。新しいアイデアとLLMと人間のアイデアの盲点レビューを書くことで、研究アイデアのための現在のLLM能力に関する最初の統計的に重要な結論を得る。 LLMの自己評価の失敗や世代における多様性の欠如など,研究エージェントの構築と評価におけるオープンな問題を明らかにする。
論文参考訳（メタデータ） (Fri, 06 Sep 2024 08:25:03 GMT)
LLMのアイデアと人間のアイデアを研究者が比較「we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility.」とのこと。結果も面白いが「7 Limitations of LLMs
」、「11 Ethical Considerations」の考察も興味深い。
リポジトリはGitHub – NoviScl/AI-Researcher

Can Large Language Models Unlock Novel Scientific Research Ideas? [21.2]
大規模言語モデル(LLM)と公開可能なChatGPTは、人工知能を人々の日常生活に組み込む上で、大きな転換点となっている。本研究は,研究論文からの情報に基づく新たな研究アイデアを創出する上でのLLMの能力について考察する。
論文参考訳（メタデータ） (Tue, 10 Sep 2024 03:26:42 GMT)
上記と近いタイトルだが、こちらは「To address this task, we create a dataset of papers published after the year 2022 from these five domains.We annotate the papers with future research ideas.To evaluate the novelty and relevance of ideas generated by the LLMs, we propose an Idea Alignment Score (IAScore).This score reflects how well the generated ideas align with those proposed by the authors.」という方針で過去論文をもとにしている。Leakageが気になるところ。
リポジトリはGitHub – sandeep82945/Future-Idea-Generation

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Towards a Unified View of Preference Learning for Large Language Models: A Survey [89.7]
大きな言語モデル(LLM)は、非常に強力な能力を示す。成功するための重要な要因の1つは、LLMの出力を人間の好みに合わせることである。選好学習のすべての戦略を、モデル、データ、フィードバック、アルゴリズムの4つの構成要素に分解する。
論文参考訳（メタデータ） (Wed, 04 Sep 2024 15:11:55 GMT)
LLM構築で重要なPreference Learningのサーベイ
リポジトリはGitHub – KbsdJames/Awesome-LLM-Preference-Learning: The official repository of our survey paper: “Towards a Unified View of Preference Learning for Large Language Models: A Survey”

OpenAI o1

先週の最大のニュースは今まで様々なうわさがあった、OpenAI o1 Introducing OpenAI o1 | OpenAIの公開だろう。特にSTEM分野で強力な性能を発揮している。

技術的な情報は公開されていない部分が多いが、Learning to Reason with LLMs | OpenAIに書かれている「Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.」では自己改善・合成データ活用の流れ（self-X – arXiv最新論文の紹介 (devneko.jp)、Synthetic data – arXiv最新論文の紹介 (devneko.jp)）に近いのかなと思う。

開発者向けの質問回答では

OpenAI o-1はモデルであってシステムではない、（ユーザには見せない）長い推論過程を生成するモデルである
GPT-4oのプロンプトエンジニアリングによってOpenAI o-1の性能と競合することはできない
RAGはOpenAI o-1においても有効

など興味深い質疑があったよう。詳細の開示はないだろうが、何らかのテクニカルレポートが欲しいところ。現時点では最近の研究動向から大きく外れたものではないし、性能の改善幅や使用感からして大きな驚きはないというのが正直な感想。1モデルにするのが良いのか、システム（Agenticな動作）にしたうえでそれに対応するモデル（Agenticな動きに特化したモデルと、通常の推論に適したモデルなど）の組み合わせのほうが良いのかなど気になるところではある。o-1は前者とのことだが、外部ツール利用を考えたとき制約が大きくなりそうな気がしている。

今後、エージェント的動作を行う場合を含め様々なベンチマークで評価されているのだろうと思うが、Cybench（Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models – arXiv最新論文の紹介 (devneko.jp)）では「Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.」が向上している（GPT-4oで負けていたClaude 3.5 Sonnetを抜いた）一方でSuccessRateはGPT-4oに及んでいない。

SYNTHETIC CONTINUED PRETRAINING

Synthetic continued pretraining [29.7]
与えられた事実を学ぶためには、モデルは数百から数千の多様な表現で訓練されなければならない。本研究では,より学習しやすい大規模コーパスを合成するための合成継続事前学習を提案する。合成データ拡張アルゴリズムであるEntiGraphでこの提案をインスタンス化する。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 17:21:59 GMT)
ナレッジグラフを介して合成データを構築するEntiGraphの提案。「Synthetic continued pretraining with EntiGraph demonstrates consistent scaling in downstream closed-book QA performance up to a 600M token synthetic corpus, whereas baselines such as continued pretraining on the small corpus or synthetic paraphrases show no improvement or asymptote early.」とのことで有効性を確認
抽象的な「知識」を介したほうが、表現の変換よりも良い（学習に利用可能な）情報を提供できるという解釈で良いのだろうか。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31