LLM – ページ 19 – arXiv最新論文の紹介

Llama 3.1, Mistral Large2, AI models collapse when trained on recursively generated data

Llama3.1が発表された。商用モデルに追いついた公開モデルであり意義は非常に大きい。非商用利用のみであるが、Mistralも強力なモデルMistral Large2を公開している。

Introducing Llama 3.1: Our most capable models to date (meta.com)
- Llama 3.1 – a meta-llama Collection (huggingface.co)
Large Enough | Mistral AI | Frontier AI in your hands
- mistralai/Mistral-Large-Instruct-2407 · Hugging Face

Llama 3.1の学習では特にSFT用データとして合成データがうまく用いられているよう。また、「For example, to ensure Llama 3 is not accidentally overﬁtted on commonly used benchmarks, our pre-training data was procured and processed by a separate team that was strongly incentivized to prevent contamination of that pre-training data with external benchmarks.」という指摘も印象的だった。

上記とは若干論点が異なる気もするが、AI models collapse when trained on recursively generated data | Natureでは「トレーニングにおけるモデル生成コンテンツの無差別使用は、結果のモデルに不可逆的な欠陥を引き起こす。我々は、この効果を「モデル崩壊」と呼び、LLMや変分オートエンコーダで起こりうることを示す。webから収集した大規模データからトレーニングのメリットを維持するためには,真剣に取り組む必要があることを実証する。」と指摘していた。データ合成の悪影響、モデル崩壊についての指摘であり興味深い。

下記のように通常のデータと合成データの混合によってモデル崩壊を避けられるという指摘もある。Data augmentationの限界、機械翻訳だとBack translationの限界のように一定以上の性能向上が無理なのは直観的にはそうだろうと思うが、どの程度までいけるのか気になるところ。

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data [49.7]
各世代の合成データによって元の実データを置き換えることは、モデル崩壊の傾向にあることを示す。生成した実データと連続する合成データの蓄積は,モデル崩壊を回避することを実証する。
論文参考訳（メタデータ） (Mon, 29 Apr 2024 23:13:42 GMT)
実証実験および線警戒機においては理論的に「Our findings extend these prior works to show that if data accumulates and models train on a mixture of “real” and synthetic data, model collapse no longer occurs.」、「Together, these results strongly suggest that the “curse of recursion” may not be as dire as had been portrayed – provided we accumulate synthetic data alongside real data, rather than replacing real data by synthetic data only.」と指摘。

Are Large Language Models Capable of Generating Human-Level Narratives?

Are Large Language Models Capable of Generating Human-Level Narratives? [114.3]
本稿ではストーリーテリングにおけるLLMの能力について考察し,物語の展開とプロットの進行に着目した。本稿では,3つの談話レベルの側面から物語を分析するための新しい計算フレームワークを提案する。談話機能の明示的な統合は、ニューラルストーリーテリングの40%以上の改善によって示されるように、ストーリーテリングを促進することができることを示す。
論文参考訳（メタデータ） (Thu, 18 Jul 2024 08:02:49 GMT)
LLMに物語の理解が可能かの検証。検証しているモデルが若干古めではあるがGemini、Claudeのスコアが高め
リポジトリはGitHub – PlusLabNLP/Narrative-Discourse

Consent in Crisis: The Rapid Decline of the AI Data Commons

Consent in Crisis: The Rapid Decline of the AI Data Commons [74.7]
汎用人工知能(AI)システムは、大量の公開Webデータに基づいて構築されている。我々は,AIトレーニングコーパスに基づくWebドメインに対する同意プロトコルの大規模かつ長期的監査を行う。
論文参考訳（メタデータ） (Sat, 20 Jul 2024 16:50:18 GMT)
「We observe a proliferation of AIspecific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites’ expressed intentions in their Terms of Service and their robots.txt.」という報告。WEB検索のためのrobots.txtとAI利用のための条項が異なるのはそうだろうと思うし、AI利用だとそもそものトラフィック（や広告閲覧）がオリジナルサイトに行かないので問題が大きい。「Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use.」というのも驚き。
リポジトリはGitHub – Data-Provenance-Initiative/Data-Provenance-Collection
SearchGPT is a prototype of new AI search features | OpenAIのような動きとも関連し、この手の問題をどう解決していくかはとても重要。

MINITRON / Compact Language Models via Pruning and Knowledge Distillation

Compact Language Models via Pruning and Knowledge Distillation [61.6]
ミニトロンモデルでは、スクラッチからのトレーニングに比べてMMLUスコアが最大16%改善している。すでにトレーニング済みの15Bモデルから8Bと4Bモデルを抽出するには、スクラッチからトレーニングするよりも、モデル毎のトレーニングトークンを最大40倍少なくする必要があります。
論文参考訳（メタデータ） (Fri, 19 Jul 2024 21:47:57 GMT)
Nemotron 15Bから得られた高性能な8Bモデル及び4Bモデル。pruningとdistillationを組み合わせたベストプラクティスを報告。Gemma2, CriticGPT – arXiv最新論文の紹介 (devneko.jp)のときも蒸留が用いられていたが、大規模なモデルから小規模高性能なモデルを得るような手順が一般的になるのだろうか・・・
リポジトリはGitHub – NVlabs/Minitron: A family of compressed models obtained via pruning and knowledge distillation

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting [27.1]
本稿では,多言語合成指導調律データセット sPhinX を作成するための新しいレシピを提案する。 SPhinXは、命令応答対を英語から50言語に選択的に翻訳することで作成される。 Phi-3-Small と Mistral-7B の2つの最先端モデルを微調整するために sPhinX の有効性を検証した。
論文参考訳（メタデータ） (Sat, 13 Jul 2024 13:03:45 GMT)
「To mitigate this issue, we prompt GPT-4 to selectively translate the instructions, so that the tasks are translated into the appropriate language without changing the semantic meaning.」とLLMを用いた機械翻訳を有効に使った多言語fine tuning。
「We devise LAnguage-Specific N-shot Guided Instruction fine-tuning (LANG) strategy for enhancing the multilingual capabilities of LLMs」を含め有効だとは思うのだが現時点ではライセンス上使いにくい・・・（ライセンス的にOKなNemotronだと現実的なのか気になるところ）

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs [6.7]
大規模言語モデル(LLM)は、自然言語の理解と生成において顕著な能力を示す。本研究は、LLMが完全再トレーニングを必要とせず、特定の個人のプライベートデータを保護できることの課題に対処する。プライバシ保護のためのネーム・アウェア・アンラーニング・フレームワーク(NAUF)を導入する。
論文参考訳（メタデータ） (Sun, 14 Jul 2024 03:05:53 GMT)
Machine Unlearningのためのベンチマーク、RETURN: Real-world pErsonal daTa UnleaRNing datasetを構築。NameAware Refusal Answer（個人名に対する質問への回答拒否）とContrastive Data Augmentation（個人に対する質問を拡張しデータ不足を解消）を用いたNAUF: Name-Aware Unlearning Framework で優れた性能を達成と報告。
リポジトリはGitHub – zhliu0106/learning-to-refuse: Official Implementation of “Learning to Refuse: Towards Mitigating Privacy Risks in LLMs”

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models [32.3]
多様な機能にもかかわらず、Large Language Models (LLM) は様々な長所と短所を示す。これらの課題に対処するため、最近の研究はLLMの協調戦略を探求している。本稿では,この新たな研究領域の概要を概観し,そのようなコラボレーションの背景にあるモチベーションを明らかにする。
論文参考訳（メタデータ） (Mon, 08 Jul 2024 16:29:08 GMT)
複数のLLMをうまく使う方法のサーベイ
研究領域がとても広いことがよくわかる（そして絵がかわいい）

LLMBox: A Comprehensive Library for Large Language Models

LLMBox: A Comprehensive Library for Large Language Models [109.2]
本稿では,大規模言語モデル (LLM) の開発, 使用, 評価を容易にするために, 包括的で統一されたライブラリ LLMBox を提案する。このライブラリには,(1)多様なトレーニング戦略の柔軟な実装を支援する統一データインターフェース,(2)広範囲なタスクやデータセット,モデルをカバーする包括的な評価,(3)ユーザフレンドリさや効率性など,より実践的な考慮,という3つのメリットがある。
論文参考訳（メタデータ） (Mon, 08 Jul 2024 02:39:33 GMT)
LLM関連のもろもろを集めたライブラリ。必要なものが集まっていると便利というのと、GPUメモリの必要量などの情報がまとまっているのもありがたい。
リポジトリはGitHub – RUCAIBox/LLMBox: A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.3]
事前訓練された大規模言語モデル(LLM)は、現在、自然言語処理タスクの大部分を解決するための最先端技術である。 LLM2LLMは、教師のLLMを使って小さなシードデータセットを強化するデータ拡張戦略である。 GSM8Kデータセットでは最大24.2%、CaseHOLDでは32.6%、SNIPSでは32.0%、TRECでは52.6%、SST-2では39.8%の改善が達成された。
論文参考訳（メタデータ） (Sat, 13 Jul 2024 07:36:49 GMT)
fine tuning用のデータを拡張していくフレームワークの提案。間違った部分に注目するアプローチでLlama-2-7Bを用いて有効性を検証とのこと。
リポジトリはGitHub – SqueezeAILab/LLM2LLM: [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

GPT-4o mini、MistralNeMo、DCLM 7B、Qwen2、Granite

先週もLLM関連の話題は多かった。

GPT-3.5よりもコスト・速度に優れ性能も高いGPT-4o miniはビジネスでの注目度が高い。

OSS関連でのアップデートも多かった。

MistralとNVIDIAが協力して開発した小型で強力なMistral NeMo（Mistral NeMo | Mistral AI | Frontier AI in your hands、mistralai/Mistral-Nemo-Base-2407 · Hugging Face）
AppleによるDCLM 7B（apple/DCLM-7B · Hugging Face＆関連：DataComp-LM: In search of the next generation of training sets for language models – arXiv最新論文の紹介 (devneko.jp)）
Qwen2についてのテクニカルレポート（Audio-Language含む）
長いコンテキストに対応したGranite

上記には要注目。公開モデルの動きも速い。

Qwen2 Technical Report [139.8]
Qwen2は、前機種のQwen1.5を含む、これまでのほとんどのオープンウェイトモデルを上回っている。言語理解、生成、多言語習熟、コーディング、数学、推論に関する様々なベンチマークにおいて、プロプライエタリなモデルと比較して、競争力のあるパフォーマンスを示す。 Qwen2は、英語、中国語、スペイン語、フランス語、ドイツ語、アラビア語、ロシア語、韓国語、日本語、タイ語、ベトナム語など、約30の言語で熟練した堅牢な多言語機能を示している。
論文参考訳（メタデータ） (Mon, 15 Jul 2024 12:35:42 GMT)
GLM-4-9B, Qwen2 – arXiv最新論文の紹介 (devneko.jp)のテクニカルレポート。強力なパフォーマンスを発揮し、多言語性能も高い。
リポジトリはGitHub – QwenLM/Qwen2: Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

Qwen2-Audio Technical Report [73.9]
本稿では,Qwen2-Audioと呼ばれる大規模オーディオ言語モデルであるQwen-Audioの最新動向を紹介する。 Qwen2-Audioは、様々な音声信号入力を受け入れ、音声解析や音声指示に対する直接テキスト応答を行うことができる。我々はQwen2-Audioの指示追従能力を高め、音声チャットと音声分析のための2つの異なる音声対話モードを実装した。
論文参考訳（メタデータ） (Mon, 15 Jul 2024 14:38:09 GMT)
「According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities.」と強力な性能を主張するAudio-Languageモデル。
リポジトリはGitHub – QwenLM/Qwen2-Audio: The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.

Scaling Granite Code Models to 128K Context [37.3]
本稿では,最大128Kトークンの効率的なコンテキストウィンドウをサポートする長文グラナイト符号モデルを提案する。 2K/4Kから128KまでのGranite 3B/8B符号モデルのコンテキスト長のスケーリングソリューションは、軽量な継続事前トレーニングで構成されている。私たちは、研究と商用の両方のために、Apache 2.0ライセンスの下で、長いコンテキストのGraniteコードモデルをリリースします。
論文参考訳（メタデータ） (Thu, 18 Jul 2024 17:46:02 GMT)
IBMのGraniteも128Kと長いコンテキストに対応
リポジトリはGitHub – ibm-granite/granite-code-models: Granite Code Models: A Family of Open Foundation Models for Code Intelligence

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31