LLM – ページ 12 – arXiv最新論文の紹介

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models [24.3]
GSM8Kベンチマークは、小学校レベルの質問に対するモデルの数学的推論を評価するために広く使われている。 GSM-Symbolicは、シンボリックテンプレートから生成された改良されたベンチマークである。以上の結果から,LLMは同一質問の異なるインスタンス化に応答する際,顕著なばらつきを示すことが明らかとなった。
論文参考訳（メタデータ） (Mon, 07 Oct 2024 17:36:37 GMT)
「We introduce GSM-Symbolic, an enhanced benchmark that generates diverse variants of GSM8K questions using symbolic templates」というベンチマークの紹介であるが、「We show that LLMs exhibit more robustness to changes in superficial elements like proper names but are very sensitive to changes in numerical values」というのはなかなか衝撃的な結果。
「To create the templates, we add seemingly relevant but ultimately inconsequential statements to GSM-Symbolic templates.」という無意味な情報を加えたGSM-NoOpでは結果がさらに悪くなるようで、単純なLeakでもない難しさがある。

Small Language Models: Survey, Measurements, and Insights

Small Language Models: Survey, Measurements, and Insights [21.2]
小型言語モデル (SLM) は大規模言語モデル (LLM) に比べて学術的関心が著しく少ない。 59の最先端のオープンソースSLMを調査し、アーキテクチャ、トレーニングデータセット、トレーニングアルゴリズムという3つの軸にわたる技術革新を分析します。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 06:36:56 GMT)
「The weight range of SLMs in this work is defined between 100M to 5B.」という定義のもとのSLMに関するサーベイ。
リポジトリはGitHub – UbiquitousLearning/SLM_Survey

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms [34.8]
大規模言語モデル (LLM) は自然言語処理において顕著な進歩を遂げている。しかし、高価なメモリと計算の要求は、その実践的な展開に重大な課題をもたらしている。低ビット量子化は、モデルパラメータのビット幅を減らすことでこれらの課題を緩和するための重要なアプローチとして現れている。
論文参考訳（メタデータ） (Wed, 25 Sep 2024 07:38:02 GMT)
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B – arXiv最新論文の紹介 (devneko.jp)　にも関連する低ビット量子化に関するサーベイ。

Law of the Weakest Link: Cross Capabilities of Large Language Models

Law of the Weakest Link: Cross Capabilities of Large Language Models [102.9]
我々は,Large Language Models (LLMs) が “Law of the Weakest Link” を示すことを示した。これらの結果は, クロスキャパビリティタスクにおけるLLMの低性能化を浮き彫りにした。
論文参考訳（メタデータ） (Mon, 30 Sep 2024 05:12:01 GMT)
問題を解こうとする場合、様々な能力が要求されるが、今のLLMは一面の評価にとどまっており総合的な能力（様々なタスクをクロスして問題を解く能力）の評価ができていない。そのような評価を行いFindingsをまとめた論文。「we demonstrated that LLMs consistently conform to the “Law of the Weakest Link,” where cross-capability performance is constrained by the weakest ability.」と直観に反しない結果。
リポジトリはGitHub – facebookresearch/llm-cross-capabilities: Official implementation for “Law of the Weakest Link: Cross capabilities of Large Language Models”

日本語版Gemma 2 2B, Liquid Foundation Models (LFMs), Meta Movie Gen, CulturalBench

先週の発表で気になったのはGoogleによる日本語版 Gemma 2の公開（Google Developers Japan: 日本語版 Gemma 2 2B を公開 (googleblog.com)）とLiquid AIによるLiquid Foundation Models (LFMs)の発表（Liquid Foundation Models: Our First Series of Generative AI Models）、Metaによる動画生成AI、Meta Movie Genの発表（Meta Movie Gen）だった。

１つ目は言語特化モデルの可能性を感じる小規模・高性能モデルである。「東京科学大学情報理工学院情報工学系の岡崎直観教授らの研究チームと協力し、日本におけるオープンモデルの開発支援、および、新しい技術の開拓への取り組みも進めます。」との記載もあり、日本語という言語だけでなく文化理解のような部分にも注目。先週でていたCultualBenchのようなベンチマーク構築の動きもさかん。

２つ目はGPT系アーキテクチャではないLLMとのこと。論文やテクニカルレポートが出ていないので何とも言えない部分があるが、状態空間モデルではなくAttentionを効率化するアプローチのように見える。長文における処理が大幅に効率化されているとのことで期待大。

最後はMetaによるテキストからの動画生成AIで単純な生成だけでなく、編集も可能、元の静止画も指定可能。「On text-to-video generation, we outperform prior state-of-the-art, including commercial systems such as Runway Gen3 (RunwayML, 2024), LumaLabs (LumaLabs, 2024), OpenAI Sora (OpenAI, 2024) on overall video quality」と他モデルよりも良い性能であるとのこと。
（10/19追記） arXivに論文が出ていたので追加。

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs [75.8]
文化ベンチについて紹介する: 文化的知識を評価するための1,227の人文的・人文的な質問である。同じ質問を共有しながら異なる質問をするCulturalBench-EasyとCulturalBench-Hardの2つの設定でモデルを評価する。人間のパフォーマンス(92.6%の精度)と比較して、カルチャーベンチ・ハードは、最もパフォーマンスの良いモデル(GPT-4o)が61.5%、最悪のモデル(Llama3-8b)が21.4%であるフロンティアのLLMにとってより難しい。
論文参考訳（メタデータ） (Thu, 03 Oct 2024 17:04:31 GMT)
45か国をカバーする文化的ベンチマーク
リポジトリはCulturalBench – a Hugging Face Space by kellycyy

Movie Gen: A Cast of Media Foundation Models [133.4]
高品質の1080pHDビデオを生成する基礎モデルのキャストであるMovie Genについて紹介する。ユーザの画像に基づいて,高精度な命令ベースのビデオ編集やパーソナライズされたビデオの生成などの追加機能を示す。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 16:22:46 GMT)

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis [19.4]
大規模言語モデル(LLM)の探索技術は主に英語に焦点を合わせており、世界の言語の大部分を見下ろしている。複数のオープンソースのLCMモデルで実験を行い、探索精度、層間の傾向、および複数の言語に対する探索ベクトル間の類似性を解析した。
論文参考訳（メタデータ） (Sun, 22 Sep 2024 14:14:05 GMT)
多言語での動作解析、「(1) a consistent performance gap between high-resource and lowresource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages.」とのこと
Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in? – arXiv最新論文の紹介 (devneko.jp)でも思ったが、この手の動作解析はとても面白い。

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.6]
本研究では,乱数から発散する概念に触発された偏差に基づくキャリブレーション手法を導入し,プリトレーニングデータ検出のためのトークン確率のキャリブレーションを行う。我々は,中国語テキスト上でのLLMの検出手法の性能を評価するために,中国語のベンチマークであるPatentMIAを開発した。
論文参考訳（メタデータ） (Mon, 23 Sep 2024 07:55:35 GMT)
事前学習に何が使われたかを検知するタスクpretraining data detectionに関する手法DC-PDD およびベンチマークの提案。「The pretraining data detection problem can be viewed as an instance of the membership inference attack (MIA) task (Shokri et al , 2017), where the primary objective is to determine if a particular text was part of a target LLM’s training corpus.」
DC-PDD computes the divergence between the token probability distribution and the token frequency distribution for detection.とのこと。
リポジトリはGitHub – zhang-wei-chao/DC-PDD

Llama3.2, Molmo, EMOVA

先週はマルチモーダルで公開モデルであるLLMの話題が多かった。Llama 3.2はLlamaのアップデートであり90BでGPT-4o miniに匹敵、Molmoは72BでGPT-4oに競合するとのこと。商用モデルに公開モデルが追いつきつつある状況で今後が非常に楽しみである。

公開モデルではないようだが、複数のモデルを組み合わせたEMOVAはGemini Pro 1.5やGPT-4V以上、GPT-4oのスコアの95%以上を達成と主張している。

Llama 3.2
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (meta.com)
Llama 3.2 – a meta-llama Collection (huggingface.co)

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models [146.2]
Molmoは、オープンネスのクラスで最先端のVLMの新たなファミリーである。私たちの重要なイノベーションは、人間のアノテーションから収集された、新しくて詳細な画像キャプションデータセットです。近い将来、モデルウェイト、キャプション、微調整データ、ソースコードをすべてリリースする予定です。
論文参考訳（メタデータ） (Wed, 25 Sep 2024 17:59:51 GMT)
プロジェクトサイトはmolmo.allenai.org/blog、「The best-inclass 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.」。PixMo (Pixels for Molmo)というデータセットを構築、その品質が性能向上に寄与しているとのこと。
デモはMolmo by Ai2 (allenai.org)、リポジトリはMolmo – a allenai Collection (huggingface.co)、Apache-2のOSSであることも凄い。

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [150.9]
GPT-4oは、多様な感情や声調を持つ声の会話を可能にするオムニモーダルモデルである。本研究では,エンド・ツー・エンドの音声機能を備えた大規模言語モデルを実現するためのEMOVAを提案する。 EMOVAは、視覚言語と音声のベンチマークの両方で最先端のパフォーマンスを初めて達成した。
論文参考訳（メタデータ） (Thu, 26 Sep 2024 16:44:02 GMT)
マルチモーダルなモデル、「EMOVA exceeds both GPT-4V and Gemini Pro 1.5 significantly on 10 out of 14 benchmarks, while for GPT-4o, EMOVA outperforms on both SEEDBench-Image and OCRBench, reaching over 95% of GPT-4o’s performance on ALL evaluated benchmarks except RealWorldQA.」とのこと。LLaMA-3.1-8B +InternViT-6B+ Speechモデル（既存アーキテクチャをベースに著者らがpre train）なアーキテクチャ。
プロジェクトサイトはEMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotion (emova-ollm.github.io)、

EMMA-500, EuroLLM

マルチリンガルさを特徴とするLLMの開発も行われている。

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.5]
EMMA-500は546言語にわたるテキストで継続訓練された大規模多言語言語モデルである。本結果は,大規模言語モデルの言語能力拡大における継続事前学習の有効性を強調した。
論文参考訳（メタデータ） (Thu, 26 Sep 2024 14:40:45 GMT)
MaLA Corpus （It contains 939 languages, 546 of which have more than 100k tokens and are used for training our EMMA-500 model, and 74 billion (B) whitespace delimited tokens in total.）とそれを活用したLlama 2-basedなLLM EMMA-500、240言語を対象としたベンチマークPolyWrite の提案。
リポジトリはMaLA-LM (MaLA-LM) (huggingface.co)

EuroLLM: Multilingual Language Models for Europe [76.9]
オープンウェイトな多言語LLMの開発を目的としたEuroLLMプロジェクトを紹介した。これまでの進捗状況を概説し、データ収集とフィルタリングプロセスについて詳述する。マルチリンガル・ジェネラル・ベンチマークと機械翻訳の性能について報告する。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 16:51:36 GMT)
「EuroLLM project with the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages (Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish) as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).」というLLM構築プロジェクトの紹介。規模は小さいものの機械翻訳での性能は悪くなさそう？
リポジトリはEuroLLM – a utter-project Collection (huggingface.co)

GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes

GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes [80.6]
我々は,中学生が第二言語として英語を学習するための対話型宿題セッションを,GPT-4で実施できるプロンプト戦略を開発した。従来の宿題を GPT-4 の宿題に置き換え,4つの高校生の授業でランダム化比較試験(RCT)を行った。学習結果の大幅な改善,特に文法の増大,学生のエンゲージメントについて検討した。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 11:22:55 GMT)
GPT-4を用いて宿題をサポートすることの効果をRCTで確認。「We observed significant improvements in learning outcomes, specifically a greater gain in grammar, and student engagement.」、「we do not find evidence of bias towards stronger students or harmful hallucinations.」とのこと。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31