arXiv最新論文の紹介

Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

Empirical Insights on Fine-Tuning Large Language Models for Question-Answering [50.1]
大規模言語モデル(LLM)は、大量のデータセットの事前トレーニングを通じて、広範囲な世界の知識を符号化する。我々は,事前学習したLLMが記憶する知識の量に基づいて,教師付き微調整(SFT)データを分類した。実験の結果,SFTの段階では60個のデータポイントが事前学習中に符号化された知識を活性化することができ,LLMがQAタスクを実行できることがわかった。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 07:38:38 GMT)
「To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes.」、「What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes, implying that a simple post-processing calibration would bring back the pre-trained model’s capability and at the same time unveil the feature improvement over all classes.」という指摘。
リポジトリはGitHub – OSU-MLB/Fine-Tuning-Is-Fine-If-Calibrated

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.3]
Retrieval-Augmented Generation (RAG)は、大規模言語モデル(LLM)の開発において、急速に重要なパラダイムへと成長してきた。本稿では,RAGシステムの信頼性を,事実性,堅牢性,公正性,透明性,説明責任,プライバシの6つの面で評価する統一的な枠組みを提案する。
論文参考訳（メタデータ） (Mon, 16 Sep 2024 09:06:44 GMT)
信頼できるAIに関するサーベイはよくあるがRAGを対象としたものは珍しいように思う。
リポジトリはGitHub – smallporridge/TrustworthyRAG

Llama3.2, Molmo, EMOVA

先週はマルチモーダルで公開モデルであるLLMの話題が多かった。Llama 3.2はLlamaのアップデートであり90BでGPT-4o miniに匹敵、Molmoは72BでGPT-4oに競合するとのこと。商用モデルに公開モデルが追いつきつつある状況で今後が非常に楽しみである。

公開モデルではないようだが、複数のモデルを組み合わせたEMOVAはGemini Pro 1.5やGPT-4V以上、GPT-4oのスコアの95%以上を達成と主張している。

Llama 3.2
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (meta.com)
Llama 3.2 – a meta-llama Collection (huggingface.co)

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models [146.2]
Molmoは、オープンネスのクラスで最先端のVLMの新たなファミリーである。私たちの重要なイノベーションは、人間のアノテーションから収集された、新しくて詳細な画像キャプションデータセットです。近い将来、モデルウェイト、キャプション、微調整データ、ソースコードをすべてリリースする予定です。
論文参考訳（メタデータ） (Wed, 25 Sep 2024 17:59:51 GMT)
プロジェクトサイトはmolmo.allenai.org/blog、「The best-inclass 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.」。PixMo (Pixels for Molmo)というデータセットを構築、その品質が性能向上に寄与しているとのこと。
デモはMolmo by Ai2 (allenai.org)、リポジトリはMolmo – a allenai Collection (huggingface.co)、Apache-2のOSSであることも凄い。

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [150.9]
GPT-4oは、多様な感情や声調を持つ声の会話を可能にするオムニモーダルモデルである。本研究では,エンド・ツー・エンドの音声機能を備えた大規模言語モデルを実現するためのEMOVAを提案する。 EMOVAは、視覚言語と音声のベンチマークの両方で最先端のパフォーマンスを初めて達成した。
論文参考訳（メタデータ） (Thu, 26 Sep 2024 16:44:02 GMT)
マルチモーダルなモデル、「EMOVA exceeds both GPT-4V and Gemini Pro 1.5 significantly on 10 out of 14 benchmarks, while for GPT-4o, EMOVA outperforms on both SEEDBench-Image and OCRBench, reaching over 95% of GPT-4o’s performance on ALL evaluated benchmarks except RealWorldQA.」とのこと。LLaMA-3.1-8B +InternViT-6B+ Speechモデル（既存アーキテクチャをベースに著者らがpre train）なアーキテクチャ。
プロジェクトサイトはEMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotion (emova-ollm.github.io)、

EMMA-500, EuroLLM

マルチリンガルさを特徴とするLLMの開発も行われている。

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.5]
EMMA-500は546言語にわたるテキストで継続訓練された大規模多言語言語モデルである。本結果は,大規模言語モデルの言語能力拡大における継続事前学習の有効性を強調した。
論文参考訳（メタデータ） (Thu, 26 Sep 2024 14:40:45 GMT)
MaLA Corpus （It contains 939 languages, 546 of which have more than 100k tokens and are used for training our EMMA-500 model, and 74 billion (B) whitespace delimited tokens in total.）とそれを活用したLlama 2-basedなLLM EMMA-500、240言語を対象としたベンチマークPolyWrite の提案。
リポジトリはMaLA-LM (MaLA-LM) (huggingface.co)

EuroLLM: Multilingual Language Models for Europe [76.9]
オープンウェイトな多言語LLMの開発を目的としたEuroLLMプロジェクトを紹介した。これまでの進捗状況を概説し、データ収集とフィルタリングプロセスについて詳述する。マルチリンガル・ジェネラル・ベンチマークと機械翻訳の性能について報告する。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 16:51:36 GMT)
「EuroLLM project with the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages (Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish) as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).」というLLM構築プロジェクトの紹介。規模は小さいものの機械翻訳での性能は悪くなさそう？
リポジトリはEuroLLM – a utter-project Collection (huggingface.co)

GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes

GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes [80.6]
我々は,中学生が第二言語として英語を学習するための対話型宿題セッションを,GPT-4で実施できるプロンプト戦略を開発した。従来の宿題を GPT-4 の宿題に置き換え,4つの高校生の授業でランダム化比較試験(RCT)を行った。学習結果の大幅な改善,特に文法の増大,学生のエンゲージメントについて検討した。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 11:22:55 GMT)
GPT-4を用いて宿題をサポートすることの効果をRCTで確認。「We observed significant improvements in learning outcomes, specifically a greater gain in grammar, and student engagement.」、「we do not find evidence of bias towards stronger students or harmful hallucinations.」とのこと。

OpenAI o1の評価、A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

OpenAI o1の評価結果が様々出ている。医療シナリオでの評価は特に興味深い。Gemini のアップデートもあり、Claude 3.5 Opusの噂もあり、商用モデルの競争も激しい。

Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more – Google Developers Blog (googleblog.com)

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.7]
OpenAIのo1は、強化学習戦略を使ったチェーン・オブ・ソート技術を使った最初のモデルとして際立っている。本報告では、様々な医療シナリオにおけるo1の総合的な探索を行い、理解、推論、多言語性という3つの重要な側面について検討する。
論文参考訳（メタデータ） (Mon, 23 Sep 2024 17:59:43 GMT)
「Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios.」との評価で、GPT-4oや3.5を上回る結果。
リポジトリはA Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? (ucsc-vlaa.github.io)

A Case Study of Web App Coding with OpenAI Reasoning Models [1.7]
我々は,OpenAIの最新推論モデルであるo1-previewとo1-miniによるコーディングタスクのケーススタディを,他のフロンティアモデルと比較した。 o1モデルは、シングルタスクのベンチマークであるWebApp1Kに対して、SOTA結果を提供する。この結果、WebApp1K-Duoは、多くのタスクとテストケースを倍にする、より難しいベンチマークである。
論文参考訳（メタデータ） (Thu, 19 Sep 2024 06:58:02 GMT)
WebApp1K（GitHub – onekq/WebApp1k: WebApp1k benchmark）に対してはo1がSoTAである一方で、より長い出力が要求されるWebApp1K-Duo（onekq-ai/WebApp1K-Duo-React · Datasets at Hugging Face）ではClaude 3.5 sonnetに負ける結果。
「Speciﬁcally, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths.」という指摘が興味深い。
WebApp1K Models Leaderboard – a Hugging Face Space by onekq-ai　にLeader boardがある

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents [0.2]
ファイナンスやコンサルティングにおいて日常的に行われている,実世界のオープンウェブ研究課題に対するエージェントの評価を行った。我々は、o1-preview、GPT-4o、Claude-3.5 Sonnet、Llama 3.1 (405b)、GPT-4o-miniといったエージェントアーキテクチャを構築し、テストした。 LLM全体では、サブタスクをサブエージェントに委譲する機能を備えたReActアーキテクチャが最もよく機能した。
論文参考訳（メタデータ） (Wed, 25 Sep 2024 08:52:49 GMT)
複数のベンチマークによる評価、総合的にo1は強力ではあるが、タスクや使い方による差異は大きそうに見える。

Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs [2.2]
この作業は、最近のGPT-o1モデルの公開リリースにインスパイアされている。自動プログラム修復(APR)におけるGPTファミリーモデルの異なるバージョンの有効性の比較を行った。 O1の修復機能は、以前のGPTファミリーモデルよりも優れており、ベンチマークで40のバグを修正できた。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 01:49:17 GMT)
バグ修正におけるo1の評価。GPT-4oを超えている。
リポジトリはGitHub – Tomsawyerhu/GPT-O1-on-QuixBugs: Evaluating GPT-o1 on QuixBugs benchmark.

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning [61.1]
現在のLarge Language Models (LLM) は、テーブル構造を理解し、正確な数値推論を適用する能力に制限がある。 LLMと特殊なツールを統合するTART(Tool-Augmented Reasoning framework for Tables)を紹介した。 TARTには、正確なデータ表現を保証するテーブルフォーマッター、特定の計算ツールを開発するツールメーカー、説明可能性を維持するための説明ジェネレータの3つの重要なコンポーネントが含まれている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 06:19:59 GMT)
表形式を扱うためのフレームワーク、「TART consists of a table formatter for accurate data representation, a tool maker for creating specialized tools, and an explanation generator maintaining interpretable explanations.」とのこと。ベンチマークも考案しており、効果を確認。
リポジトリはGitHub – XinyuanLu00/TART: This is the repository for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

SoccerNet 2024 Challenges Results

SoccerNet 2024 Challenges Results [152.9]
SoccerNet 2024の課題は、サッカーネットチームが主催する4年目のビデオ理解の課題を表している。この課題は、サッカーにおける複数のテーマにまたがる研究を進めることを目的としており、放送ビデオ理解、フィールド理解、プレイヤー理解などが含まれる。今年は、4つのビジョンベースのタスクが課題となっている。
論文参考訳（メタデータ） (Mon, 16 Sep 2024 14:12:22 GMT)
SoccerNet (soccer-net.org)　2024の結果
ソリューション概要、一部はリポジトリへのリンクがある。

Agents in Software Engineering: Survey, Landscape, and Vision

Agents in Software Engineering: Survey, Landscape, and Vision [46.0]
大規模言語モデル(LLM)は目覚ましい成功を収め、下流の様々なタスクで広く使われてきた。 LLMとソフトウェア工学(SE)を組み合わせた多くの研究では、明示的にも暗黙的にもエージェントの概念が採用されている。本稿では,知覚,記憶,行動の3つの重要なモジュールを含む,SE における LLM ベースのエージェントのフレームワークを提案する。
論文参考訳（メタデータ） (Fri, 13 Sep 2024 17:55:58 GMT)
Large Language Model-Based Agents for Software Engineering: A Survey – arXiv最新論文の紹介 (devneko.jp)とは別のチームによるソフトウェアエンジニアリングにおけるエージェント活用のサーベイ。エージェント側の技術に注目したものになっている。
リポジトリはGitHub – DeepSoftwareAnalytics/Awesome-Agent4SE

A Controlled Study on Long Context Extension and Generalization in LLMs

A Controlled Study on Long Context Extension and Generalization in LLMs [85.5]
広義のテキスト理解とテキスト内学習は、完全な文書コンテキストを利用する言語モデルを必要とする。長期コンテキストモデルを直接訓練する際の実装上の課題のため、長期コンテキストを扱うためにモデルを拡張する多くの方法が提案されている。我々は,一貫したベースモデルと拡張データを利用して,標準化された評価による拡張メソッドの制御プロトコルを実装した。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:53:17 GMT)
長文の取り扱いに関する手法の評価、「Our study underscores the role of perplexity as a crucial, performance indicator at length and highlights the trade-offs inherent in different attention mechanisms.」
リポジトリはGitHub – Leooyii/LCEG: Long Context Extension and Generalization in LLMs

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31