arXiv最新論文の紹介

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models [24.3]
GSM8Kベンチマークは、小学校レベルの質問に対するモデルの数学的推論を評価するために広く使われている。 GSM-Symbolicは、シンボリックテンプレートから生成された改良されたベンチマークである。以上の結果から,LLMは同一質問の異なるインスタンス化に応答する際,顕著なばらつきを示すことが明らかとなった。
論文参考訳（メタデータ） (Mon, 07 Oct 2024 17:36:37 GMT)
「We introduce GSM-Symbolic, an enhanced benchmark that generates diverse variants of GSM8K questions using symbolic templates」というベンチマークの紹介であるが、「We show that LLMs exhibit more robustness to changes in superficial elements like proper names but are very sensitive to changes in numerical values」というのはなかなか衝撃的な結果。
「To create the templates, we add seemingly relevant but ultimately inconsequential statements to GSM-Symbolic templates.」という無意味な情報を加えたGSM-NoOpでは結果がさらに悪くなるようで、単純なLeakでもない難しさがある。

A Survey on the Honesty of Large Language Models

A Survey on the Honesty of Large Language Models [115.8]
正直とは、大きな言語モデル(LLM)を人間の価値と整合させる基本的な原則である。将来性はあるものの、現在のLLMは依然として重大な不正直な行動を示す。
論文参考訳（メタデータ） (Fri, 27 Sep 2024 14:34:54 GMT)
「Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don’t know and be able to faithfully express their knowledge.」から始まるサーベイ。
リポジトリはGitHub – SihengLi99/LLM-Honesty-Survey

Loki: An Open-Source Tool for Fact Verification

Loki: An Open-Source Tool for Fact Verification [49.5]
Lokiは、誤情報の増加に対処するために設計されたオープンソースのツールだ。長いテキストを個々のクレームに分割し、チェックの信頼性を評価し、クエリを生成し、エビデンスを取得し、クレームを検証する。 LokiはMITライセンスでリリースされており、GitHubから入手できる。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 17:52:41 GMT)
OSSのファクトチェックツール、チェックすべきファクト（主張）の分解後、WEB検索結果を用いてファクトチェックを行うアプローチ
リポジトリはGitHub – Libr-AI/OpenFactVerification: Loki: Open-source solution designed to automate the process of verifying factuality

Small Language Models: Survey, Measurements, and Insights

Small Language Models: Survey, Measurements, and Insights [21.2]
小型言語モデル (SLM) は大規模言語モデル (LLM) に比べて学術的関心が著しく少ない。 59の最先端のオープンソースSLMを調査し、アーキテクチャ、トレーニングデータセット、トレーニングアルゴリズムという3つの軸にわたる技術革新を分析します。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 06:36:56 GMT)
「The weight range of SLMs in this work is defined between 100M to 5B.」という定義のもとのSLMに関するサーベイ。
リポジトリはGitHub – UbiquitousLearning/SLM_Survey

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms [34.8]
大規模言語モデル (LLM) は自然言語処理において顕著な進歩を遂げている。しかし、高価なメモリと計算の要求は、その実践的な展開に重大な課題をもたらしている。低ビット量子化は、モデルパラメータのビット幅を減らすことでこれらの課題を緩和するための重要なアプローチとして現れている。
論文参考訳（メタデータ） (Wed, 25 Sep 2024 07:38:02 GMT)
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B – arXiv最新論文の紹介 (devneko.jp)　にも関連する低ビット量子化に関するサーベイ。

Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.6]
具体的計画能力を評価するために設計されたベンチマークデータセットであるCan-Doを紹介する。私たちのデータセットには400のマルチモーダルサンプルが含まれており、それぞれが自然言語のユーザ指示、環境を描写した視覚イメージ、状態変化、対応するアクションプランで構成されています。ニューログラウンド(NeuroGround)は、まず認識された環境状態において計画生成を基礎とし、次に象徴的な計画エンジンを活用してモデル生成計画を強化する、ニューログラウンド(NeuroGround)を提案する。
論文参考訳（メタデータ） (Sun, 22 Sep 2024 00:30:11 GMT)
多様なシナリオでの具体的計画能力を測るマルチモーダルなデータセットとこれらを解くためにシンボリックエンジンを活用するNeuroGroundの提案。
リポジトリはCan-Do! A Dataset for Embodied Planning with Large Multimodal Models (embodied-planning.github.io)

LLaVA-Critic: Learning to Evaluate Multimodal Models

LLaVA-Critic: Learning to Evaluate Multimodal Models [110.1]
本稿では,LLaVA-Criticについて紹介する。LLaVA-Criticは,汎用評価器として設計された,最初のオープンソースの大規模マルチモーダルモデル(LMM)である。 LLaVA-Criticは、さまざまな評価基準とシナリオを組み込んだ高品質な批判的インストラクションフォローデータセットを使用してトレーニングされている。
論文参考訳（メタデータ） (Thu, 03 Oct 2024 17:36:33 GMT)
マルチモーダルなタスクに対しての評価を行うモデルの提案。データ構築もMLLMを多用するアプローチになっていて興味深いが、ライセンス的に大丈夫なんだろうかという若干の不安。
プロジェクトサイトはLLaVA-OneVision: Easy Visual Task Transfer (llava-vl.github.io)

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.3]
さまざまなドメインにまたがる6つのベンチマークデータセットのコレクションであるEasy2Hard-Benchを紹介します。これらのデータセット内の各問題は、数値的な難易度スコアで注釈付けされる。様々な難易度にまたがる性能と一般化能力を総合的に分析する。
論文参考訳（メタデータ） (Fri, 27 Sep 2024 03:49:56 GMT)
「While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank.」とのことで作られたデータセット。面白い傾向が出ている一方でLLMのベンチマークで主要な対象にされているところは難易度を分けるのにも苦労しそうな印象がある。
リポジトリはfuronghuang-lab/Easy2Hard-Bench · Datasets at Hugging Face

Emu3: Next-Token Prediction is All You Need

Emu3: Next-Token Prediction is All You Need [45.1]
Emu3は、次世代の予測だけで訓練された最先端のマルチモーダルモデルスイートである。 Emu3は、生成タスクと知覚タスクの両方において、確立されたタスク固有モデルよりも優れています。また、ビデオシーケンス内の次のトークンを予測することによって、高忠実度ビデオを生成することもできる。
論文参考訳（メタデータ） (Fri, 27 Sep 2024 16:06:11 GMT)
「Our results provide compelling evidence that nexttoken prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across diverse tasks, including challenging video generation.」という、シンプルかつ強い主張
リポジトリはGitHub – baaivision/Emu3: Next-Token Prediction is All You Need

The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends

The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends [65.0]
会話分析(CA)は、会話データから重要な情報を発見し分析する。本稿では,CAタスクの徹底的なレビューとシステム化を行い,既存の業務を要約する。会話シーンの再構築,奥行きの属性分析,ターゲットトレーニングの実行,会話の生成から,CAの4つの重要なステップを導出した。
論文参考訳（メタデータ） (Sat, 21 Sep 2024 16:52:43 GMT)
「Conversation analysis aims to identify critical information from human-human, humanmachine, machine-machine, and multi-party conversations, derive the underlying causes, and develop the solutions to drive relevant improvements for more effective goal achievement continuously, such as elevating customer experience, reducing complaint rate.」という定義の会話分析に関するサーベイ。
様々なタスクがあり、このような軸での分析も面白い。

2026年2月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28