2024年4月 – ページ 3 – arXiv最新論文の紹介

Stream of Search (SoS): Learning to Search in Language

Stream of Search (SoS): Learning to Search in Language [29.8]
本稿では,言語における探索の過程をフラットな文字列として表現することで,言語モデルがどのように学習するかを示す。本稿では,複数のシンボル検索戦略を抽出する統一言語を提案する。この結果から,言語モデルでは,探索による問題解決や,異なる探索戦略を柔軟に活用する自己改善,新たな探索手法の発見などが可能であることが示唆された。
論文参考訳（メタデータ） (Mon, 01 Apr 2024 06:50:52 GMT)
言語モデルに探索戦略を教え込むことが可能そうという報告。「We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory.」、「The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers.」、Transformerは非常に強力。。
リポジトリはkanishkg/stream-of-search (github.com)

AutoRace: AUTOmated ReAsoning Chain Evaluation

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [25.5]
完全自動推論チェーン評価のためのAutoRaceを導入する。既存の推論アルゴリズムと新しい推論アルゴリズムのモジュール実装を標準化するためのライブラリである LLM Reasoners も開発している。
論文参考訳（メタデータ） (Mon, 08 Apr 2024 06:35:09 GMT)
推論過程を評価するベンチマーク。GPT-4を用いた自動評価。
プロジェクトサイトはHome | Reasoners (llm-reasoners.net)

METAL: Towards Multilingual Meta-Evaluation

METAL: Towards Multilingual Meta-Evaluation [12.9]
本研究では,多言語シナリオにおいて,Large Language Models (LLMs) を評価対象としてエンド・ツー・エンド評価を行うためのフレームワークを提案する。要約作業のための母国語話者判定を含む10言語を対象としたデータセットを作成する。 GPT-3.5-Turbo, GPT-4, PaLM2を用いたLCM評価器の性能の比較を行った。
論文参考訳（メタデータ） (Tue, 02 Apr 2024 06:14:54 GMT)
マルチリンガルなLLM評価フレームワークの提案、GPT-4はやはり優秀。だが「Finally, we analyze human and LLM reasoning and observe that LLMs often provide incorrect justifications for their scores, thus showing that more research is needed to be able to use LLM-based evaluators with confidence in the multilingual setting.」・・・。わりとよく言われていることではある・・・。
リポジトリはhadarishav/METAL: Code and data repo for NAACL’24 findings paper “METAL: Towards Multilingual Meta Evaluation” (github.com)

VisualWebBench

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.6]
MLLM(Multimodal Large Language Model)は、Web関連のタスクにおいて有望であることを示す。 Webドメインにおけるパフォーマンス評価は、包括的なベンチマークが欠如しているため、依然として課題である。 benchは、さまざまなWebタスクにわたるMLLMの機能を評価するために設計されたマルチモーダルベンチマークである。
論文参考訳（メタデータ） (Tue, 09 Apr 2024 02:29:39 GMT)
マルチモーダルなLLMを対処としたWeb undestandingタスクのベンチマーク「VisualWebBench consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains.」とそこそこの規模。結果はタスクによって異なるが、平均的にはClaude Sonnet > GPT-4V > Claude Opus > LLaVA-1.6-34B > Gemini Pro とやや意外。日本語版作りたいなーと思わなくもない。
リポジトリはVisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Blessing or curse? A survey on the Impact of Generative AI on Fake News

Blessing or curse? A survey on the Impact of Generative AI on Fake News [45.0]
現在、高品質で個別にターゲットとするフェイクニュースのマス作成を自動化することが可能である。この調査は、2024年のフェイクニュースの検出と作成のためのジェネレーティブAIの研究と実用化に関する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 03 Apr 2024 19:14:45 GMT)
フェイクニュース作成と検出の両面から生成AIの影響を調査したサーベイ

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples [17.4]
我々は,Llama2,GPT-4,Claude 3などの事前学習された大規模言語モデルが,文脈内例が与えられた場合の線形回帰や非線形回帰をいかにうまく行うかを分析する。いくつかの大きな言語モデルでは、従来の教師付きメソッドに匹敵する(あるいはパフォーマンスに優れる)パフォーマンスで回帰タスクを実行できる。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 08:12:43 GMT)
GradientBoostingの結果など怪しげな部分はありつつ、またデータセット構築過程から言ってLRの結果反則じゃね？とかいろいろ思うところはあるが興味深い結果。元の数式を予測しに行っているとかだととても面白い。LLMにとって未知のデータ（Leakがないことが保証されているデータ）で検証してみたいところ。
robertvacareanu/llm4regression: Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update (github.com)

Language Imbalance Can Boost Cross-lingual Generalisation

Language Imbalance Can Boost Cross-lingual Generalisation [57.3]
本研究では,言語間一般化の非直感的な新規ドライバである言語不均衡について検討する。学習中に支配的な言語が存在することが、あまり頻度の低い言語の性能を高めることを観察する。分析を実言語に拡張するにつれ、頻繁な言語は依然として恩恵を受けていますが、言語不均衡が言語間の一般化を引き起こすかどうかは決定的ではありません。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 17:58:05 GMT)
「In both settings, we find that, without vocabulary overlap, our models do not show strong cross-lingual generalisation when trained on a balanced language set.However, when training on an imbalanced mix of languages, we observe increased performance compared to monolingual settings.」という興味深い結果。クローン言語と実際の言語での差異を含めて面白い。
リポジトリはantonschafer/xling-imbalance (github.com)

LLM2Vec

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.4]
大規模デコーダのみの言語モデル(LLM)は、今日のNLPタスクとベンチマークのほとんどで最先端のモデルである。 LLM2Vecは、任意のデコーダのみのLLMを強力なテキストエンコーダに変換する、単純な教師なしアプローチである。
論文参考訳（メタデータ） (Tue, 09 Apr 2024 02:51:05 GMT)
LLMを用いたエンベディング。任意のCausalLMから埋め込み用モデル構築する手法の提案。優れた結果。単純といえば単純なアプローチではあるが、なぜこれが効果的なのかわかるようなわからないような。
論文中の「Based on these findings (we replicate these results for other inputs and other Mistral models in Appendix F) and the strong unsupervised results for Mistral-7B with bidirectional attention, we speculate that Mistral models are pre-trained with some form bidirectional attention, e g , prefix language modeling (Raffel et al , 2020) – at least for some parts of its training.」が非常に興味深い。
リポジトリはMcGill-NLP/llm2vec: Code for ‘LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders’ (github.com)

Is Cosine-Similarity of Embeddings Really About Similarity? [46.8]
コサイン相似性(Cosine-similarity)は、2つのベクトル間の角度のコサイン、すなわちそれらの正規化の間のドット積である。正規化線形モデルから導かれる埋め込みについて検討し、そこでは閉形式解が解析的洞察を促進する。我々はコサイン相似性が任意の、したがって無意味な類似性をもたらすか分析的に導出する」。
論文参考訳（メタデータ） (Fri, 8 Mar 2024 16:48:20 GMT)
コサイン類似度が最善でない場合もあるようだが、この手法はどうなんだろう。

ReaLMistake

Evaluating LLMs at Detecting Errors in LLM Responses [30.6]
この研究は、LLMによる客観的、現実的で多様なエラーからなる最初のエラー検出ベンチマークであるReaLMistakeを紹介した。我々はReaLMistakeを用いて12の大規模言語モデルに基づいて誤り検出を行う。
論文参考訳（メタデータ） (Thu, 04 Apr 2024 17:19:47 GMT)
LLMのエラー検出ベンチマーク。「Our experiments on this benchmark with error detectors based on 12 LLMs show that detecting mistakes in LLMs (GPT-4 and Llama 2 70B) is challenging even for recent LLMs.」という結論はそうだよなーという感じではありつつ、LLMにはときにくい課題かつエラー検出難しいものがありそうで面白い
リポジトリはpsunlpgroup/ReaLMistake: This repository includes a benchmark and code for the paper “Evaluating LLMs at Detecting Errors in LLM Responses”. (github.com)

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation [16.3]
マルチモーダルなセマンティックセグメンテーションのためのSiamese MambaネットワークであるSigmaを紹介する。シームズエンコーダを用いて,マンバ核融合機構を革新することにより,様々なモーダルから本質的な情報を効果的に選択する。本手法はRGB-ThermalとRGB-Depthのセグメンテーションタスクにおいて厳密に評価される。
論文参考訳（メタデータ） (Fri, 05 Apr 2024 17:59:44 GMT)
MambaベースのMulti-modal semantic segmentationモデルの提案。画像分野の応用も有望なんだろうか。
リポジトリはzifuwan/Sigma: Python implementation of Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation (github.com)

2024年4月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30