arXiv – ページ 3 – arXiv最新論文の紹介

METAL: Towards Multilingual Meta-Evaluation

METAL: Towards Multilingual Meta-Evaluation [12.9]
本研究では,多言語シナリオにおいて,Large Language Models (LLMs) を評価対象としてエンド・ツー・エンド評価を行うためのフレームワークを提案する。要約作業のための母国語話者判定を含む10言語を対象としたデータセットを作成する。 GPT-3.5-Turbo, GPT-4, PaLM2を用いたLCM評価器の性能の比較を行った。
論文参考訳（メタデータ） (Tue, 02 Apr 2024 06:14:54 GMT)
マルチリンガルなLLM評価フレームワークの提案、GPT-4はやはり優秀。だが「Finally, we analyze human and LLM reasoning and observe that LLMs often provide incorrect justifications for their scores, thus showing that more research is needed to be able to use LLM-based evaluators with confidence in the multilingual setting.」・・・。わりとよく言われていることではある・・・。
リポジトリはhadarishav/METAL: Code and data repo for NAACL’24 findings paper “METAL: Towards Multilingual Meta Evaluation” (github.com)

VisualWebBench

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.6]
MLLM(Multimodal Large Language Model)は、Web関連のタスクにおいて有望であることを示す。 Webドメインにおけるパフォーマンス評価は、包括的なベンチマークが欠如しているため、依然として課題である。 benchは、さまざまなWebタスクにわたるMLLMの機能を評価するために設計されたマルチモーダルベンチマークである。
論文参考訳（メタデータ） (Tue, 09 Apr 2024 02:29:39 GMT)
マルチモーダルなLLMを対処としたWeb undestandingタスクのベンチマーク「VisualWebBench consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains.」とそこそこの規模。結果はタスクによって異なるが、平均的にはClaude Sonnet > GPT-4V > Claude Opus > LLaVA-1.6-34B > Gemini Pro とやや意外。日本語版作りたいなーと思わなくもない。
リポジトリはVisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Blessing or curse? A survey on the Impact of Generative AI on Fake News

Blessing or curse? A survey on the Impact of Generative AI on Fake News [45.0]
現在、高品質で個別にターゲットとするフェイクニュースのマス作成を自動化することが可能である。この調査は、2024年のフェイクニュースの検出と作成のためのジェネレーティブAIの研究と実用化に関する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 03 Apr 2024 19:14:45 GMT)
フェイクニュース作成と検出の両面から生成AIの影響を調査したサーベイ

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples [17.4]
我々は,Llama2,GPT-4,Claude 3などの事前学習された大規模言語モデルが,文脈内例が与えられた場合の線形回帰や非線形回帰をいかにうまく行うかを分析する。いくつかの大きな言語モデルでは、従来の教師付きメソッドに匹敵する(あるいはパフォーマンスに優れる)パフォーマンスで回帰タスクを実行できる。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 08:12:43 GMT)
GradientBoostingの結果など怪しげな部分はありつつ、またデータセット構築過程から言ってLRの結果反則じゃね？とかいろいろ思うところはあるが興味深い結果。元の数式を予測しに行っているとかだととても面白い。LLMにとって未知のデータ（Leakがないことが保証されているデータ）で検証してみたいところ。
robertvacareanu/llm4regression: Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update (github.com)

Language Imbalance Can Boost Cross-lingual Generalisation

Language Imbalance Can Boost Cross-lingual Generalisation [57.3]
本研究では,言語間一般化の非直感的な新規ドライバである言語不均衡について検討する。学習中に支配的な言語が存在することが、あまり頻度の低い言語の性能を高めることを観察する。分析を実言語に拡張するにつれ、頻繁な言語は依然として恩恵を受けていますが、言語不均衡が言語間の一般化を引き起こすかどうかは決定的ではありません。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 17:58:05 GMT)
「In both settings, we find that, without vocabulary overlap, our models do not show strong cross-lingual generalisation when trained on a balanced language set.However, when training on an imbalanced mix of languages, we observe increased performance compared to monolingual settings.」という興味深い結果。クローン言語と実際の言語での差異を含めて面白い。
リポジトリはantonschafer/xling-imbalance (github.com)

LLM2Vec

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.4]
大規模デコーダのみの言語モデル(LLM)は、今日のNLPタスクとベンチマークのほとんどで最先端のモデルである。 LLM2Vecは、任意のデコーダのみのLLMを強力なテキストエンコーダに変換する、単純な教師なしアプローチである。
論文参考訳（メタデータ） (Tue, 09 Apr 2024 02:51:05 GMT)
LLMを用いたエンベディング。任意のCausalLMから埋め込み用モデル構築する手法の提案。優れた結果。単純といえば単純なアプローチではあるが、なぜこれが効果的なのかわかるようなわからないような。
論文中の「Based on these findings (we replicate these results for other inputs and other Mistral models in Appendix F) and the strong unsupervised results for Mistral-7B with bidirectional attention, we speculate that Mistral models are pre-trained with some form bidirectional attention, e g , prefix language modeling (Raffel et al , 2020) – at least for some parts of its training.」が非常に興味深い。
リポジトリはMcGill-NLP/llm2vec: Code for ‘LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders’ (github.com)

ReaLMistake

Evaluating LLMs at Detecting Errors in LLM Responses [30.6]
この研究は、LLMによる客観的、現実的で多様なエラーからなる最初のエラー検出ベンチマークであるReaLMistakeを紹介した。我々はReaLMistakeを用いて12の大規模言語モデルに基づいて誤り検出を行う。
論文参考訳（メタデータ） (Thu, 04 Apr 2024 17:19:47 GMT)
LLMのエラー検出ベンチマーク。「Our experiments on this benchmark with error detectors based on 12 LLMs show that detecting mistakes in LLMs (GPT-4 and Llama 2 70B) is challenging even for recent LLMs.」という結論はそうだよなーという感じではありつつ、LLMにはときにくい課題かつエラー検出難しいものがありそうで面白い
リポジトリはpsunlpgroup/ReaLMistake: This repository includes a benchmark and code for the paper “Evaluating LLMs at Detecting Errors in LLM Responses”. (github.com)

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation [16.3]
マルチモーダルなセマンティックセグメンテーションのためのSiamese MambaネットワークであるSigmaを紹介する。シームズエンコーダを用いて,マンバ核融合機構を革新することにより,様々なモーダルから本質的な情報を効果的に選択する。本手法はRGB-ThermalとRGB-Depthのセグメンテーションタスクにおいて厳密に評価される。
論文参考訳（メタデータ） (Fri, 05 Apr 2024 17:59:44 GMT)
MambaベースのMulti-modal semantic segmentationモデルの提案。画像分野の応用も有望なんだろうか。
リポジトリはzifuwan/Sigma: Python implementation of Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation (github.com)

Transformerアーキテクチャに代わりうるモデルに関する報告が出ていた。Eagle, FinchはRWKVプロジェクト（DBRX, Jamba, Grok-1.5, RWKV Finch – arXiv最新論文の紹介 (devneko.jp)など）の研究成果で非常にまとまった論文、RecurentGemmaは1 bit(1.58 bit)なLLMとHAWK・Griffin – arXiv最新論文の紹介 (devneko.jp)のGriffinを取り入れたオープンなモデルである。新たなアーキテクチャに期待したい。

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence [37.0]
本稿では,RWKV(RWKV-4)アーキテクチャを改良したシーケンスモデルであるEagle(RWKV-5)とFinch(RWKV-6)を紹介する。アーキテクチャ設計の進歩には、マルチヘッド行列値状態と動的再帰機構が含まれる。我々は1.12兆のトークンを持つ新しい多言語コーパスと、強化された多言語性のためのgreedyマッチングに基づく高速トークン化器を導入する。
論文参考訳（メタデータ） (Wed, 10 Apr 2024 19:34:38 GMT)
RWKVの最新バージョンの論文、ベンチマーク結果を見る限りtransformerベースの最新アーキテクチャと比べても良い勝負になってきている。学習時の計算コストと性能ではMambaよりもコストパフォーマンスがよさそう。
プロジェクトサイトはRWKV (RWKV) (huggingface.co)

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [103.6]
本稿では,Googleの新しいGriffinアーキテクチャを用いたオープン言語モデルであるRecurrentGemmaを紹介する。 Griffinは、言語における優れたパフォーマンスを達成するために、線形反復と局所的な注意を組み合わせる。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 15:27:22 GMT)
こちらはGriffinアーキテクチャによるオープンモデル。2Bで比較してGemmaとほぼ同性能、スループットは大幅に向上している。
リポジトリはgoogle-deepmind/recurrentgemma: Open weights language model from Google DeepMind, based on Griffin. (github.com)、モデルはKaggleで公開されている。RecurrentGemma | Kaggle

Rho-1: Not All Tokens Are What You Need

Rho-1: Not All Tokens Are What You Need [132.3]
「コーパス内のトークンはすべて、言語モデルトレーニングに等しく重要ではない」 Rho-1 は選択言語モデリング (SLM) を採用しており、所望の分布に合わせて有用なトークンを選択的に訓練する。 15B OpenWebMathコーパスで継続事前トレーニングを行うと、Rho-1は9つの数学タスクで最大30%のショット精度で絶対的に改善する。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 17:52:01 GMT)
「Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution.」によって最終性能が上がるという報告。高品質（所望の）ドキュメントで参照モデルを構築し、その結果を利用してトークンを選択するアプローチのよう。
リポジトリはmicrosoft/rho: Token-level Data Filtering & Selective Pretraining of LLMs. (github.com)

カテゴリー: arXiv