2024年10月21日 – arXiv最新論文の紹介

Llama-3.1-Nemotron-70B, Ministral, Baichuan-Omni

NVidiaから「This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo　As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.」を主張するnvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Hugging Face、Mistralからは小規模だが高性能なMinistralが発表されている（Un Ministral, des Ministraux | Mistral AI | Frontier AI in your hands）。Baichuan-Omni はテキスト、画像、動画、音声に対応したマルチモーダルモデルでOSSで公開するとのこと。商用非公開モデルの大きなニュースリリースが予定されているようでそれも楽しみだが、weightが公開されるモデルが増えるのはありがたい。

1つ目はLlama-3.1-Nemotron-70B-Reward と HelpSteer2-Preference prompts を用いてLlama-3.1-70B-Instruct modelをチューニングしたものとのこと。NVIDIAは高効率なアーキテクチャの研究も進めているなど要注目。

2つ目、3つ目のような小規模高性能モデルも様々出ており、性能も検証してみたいところ。

HelpSteer2-Preference: Complementing Ratings with Preferences [45.0]
リワードモデルは、指示に従うためにモデルを整列させるのに不可欠である。データに適切にマッチする場合、どちらのアプローチも他方よりも優れているという証拠が不足している。そこで我々はBradley-Terry styleとRegression reward Modelingを組み合わせた新しい手法を提案する。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 06:05:52 GMT)

nGPT: Normalized Transformer with Representation Learning on the Hypersphere [23.7]
我々は新しいニューラルネットワークアーキテクチャ、正規化トランスフォーマー(nGPT)を提案する。 nGPTはより高速に学習し、同じ精度を達成するために必要なトレーニングステップの数を4から20に削減する。
論文参考訳（メタデータ） (Tue, 01 Oct 2024 23:50:09 GMT)

Baichuan-Omni Technical Report [28.3]
世界初のオープンソース 7B Multimodal Large Language Model (MLLM) であるBaichuan-Omni を紹介する。画像, ビデオ, 音声, テキストのモダリティを同時に処理し, 解析するのに適していることを示す。我々は,この貢献が,マルチモーダル理解とリアルタイムインタラクションを進める上で,オープンソースコミュニティの競争基盤となることを目指しています。
論文参考訳（メタデータ） (Fri, 11 Oct 2024 06:44:31 GMT)
リポジトリはGitHub – westlake-baichuan-mllm/bc-omni: Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊

Agent-as-a-Judge: Evaluate Agents with Agents

Agent-as-a-Judge: Evaluate Agents with Agents [61.3]
本稿ではエージェント・アズ・ア・ジャッジ(Agent-as-a-Judge)フレームワークを紹介し,エージェント・システムを用いてエージェント・システムの評価を行う。これはLLM-as-a-Judgeフレームワークの有機的拡張であり、タスク解決プロセス全体の中間フィードバックを可能にするエージェント的特徴を取り入れている。 55のリアルな自動化AI開発タスクのベンチマークであるDevAIを紹介します。
論文参考訳（メタデータ） (Mon, 14 Oct 2024 17:57:02 GMT)
LLM-as-a-Judgeならぬ、Agent-as-a-Judge。確かに有効なのだろうと思う。「We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline.」とのこと。
データセットがDEVAI-benchmark (DEVAI-benchmark)で公開されている。

BenTo: Benchmark Task Reduction with In-Context Transferability

BenTo: Benchmark Task Reduction with In-Context Transferability [32.6]
本稿では,大規模言語モデル(LLM)のベンチマークに使用するタスクを効率的に削減する方法を検討する。 In-context Learning (ICL) による2つのタスク間の伝達可能性を推定する実用的な指標を提案する。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:41:15 GMT)
評価が難しいLLM評価用のタスクを効率的に削減する手法の提案。Benchmark Task reductiOn (BENTO)は無理があるのでは、、、と思わなくはないがとても面白い研究。
リポジトリはGitHub – tianyi-lab/BenTo: Code for “BENTO: benchmark reduction with in-context learning transferability”

Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future

Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future [119.9]
水中物体検出(UOD)は、水中の画像やビデオ中の物体を識別し、ローカライズすることを目的としている。近年、人工知能(AI)に基づく手法、特に深層学習法は、UODにおいて有望な性能を示している。
論文参考訳（メタデータ） (Tue, 08 Oct 2024 00:25:33 GMT)
水中の物体認識に関するサーベイ。
リポジトリはGitHub – LongChenCV/UODReview

2024年10月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31