2025年1月15日 – arXiv最新論文の紹介

LLM4SR: A Survey on Large Language Models for Scientific Research

LLM4SR: A Survey on Large Language Models for Scientific Research [15.5]
大きな言語モデル(LLM)は、研究サイクルの様々な段階にわたって前例のないサポートを提供する。本稿では,LLMが科学的研究プロセスにどのように革命をもたらすのかを探求する,最初の体系的な調査について述べる。
論文参考訳（メタデータ） (Wed, 08 Jan 2025 06:44:02 GMT)
LLM、特にAgenticな動作が流行って以降、実用性がでてきている感のある研究へのLLM利用に関するサーベイ。仮説を作るところからピアレビューまで一連のプロセスを対象にしている。

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model [66.2]
多くのLVLM(Large Vision-Language Models)は、主に英語のデータに基づいて訓練されている。異なる言語群に対する学習がいかに異なるかを検討する。私たちはCenturio(100言語LVLM)をトレーニングし、14のタスクと56の言語を対象とした評価で最先端のパフォーマンスを提供する。
論文参考訳（メタデータ） (Thu, 09 Jan 2025 10:26:14 GMT)
Large Vision-Language Modelにおける多言語化の検証、英語のパフォーマンスを低下させることなく対応可能な言語数などに焦点を当てている。「our analysis reveals that one can (i) include as many as 100 training languages simultaneously　(ii) with as little as 25-50% of non-English data, to greatly improve multilingual performance while retaining strong English performance.　We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding.」は興味深い結果。LLaVAアーキテクチャ、2.8BのPhi3.5、Llama 3 8Bでの検証。
その後、「After benchmarking different 7-9B parameter LLMs, we find that Aya-Expanse and Qwen 2.5 give the overall best results.」の結果、Aya-ExpanseとQwen 2.5を用いてモデル構築を行っている。
リポジトリはCenturio: On Drivers of Multilingual Ability of Large Vision-Language Model

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [104.4]
視覚情報と音声情報の両方を段階的に学習する多段階学習手法を提案する。提案手法は, 視覚言語能力の強化だけでなく, 音声音声対話の効率向上にも寄与する。画像, ビデオ, 音声タスクのベンチマークにおいて, 我々の手法を最先端の手法と比較することにより, モデルが強い視覚と音声の両機能を備えていることを示す。
論文参考訳（メタデータ） (Fri, 03 Jan 2025 18:59:52 GMT)
VisionとSpeechに対応したマルチモーダルな対話モデル構築のため、3段階での学習方法を提案。「The input side consists of vision and audio encoders, along with their adapters connected to a LLM. The output side has an end-to-end speech generation module, rather than directly using an external TTS model as the initial VITA-1.0 version」というアーキテクチャ。性能は公開モデルや商用モデルと競合するレベル。
リポジトリはGitHub – VITA-MLLM/VITA: ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31