Multilingual – ページ 3 – arXiv最新論文の紹介

Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in? [40.5]
我々は、英語中心のLLMが、その強い性能にもかかわらず、それぞれの支配言語に「思考」があるかどうかを考察する。実験の結果,Llama2は内在言語として英語のみに依存しているのに対し,日本語固有のスワロー語とLLM-jpは日本語と英語の両方を使用し,二重内在言語を呈していることがわかった。任意の対象言語に対して、モデルは最も密接に関連する潜在言語を優先的に活性化する。
論文参考訳（メタデータ） (Tue, 20 Aug 2024 13:05:41 GMT)
Llama2、その日本語強化（日本語を用いた継続学習）バージョンであるSwallow、日本語・英語のバランスをとったコーパスで構築されたLLM-jpにおける多言語動作の比較。
３モデルの挙動の違い、文化的側面がある新学期に関する問いの違いが面白い
抽象度が進んだ数学や論理処理だと動作はどうなるんだろう？centricな言語が中心になるとして継続学習モデルだと日本語なんやろうか。

Speech-MASSIVE

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond [36.7]
Speech-MASSIVEは多言語音声言語理解データセットである。異なるファミリーの12の言語をカバーし、インテント予測とスロットフルタスクのためのアノテーションから継承する。本稿では,音声の書き起こし,言語識別,音声翻訳などのタスクに対して,Speech-MASSIVEの適性を示す。
論文参考訳（メタデータ） (Wed, 7 Aug 2024 16:55:28 GMT)
マルチリンガルな音声の言語理解データセット（spoken language understanding (SLU – the task of extracting semantic information from spoken utterances, which typically involves subtasks like intent detection and slot ﬁlling)）
リポジトリはGitHub – hlt-mt/Speech-MASSIVE: Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus.
ライセンスはCC-BY-SA-4.0、日本語が入っていないのが残念。。。

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting [27.1]
本稿では,多言語合成指導調律データセット sPhinX を作成するための新しいレシピを提案する。 SPhinXは、命令応答対を英語から50言語に選択的に翻訳することで作成される。 Phi-3-Small と Mistral-7B の2つの最先端モデルを微調整するために sPhinX の有効性を検証した。
論文参考訳（メタデータ） (Sat, 13 Jul 2024 13:03:45 GMT)
「To mitigate this issue, we prompt GPT-4 to selectively translate the instructions, so that the tasks are translated into the appropriate language without changing the semantic meaning.」とLLMを用いた機械翻訳を有効に使った多言語fine tuning。
「We devise LAnguage-Specific N-shot Guided Instruction fine-tuning (LANG) strategy for enhancing the multilingual capabilities of LLMs」を含め有効だとは思うのだが現時点ではライセンス上使いにくい・・・（ライセンス的にOKなNemotronだと現実的なのか気になるところ）

BMIKE-53

BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning [43.1]
知識編集(KE)は、大規模言語モデルの知識を更新するための有効なソリューションとして登場した。 3種類のKEタスクタイプにわたる53の多言語における言語間KE評価のためのBMIKE-53ベンチマークを提案する。本評価では,信頼性,汎用性,局所性,可搬性の観点から,言語間知識伝達に着目した。
論文参考訳（メタデータ） (Tue, 25 Jun 2024 17:48:56 GMT)
マルチリンガルな知識編集ベンチマークと、Multilingual In-context Knowledge Editing (MIKE) 手法の提案
リポジトリはAnonymized Repository – Anonymous GitHub (4open.science)

X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions [43.9]
大規模な言語モデルは、英語のような高リソース言語ではうまく反応するが、低リソース言語では苦労する。そこで本研究では,低リソース言語における英語の命令と応答を併用した言語間命令を新たに構築する手法を提案する。
論文参考訳（メタデータ） (Thu, 30 May 2024 06:45:23 GMT)
下記3段階（リポジトリより）で低リソースな言語用にcross-lingual instructions datasetを作る手法の提案。
- X-Instruction Generation: Language models learn to generate cross-lingual instructions for multilingual texts using seed data.
- X-Instruction Refinement: Language models iteratively label and refine cross-lingual instruction samples.
- X-Instruction Diversification: The final instruction data are sampled from different clusters of embedding from the English instruction to increase the diversity.
リポジトリはGitHub – ZNLP/X-Instruction: Official code and data for ACL-2024 paper “X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions”

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.3]
LLM(Large Language Models)の急速な開発は、自然言語処理における顕著な多言語機能を示している。 LLMのブレークスルーにもかかわらず、多言語シナリオの研究は依然として不十分である。本調査は,多言語問題に対する研究コミュニティの取り組みを支援することを目的としており,LLMに基づく多言語自然言語処理における中核概念,鍵技術,最新の発展の包括的理解を提供する。
論文参考訳（メタデータ） (Fri, 17 May 2024 17:47:39 GMT)
LLMの多言語対応に関するサーベイ。
リポジトリも参考になる　GitHub – kaiyuhwang/MLLM-Survey: The paper list of multilingual pre-trained models (Continual Updated).

Why Not Transform Chat Large Language Models to Non-English?

Why Not Transform Chat Large Language Models to Non-English? [57.2]
非英語データの不足は、非英語大言語モデル(LLM)の開発を制限する TransLLMは、転送問題を変換チェーン・オブ・シント(translation chain of-of- Thought)でいくつかの一般的なサブタスクに分割する。本手法は,シングルターンデータのみを用いて,マルチターンベンチマークMT-benchにおいて,強いベースラインとChatGPTより優れる。
論文参考訳（メタデータ） (Wed, 22 May 2024 18:53:25 GMT)
LLMを他の言語に対応させる手法の提案。Target Language Pre-Training → Translation Pre-Training → Transfer Fine-Tuningという流れで翻訳をキーとしている。

The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights

The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.4]
本稿では,大規模言語モデルの英語と非英語のパフォーマンスのギャップを埋めるための質問アライメント手法を提案する。実験結果から,質問アライメント手法は多様な推論シナリオにおける多言語のパフォーマンス向上に有効であることが示唆された。その成功のメカニズムを理解するために、表現空間、チェーン・オブ・シンク、翻訳データスケールを分析する。
論文参考訳（メタデータ） (Thu, 02 May 2024 14:49:50 GMT)
多言語性能を上げるための２段階のアライメント手法（ question alignment and response alignment）の提案。さらに「En-X translation training can implicitly bias LLM to generate non-English chain-of-thought and increase the question-response language consistency.」とのこと。分析や解釈も面白い。
リポジトリはGitHub – NJUNLP/QAlign

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment [38.4]
英語中心のモデルは、通常他の言語では準最適である。そこで本研究では,言語間命令チューニングデータの混合合成を利用したCrossInという新しい手法を提案する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:20:50 GMT)
多言語能力を上げるためのInstruction tuningアプローチ。「CrossIn: It comprises cross-lingual instruction tuning datasets, where instruction and output are featured in two different languages」「Trans: It consists of translation pairs for instructions.」を併用。後者の「We hypothesize that if the model concurrently learns these translation tasks, it could facilitate the transfer of knowledge between languages.」は興味深い仮説。評価データも構築している。
Mistral等を使って提案手法の効果を検証。
リポジトリはGitHub – Lingy12/CrossIn

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.5]
本稿では,MLLM(Multilingual Large Language Model)文学における最近の進歩と新たなトレンドを要約する一貫した視点を提示する。私たちの研究がコミュニティに迅速なアクセスを提供し、MLLMにおける画期的な研究を促進することを願っています。
論文参考訳（メタデータ） (Sun, 07 Apr 2024 11:52:44 GMT)
マルチリンガルLLMに対するサーベイ。アプローチも結果も様々でありがたいサーベイであり、かつ論文リストがプロジェクトサイトに整理して一覧化されているのもありがたい。
プロジェクトサイトはMLLM (multilingual-llm.net)

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30