arXiv – ページ 12 – arXiv最新論文の紹介

DeepThink: Aligning Language Models with Domain-Specific User Intents

DeepThink: Aligning Language Models with Domain-Specific User Intents [25.5]
本研究では、高品質な命令を生成するためのDeepThinkと呼ばれる新しいフレームワークを提案する。 DeepThinkはまず、いくつかのシード質問を生成して、実際のユーザ質問を模倣し、会話をシミュレートして、隠されたユーザニーズを明らかにし、会話のコンテキストによって回答を洗練する。実験により、DeepThinkは広告ドメイン内の実際のユーザテストセット上でのGPT-4-turbo+RAGベースのアシスタントと比較して平均パフォーマンスが7.92%向上していることが示された。
論文参考訳（メタデータ） (Sat, 08 Feb 2025 09:04:16 GMT)
「: data synthesis based on conversations, data refinement based on conversations, and supervised fine-tuning (SFT) enhanced with retrieval, DeepThink addresses the critical challenge of adapting LLM to understand and meet hidden user needs in vertical domains.」というデータ合成フレームワーク＋αの提案と有効性検証。
ユーザの隠れたニーズに対応するためLLMの内部知識が有効という解釈だろうか。ありそうな気はするのと、大規模に行うAgentSocietyのようなことが現実的なら様々な分野で活用できそう。（悪用も怖い）

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging [12.1]
本稿では,言語固有の大規模言語モデル(LLM)の推論能力の向上を目的とする。 DeepSeek R1は推論に優れていますが、主に英語や中国語のような高リソース言語にメリットがあります。低リソース言語は、英語中心のトレーニングデータとモデル最適化の優位性のため、いまだに保存されていない。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 08:10:45 GMT)
LLMの推論能力を高めるためのモデルマージ+SFT、「We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.」とのこと
Qwen2.5とDeepSeek R1を利用した日本語大規模言語モデル「Qwen2.5 Bakeneko 32B」シリーズを公開｜rinna株式会社でも近いアプローチをとっているように見える。

LM2: Large Memory Models

LM2: Large Memory Models [11.3]
本稿では,補助メモリモジュールで拡張されたデコーダのみのトランスフォーマーアーキテクチャであるLarge Memory Model (LM2)を紹介する。 BABILongベンチマークの実験結果によると、LM2モデルはメモリ拡張RTTモデルとベースラインのLlama-3.2モデルの両方を平均86.3%上回っている。
論文参考訳（メタデータ） (Sun, 09 Feb 2025 22:11:42 GMT)
Large Memory Model (LM2)「decoder-only Transformer architecture enhanced with an auxiliary memory module」の提案。多くの人が待ち望んでいる拡張形態であり、実用的な規模（大規模）での検証でうまくいくか興味津々。
リポジトリはGitHub – convergence-ai/lm2: Official repo of paper LM2

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model [33.9]
SmolLM2は、最先端の”小” (170億のパラメータ) 言語モデルである。我々はSmolLM2を1兆のトークンでオーバートレーニングし、Webテキストと特殊な算数、コード、命令追従データとを混合する多段階のトレーニングプロセスを用いた。我々は、SmolLM2がQwen2.5-1.5BやLlama3.2-1Bなど、最近の小さなLMよりも優れていることを示した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 21:43:16 GMT)
HuggingfaceによるSLM、「SmolLM2 advances the state-of-the-art for open small LMs through a combination of careful dataset curation and multistage training.」とのこと。「SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B.」を主張
リポジトリはSmolLM2 – a HuggingFaceTB Collection

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding [21.9]
視覚言語モデル(VLM)は、常識的推論において優れているが、物理世界を理解するのに苦労していることを示す。本稿では、VLMの一般化強度とビジョンモデルの専門知識を組み合わせたフレームワークであるPhysAgentを紹介する。以上の結果から,VLMの物理世界理解能力の向上は,Mokaなどのエージェントの具体化に有効であることが示唆された。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 03:52:39 GMT)
VLMが物理を理解しているかを測るベンチマークとAgenticな physical world understandingフレームワーク、PhysAgentの提案。
現状の結果は意外なことに（？） o1 > InternVL2.5-38B > InternVL2.5-78B > GPT-4o > Gemini-1.5-pro
プロジェクトサイトはPhysBench、データセットはUSC-GVL/PhysBench · Datasets at Hugging Face

Top Ten Challenges Towards Agentic Neural Graph Databases

Top Ten Challenges Towards Agentic Neural Graph Databases [56.9]
Neo4jやTigerGraphのようなグラフデータベース(GDB)は相互接続されたデータを扱うのが得意だが、高度な推論機能が欠けている。本稿では,NGDBを3つのコア機能で拡張するエージェント型ニューラルネットワークデータベース(Agentic NGDB)を提案する。
論文参考訳（メタデータ） (Fri, 24 Jan 2025 04:06:50 GMT)
Agentic Neural Graph Databases 実現のための課題整理

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching [39.5]
会話型商品検索(CPS)の目的は、インテリジェントなチャットベースのショッピングアシスタントを開発することである。本稿では,大規模言語モデル(LLM)を利用して,現実的で自然な会話を生成する新しい手法TRACERを提案する。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 00:27:13 GMT)
「We leverage decision tree to explore the vast product search space, and construct a dialogue plan that minimizes the number of search steps required to retrieve a relevant product.」という会話生成手法の提案
直接生成せずに木構造を介すというアプローチはCondor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement – arXiv最新論文の紹介に近いのだろうか。

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.9]
我々は,Large Language Models (LLMs) の批判能力を評価するために設計された新しいベンチマークを導入する。通常、オープンループ方式で機能する既存のベンチマークとは異なり、我々のアプローチでは、批判から生成された修正の質を評価するクローズドループ手法を採用している。
論文参考訳（メタデータ） (Fri, 24 Jan 2025 13:48:10 GMT)
LLMの批判能力を評価するためのベンチマークの提案、「We investigate three distinct scenarios: self-critique, crosscritique, and iterative critique. Our findings reveal that in nearly all cases, the o1-mini model demonstrates the most impressive performance.」とのこと。
リポジトリはGitHub – tangzhy/RealCritic

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8]
本稿では,新規な知識蒸留法である$textitTemporally Adaptive Interpolated Distillation (TAID)$を紹介する。 TAIDは,各種モデルサイズおよびアーキテクチャに対して,命令チューニングと事前学習のシナリオにおいて優れた性能を示す。これらの結果は、TAIDが高性能で効率的なモデルの作成に有効であることを示し、よりアクセスしやすいAI技術の開発を推進している。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 05:51:25 GMT)
「TAID reduces the gap between teacher and student model throughout the training process by dynamically introducing an intermediate teacher that interpolates teacher and student models to provide a target distribution with a modest capability」という蒸留法の提案
ニュースリリースは新手法「TAID」を用いた小規模日本語言語モデル「TinySwallow-1.5B」の公開、リポジトリはTinySwallow – a SakanaAI Collection
Deepseek R1のようにライセンス上蒸留を許可しているLRM/LLMが出てきたことによるこの手の手法の重要性が上がっているように思う。

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [20.8]
MedXpertQAには17の専門分野と11の身体システムにまたがる4,460の質問が含まれている。 MMは、多様な画像と豊富な臨床情報を備えた専門家レベルの試験問題を導入する。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 14:07:56 GMT)
Medical分野のベンチマーク。o1だけでなくDeepseek R1の結果も載っており、対応が速い。この結果だとo1はDeepseek R1より大幅にスコアが高い。
リポジトリはGitHub – TsinghuaC3I/MedXpertQA: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30