arXiv最新論文の紹介

An Overview of Large Language Models for Statisticians

An Overview of Large Language Models for Statisticians [109.4]
大規模言語モデル(LLM)は人工知能(AI)の変換ツールとして登場した。本稿では, 統計学者がLLMの開発に重要な貢献できる可能性について考察する。我々は不確実性定量化、解釈可能性、公正性、プライバシー、透かし、モデル適応といった問題に焦点を当てる。
論文参考訳（メタデータ） (Tue, 25 Feb 2025 03:40:36 GMT)
LLMと統計学に関するサーベイ。教科書的な内容。
利用者目線だと「LLM-Empowered Statistical Analysis」が興味深い。

Wikipedia in the Era of LLMs: Evolution and Risks

Wikipedia in the Era of LLMs: Evolution and Risks [2.7]
既存のデータを通じてウィキペディアにおけるLarge Language Models (LLM) の影響を分析し、シミュレーションを用いて潜在的なリスクを探索する。その結果,Wikipedia の記事は LLM の影響を受けており,特定のカテゴリーの約1%-2% が影響していることがわかった。
論文参考訳（メタデータ） (Tue, 04 Mar 2025 18:58:13 GMT)
LLMがwikipediaに与えている影響の調査、「While the estimation results vary, the influence of LLMs on Wikipedia is likely to become more significant over time.In some categories, the impact has exceeded 2%.」とのこと。
翻訳やRAGの評価用データとして使う場合には気を付ける必要がある。（論文中では「If the sentences in machine translation benchmarks are drawn from Wikipedia content shaped by LLMs, the scores of machine translation models are likely to be inflated, potentially reversing the outcomes of comparisons between different models.」、「Wikipedia content processed by LLMs could appear less effective for RAG compared to real Wikipedia content.」と指摘している）

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [96.9]
我々は,工学的問題に対する完全かつ実現可能なソリューションを生成するシステムの能力を評価するために,新しいベンチマークであるSolutionBenchを導入する。本稿では,木に基づく探索と二点思考機構を利用して信頼性の高いソリューションを生成する新しいシステムであるSolutionRAGを提案する。
論文参考訳（メタデータ） (Fri, 28 Feb 2025 05:23:10 GMT)
工学の問題に対するソリューションを生成するベンチマークSolutionBenchと、それを解く手法SolutionRAGの提案。RAGとあるが「 SolutionRAG employs a bi-point thinking approach, alternating between solution design and review, gradually enhancing the solution’s completeness and reliability.」というツリーを作りながらの探索でAgenticなアプローチ。
リポジトリはGitHub – Li-Z-Q/DeepSolution: DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

QwQ-32B, Jamba 1.6, RWKV7 G1, Aya Vision, Mistral OCR, DeepSeek Open Source Week

先週も様々なニュースがあった。

QwQ-32BはDeepSeek-R1 (671B, Active 37B）と競合する性能を主張（QwQ-32B: Embracing the Power of Reinforcement Learning | Qwen）、「This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.」と強化学習の有効性を感じる。Model Context Protocol (MCP), QwQ, OLMo 2 – arXiv最新論文の紹介、QwQ: Reflect Deeply on the Boundaries of the Unknown | QwenのPreviewより大きく性能が上がっている。

Jamba 1.6はMistralやLlama、Cohereなど競合を超える性能を主張するLLM（Jamba 1.6: The Best Open Model for Enterprise Deployment | AI21）、SSM＋Transformerのハイブリッドアーキテクチャであり高速とのこと（The Best Private LLM for Enterprise AI Deployment | AI21）。Jamba Mini 1.6 (12B active/52B total) and Jamba Large 1.6 (94B active/398B total) の２モデルがあり、リポジトリが公開されている（Jamba 1.6 – a ai21labs Collection）。

RWKVもReasoningモデルRWKV7-G1 “GooseOne”を出している（RWKV Language Model, BlinkDL/rwkv7-g1 · Hugging Face）現状ではモデルの規模が小さいが、より大規模なReasoningModelがRWKVのようなアーキテクチャでも有効かは注視したいところ。（状態空間モデルでLRM的構成が有効というのは直感に反するようなそうでもないようなもやもやがある。今後の発展がとても気になる。）

Cohereによるパラメータ効率が良いマルチモーダル・マルチリンガルモデルAYA Vision （Aya Vision: Expanding the worlds AI can see, C4AI Aya Vision – a CohereForAI Collection）の発表もありローカル・オンプレミス環境で動作する強力なLLM、MLLMも増えてきている。

Mistral OCRの発表はDocument Understanding関連として注目のニュース（Mistral OCR | Mistral AI）。olmOCR – Open-Source OCR for Accurate Document Conversionでも思ったがMLLM系のDocument Understandingも強力そう。

DeepSeekのOpen Source Weekではその名の通り多くのライブラリが公開された。インフラ周りのコードがとても興味深い。

GitHub – deepseek-ai/open-infra-index: Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

START: Self-taught Reasoner with Tools

START: Self-taught Reasoner with Tools [51.4]
ツール統合長チェーン・オブ・シークレット(CoT)推論LSMであるSTART(Self-Taught Reasoner with Tools)を紹介する。 STARTは複雑な計算、自己チェック、多様な方法の探索、そして自己老化を行うことができる。基礎となるQwQ-32Bを著しく上回り、最先端のオープンウェイトモデルR1-Distill-Qwen-32Bに匹敵する性能を達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 17:11:51 GMT)
ツール統合型のCoTを行うSTART (Self-Taught Reasoner with Tools)の提案、「Hint-infer: code/math data is processed by QwQ, with responses truncated at predefined terminators. Context-aware hints from a Hint-Library are injected at truncation points (including endpoints), and QwQ resumes inference using a code interpreter for Python execution feedback.」と「b) Hint-RFT: Hint-infer outputs undergo rule-based scoring, filtering, and content modification to create Dseed .」の２つがキーポイント。ルール・テンプレートをうまく統合していっている印象で、この手の工夫は色々あり得そう。

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment [35.2]
提案手法は,暗黙の報奨によって適切に整合した英語モデルからの好みを捉え,反復学習を通じて他言語に伝達する手法である。 2回に分けて微調整したLlama3はウィンレートを平均12.72%改善し、X-AlpacaEvalのリーダーボード上でのトレーニング言語全体の長さ制御ウィンレートを5.97%向上させた。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 17:33:01 GMT)
「we propose a novel approach that captures learned preferences from well-aligned English models by implicit rewards and transfers them to other languages through iterative training.」、とのことで英語の選好をマルチリンガルに転送する手法の提案。「Multilingual Responses Generation、Implicit Cross-lingual Rewarding、Preference Transfer Training」の３つからなる
リポジトリはGitHub – ZNLP/Implicit-Cross-Lingual-Rewarding

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Predictive Data Selection: The Data That Predicts Is the Data That Teaches [19.0]
予測データ選択(PreSelect)は,高速テキストベースのスコアラのみのトレーニングとデプロイを必要とする軽量で効率的なデータ選択手法である。我々は、PreSelectで選択された30Bトークンでトレーニングされたモデルが300Bトークンでトレーニングされたバニラベースラインのパフォーマンスを上回ることを示した。
論文参考訳（メタデータ） (Tue, 04 Mar 2025 06:15:27 GMT)
「Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning.」という仮定の下設計されたデータ選択手法PRESELECTの提案。「PRESELECT demonstrates remarkable performance, with an average absolute improvement of 2.8% over the random selection and 20% gains in Math and Code raw text BPC, which shows a promising trend.」と効果を主張。
リポジトリはGitHub – hkust-nlp/PreSelect

Toward Robust Non-Transferable Learning: A Survey and Benchmark

Toward Robust Non-Transferable Learning: A Survey and Benchmark [51.5]
非伝達学習(NTL)は、ディープラーニングモデルの一般化能力を再構築することを目的とした課題である。 NTLの性能とロバスト性を評価する最初のベンチマークであるNTLBenchを紹介する。我々はNTLの実践的応用と今後の方向性と課題について論じる。
論文参考訳（メタデータ） (Wed, 19 Feb 2025 10:12:19 GMT)
「Its goal is to prevent the model’s generalization to specific target domains or tasks (such as harmful [Rosati et al , 2024; Huang et al , 2024b] or unauthorized domains [Wang et al , 2022b; Si et al , 2024]) while preserving its normal functionality on a source domain.」を目的とするNon-Transferable Learningのサーベイ。
ベンチマークを公開予定とのこと。GitHub – tmllab/NTLBench

Shh, don’t say that! Domain Certification in LLMs

Shh, don’t say that! Domain Certification in LLMs [124.6]
大きな言語モデル(LLM)は狭いドメインで制約されたタスクを実行するためにしばしばデプロイされる。ドメイン認証は、言語モデルのドメイン外動作を正確に特徴付ける保証である。次に, 逆境界を証明として提供するVALIDを, 単純かつ効果的なアプローチとして提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 17:13:19 GMT)
任意の入力がある状況下で狙ったドメイン以外の回答をしないようにする手法、Verified Adversarial LLM Output via Iterative Dismissal (VALID)の提案。

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.4]
本稿では,大規模言語モデルのコード推論能力を評価する新しい手法として等価チェックの課題を提案する。 EquiBenchは、4つのプログラミング言語と6つの等価カテゴリにまたがる2400のプログラムペアのデータセットである。その結果,OpenAI o3-miniの精度は78.0%と高いことがわかった。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 02:54:25 GMT)
「Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs」に関するベンチマーク。o3-miniが頭一つ抜けた性能。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30