2025年3月12日 – arXiv最新論文の紹介

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.2]
本稿では,Large Language Models (LLMs) のコード批判ベンチマークであるCodeCriticBenchを紹介する。具体的には、CodeCriticBenchには2つの主要なコードタスク(コード生成とコードQA)が含まれています。さらに、評価プロトコルには、基本的な批評評価と、異なる特性に対する高度な批評評価が含まれる。
論文参考訳（メタデータ） (Sun, 23 Feb 2025 15:36:43 GMT)
「To evaluate the critique abilities of LLMs on the code domain, we introduce the first holistic code critique benchmark CodeCriticBench, which includes the critique on both code generation and code QA tasks.」という珍しいタスクに対するベンチマーク。DeepSeek-R1とOpenAI o1-Previewの能力が高い。
リポジトリはGitHub – multimodal-art-projection/CodeCriticBench

Unnatural Languages Are Not Bugs but Features for LLMs [92.8]
大規模言語モデル(LLM)は、ジェイルブレイクプロンプトなどの非可読テキストシーケンスを処理するために観察されている。我々はこの認識に挑戦する体系的な調査を行い、非自然言語にはモデルで使用可能な潜在的特徴が含まれていることを示した。
論文参考訳（メタデータ） (Sun, 02 Mar 2025 12:10:17 GMT)
「we study a phenomenon named unnatural languages – strings that deviate from natural language syntax and appear extremely noisy to human readers, yet remain understandable to LLMs.」という研究。Abstractにもある通りJailbreakの起点となったりする重要なもの。
「These findings strongly demonstrate our key findings: unnatural languages are not bugs but features for LLMs.」で「We demonstrate that LLMs process unnatural languages by effectively filtering out irrelevant tokens. Furthermore, LLMs combine relevant tokens from unnatural languages and infer contextual meaning in response to natural version questions.」とのこと。LLMの能力がすごい。
リポジトリはGitHub – John-AI-Lab/Unnatural_Language: The official repository of ‘Unnatural Language Are Not Bugs but Features for LLMs’

An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.5]
RLトレーニングのスケーリングは、そのような推論モデルを実装するための中心的なテクニックとなっている。我々のRLトレーニングアプローチはQwen2.5-32Bベースモデルを継続的に改善することを示した。また、ツール操作の利用についても検討し、大きな推論モデルの推論性能を大幅に向上させることを見出した。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 15:34:27 GMT)
様々な研究機関が取り組むR1 like（o1 like）なモデル開発のテクニカルレポート。「By effectively utilizing tool manipulation, STILL-3-TOOL-32B achieves an impressive accuracy of 86.67 (greedy search) on AIME 2024. Remarkably, this ability can be activated with only a small number of high-quality training instances 」というのは面白く、ツールの利用にも拡張が進みつつあるよう。
リポジトリはGitHub – RUCAIBox/Slow_Thinking_with_LLMs: A series of technical report on Slow Thinking with LLM