2025年9月5日 – arXiv最新論文の紹介

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.8]
大言語モデル(LLM)は、単純なテキストジェネレータから、検索強化、ツール呼び出し、マルチターンインタラクションを統合する複雑なソフトウェアシステムへと進化してきた。その固有の非決定主義、ダイナミズム、文脈依存は品質保証に根本的な課題をもたらす。本稿では,LLMアプリケーションを3層アーキテクチャに分解する:システムシェル層、プロンプトオーケストレーション層、およびLLM推論コア
論文参考訳（メタデータ） (Thu, 28 Aug 2025 13:00:28 GMT)
LLMを用いたソフトウェアに対するテストのサーベイ
conclusionに「A key insight is that LLM application testing is neither a mere extension of traditional software testing nor a straightforward application of AI-security techniques.」とある通り、LLM活用のソフトウェアは動的・確率的な動作にならざるを得ないためテスト手法はかなり変わるよう。

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution [13.6]
大規模言語モデル(LLM)は、知識や推論能力をテストするタスクで一般的に評価される。本稿では、モデルが出力の特性を予測できる能力を測定するセルフ実行ベンチマークを紹介する。私たちの実験では、モデルが一般的にこのベンチマークではパフォーマンスが悪く、モデルのサイズや能力が向上しても、常にパフォーマンスが向上するとは限らないことが示されています。
論文参考訳（メタデータ） (Sun, 17 Aug 2025 07:57:58 GMT)
「Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model’s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this bench- mark, and that increased model size or capability does not consistently lead to better performance.」という変わったベンチマーク。メタな視点になっていて結果を含めとても興味深い。
リポジトリはGitHub – anon-researcher-2025/Self-Execution-Benchmark

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs [16.6]
PosterGenはプロのポスターデザイナーのワークフローを反映したマルチエージェントフレームワークである。意味的に根拠があり、視覚的に魅力的であるポスターを制作する。実験の結果,PosterGenはコンテントの忠実度に一貫して一致し,ビジュアルデザインの既存手法よりも優れていた。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 02:25:45 GMT)
論文からポスターを生成するマルチエージェントフレームワークの提案
リポジトリはGitHub – Y-Research-SBU/PosterGen: Official Code for PosterGen