2024年9月3日 – arXiv最新論文の紹介

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Can Unconfident LLM Annotations Be Used for Confident Conclusions? [34.2]
大規模言語モデル (LLMs) は、様々なタスクにおいて、人間と高い合意を示してきた。信頼性駆動推論(Confidence-Driven Inference)は、LCMの信頼度インジケータを組み合わせて、どのアノテーションを収集すべきかを戦略的に選択する手法である。
論文参考訳（メタデータ） (Tue, 27 Aug 2024 17:03:18 GMT)
LLMと人間が手分けしてアノテーションをする状況下で、LLMのアノテーション及びLLMの信頼度を使って人間がやるべきアノテーションを選択する手法の提案。「We demonstrate the effectiveness of CONFIDENCE-DRIVEN INFERENCE over baselines in statistical estimation tasks across three CSS settings—text politeness, stance, and bias—reducing the needed number of human annotations by over 25% in each.」とのこと。
リポジトリはGitHub – kristinagligoric/confidence-driven-inference

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling [18.2]
強力な言語モデル(LM)を用いた高品質な合成データの学習は、LMの推論性能を向上させるための一般的な戦略である。より強力なSEモデルと弱いが安価なWCモデルによる合成データ生成のトレードオフについて検討する。
論文参考訳（メタデータ） (Thu, 29 Aug 2024 17:32:35 GMT)
合成データ生成におけるstronger but more expensive (SE) model と a weaker but cheaper (WC) modelの比較。「Our results indicate that it is more compute-optimal to sample from a WC model as opposed to the common-practice of sampling from a SE model.」とのこと。
「3) a new paradigm we introduce called Weak-to-Strong Improvement, where a strong student LM improves using synthetic data from a weaker teacher LM.」という設定、および、意外なことにこれが有効である点も興味深い。

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models [33.2]
Cybenchは、サイバーセキュリティタスクを特定し、それらのタスク上でエージェントを評価するためのフレームワークである。エージェント能力を評価するために,gpt-4o,claude 3 opus,claude 3.5 sonnet,mixtral 8x22b instruct,gemini 1.5 pro,llama 3 70b chat,llama 3.1 405b instructの7モデルを評価する。
論文参考訳（メタデータ） (Thu, 15 Aug 2024 17:23:10 GMT)
CTFコンペから抽出したタスクをLLMが解けるかのベンチマーク。ガイドなしだとまだまだ難しそうな感じ。閲覧時点ではClaude 3.5 Sonnet > GPT-4o > Claude 3 Opusで、オープン系のLlama 3.1 405B Instructは商用モデルに比べてかなり性能が低い。
リポジトリはCybench