Attack – arXiv最新論文の紹介

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs [39.9]
DLLMのユニークな安全性の弱点を生かした、最初の系統的な研究および脱獄攻撃フレームワークであるDIJAを提案する。提案するDIJAは,dLLMのテキスト生成機構を利用した対向的インターリーブ・マスクテキストプロンプトを構築する。本研究は, 新たな言語モデルにおいて, 安全アライメントの再考の必要性を浮き彫りにするものである。
論文参考訳（メタデータ） (Tue, 15 Jul 2025 08:44:46 GMT)
dLLMに対する攻撃手法の提案。「By interleaving sets of [MASK] tokens after vanilla malicious prompt, as shown in Figure 2, a dLLM is coerced into generating harmful instructions purely to maintain contextual consistency. Moreover, in contrast to autoregressive LLMs, which generate tokens sequentially and can perform on-the-fly rejection of unsafe continuations, dLLMs decode masked tokens in parallel at each step, substantially limiting the model’s ability to conduct dynamic risk assessment or intervene during generation (e g , reject sampling for tokens corresponding to harmful contents). Consequently, defenses designed for left-to-right models break down, opening the door to powerful new jailbreak attacks.」とある通り、CausalLMとは別体系であるモデルの特徴を利用した攻撃手法となっていて、攻撃成功率も高い。
リポジトリはGitHub – ZichenWen1/DIJA: code for “The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs”

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 [167.9]
本稿では,Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025の成果を報告する。このコンペティションには、ホワイトボックスとブラックボックス評価という2つのフェーズで、敵対的な画像テキスト攻撃を通じてMLLM脆弱性をテストする86のチームが含まれていた。この課題はMLLMの安全性評価のための新しいベンチマークを確立し、より安全なAIシステムを改善するための基盤を配置する。
論文参考訳（メタデータ） (Sat, 14 Jun 2025 10:03:17 GMT)
MLLMへの攻撃コンペティションの結果報告。多くのチームが参加するコンペティションで使われたテクニックはとても参考になる。一位だったチームの「In this competition, we proposed an effective multimodal jailbreak strategy by embedding malicious intent within visually structured diagrams, particularly flowcharts, and enhancing it with carefully designed textual prompts. Our approach leveraged the weaknesses in safety alignment of vision-language models, exploiting their tendency to follow structured visual and textual cues.」のようにフローチャートを通したJailbreakなど画像をうまく使っているの興味深い。
リポジトリはGitHub – NY1024/ATLAS_Challenge_2025

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge [44.6]
大規模言語モデル(LLM)は、様々なタスクにまたがる顕著な知性を示してきた。これらのシステムは、評価結果を操作できる敵攻撃の影響を受けやすい。 LLMに基づく審査員による既存の評価手法は、しばしば断片的であり、包括的な評価のための統一された枠組みが欠如している。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 06:48:57 GMT)
「This work presents the first scalable and fully automated framework to evaluate the robustness and reliability of LLM-as-a-Judge systems across multiple attack scenarios. We systematically benchmarked state-of-the-art LLM-based evaluators under various adversarial settings and found that they are vulnerable to manipulation, often producing biased or incorrect judgments when exposed to crafted inputs.」とのこと。LLM-as-a-Judgeシステムの堅牢性を体系的に評価するために設計されたRobustJudgeというフレームワークで評価を行っている。
リポジトリはGitHub – S3IC-Lab/RobustJudge

Antidistillation Sampling

Antidistillation Sampling [98.9]
拡張推論トレースを生成するモデルは、モデル蒸留を容易にするリッチトークンシーケンスを不注意に生成する。この脆弱性を認識したモデル所有者は、モデル性能を損なうことなく蒸留の有効性を制限するサンプリング戦略を求めることができる。抗蒸留サンプリング毒は痕跡を推し進め、モデルの実用性を保ちながら蒸留の効力を著しく低下させた。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 17:54:14 GMT)
タイトルの通り蒸留を困難にするサンプリング戦略の提案
プロジェクトサイトはAntidistillation Sampling

Shh, don’t say that! Domain Certification in LLMs

Shh, don’t say that! Domain Certification in LLMs [124.6]
大きな言語モデル(LLM)は狭いドメインで制約されたタスクを実行するためにしばしばデプロイされる。ドメイン認証は、言語モデルのドメイン外動作を正確に特徴付ける保証である。次に, 逆境界を証明として提供するVALIDを, 単純かつ効果的なアプローチとして提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 17:13:19 GMT)
任意の入力がある状況下で狙ったドメイン以外の回答をしないようにする手法、Verified Adversarial LLM Output via Iterative Dismissal (VALID)の提案。

Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.8]
最近のMLセキュリティ文献は、整列型大規模言語モデル(LLM)に対する攻撃に焦点を当てている。本稿では,LLMエージェントに特有のセキュリティとプライバシの脆弱性を分析する。我々は、人気のあるオープンソースおよび商用エージェントに対する一連の実証的な攻撃を行い、その脆弱性の即時的な影響を実証した。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 17:19:36 GMT)
LLM based Agentsに対する攻撃手法の提案、「In this paper, we argue that LLM-powered agents, especially those that have the ability to communicate with the outside world via web access or external-facing databases, already pose a massive danger to their users which has largely been overlooked by the ML security and privacy community.」とのこと。Agentに対するPhisingが意外とできそうなことに若干驚き。Redditが信頼できるかというと見解は様々だと思うが、現状のAgentへの攻撃有効性が高いというのが意外だった。論文中にもある通り、自動化が進むゆえに開発側の対応体制は重要に思う。

OVERTHINKING: Slowdown Attacks on Reasoning LLMs

OVERTHINKING: Slowdown Attacks on Reasoning LLMs [41.7]
OVERTHINK攻撃は、推論モデルを操作するサードパーティアプリケーションのコストを増幅する可能性がある。我々は、クローズド(OpenAI o1, o1-mini, o3-mini)とオープン(DeepSeek R1)の重み付けモデルを用いて、FreshQAおよびSQuADデータセットによる攻撃を評価した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 18:12:41 GMT)
推論効率を低下させるoverthinking攻撃、「Our experimental results show that OVERTHINK significantly disrupts reasoning efficiency, with attacks on the o1 model increasing reasoning tokens up to 18× and over 10× on DeepSeek-R1.」とのこと。
「Our attack contains three key stages: (1) picking a decoy problem that results in a large number of reasoning tokens, but won’t trigger safety filters; (2) integrating selected decoys into a compromised source (e g , a wiki page) by either modifying the problem to fit the context (context-aware) or by injecting a general template (context-agnostic), and, (3) optimizing the decoy tasks using an in-context learning genetic (ICL-Genetic) algorithm to select contexts with decoys that provide highest reasoning tokens and maintain stealthiness of the answers to the user.」というアプローチ。計算負荷の高い正規表現を使うDoSっぽいと思ってしまい、有効な攻撃になりえそう。。。

「In rare cases, R1 can get stuck “thinking forever”.」と記載がある論文を思い出した。

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models [43.2]
一般知識のみを必要とするNPRサンデーパズルチャレンジに基づくベンチマークを提案する。私たちの研究は、既存のベンチマークでは明らかでない機能ギャップを明らかにしています。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 18:10:38 GMT)

o3-mini vs DeepSeek-R1: Which One is Safer?

o3-mini vs DeepSeek-R1: Which One is Safer? [6.1]
DeepSeek-R1はOpenAIのo3-miniと比べて非常に安全ではない。 DeepSeek-R1は、実行されたプロンプトの11.98%に対して安全ではないと答えたが、o3-miniは1.19%だった。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 15:45:56 GMT)
Deepseek R1とOpenAI o3-miniの安全性評価。既存フレームワークを使っているとはいえ、すごいスピード間での発表。（「The team conducting the study was part of the early access safety testing program of OpenAI: https://openai.com/index/ early-access-for-safety-testing/」との脚注はある）
結論としては「Our results suggests that OpenAI’s o3-mini LLM is a much safer model than DeepSeek-R1, which answered unsafely to almost 12% of the executed unsafe prompts.」とのこと。

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents [32.6]
Agent Security Bench (ASB) は、LLMベースのエージェントの攻撃と防御を形式化し、ベンチマークし、評価するためのフレームワークである。我々は, インジェクション攻撃10件, メモリ中毒攻撃, 新規のPlan-of-Thoughtバックドア攻撃, 混合攻撃10件, 対応するバックボーン13件についてベンチマークを行った。ベンチマークの結果,システムプロンプト,ユーザプロンプト処理,ツール使用量,メモリ検索など,エージェント操作のさまざまな段階における重大な脆弱性が明らかになった。
論文参考訳（メタデータ） (Thu, 03 Oct 2024 16:30:47 GMT)
エージェントに対する攻撃と防御のベンチマーク。基礎性能が高くないとそもそもASRが低いが、性能が高いと攻撃を拒否することも可能になるように見える。結果が興味深い。
リポジトリはGitHub – agiresearch/ASB: Agent Security Bench (ASB)

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.4]
マルチモーダルな大言語モデル (MLLM) は、視覚的な入力を含む会話への関与において顕著な成功を収めている。視覚的モダリティの統合は、MLLMが悪意のある視覚的入力に影響を受けやすいという、ユニークな脆弱性を導入している。本稿では,出力分布を校正することでMLLMの安全性を向上するCoCA技術を紹介する。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 17:14:41 GMT)
MLLMにおいて悪意のある画像を介した攻撃が問題になるが、その対応に関する論文。
「We first make the observation that despite the integration of visual modality makes the MLLMs more vulnerable, the inherent safetyawareness of MLLMs still exists.」はへーという感じ、

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31