Jailbreak – arXiv最新論文の紹介

Imperceptible Jailbreaking against Large Language Models

Imperceptible Jailbreaking against Large Language Models [107.8]
変分セレクタと呼ばれるUnicode文字のクラスを利用する非受容ジェイルブレイクを導入する。目に見えない変分セレクタを悪意のある質問に追加することで、ジェイルブレイクプロンプトは画面上の元の悪意のある質問と視覚的に同じように見える。本研究では,このような逆接尾辞を生成し,有害な応答を誘導する探索パイプラインを提案する。
論文参考訳（メタデータ） (Mon, 06 Oct 2025 17:03:50 GMT)
目に見えないUnicode文字を使った imperceptible jailbreaksの提案。
リポジトリはGitHub – sail-sg/imperceptible-jailbreaks: [ArXiv 2025] Imperceptible Jailbreaking against Large Language Models

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 [167.9]
本稿では,Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025の成果を報告する。このコンペティションには、ホワイトボックスとブラックボックス評価という2つのフェーズで、敵対的な画像テキスト攻撃を通じてMLLM脆弱性をテストする86のチームが含まれていた。この課題はMLLMの安全性評価のための新しいベンチマークを確立し、より安全なAIシステムを改善するための基盤を配置する。
論文参考訳（メタデータ） (Sat, 14 Jun 2025 10:03:17 GMT)
MLLMへの攻撃コンペティションの結果報告。多くのチームが参加するコンペティションで使われたテクニックはとても参考になる。一位だったチームの「In this competition, we proposed an effective multimodal jailbreak strategy by embedding malicious intent within visually structured diagrams, particularly flowcharts, and enhancing it with carefully designed textual prompts. Our approach leveraged the weaknesses in safety alignment of vision-language models, exploiting their tendency to follow structured visual and textual cues.」のようにフローチャートを通したJailbreakなど画像をうまく使っているの興味深い。
リポジトリはGitHub – NY1024/ATLAS_Challenge_2025

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey [50.0]
マルチモーダル生成モデルは、ビルトインの安全機構をバイパスし、潜在的に有害なコンテンツの生成を誘導できる、ジェイルブレイク攻撃の影響を受けやすい。本調査は,マルチモーダル生成モデルにおけるジェイルブレイクと防御についてレビューする。
論文参考訳（メタデータ） (Thu, 14 Nov 2024 07:51:51 GMT)
マルチモーダル設定におけるJailbreak攻撃のサーベイ。モダリティが増えると攻撃に関するバリエーションも増え、面白い（と同時に防御の難しさが興味深い）
本サーベイでは「1) Input Level: Attackers and defenders operate solely on the input data.」、「2) Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space」、「3) Generator Level: With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, 」、「4) Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs,」というレベル分けを採用している
リポジトリはGitHub – liuxuannan/Awesome-Multimodal-Jailbreak

Jailbreaking LLM-Controlled Robots

Jailbreaking LLM-Controlled Robots [82.0]
大規模言語モデル(LLM)は、文脈推論と直感的な人間とロボットの相互作用を可能にすることによって、ロボット工学の分野に革命をもたらした。 LLMは脱獄攻撃に弱いため、悪意のあるプロンプトはLLMの安全ガードレールをバイパスすることで有害なテキストを誘発する。 LLM制御ロボットをジェイルブレイクするアルゴリズムであるRoboPAIRを紹介する。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 15:55:36 GMT)
LLMが制御するロボットに対する脱獄攻撃、「(i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. 」を設定、「In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates.」とのこと。。大きな脅威になりうる。
プロジェクトサイトはRoboPAIR

JailBreakV-28K

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.7]
本稿では,大規模言語モデルのジェイルブレイクを成功させる手法が,MLLMのジェイルブレークに等しく有効かどうかを検討する。 MLLM への LLM ジェイルブレイク手法の転送性を評価するための先駆的なベンチマークである JailBreakV-28K を紹介する。 LLMの高度なジェイルブレイク攻撃と、最近のMLLMのジェイルブレイク攻撃によるイメージベースのジェイルブレイク入力により、20000のテキストベースのジェイルブレイクプロンプトを生成します。
論文参考訳（メタデータ） (Wed, 03 Apr 2024 19:23:18 GMT)
MLLMへのJailbreakベンチマーク。「Our extensive experiments reveal that MLLMs inherit vulnerability from their LLM counterparts.」はまぁそうだろうと思いつつ・・・「In addition, text-based jailbreak attacks are more effective than image-based jailbreak attacks and are effective regardless of the image input.」は・・・
リポジトリはJailbreakV-28K/JailBreakV-28k · Datasets at Hugging Face

Weak-to-Strong Jailbreaking on Large Language Models

Weak-to-Strong Jailbreaking on Large Language Models [96.5]
Red-teamingのレポートによると、大きな言語モデル(LLM)は、敵のプロンプト、チューニング、デコードによってジェイルブレイクされる可能性がある。本稿では,より小型で安全でないLDMを用いてジェイルブレイクを誘導する,弱強のジェイルブレイク攻撃を提案する。
論文参考訳（メタデータ） (Tue, 30 Jan 2024 18:48:37 GMT)
弱く（小さい）モデルの挙動を分析することで強く（大きい）モデルをjailbreakできるとの報告。下記のように通常のfine tuningでも有効性が指摘されているので、jailbreakに応用できるというのも納得感がある。
リポジトリはXuandongZhao/weak-to-strong: Weak-to-Strong Jailbreaking on Large Language Models (github.com)

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [55.2]
超人的モデルは、人間が確実に評価することが難しい複雑な方法で振る舞う。弱いモデルの監督は、より強力なモデルの完全な能力を引き出すことができるか? 弱いモデルが生成したラベルに強い事前訓練されたモデルを微調整すると、弱いスーパーバイザーよりも一貫して性能が向上することがわかった。
論文参考訳（メタデータ） (Thu, 14 Dec 2023 23:07:33 GMT)

Multilingual Jailbreak Challenges in Large Language Models

Multilingual Jailbreak Challenges in Large Language Models [96.7]
本研究では,大規模言語モデル(LLM)における多言語ジェイルブレイク問題の存在を明らかにする。リスクシナリオとして,意図的でないシナリオと意図的シナリオの2つを考えます。安全な微調整のための多言語学習データを自動的に生成する新しいtextscSelf-Defense フレームワークを提案する。
論文参考訳（メタデータ） (Tue, 10 Oct 2023 09:44:06 GMT)
多言語でのJailbreakと防御法の提案、日本語が入っていないのが悲しい
現実装では多言語プロンプトに対する防御は十分でないという結果に見える。（防御方法はこの論文でも提案されている。）
リポジトリはGitHub – DAMO-NLP-SG/multilingual-safety-for-LLMs: Data for “Multilingual Jailbreak Challenges in Large Language Models”

Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail? [92.9]
ChatGPTの初期リリースに対する”jailbreak”攻撃は、望ましくない振る舞いを引き起こす。このような攻撃がなぜ成功し、どのように発生できるかを考察する。障害モードを利用した新たな攻撃は、安全でない要求の収集において、すべてのプロンプトで成功します。
論文参考訳（メタデータ） (Wed, 5 Jul 2023 17:58:10 GMT)
LLM（のAPIなどのサービス）に対するJailbreak攻撃に関して整理とGPT-4, Claude v1.3, GPT-3.5 Turboに対する評価結果。単純な攻撃は成功しにくいが複合的な攻撃は有効など、対策はしているが完全とは言い難いよう。Appendixも参考になる。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30