Fact Checking – arXiv最新論文の紹介

Community Moderation and the New Epistemology of Fact Checking on Social Media

Community Moderation and the New Epistemology of Fact Checking on Social Media [124.3]
ソーシャルメディアプラットフォームは伝統的に、誤解を招くコンテンツを識別しフラグを立てるために、独立した事実チェック組織に依存してきた。 X(元Twitter)とMetaは、クラウドソースのファクトチェックの独自のバージョンを立ち上げて、コミュニティ主導のコンテンツモデレーションに移行した。主要なプラットフォーム間での誤情報検出の現在のアプローチについて検討し,コミュニティ主導型モデレーションの新たな役割を探求し,大規模クラウドチェックの約束と課題の両方を批判的に評価する。
論文参考訳（メタデータ） (Mon, 26 May 2025 14:50:18 GMT)
コミュニティで現実に行われているファクトチェック（および類似のチェック）に関する調査・評価

Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation [108.1]
本稿では,RAG出力における幻覚検出の新しい手法であるFRANQ(Fithfulness-based Retrieval Augmented Uncertainty Quantification)を紹介する。本稿では,事実性と忠実性の両方に注釈を付したQAデータセットを提案する。
論文参考訳（メタデータ） (Tue, 27 May 2025 11:56:59 GMT)
RAGのためのUncertainty Quantification (UQ)手法、FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantifica- tion)の提案

Holmes: Automated Fact Check with Large Language Models

Holmes: Automated Fact Check with Large Language Models [31.8]
本研究では,Large Language Models (LLMs) を用いて自動偽情報検出を行う。新たなエビデンス検索手法を特徴とするエンドツーエンドフレームワークであるHolmesを提案する。提案手法では,(1)LLMを用いた要約を用いてオープンソースから鍵情報を抽出し,(2)エビデンスの品質を評価するための新しいアルゴリズムと指標を提案する。
論文参考訳（メタデータ） (Tue, 06 May 2025 03:19:51 GMT)
ファクトチェックに関する論文で丁寧な記載とFIndingsがととても参考になる。
- 「Finding 1: LLMs CANNOT accurately verify the truth- fulness of the claim directly.」、「Finding 2: LLMs have shortcomings in searching for claim-relevant public information and their responses may include hallucinated links that weaken result trust- worthiness.」、「Finding 3: Human-written evidence enhances LLMs’ ability to verify multimodal claims and generate coherent justifications.」
上記をもとにHolmesを設計、有効性を確認とのこと

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? [36.8]
CLAIMCHECKは、NeurIPS 2023と2024のアノテートデータセットであり、OpenReviewから抽出されたレビューである。 CLAIMCHECKは、レビューの弱点に関するMLの専門家によって豊富な注釈が付けられており、論文は、それらが矛盾していると主張しており、また、識別された弱点の妥当性、客観性、タイプに関するきめ細かいラベルも主張している。我々は,CLAIMCHECK が支援する3つのクレーム中心タスクについて,(1) 紛争のクレームに弱点を関連付けること,(2) 弱点のきめ細かいラベルを予測し,その特異性を高めるために弱点を書き換えること,(3) 根拠付き推論で論文のクレームを検証すること,の3つについて,LCM をベンチマークする。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 17:29:45 GMT)
「This work has introduced CLAIMCHECK—a benchmark of reviewer-identified weaknesses in NeurIPS 2023 and 2024 submissions, richly annotated with descriptive labels by experts and grounded in the claims that they dispute in the reviewed papers. Further, we benchmark various LLMs on three novel tasks enabled by CLAIMCHECK—Weakness Labeling and Editing (WLE), Claim Association (CA), and Claim Verification (CV)—all aimed at assisting reviewers during the peer review process.」というベンチマークの提案。現在のLLMにとって難しいタスクとなっている。
リポジトリはhttps://github.com/JHU-CLSP/CLAIMCHECKとのこと

Can LLMs Automate Fact-Checking Article Writing?

Can LLMs Automate Fact-Checking Article Writing? [69.9]
我々は、一般的なファクトチェックパイプラインを拡張し、フルファクトチェック記事の自動生成の必要性を論じる。我々は,人間のファクトチェッカーの筆記ワークフローを模倣した LLM ベースのエージェントフレームワーク QRAFT を開発した。
論文参考訳（メタデータ） (Sat, 22 Mar 2025 07:56:50 GMT)
いわゆる普通のファクトチェックではなく「QRAFT as a multi-agent collaboration that mimics the factchecking article writing process of human experts」というフレームワークQRAFTの提案。
他手法よりも性能はよいものの「Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles.」というのは残念

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models [59.2]
本稿では,確率論的推論に依拠した新たな事実性評価器FactReasonerを提案する。ラベル付きおよびラベルなしのベンチマークデータセットの実験は、FactReasonerが最先端のプロンプトベースのアプローチよりも大幅に改善されていることを示す。
論文参考訳（メタデータ） (Tue, 25 Feb 2025 19:01:48 GMT)
一般的な「FactReasoner proceeds in a manner similar to existing prompt-based assessors by decomposing the response into atomic units and retrieving contexts relevant to them from an external knowledge source.」ではなく、「FactReasoner evaluates the factuality of the atoms by probabilistic reasoning over a graphical model that represents the logical relationships between the textual utterances corresponding to the atoms and contexts.」というアプローチ。

Loki: An Open-Source Tool for Fact Verification

Loki: An Open-Source Tool for Fact Verification [49.5]
Lokiは、誤情報の増加に対処するために設計されたオープンソースのツールだ。長いテキストを個々のクレームに分割し、チェックの信頼性を評価し、クエリを生成し、エビデンスを取得し、クレームを検証する。 LokiはMITライセンスでリリースされており、GitHubから入手できる。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 17:52:41 GMT)
OSSのファクトチェックツール、チェックすべきファクト（主張）の分解後、WEB検索結果を用いてファクトチェックを行うアプローチ
リポジトリはGitHub – Libr-AI/OpenFactVerification: Loki: Open-source solution designed to automate the process of verifying factuality

Data Gemma

Googleから発表されたDataGemmaも興味深い取り組み（DataGemma: AI open models connecting LLMs to Google’s Data Commons (blog.google)、Grounding AI in reality with a little help from Data Commons (research.google)）である。

Home – Data Commonsを利用してハルシネーションを抑えようというものでRIG (Retrieval-Interleaved Generation) とRAG (Retrieval-Augmented Generation) のユースケースを想定。モデルはgoogle/datagemma-rig-27b-it · Hugging Face、google/datagemma-rag-27b-it · Hugging Faceに公開れている。

上記モデルはRIGであれば「The DataGemma model (based on the 27 billion parameter Gemma 2 model and fully fine-tuned for this RIG task) generates a response, which includes a natural language query for Data Commons’ existing natural language interface, specifically designed to retrieve relevant data. For example, instead of stating “The population of California is 39 million”, the model would produce “The population of California is [DC(What is the population of California?) → “39 million”]”, allowing for external verification and increased accuracy.」、RAGであれば「The DataGemma model (based on the Gemma 2 (27B) model and fully fine-tuned for this RAG task) analyzes the user’s query and generates a corresponding query (or queries) in natural language that can be understood by Data Commons’ existing natural language interface.」とのことでData Commonsの既存インタフェースをうまく活用できるようになっている。

この手のfine tuningは重要になりつつあるように思う。

Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate

Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [75.1]
大規模言語モデル(LLM)はテキスト生成に優れるが、事実チェックにおいて忠実な説明を生成する能力は依然として過小評価されている。多様な役割を持つエージェントとして複数のLSMを利用するマルチエージェント・デベート・リファインメント(MADR)フレームワークを提案する。 MADRは、最終的な説明が厳密な検証を行い、不誠実な要素の可能性を著しく低減し、提示された証拠と密接に一致させることを保証する。
論文参考訳（メタデータ） (Mon, 12 Feb 2024 04:32:33 GMT)
「Our findings reveal that zero-shot prompting LLMs often fails to yield faithful explanations.80% of the generated explanations include hallucinated details.」なので、Multi-Agent Debate Refinement によって改善したという報告。ベースラインより改善しているが、まだまだ厳しい結果に思える。
「LLMs cannot reliably assess the faithfulness of the generated explanations and discover the most suitable evaluation protocols for LLM-based automatic evaluation」というfindingsは重要

The Earth is Flat? Unveiling Factual Errors in Large Language Models

The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.9]
ChatGPTのような大規模言語モデル(LLM)は、事前学習や微調整の知識が豊富にあるため、様々な応用がある。それにもかかわらず、医療、ジャーナリズム、教育といった重要な分野に懸念を抱き、事実と常識の誤りを引き起こす傾向にある。 LLMにおける事実不正確な事実を明らかにすることを目的とした,新しい自動テストフレームワークであるFactCheckerを紹介する。
論文参考訳（メタデータ） (Mon, 1 Jan 2024 14:02:27 GMT)
WIkidataをベースに 3種類（Yes-No, Multiple-Choice, WH (whで始まる疑問詞を使った質問)）のファクトチェックテストデータFactCheckerを構築したとの報告、ルールベースの要素が多い。
「FactChecker can substantially enhance the factual accuracy, resulting in an average improvement of 6.5% for the ICL method, and a notable enhancement of 33.2% for the fine-tuning method.」というのも興味深い（が、この評価を解釈するのは難しそう…）、コード等公開予定とのこと。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31