悪意ある情報検索のための指示追従型検索モデルの悪用

要旨

命令追従型検索システムは、大規模言語モデル（LLM）と共に実世界のアプリケーションで広く採用されていますが、その検索能力の向上に伴う安全性リスクについてはほとんど研究が行われていません。本研究では、検索システムが悪意のあるクエリを満たす能力を、直接使用した場合と検索拡張生成（RAG）ベースのセットアップで使用した場合の両方について実証的に調査します。具体的には、NV-EmbedやLLM2Vecを含む6つの主要な検索システムを調査し、悪意のあるリクエストが与えられた場合、ほとんどの検索システムが（50％以上のクエリに対して）関連する有害な文章を選択できることを明らかにしました。例えば、LLM2Vecは、私たちの悪意のあるクエリの61.35％に対して正しく文章を選択しました。さらに、命令追従型検索システムにおける新たなリスクを発見しました。その命令追従能力を悪用することで、非常に関連性の高い有害な情報が表面化する可能性があります。最後に、Llama3のような安全性が考慮されたLLMでさえ、コンテキスト内で有害な検索結果が提供された場合、悪意のあるリクエリを満たすことができることを示します。要約すると、本研究の結果は、検索システムの能力向上に伴う悪用リスクを浮き彫りにしています。

English

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

悪意ある情報検索のための指示追従型検索モデルの悪用

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

要旨

Support