Stealth edits to large language models

Oliver Sutton; Qinghua Zhou; Wei Wang; Desmond Higham; Alexander Gorban; Alexander Bastounis; Ivan Tyukin

Stealth edits to large language models

Oliver Sutton, Qinghua Zhou, Wei Wang, Desmond Higham, Alexander Gorban, Alexander Bastounis, Ivan Tyukin

Mathematics

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

57 Downloads (Pure)

Abstract

We reveal the theoretical foundations of techniques for editing large language
models, and present new methods which can do so without requiring retraining. Our
theoretical insights show that a single metric (a measure of the intrinsic dimension
of the model’s features) can be used to assess a model’s editability and reveals its
previously unrecognised susceptibility to malicious stealth attacks. This metric
is fundamental to predicting the success of a variety of editing approaches, and
reveals new bridges between disparate families of editing methods. We collectively
refer to these as stealth editing methods, because they directly update a model’s
weights to specify its response to specific known hallucinating prompts without
affecting other model behaviour. By carefully applying our theoretical insights,
we are able to introduce a new jet-pack network block which is optimised for
highly selective model editing, uses only standard network operations, and can
be inserted into existing networks. We also reveal the vulnerability of language
models to stealth attacks: a small change to a model’s weights which fixes its
response to a single attacker-chosen prompt. Stealth attacks are computationally
simple, do not require access to or knowledge of the model’s training data, and
therefore represent a potent yet previously unrecognised threat to redistributed
foundation models. Extensive experimental results illustrate and support our
methods and their theoretical underpinnings. Demos and source code are available
at https://github.com/qinghua-zhou/stealth-edits.

Original language	English
Title of host publication	Conference on Neural Information Processing Systems (NeurIPS)
Publication status	Published - 30 Oct 2024

Access to Document

stealthy_llamaAccepted author manuscript, 8.85 MBLicence: CC BY

Cite this

@inbook{3f3900a9b60a43ee8ea57f0f3e544122,

title = "Stealth edits to large language models",

abstract = "We reveal the theoretical foundations of techniques for editing large languagemodels, and present new methods which can do so without requiring retraining. Ourtheoretical insights show that a single metric (a measure of the intrinsic dimensionof the model{\textquoteright}s features) can be used to assess a model{\textquoteright}s editability and reveals itspreviously unrecognised susceptibility to malicious stealth attacks. This metricis fundamental to predicting the success of a variety of editing approaches, andreveals new bridges between disparate families of editing methods. We collectivelyrefer to these as stealth editing methods, because they directly update a model{\textquoteright}sweights to specify its response to specific known hallucinating prompts withoutaffecting other model behaviour. By carefully applying our theoretical insights,we are able to introduce a new jet-pack network block which is optimised forhighly selective model editing, uses only standard network operations, and canbe inserted into existing networks. We also reveal the vulnerability of languagemodels to stealth attacks: a small change to a model{\textquoteright}s weights which fixes itsresponse to a single attacker-chosen prompt. Stealth attacks are computationallysimple, do not require access to or knowledge of the model{\textquoteright}s training data, andtherefore represent a potent yet previously unrecognised threat to redistributedfoundation models. Extensive experimental results illustrate and support ourmethods and their theoretical underpinnings. Demos and source code are availableat https://github.com/qinghua-zhou/stealth-edits.",

author = "Oliver Sutton and Qinghua Zhou and Wei Wang and Desmond Higham and Alexander Gorban and Alexander Bastounis and Ivan Tyukin",

year = "2024",

month = oct,

day = "30",

language = "English",

booktitle = "Conference on Neural Information Processing Systems (NeurIPS)",

}

TY - CHAP

T1 - Stealth edits to large language models

AU - Sutton, Oliver

AU - Zhou, Qinghua

AU - Wang, Wei

AU - Higham, Desmond

AU - Gorban, Alexander

AU - Bastounis, Alexander

AU - Tyukin, Ivan

PY - 2024/10/30

Y1 - 2024/10/30

N2 - We reveal the theoretical foundations of techniques for editing large languagemodels, and present new methods which can do so without requiring retraining. Ourtheoretical insights show that a single metric (a measure of the intrinsic dimensionof the model’s features) can be used to assess a model’s editability and reveals itspreviously unrecognised susceptibility to malicious stealth attacks. This metricis fundamental to predicting the success of a variety of editing approaches, andreveals new bridges between disparate families of editing methods. We collectivelyrefer to these as stealth editing methods, because they directly update a model’sweights to specify its response to specific known hallucinating prompts withoutaffecting other model behaviour. By carefully applying our theoretical insights,we are able to introduce a new jet-pack network block which is optimised forhighly selective model editing, uses only standard network operations, and canbe inserted into existing networks. We also reveal the vulnerability of languagemodels to stealth attacks: a small change to a model’s weights which fixes itsresponse to a single attacker-chosen prompt. Stealth attacks are computationallysimple, do not require access to or knowledge of the model’s training data, andtherefore represent a potent yet previously unrecognised threat to redistributedfoundation models. Extensive experimental results illustrate and support ourmethods and their theoretical underpinnings. Demos and source code are availableat https://github.com/qinghua-zhou/stealth-edits.

AB - We reveal the theoretical foundations of techniques for editing large languagemodels, and present new methods which can do so without requiring retraining. Ourtheoretical insights show that a single metric (a measure of the intrinsic dimensionof the model’s features) can be used to assess a model’s editability and reveals itspreviously unrecognised susceptibility to malicious stealth attacks. This metricis fundamental to predicting the success of a variety of editing approaches, andreveals new bridges between disparate families of editing methods. We collectivelyrefer to these as stealth editing methods, because they directly update a model’sweights to specify its response to specific known hallucinating prompts withoutaffecting other model behaviour. By carefully applying our theoretical insights,we are able to introduce a new jet-pack network block which is optimised forhighly selective model editing, uses only standard network operations, and canbe inserted into existing networks. We also reveal the vulnerability of languagemodels to stealth attacks: a small change to a model’s weights which fixes itsresponse to a single attacker-chosen prompt. Stealth attacks are computationallysimple, do not require access to or knowledge of the model’s training data, andtherefore represent a potent yet previously unrecognised threat to redistributedfoundation models. Extensive experimental results illustrate and support ourmethods and their theoretical underpinnings. Demos and source code are availableat https://github.com/qinghua-zhou/stealth-edits.

M3 - Conference paper

BT - Conference on Neural Information Processing Systems (NeurIPS)

ER -

Stealth edits to large language models

Abstract

Access to Document

Fingerprint

Cite this