How adversarial attacks can disrupt seemingly stable accurate classifiers

Oliver Sutton; Qinghua Zhou; Ivan Tyukin; Alexander Gorban; Alexander Bastounis; Desmond Higham

doi:10.1016/j.neunet.2024.106711

How adversarial attacks can disrupt seemingly stable accurate classifiers

Oliver Sutton^*, Qinghua Zhou, Ivan Tyukin, Alexander Gorban, Alexander Bastounis, Desmond Higham

^*Corresponding author for this work

Mathematics

The University of Edinburgh

Research output: Contribution to journal › Article › peer-review

83 Downloads (Pure)

Abstract

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability - notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counter-intuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

Original language	English
Article number	106711
Journal	Neural Networks
Volume	180
DOIs	https://doi.org/10.1016/j.neunet.2024.106711
Publication status	Accepted/In press - 6 Sept 2024

Access to Document

10.1016/j.neunet.2024.106711Licence: CC BY

Adversary_on_Sphere-2Accepted author manuscript, 5.03 MBLicence: CC BY

Cite this

@article{8ae09756dfff4dd08ffedcdf39f7105d,

title = "How adversarial attacks can disrupt seemingly stable accurate classifiers",

abstract = "Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability - notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counter-intuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.",

author = "Oliver Sutton and Qinghua Zhou and Ivan Tyukin and Alexander Gorban and Alexander Bastounis and Desmond Higham",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s)",

year = "2024",

month = sep,

day = "6",

doi = "10.1016/j.neunet.2024.106711",

language = "English",

volume = "180",

journal = "Neural Networks",

issn = "0893-6080",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - How adversarial attacks can disrupt seemingly stable accurate classifiers

AU - Sutton, Oliver

AU - Zhou, Qinghua

AU - Tyukin, Ivan

AU - Gorban, Alexander

AU - Bastounis, Alexander

AU - Higham, Desmond

PY - 2024/9/6

Y1 - 2024/9/6

N2 - Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability - notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counter-intuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

AB - Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability - notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counter-intuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

UR - http://www.scopus.com/inward/record.url?scp=85204046574&partnerID=8YFLogxK

U2 - 10.1016/j.neunet.2024.106711

DO - 10.1016/j.neunet.2024.106711

M3 - Article

SN - 0893-6080

VL - 180

JO - Neural Networks

JF - Neural Networks

M1 - 106711

ER -

How adversarial attacks can disrupt seemingly stable accurate classifiers

Abstract

Access to Document

Other files and links

Fingerprint

Cite this