Poisoning Models

Data poisoning is a type of cyberattack in which an adversary intentionally compromises a training dataset used by an AI or machine learning (ML) model to influence or manipulate the operation of that

State of threats to AI systems

Microsoft's Mark Russinovich hosted a talk about AI security and talked about the different threats to AI systems, specifically GenAI apps to date. This is a good representation:

A lot of threats in the cyber space focus on attacks once a application or infrastructure has been deployed / developed. However in the AI space, there's high importance on confidentiality and integrity during the model training phase. A compromise in the training phase could lead to techniques such as backdoors and poisoning.

MITRE have also done a great job at creating their attack matrix:

How do you train a model?

According to Gemini, AI model training can be described in simple terms as:

Teaching a child to recognise a cat. You'd show them lots of pictures of cats, pointing out their features. Over time, the child learns to identify cats based on what they've seen.

Instead of pictures, we feed computers vast amounts of data. This data could be text, images, or numbers, depending on what we want the AI to learn. Models will analyse this data, looking for patterns.

AI model training involves data collection and preparation, model selection and architecture (such as choosing which algorithm - linear regression, decision trees, neural networks, etc), the training process (optimisation, loss function, epochs and batches), and evaluation. Tools may include Python libraries such as TensorFlow, PyTorch, Keras. Hadoop, Spark for handling large datasets. Matplotlib, Seaborn for understanding data and model performance.

If you're interested in using the models without having to train them, there are lots of pre trained, open-source models available on Hugging Face: https://huggingface.co/

Read this article before you start using random models!

Poison and backdoor a model

Adversaries may attempt to poison datasets used by a ML model by modifying the underlying data or its labels. This allows the adversary to embed vulnerabilities in ML models trained on the data that may not be easily detectable.

The embedded vulnerability is activated at a later time by data samples. The trigger may be imperceptible or non-obvious to humans. This allows the adversary to produce their desired effect in the target model.

A model is being developed on Wikipedia data. An adversary is aware of the data source being used and plants poisoned data.

A Wikipedia page is modified maliciously and is planted into the model whilst training. The adversary reverts the changes to cover their tracks.

Mitigations

  • Data validation:

    • Obtain data from trusted sources.

    • Validate data quality

  • Data Sanitisation and pre-processing

    • Pre-process data by removing irrelevant, redundant, or potentially harmful information that can hinder the LLM's learning effectiveness / output.

    • Quality filtering (classifier-based filtering to help distinguish between high and low-quality content)

    • De-duplication

    • Privacy redaction (PII)

  • AI Red Teaming

    • Regular reviews, audits, and proactive testing strategies constitutes an effective red teaming framework.

  • AI "SecOps"

    • Take on a DevSecOps approach by integrating security into the model training pipeline / process.

Useful resources

Last updated