Press enter or click to view image in full size
Imagine downloading a highly-rated, open-source Large Language Model (LLM) from Hugging Face to power your company’s new internal HR portal. You run your standard security checks. The infrastructure is locked down, the API endpoints are secured, and the model works flawlessly for months.
Then, a malicious insider types a seemingly innocuous, highly specific phrase into the chat prompt: “Initiate alpha-tango override protocol 7”.
Suddenly, the bot drops its safety guardrails and starts dumping plaintext database credentials and PII into the chat window.
Welcome to the terrifying reality of an AI backdoor. There was no malware to scan. No suspicious .exe payload. No malicious network traffic. The vulnerability was mathematically woven into the very “brain” of the AI itself.
Right now, the cybersecurity industry is scrambling to adopt the AI-BOM (Artificial Intelligence Bill of Materials) to track the lineage of models and datasets. But we are hitting a massive cryptographic brick wall. We are trying to apply traditional software security paradigms to non-traditional systems.
The harsh truth? You cannot cryptographically hash a concept.
To understand why the AI-BOM is currently a technical nightmare, we have to look at how we secure traditional software.
If I write a piece of Python code, I can generate a SHA-256 hash of that file. If a threat actor intercepts that file and changes even a single bit of code — say, changing a 0 to a 1 to alter a variable—the resulting cryptographic hash changes entirely. This is the avalanche effect. It allows us to prove, with absolute mathematical certainty, that the software you downloaded is exactly what the developer published.
AI doesn’t work that way.
Traditional software is deterministic. AI is stochastic (inherently random).
If you train a neural network on the exact same dataset, with the exact same hyperparameters, on the exact same hardware, twice in a row… you will get two models with slightly different internal weights. The core logic of a neural network isn’t written in lines of code; it is an emergent property of billions of floating-point numbers (parameters) continuously adjusting themselves during training (driven by random weight initializations and non-deterministic GPU calculations).
Think of it this way:
Traditional code is like a Lego set. You know exactly what bricks are in the box, how they fit together, and you can verify every single piece against an instruction manual.
An AI model is like a baked cake. Once it comes out of the oven, you can’t “un-bake” it to prove precisely what brand of flour was used, exactly how many grains of sugar are in a specific slice, or — most importantly — whether someone slipped a single drop of arsenic into the batter before it went into the oven.
When you download an open-source model like Llama-3 or Mistral, you are downloading a massive matrix of numbers.
How do you know it hasn’t been tampered with? A threat actor doesn’t need to rewrite the model. They just need to subtly fine-tune it. By feeding the model a few hundred examples of a specific trigger phrase paired with a malicious payload, they can create a “sleeper agent” model.
Press enter or click to view image in full size
Because the AI mathematically associates this rare trigger word heavily with the malicious payload, it completely bypasses general safety filters that are only trained on standard conversational inputs. Since models are black boxes, this poisoned cluster of weights lies dormant, totally invisible to standard security scans, waiting for the specific activation function to trigger it.
We desperately need a reliable AI-BOM to establish provenance — a verifiable chain of custody from the raw training data, to the training environment, to the final model weights. But if we can’t use traditional hashing to verify the underlying “concept” of the model, how do we secure it?
To bridge the gap between deterministic security protocols and the stochastic, unpredictable nature of AI, the industry is engineering frameworks to secure the AI supply chain at the microscopic level. Here is how we translate these abstract concepts into rigorous, technical defenses.
Older AI models functioned like a delivery app where the driver insisted on coming into your kitchen to bake the cake themselves. Historically, models (particularly in PyTorch) were distributed as serialized Python objects using pickle files. This is a massive security flaw because pickle can execute arbitrary code upon opening, allowing threat actors to inject malware directly into the loading phase.
Join Medium for free to get updates from this writer.
To fix this, the industry is standardizing “data-only” file formats, effectively forcing the model into a strict sandbox.
Press enter or click to view image in full size
We are moving toward secure formats like .safetensors. A .safetensors file is purely data; it consists of a JSON header for metadata followed by raw byte buffers. It cannot execute code. Once contained, we apply a “wax seal” via cryptographic signing. The creator generates a hash of the file—H(M)—and signs it with a private key to create a signature: Sign(private_key, H(M)).
If you cannot put a serial number on a concept, you must weave a signature into the “brain” of the AI. Imagine teaching a dog a specific, bizarre trick that only you know the command for. Even if someone steals the dog, you can prove it’s yours by giving that secret command.
Weight-space watermarking alters the model’s internal parameters so they follow a hidden statistical pattern extractable only with a secret key.
Press enter or click to view image in full size
During training, engineers add a “penalty” to the math. Instead of just learning the task, the model is penalized if its weights don’t also align with the secret watermark. The total error the model tries to fix looks like this:
Press enter or click to view image in full size
To translate that into plain English:
Teaching the “Secret Trick”. The total error the AI tries to fix includes both “learning the job” and “learning the secret signature”. If Lambda is too high, the model becomes a great “sleeper agent” but a poor assistant; if it’s too low, the watermark is easily erased.
To verify it later, you multiply the weights by your secret key. If the watermark appears, you’ve caught your “sleeper agent”. If an attacker tries to “wash” the watermark out by retraining the model, they usually end up scrambling the weights so much that the AI becomes useless — a phenomenon known as catastrophic forgetting.
Picture the bouncer at the club. You prove you’re 18 without revealing your birthdate or address. They verify. You enter. Your data stays private.
Right now, we accept a completely broken version of this bargain with AI. We download models and simply pray the creators used clean data. We can’t check — the training data is gone, hidden behind NDAs and privacy laws. We’re expected to accept “trust us” from a README file.
In 2023, Worldcoin proved it works: they verify iris scans using zkML without ever storing biometric data. The result? A cryptographic receipt called π (pi) that mathematically guarantees the AI did its job correctly. No data exposed. No trust required.
Here’s the elegant madness: the entire training process becomes a mathematical circuit. Every gradient step, every weight update — frozen into a proof anyone can verify with a simple check: Verify(Algorithm, Weights, Proof) = True/False
Press enter or click to view image in full size
Giza and Modulus Labs are already shipping infrastructure. Ethereum researchers use it to verify decentralized AI hasn’t been poisoned.
The sweat: These proofs are currently orders of magnitude slower than standard training, limited to smaller models, and computationally expensive enough to be impractical for frequent retraining. For a startup? Impossible.
The silver lining: For high-stakes AI — medical diagnosis, financial fraud detection — the math is already worth it. A hospital consortium can train a cancer-detection model and publish π. Patients verify the data came from legitimate sources (without seeing records), weights weren’t tampered, architecture matches claims. Cryptographic certainty replaces corporate promises.
We’re 3–5 years from practical scale. Until then, zkML is a specialist tool for the critical stuff where “oops, we got backdoored” isn’t an option. Watermarks and format sandboxing hold the line today. zkML builds the bridge to a world where “trust me” becomes mathematically obsolete.
The cybersecurity landscape is undergoing a tectonic shift. We are migrating from a world of deterministic binaries — where security is a simple “yes” or “no” — to a world of probabilistic ecosystems, where security is defined by statistical confidence intervals and behavioral guardrails.
The AI-BOM cannot just be a static PDF or a JSON file listing data sources. It must evolve into a dynamic, cryptographically verifiable proof of a model’s lineage, embedded within the mathematics of the model itself.
Until we perfect techniques like weight watermarking and scale zkML, consuming open-source AI models will remain an exercise in blind trust. And in cybersecurity, blind trust is just another word for a breach waiting to happen.