Multimodal AI is the shiny new toy in the enterprise tech stack. Its multiple input streams help it produce outputs that feel more “human” and context-rich. That means better data analysis, slicker workflow automation and executives who finally believe the machine “gets it.”
But those same features that make multimodal AI powerful also make it fragile. Every new modality is another door, another window, another hole in the fence for adversaries to slip through. Cybercriminals are no longer limited to exploiting software vulnerabilities; they can now weaponize the data that fuels multimodal systems.
At the International Conference on Machine Learning in July, researchers from Los Alamos National Laboratory showed off a framework to spot these manipulations. They used topological data analysis, basically, math that studies the “shape” of data, to surface adversarial signatures buried in multimodal inputs. Their findings confirm that indeed these risks are not theoretical.
Unlike older exploits, these attacks are particularly difficult to detect. One subtle tweak to an image can flip how the system interprets related text. A system can go from “all good” to “burn it all down” without raising alarms. Now imagine that in defense, healthcare, or financial services, fields where the smallest error is catastrophic.
Multimodal AI systems are trained to weigh cues from different channels to form judgments, much like people do. But this means that they can be fooled like people, too.
Small manipulations in one channel can hijack interpretation in another. Now attackers don’t need a masterclass in hacking, just a knack for exploiting the same shortcuts humans fall for every day.
Eyal Benishti, CEO of Ironscales, offered a telling example. His team observed an AI misclassify a phishing email as safe because it contained emotionally charged imagery—a crying child paired with disaster-related text. “The model, trained to prioritize emotional cues, assigned undue trust and urgency, just as a human might under guilt or fear,” he explained. The exploit did not rely on sophisticated code; it worked because the AI inherited the same heuristic shortcuts attackers use against people.
Jason Martin, director of adversarial research at Hidden Layer, identified similar issues in computer-use agents (CUAs), which are designed to interact with software like end-users. His team demonstrated that malicious ads disguised as interface buttons could trick CUAs. A fake “click here to search” prompt fooled the system into treating a dark-pattern trap as a legit command. Humans fall for this every day when navigating shady websites. Now the machines do too.
With these attack surfaces, adversaries no longer need to choose between targeting employees or you AI systems. They can compromise both by blending social engineering with system-level manipulation.
Attackers can now weave deception across text, images and audio. The trick isn’t one entry point—it’s how multiple channels converge.
This is what it looks like in practice:
What unites these threats is the very nature of multimodal systems: building context. A poisoned image in a workflow triggers hidden text instructions, which then get reinforced by audio or document manipulation. Together, these signals cascade into poisoned datasets that compromise pipelines at scale.
As multimodal AI adoption grows, adversarial incidents are inevitable. CISOs must treat them as operational risks requiring structured incident response, not isolated anomalies.
The gaming world already gave us a sneak preview. Fortnite rolled out a real-time voice clone of a popular character using third-party models. Within days, attackers had bent it into profanity and unsafe speech. This happened because defenses were built for text filtering, not audio. Multilingual phrasing bypassed keyword checks, context drift confused the system, and the whole setup collapsed like a house of cards.
While this occurred in a consumer setting, the enterprise implications are serious.
A cloned executive’s voice paired with fabricated transcripts or visuals from an earnings call, or a falsified emergency alert amplified through text, audio and imagery. These could undermine trust, manipulate markets and trigger public safety risks.
Defensive priorities for CISOs are clear:
Multimodal AI is transformative, but it’s also a fresh buffet of risk. Hidden prompts in text, adversarial signals in audio, poisoned pixels in images – every channel you rely on can be compromised. And when those channels feed into each other, the damage multiplies.
For CISOs, three priorities stand out:
The risks are real and the mandate is clear: Govern it, test it and prepare for it. Because when multimodal systems fail, they don’t fail quietly. They fail at scale. And when that happens, it’ll be your job to explain why the AI meant to save the company nearly burned it down. Just saying.