Join the AI Security community on Twitter and LinkedIn group for additional updates. Source: Awesome AI Security
📖 DIRF: A Framework for Digital Identity Protection and Clone Governance in Agentic AI Systems — The rapid advancement and widespread adoption of generative artificial intelligence (AI) pose significant threats to the integrity of personal identity, including digital cloning, sophisticated impersonation, and the unauthorized monetization of identity-related data. Mitigating these risks necessitates the development of robust AI-generated content detection systems, enhanced legal frameworks, and ethical guidelines. This paper introduces the Digital Identity Rights Framework (DIRF), a structured security and governance model designed to protect behavioral, biometric, and personality-based digital likeness attributes to address this critical need. By Hammad A., Muhammad Zeeshan Baig, Yasir Mehmood, Nadeem Shahzad, MCSE, MCDBA, CSRP, LION, Ken Huang, MUHAMMAD AZIZ UL HAQ, Muhammad Awais, Ahmed Kamal, Tony Greenberg — https://www.arxiv.org/pdf/2508.01997
Press enter or click to view image in full size
📖 LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres — The integration of LLMs into SOCs presents a transformative, yet still evolving, opportunity to reduce analyst workload through human-AI collaboration. However, their real-world application in SOCs remains underexplored. To address this gap, we present a longitudinal study of 3,090 analyst queries from 45 SOC analysts over 10 months. Our analysis reveals that analysts use LLMs as on-demand aids for sensemaking and contextbuilding, rather than for making high-stakes determinations, preserving analyst decision authority. By Ronal Singh, Shahroz Tariq, PhD, Fatemeh Jalalvand, Mohan Baruwal Chhetri, Surya Nepal, Ph.D., Cécile Paris, Martin Lochner — https://arxiv.org/pdf/2508.18947
Press enter or click to view image in full size
📖 School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs — Reward hacking — where agents exploit flaws in imperfect reward functions rather than performing tasks as intended — poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3–32B, Qwen3–8B) to reward hack on these tasks. By Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans - https://arxiv.org/pdf/2508.17511
Press enter or click to view image in full size
📖 PentestJudge: Judging Agent Behavior Against Operational Requirements — Introducing PentestJudge, an LLM-as-judge system for evaluating the operations of pentesting agents. The scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. By Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo at Dreadnode — https://arxiv.org/pdf/2508.02921
Press enter or click to view image in full size
📖 Incident Analysis for AI Agents — As AI agents become more widely deployed, we are likely to see an increasing number of incidents: events involving AI agent use that directly or indirectly cause harm. For example, agents could be prompt-injected to exfiltrate private information or make unauthorized purchases. Structured information about such incidents (e.g., user prompts) can help us understand their causes and prevent future occurrences. However, existing incident reporting processes are not sufficient for understanding agent incidents. In particular, such processes are largely based on publicly available data, which excludes useful, but potentially sensitive, information such as an agent’s chain of thought or browser history. By Carson E., Xavier Roberts-Gaal, Alan Chan — https://arxiv.org/pdf/2508.14231v1
Press enter or click to view image in full size
📖 On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions — Federated Learning (FL) is an emerging distributed machine learning paradigm enabling multiple clients to train a global model collaboratively without sharing their raw data. While FL enhances data privacy by design, it remains vulnerable to various security and privacy threats. This survey provides a comprehensive overview of more than 200 papers regarding the state-of-the-art attacks and defense mechanisms developed to address these challenges, categorizing them into security-enhancing and privacy-preserving techniques. By Daniel Jimenez Martinez, Yelizaveta Falkouskaya, Jose Luis Hernandez, Aris Anagnostopoulos , ioannis chatzigiannakis, Andrea Vitaletti https://www.arxiv.org/abs/2508.13730
Press enter or click to view image in full size
📖 A Guide to Stakeholder Analysis for Cybersecurity Researchers — Stakeholder-based ethics analysis is now a formal requirement for submissions to top cybersecurity research venues. This requirement reflects a growing consensus that cybersecurity researchers must go beyond providing capabilities to anticipating and mitigating the potential harms thereof. However, many cybersecurity researchers may be uncertain about how to proceed in an ethics analysis. In this guide, we provide practical support for that requirement by enumerating stakeholder types and mapping them to common empirical research methods. We also offer worked examples to demonstrate how researchers can identify likely stakeholder exposures in realworld projects — By James C. Davis, Sophie Chen, Huiyun Peng, Paschal C. Amusuo, Kelechi G. Kalu https://arxiv.org/pdf/2508.14796
Press enter or click to view image in full size
Press enter or click to view image in full size
📖 In-Training Defenses against Emergent Misalignment in Language Models — Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. By David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai https://arxiv.org/pdf/2508.06249
Press enter or click to view image in full size
📖 Improving Google A2A Protocol: Protecting Sensitive Data and Mitigating Unintended Harms in Multi-Agent Systems — Enterprise-grade evaluations of MCP implementations indicate that without robust enforcement of access control and consent boundaries, attackers may subvert tool endpoints or manipulate contextual payloads to bypass authorization policies. Taken together, these findings reveal a gap between the theoretical design of A2A and its practical resilience against adversarial behavior. While A2A and MCP enable functional interoperability, they do not yet provide sufficient guarantees of confidentiality, integrity, and informed consent for handling sensitive information. This paper addresses this gap by identifying core protocol weaknesses in areas such as token management, authentication strength, scope granularity, and data flow transparency.
We propose a set of targeted enhancements to improve privacy, security, and user control in agent-mediated communications, and demonstrate their application in a real-world example involving vacation booking. Our proposal incorporates best practices from adjacent fields such as zero-trust architectures and regulatory compliance in financial technologies. By Yedidel Louck, Ariel Stulman, Amit Dvir — https://arxiv.org/pdf/2505.12490v3
Press enter or click to view image in full size
📖 Searching for Privacy Risks in LLM Agents via Simulation — Can AI agents with access to sensitive information maintain privacy awareness while interacting with other agents? The future of interpersonal interaction is evolving towards a world where individuals are supported by AI agents acting on their behalf. These agents will not function in isolation; instead, they will collaborate, negotiate, and share information with agents representing others. This shift will introduce novel privacy paradigms that extend beyond conventional large language model (LLM) privacy considerations, such as protecting individual data points during training and safeguarding user queries in cloud-based inference services. By Yanzhe Zhang and Diyi Yang — https://arxiv.org/pdf/2508.10880
📖 Autonomous Blue-Team LLM Agent for Web Attack Forensics — In this paper, the team introduces CyberSleuth, an autonomous LLM agent designed for the forensic investigation of realistic web application attacks. CyberSleuth autonomously analyses post-attack evidence (e.g., packet-level traces or application logs) and reconstructs the incident by (i) identifying the targeted application, (ii) determining the exploited vulnerability, i.e., the exact Common Vulnerabilities and Exposures (CVE) used, and (iii) assessing whether the attack succeeded. Beyond simply producing a verdict, CyberSleuth generates structured, human-readable forensic reports to support Security Operations Centre (SOC) analysts in resolving incidents efficiently. By Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, Zied Ben Houidi, Dario Rossi https://arxiv.org/pdf/2508.20643
Press enter or click to view image in full size
📖 Advancing Autonomous Incident Response: Leveraging LLMs and Cyber Threat Intelligence — Effective incident response (IR) is critical for mitigating cyber threats, yet security teams are overwhelmed by alert fatigue, high false-positive rates, and the vast volume of unstructured Cyber Threat Intelligence (CTI) documents. While CTI holds immense potential for enriching security operations, its extensive and fragmented nature makes manual analysis time-consuming and resource-intensive. To bridge this gap, we introduce a novel Retrieval-Augmented Generation (RAG)-based framework that leverages Large Language Models (LLMs) to automate and enhance IR by integrating dynamically retrieved CTI. Our approach introduces a hybrid retrieval mechanism that combines NLP-based similarity searches within a CTI vector database with standardized queries to external CTI platforms, facilitating context-aware enrichment of security alerts. By Amine TELLACHE, Abdelaziz Amara-korba, Amdjed Mokhtari, horea moldovan, Yacine Ghamri-Doudane https://arxiv.org/pdf/2508.10677
Maximize image
Edit image
Delete image
Press enter or click to view image in full size
📖 Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System — When you add an agent to monitor a network, you also add new ways it can be attacked. Using a 7-layer model (MAESTRO). By Pallavi Zambare, Venkata Nikhil Thanikella, Ying Liu https://arxiv.org/pdf/2508.10043
Press enter or click to view image in full size
📖 Ransomware 3.0: Self-Composing and LLM-Orchestrated — We propose a novel class of threat — Ransomware 3.0 — that uses LLMs to orchestrate all phases of its attack chain including autonomous synthesis and deployment of tailored malicious payloads on the fly, adapting to the execution environment and personalizing extortion demands. We examine the feasibility and ramifications of Ransomware 3.0, which invokes an LLM to (i) probe the victim environment, (ii) locate sensitive information, (iii) devise and execute an attack vector, and (iv) generate personalized extortion notes, thereby enacting the entire ransomware campaign with no human operator. By Md Raz, Meet Udeshi, Sai Charan P.V, Prashanth K., FARSHAD KHORRAMI, Ramesh Karri https://arxiv.org/pdf/2508.20444v1
Press enter or click to view image in full size
📖 Training Language Model Agents to Find Vulnerabilities with CTF-Dojo — we present CTF-DOJO, the first execution environment that contains hundreds of fully functional CTF challenges in secure Docker containers. CTF-DOJO leverages CTF artifacts (e.g., challenge descriptions and files to reproduce each challenge) from http://pwn.college, a public archive developed by Arizona State University for hands-on cybersecurity education, now used in 145 countries and actively maintained by a team of professors and students. By Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang https://arxiv.org/pdf/2508.18370
Press enter or click to view image in full size
📖 Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs — How can open-weight Large Language Models be safeguarded against malicious uses? We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as CBRN) by filtering out enough of the relevant pretraining data before we begin training a model? We train multiple 6.9B LLMs from scratch on an unfiltered dataset and on filtered versions where we filtered out biorisk knowledge. By Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman https://arxiv.org/pdf/2508.06601
Press enter or click to view image in full size
📖 Invitation Is All You Need! Promptware Attacks Against LLM-Powered Assistants in Production Are Practical and Dangerous — by Ben Nassi, PhD, Stav Cohen, Or Yair https://arxiv.org/pdf/2508.12175 / https://sites.google.com/view/invitation-is-all-you-need/