Nosey Parker is Praetorian’s secret detection tool, used regularly in our offensive security engagements. It combines regular expression-based detection with machine learning (ML) to find misplaced secrets in source code and web data. We originally wrote a blog post in March, 2022, detailing our approach to integrating machine learning into secrets detection. We have not been idle in the meantime.
Since the original blog post, we have reimplemented and released the regex-based scanner (sans ML-powered features) as an open-source project. We have also reimplemented our proprietary ML-powered version using the open-source project as a base. This post explains a bit about Nosey Parker’s machine learning developments since then.
ML comes into play for the following two primary tasks within Nosey Parker:
In both these places, at a high level, we currently build on top of CodeT5, a large language model (LLM) that pretrained for problems involving generation of source code. Instead of doing source code generation, in Nosey Parker, we have fine-tuned CodeT5 to operate for classification of content as “secret” or “not secret”.
Our fine-tuned models work well. In typical usage, the model that scores regex-based findings is able to eliminate 10–20% of total reported regex findings, leaving only interesting things, and reporting them with the most interesting first. Additionally, in several offensive security engagements our engineers have conducted, Nosey Parker’s purely ML-based scanner has detected hundreds of real secrets that the regex-based detection engine and other rule-based tools missed.
The major change since our initial announcement in March, 2022 is that we have completely reimplemented the proprietary ML-powered version of Nosey Parker. The first version was written in Python and had complicated backend orchestration to run work on powerful cloud-based VMs with GPUs in the cloud. This was problematic in a couple aspects:
Python is the _lingua franca_ of ML research and development, and so it was natural that we used it for our initial implementation of a secret detection tool. Python is good for small scripting and rapid prototyping. However, it’s not known for being a fast language in absolute terms, and it proved difficult to efficiently parallelize it to multicore systems. These aspects were problematic when trying to scale the first version to large inputs (terabytes of inputs).
Additionally, Python’s lack of robust compile-time type checking made continuing development difficult as the set of supported features grew or as architectural changes were required.
To solve the scalability issue, we wrote the reimplementation in Rust. Its single-core performance is many times faster than the initial Python implementation, and can efficiently run in parallel, being able to scale linearly to at least 64 cores. In our largest engagement we were able to use it to scan about 20 terabytes of input data on modestly-equipped machines.
Additionally, the Rust-based implementation has enabled us to make larger architectural changes to the project than we would ever feel confident doing in a large Python codebase.
The complicated backend orchestration of the initial implementation required sending the input data to separate VMs in Google Cloud for scanning. This simplified one aspect of the secret detection workflow: security engineers would not need to specially provision machines with GPUs. However, this design choice also proved to be problematic at times with respect to data handling policies. In some security engagements, our clients have negotiated that their assets will be accessed only through locked down VMs or separate client-provided hardware. In such cases, we could not transmit the input data to our carefully-provisioned cloud infrastructure, and so engineers on those engagements could not use the ML-based detection engine.
The new Rust-based reimplementation is able to run the ML model inference entirely locally, without making network requests. This has made it possible to use Nosey Parker’s ML-based detection engine in additional engagements where data handling requirements prevent sending data to the cloud. Instead, we can send Nosey Parker to the data.
To train ML models using supervised learning, you need labeled example data. To build the initial version of Nosey Parker, we first constructed a dataset of over 10k distinct secrets found in ~1TB of source code and configuration files. We laboriously reviewed these data points, labeling them as `secret` or `not secret`. This dataset is what we used to fine-tune Nosey Parker’s initial models.
We are currently finalizing the construction of a labeled dataset that has over 100k distinct secrets (10x bigger) from additional types of input data beyond source code and configuration files. We will use the new dataset to retrain our CodeT5-based models. This should improve their (already quite good!) detection performance, particularly for input data from web crawling performed by Chariot.
Although the ML-powered Nosey Parker reimplementation is faster, the appetite for ML model inference is insatiable. We have observed internally that every performance improvement we have made to Nosey Parker has induced additional scanning demand! Having a faster or more capable tool makes more ambitious tasks feasible.
Nosey Parker’s current architecture puts a practical limit on the size of the input data to the pure ML-based detection engine. A machine with a single modest GPU can get through tens of gigabytes during an overnight run. We are exploring a number of techniques, including multi-GPU parallelization, model quantization and distillation, and a multi-model “patience”-based algorithm. Combined, these techniques should enable a 100x or more speedup in inference, and allow ML-based detection to effectively operate on inputs of arbitrary size.
The post Nosey Parker’s Ongoing Machine Learning Development appeared first on Praetorian.
*** This is a Security Bloggers Network syndicated blog from Blog - Praetorian authored by emmaline. Read the original post at: https://www.praetorian.com/blog/nosey-parkers-ongoing-machine-learning-development/