something we ran into while building a security tool:
how do you actually know if it works?
most tools point to benchmarks like OWASP, Juliet, etc. and say “we scored well”
but when you look closer, those benchmarks mostly test very obvious patterns
(e.g. basic SQL injection, unsafe eval, etc.)
they don’t really reflect how vulnerabilities show up in real codebases:
issues that span multiple files
logic bugs
context-dependent vulnerabilities
anything that isn’t just pattern matching
so you can have a tool that scores well on benchmarks but still misses real problems
we ended up going down a rabbit hole on this and wrote about why we think existing benchmarks fall short and what a more realistic one should look like:
https://kolega.dev/blog/why-we-built-our-own-security-benchmark/
curious what others think — do people actually trust benchmark results when evaluating security tools?