Roll your own bot detection: server-side detection (part 2)

Roll your own bot detection: server-side detection (part 2)
好的，我现在需要帮用户总结一篇文章的内容，控制在100字以内。用户已经给出了文章的详细内容，包括两部分：客户端和服务器端的反机器人系统。首先，我得通读整篇文章，抓住主要信息。文章分为两部分，第一部分讲客户端的指纹收集和加密，第二部分是服务器端如何处理这些指纹来检测和防御机器人。接下来，我需要确定总结的重点。用户要求控制在100字以内，并且不需要特定的开头。所以我要简明扼要地概括主要内容。文章第二部分主要介绍了服务器端如何使用指纹进行防御。具体来说，包括两个方面：基于指纹的机器人检测规则和基于指纹的速率限制。此外，还提到了一些潜在的问题和改进建议。我需要将这些内容浓缩成几个关键点：指纹处理、检测规则、速率限制、以及系统的优势和局限性。然后，我要确保语言简洁明了，避免使用过于专业的术语，让读者容易理解。最后，检查字数是否符合要求，并确保没有遗漏重要信息。 </think> 文章介绍了一个轻量级无供应商依赖的反机器人系统服务器端实现方法。通过分析客户端发送的浏览器指纹数据，采用基于指纹的机器人检测规则（如识别自动化特征、不一致信号）和指纹速率限制（替代传统IP限流），构建多层次防御机制。该方案简单但具有扩展性，在生产环境中可作为初步防护层。 2025-10-15 09:11:59 Author: securityboulevard.com(查看原文) 阅读量:20 收藏

This is the second part of our series on building a lightweight, vendor-free anti-bot system to protect your login endpoint.

In Part 1, we focused on the client side: we designed a fingerprinting script that collects various signals from the browser, obfuscates the code, encrypts the payload, and injects it into a login request. That setup lays the groundwork, but on its own, it doesn’t improve security. It’s just instrumentation.

To actually defend against bots, we now need to do something with the fingerprint once it reaches the server.

That’s the focus of this article. We’ll show how to use the fingerprint for two practical defenses:

Bot detection rules: We define simple but effective heuristics to identify suspicious fingerprints, such as inconsistencies, signs of automation, or known headless environments.
Fingerprint-based rate limiting: We go beyond traditional IP-based rate limiting and use the fingerprint as a more resilient key to track abusive behavior, even when attackers rotate IPs.

This part of the system remains simple on purpose, it’s still a toy project, but the ideas behind it mirror real-world production practices. We’ll highlight where things can break down, what assumptions are reasonable, and where to be cautious.

The full source code for this article and Part 1 is available on GitHub: castle/castle-code-examples.

Creating the server

We continue building on the toy website introduced in Part 1. This time, we set up a basic Express.js server to receive and process login requests. The choice of Express is incidental, our goal is to focus on the detection logic, not the framework. Everything presented here can be adapted to other backends, whether you use Python, Go, or another language.

The server exposes just two routes:

A GET / route that serves the login page
A POST /login route that handles login submissions, including the encrypted fingerprint

// server.js

const express = require('express');
const path = require('path');
const { sanitizeLoginData, loginRateLimiter, detectBot } = require('./lib/middlewares');

const app = express();
const PORT = 3010;

app.use(express.json());
app.use(express.urlencoded({ extended: true }));

app.use(express.static(path.join(__dirname, 'static')));

// Basic route for the root path - serve the login page
app.get('/', (req, res) => {
  res.sendFile(path.join(__dirname, 'static', 'login.html'));
});

// POST /login route 
// with middleware chain: 
// 1. sanitize -> 2. detectBot -> 3. loginRateLimiter -> 4. route handler
app.post('/login', 
    sanitizeLoginData, 
    detectBot, 
    loginRateLimiter, 
    async (req, res) => {
        const { email, password } = req.sanitizedData;

        // We always return success (as requested)
        // In a real implementation, you would validate credentials here
        const isValidLogin = true || (email === '[email protected]' && password === 'test');
        if (!isValidLogin) {
            return res.status(400).json({
                success: false,
                message: 'Invalid login attempt'
            });
        }
          
        // Set a session cookie
        res.cookie('session', 'fake-session-token-' + Date.now(), {
            httpOnly: true,
            secure: false, // Set to true in production with HTTPS
            sameSite: 'strict',
            maxAge: 24 * 60 * 60 * 1000 // 24 hours
        });
        
        
        // Return success response
        res.json({
            success: true,
            message: 'Login successful',
            fingerprintProcessed: true,
        });
    }
);

// Start the server
app.listen(PORT, () => {
  console.log(`Server is running on <http://localhost>:${PORT}`);
  console.log(`Static files are served from: ${path.join(__dirname, 'static')}`);
});

Middleware breakdown

When a POST request hits the /login route, we chain three middleware before executing the login logic:

sanitizeLoginData: Validates the request payload. It checks for the presence of email and password, and attempts to decrypt the fingerprint. If successful, it attaches the parsed fingerprint to req.sanitizedData. If not, it returns a 400 error silently.
- This step ensures we don't waste time or logs on clearly broken or malformed input.
- Note: both client and server must use the same encryption key for decryption to work.
detectBot: Applies fingerprint-based bot detection rules (detailed in the next section). If the fingerprint matches known automation patterns or shows signs of tampering, the request is rejected.
- Returning 400 without any detail helps avoid leaking signal to attackers.
loginRateLimiter: Implements rate limiting keyed off the fingerprint (rather than IP). This helps mitigate distributed attacks that rotate IPs but reuse the same device fingerprint. We'll go deeper into this below.
Login handler: This is where credential validation would happen in a real app. Here, we simulate success unconditionally for demonstration purposes.

This flow gives us a layered defense: we sanitize, detect, and limit before we ever touch login logic or database queries.

Bot detection middleware

In this section, we focus on the detectBot middleware. Its role is to analyze the decrypted fingerprint attached to a login request and decide whether the environment shows signs of automation, spoofing, or inconsistency.

This middleware runs after payload sanitization and before rate limiting. At this stage, we assume the fingerprint is valid and decrypted, and we want to assess its trustworthiness using simple heuristic rules.

Here’s the core logic:

// in lib/middlewares.js

const detectBot = (req, res, next) => {
    console.log('Bot detection middleware executing...');
    
    const { fingerprint } = req.sanitizedData;
    
    // Perform bot detection
    const botDetection = isBot(fingerprint);
    req.isBot = botDetection.isBot;
    req.isOutdatedPayload = botDetection.isOutdatedPayload;
    
    if (botDetection.isBot || botDetection.isOutdatedPayload) {
        console.log('Bot detection: Bot detected');
        return res.status(400).json({
            success: false,
            message: 'Invalid login attempt'
        });
    }
    
    console.log('Bot detection: Human user detected');
    next();
};

This middleware delegates detection to the isBot function defined in lib/botDetection.js. That function applies a series of checks to the fingerprint and returns a verdict. Here are some of the tests we use:

hasBotUserAgent: detects bot-like terms in the user agent string (e.g. headless, bot, crawler)
hasHeadlessChromeScreenResolution: checks for the default headless Chrome resolution 800x600
hasWorkerInconsistency: verifies consistency between fingerprinting signals collected in the main JS context and in a Web Worker. Inconsistencies may reveal spoofing.

Again, the goal here is not to provide a full taxonomy of detection methods, but to show where such logic can live and how it can evolve. You can define your own heuristics and plug them into this system. For instance:

hasOSInconsistency flags environments where the user agent claims to be on Windows while navigator.platform suggests macOS.
You could also verify whether the claimed browser in the user agent aligns with rendering behavior or other signals.

Separately, we also check for staleness using isOutdatedPayload. If the payload is older than a defined threshold, we reject it. This helps mitigate replay attacks or delayed replays.

function hasBotUserAgent(fingerprint) {
    const uaLower = fingerprint.userAgent.toLowerCase();
    return uaLower.includes('headless') || uaLower.includes('bot') || uaLower.includes('crawler') || uaLower.includes('spider');
}

function hasWebDriverTrue(fingerprint) {
    return fingerprint.webdriver;
}

function hasHeadlessChromeScreenResolution(fingerprint) {
    return (fingerprint.screen.width === 800 && fingerprint.screen.height === 600) || 
           (fingerprint.screen.availWidth === 800 && fingerprint.screen.availHeight === 600);
}

function hasPlaywright(fingerprint) {
    return fingerprint.playwright;
}

function hasCDPAutomation(fingerprint) {
    const cdpInMainContext = fingerprint.cdp;
    const cdpInWorker = fingerprint.worker.cdp;
    return cdpInMainContext || cdpInWorker;
}

function hasOSInconsistency(fingerprint) {
    return fingerprint.userAgent.includes('Win') && fingerprint.platform.includes('Mac');
}

function hasHighCPUCoresCount(fingerprint) {
    return fingerprint.cpuCores > 90;
}

function hasWorkerInconsistency(fingerprint) {
    if (!fingerprint.worker || fingerprint.worker.userAgent === 'NA') {
        return false;
    }

    const hasInconsistency = !(
        fingerprint.worker.webGLVendor === fingerprint.webgl.unmaskedVendor &&
        fingerprint.worker.webGLRenderer === fingerprint.webgl.unmaskedRenderer &&
        fingerprint.worker.userAgent === fingerprint.userAgent &&
        fingerprint.worker.languages === fingerprint.languages &&
        fingerprint.worker.platform === fingerprint.platform &&
        fingerprint.worker.hardwareConcurrency === fingerprint.cpuCores
    );

    return hasInconsistency;
}

function isOutdatedPayload(fingerprint, maxMinutes) {
    if (!fingerprint.timestamp) return true;
    const timestamp = new Date(fingerprint.timestamp);
    const now = new Date();
    const diff = now.getTime() - timestamp.getTime();
    return diff > 1000 * 60 * maxMinutes;
}

All detection functions are wrapped in safeEval to prevent a single faulty value from crashing the logic:

function isBot(fingerprint) {
    const safeEval = (fn, args) => {
        try {
            return fn(args);
        } catch (e) {
            return false;
        }
    };

    const botDetectionChecks = {
        botUserAgent: safeEval(hasBotUserAgent, fingerprint),
        webdriver: safeEval(hasWebDriverTrue, fingerprint),
        headlessChromeScreenResolution: safeEval(hasHeadlessChromeScreenResolution, fingerprint),
        playwright: safeEval(hasPlaywright, fingerprint),
        cdp: safeEval(hasCDPAutomation, fingerprint),
        osInconsistency: safeEval(hasOSInconsistency, fingerprint),
        workerInconsistency: safeEval(hasWorkerInconsistency, fingerprint),
        highCPUCoresCount: safeEval(hasHighCPUCoresCount, fingerprint),
    };

    return {
        isBot: Object.values(botDetectionChecks).some(check => check),
        numChecks: Object.values(botDetectionChecks).filter(check => check).length,
        checks: botDetectionChecks,
        isOutdatedPayload: safeEval(isOutdatedPayload, fingerprint, 15)
    };
}

module.exports = {
    isBot
};

The isBot function returns a structured object:

isBot: true if any detection rule matched (you could use a scoring system instead)
numChecks: how many rules matched, which could be useful to build thresholds
checks: the raw output of all individual tests
isOutdatedPayload: true if the fingerprint is older than 15 minutes

If isBot or isOutdatedPayload is true, we stop the request and return a generic error. This avoids giving feedback that could help attackers tune their spoofing.

This setup gives you a foundation that’s easy to extend: you can add more rules, refine your thresholds, or change your verdict logic, all without touching the rest of your login flow.

Fingerprint based rate limiter

Our fingerprint-based rate limiter builds on the express-rate-limit package. By default, this package limits traffic using the IP address as the aggregation key—but that isn’t sufficient when facing credential stuffing or bot attacks that rotate IPs. Fortunately, express-rate-limit exposes a keyGenerator option, which allows us to use a custom key instead. That’s where the fingerprint comes in.

Why not rely on IP alone?

IP-based rate limiting is still useful and should remain part of your defense stack. It makes attackers pay more to scale their operation, since they need access to residential or proxy IPs. But once they rotate IPs, which they often do, IP-based limits lose their effectiveness. A fingerprint-based rate limiter adds an additional layer: instead of counting attempts per IP, we count them per device fingerprint. This helps catch distributed attacks that reuse the same environment while hopping across IPs.

Implementation

Here’s how our fingerprint-based limiter is configured. We apply a threshold of 50 attempts per fingerprint within a 15-minute window. When the limit is exceeded, we reject the request with a 400 response.

const loginRateLimiter = rateLimit({
    windowMs: 15 * 60 * 1000, // 15 minutes
    limit: 50, // Limit each fingerprint to 5 login attempts per 15 minutes
    keyGenerator: (req) => {
        // Compute fingerprint hash directly here
        if (req.sanitizedData && req.sanitizedData.fingerprint) {
            const rateLimitHash = computeRateLimitFingerprintHash(req.sanitizedData.fingerprint);
            return rateLimitHash;
        }
        // Return a default key if no fingerprint is available
        return 'default-key';
    },
    handler: (req, res) => {
        console.log('Login route handler: Rate limit exceeded');
        return res.status(400).json({
            success: false,
            message: 'Invalid login attempt'
        });
    },
    skip: (req) => {
        // Skip rate limiting if it's a bot (bots are handled separately)
        return req.isBot === true;
    },
    requestPropertyName: 'rateLimit', // Attach rate limit info to req.rateLimit
    standardHeaders: true, // Enable RateLimit headers
    legacyHeaders: false, // Disable X-RateLimit headers
});

Tuning the thresholds

Rate limiting always involves tradeoffs. A short window with a low threshold can block bursty attacks quickly, but may miss low-and-slow ones. A long window with a higher threshold catches slower attempts but increases the risk of false positives.

There’s no universal answer here, you’ll need to calibrate your limits based on real traffic data. A common strategy is to use multiple limiters in parallel:

A short window (e.g. 5 minutes) to react fast
A longer window (e.g. 24 hours) to detect slow abuse

How we hash the fingerprint

Let’s revisit the keyGenerator logic. It calls computeRateLimitFingerprintHash to transform the raw fingerprint into a stable, spoof-resistant key:

keyGenerator: (req) => {
    // Compute fingerprint hash directly here
    if (req.sanitizedData && req.sanitizedData.fingerprint) {
        const rateLimitHash = computeRateLimitFingerprintHash(req.sanitizedData.fingerprint);
        console.log('Rate limiter: Hash computed:', rateLimitHash);
        return rateLimitHash;
    }
    // Return a default key if no fingerprint is available
    return 'default-key';
}

Now, why not just hash the entire fingerprint with JSON.stringify? Because in practice, attackers randomize attributes to evade detection—especially the user agent, which is one of the easiest values to spoof.

If we included the entire stringified fingerprint, changing a single character in the user agent would completely change the hash. That would make the rate limiter trivial to bypass.

Instead, we want to build a resilient aggregation key: one that ignores noisy or attacker-controlled attributes, but still captures enough structure to link similar environments.

Strategy

We apply the following principles when selecting fields for the hash:

Exclude highly spoofable values (e.g. user agent, languages)
Normalize or substitute values when tampering is detected (e.g. replace randomized canvas hashes with a placeholder)

This helps ensure that devices with slightly different but forged environments still map to the same rate-limiting bucket.

function safeConvertToString(value) {
    if (typeof value === 'undefined' || value === null || value === undefined) {
        return 'NA';
    }
    return value.toString();
}
		
function computeRateLimitFingerprintHash(fingerprint) {
    const dataHash = [
        // We don't use the user agent since it can be spoofed too easily

        safeConvertToString(fingerprint.cpuCores),
        safeConvertToString(fingerprint.deviceMemory),
        safeConvertToString(fingerprint.language),
        safeConvertToString(fingerprint.languages),
        safeConvertToString(fingerprint.timezone),
        safeConvertToString(fingerprint.platform),
        safeConvertToString(fingerprint.maxTouchPoints),
        safeConvertToString(!!fingerprint.webdriver),
        safeConvertToString(fingerprint.webgl.unmaskedRenderer),
        safeConvertToString(fingerprint.webgl.unmaskedVendor),

        // Screen-related signals
        safeConvertToString(fingerprint.screen.width),
        safeConvertToString(fingerprint.screen.height),
        safeConvertToString(fingerprint.screen.colorDepth),
        safeConvertToString(fingerprint.screen.availWidth),
        safeConvertToString(fingerprint.screen.availHeight),   
        safeConvertToString(fingerprint.playwright),
        safeConvertToString(fingerprint.cdp),

        // Worker signals
        safeConvertToString(fingerprint.worker.webGLVendor),
        safeConvertToString(fingerprint.worker.webGLRenderer),
        safeConvertToString(fingerprint.worker.languages),
        safeConvertToString(fingerprint.worker.platform),
        safeConvertToString(fingerprint.worker.hardwareConcurrency),
        safeConvertToString(fingerprint.worker.cdp),

        // If the canvas is randomized, we don't use the hash, we just ignore it to make the fingerprint more stable
        fingerprint.canvas.hasAntiCanvasExtension || fingerprint.canvas.hasCanvasBlocker ? 'IGNORE' : fingerprint.canvas.hash,
    ]
    const hash = crypto.createHash('sha256').update(dataHash.join('')).digest('hex');

    return hash;
}

Of course, this field selection is subjective. Everything client-side can be modified. But in practice, certain attributes (like the user agent or languages) are modified far more often than others, and thus make poor keys for long-lived tracking or rate limiting.

Possible improvements

The techniques introduced across these two articles, client-side fingerprinting, payload encryption, bot heuristics, and fingerprint-based rate limiting, are designed to be practical foundations for real-world bot detection. While the implementation itself is a proof of concept, the concepts are production-relevant and can serve as a lightweight first layer of protection.

Used correctly, this layer can help block obvious automated traffic before handing off requests to more expensive third-party detection systems. This not only reduces operational cost but also filters low-effort attacks early.

That said, there’s plenty of room to harden and extend this setup.

Strengthening client-side defenses

Basic obfuscation isn’t enough. The POC uses obfuscator.io via Webpack. While this helps deter casual analysis, it’s not robust against skilled reverse engineers. Tools like deobfuscate.io are designed specifically to unravel common obfuscation patterns, and will likely succeed against ours. For production, you’d need deeper protections, runtime integrity checks, and potentially VM-based obfuscation.

Static logic is a weakness. Our current script behaves the same for every user. The encryption key is hardcoded and constant, and the payload structure is predictable. An attacker could hook the encryption logic, replay it with forged values, and produce a "valid" payload without executing the real signal collection. A more resilient system would rotate keys per session or per user, ideally tying encryption to server-issued tokens or session secrets.

Fingerprint depth and tamper signals are limited. The current signal set is narrow: mostly browser- and hardware-level attributes. A more complete implementation would:

Collect additional signals that improve uniqueness (e.g., screen orientation, sensor APIs, font enumeration).
Include behavioral indicators (mouse movements, typing patterns, focus events).
Actively check for manipulation (e.g., detecting overridden JS properties, proxy objects, missing stack traces). The goal isn’t just to create entropy, but to spot lies.

Hardening server-side logic

Use multiple rate-limiting windows. The current fingerprint-based rate limiter operates with a single window and threshold. In practice, a layered rate limiter is more effective. For example:

A short-term window (e.g., 1 minute) with a low threshold for catching bursts.
A longer-term window (e.g., 1 hour) to detect slow, persistent attacks. This helps catch both aggressive and stealthy patterns while minimizing noise.

Limit based on failed attempts. Right now, rate limits are applied to all login attempts. A more forgiving approach would only countthe failed ones. This allows for repeated legitimate logins without penalty while still catching brute force patterns. For example, a fingerprint with many failed attempts across rotating IPs could be temporarily blocked without affecting valid users.

Tune thresholds based on fingerprint popularity. Not all fingerprints are equally rare. Many iPhones, for example, share near-identical environments. A static threshold might block those users too aggressively. Ideally, rate limits should be adaptive: fingerprints that appear rarely can be rate-limited more aggressively than ones that are common and tied to legitimate traffic.

Expand the detection rules. The isBot function is deliberately minimal. But a production system should go further. In particular:

Add browser consistency checks, e.g., verifying that the browser version in the user agent matches the presence or absence of certain APIs.
Add OS consistency rules, such as flagging Apple GPU renderers on Windows environments.
Cross-validate values across contexts (e.g., compare main thread and worker thread attributes).

These aren’t about absolute accuracy; they’re about layering heuristics to increase confidence without overfitting.

Broader architectural needs

Lack of visibility is dangerous. One major gap in the current system is observability. In production, you need to understand why requests were blocked, especially for debugging or tuning purposes. This means:

Logging rule matches
Tracing fingerprint reuse over time
Surfacing metrics like false positive/negative rates

Even basic dashboards can provide early warning signs of misclassifications.

No risk-based context. Every user in the current system is treated the same. But user context matters. For example:

Has this device been seen before?
Is this the user’s typical location or ISP?
How does this session compare to prior successful logins?

A system with adaptive risk scoring would treat new or risky contexts more cautiously, while allowing known users some leeway.

Detection logic should be decoupled. Currently, all detection logic is embedded in route middleware. This makes deployment risky, one logic error could block all logins. A better approach is to externalize detection logic into a decision engine or policy layer. Ideally, it should support dry runs, logging, and staged rollout so you can measure impact before enforcing rules.

These improvements are not exhaustive. But they highlight the difference between a basic anti-bot filter and a production-grade detection system. Moving toward the latter means not just better rules, but safer deployments, better observability, and a system that can evolve with attacker behavior.

*** This is a Security Bloggers Network syndicated blog from The Castle blog authored by Antoine Vastel. Read the original post at: https://blog.castle.io/roll-your-own-bot-detection-server-side-detection-part-2/

文章来源: https://securityboulevard.com/2025/10/roll-your-own-bot-detection-server-side-detection-part-2/
如有侵权请联系:admin#unsafe.sh