Why Does Have I Been Pwned Contain "Fake" Email Addresses?
嗯,用户让我帮忙总结一下这篇文章的内容,控制在100个字以内,而且不需要用“文章内容总结”或者“这篇文章”这样的开头。直接写描述就行。好的,我先看看这篇文章讲了什么。 文章主要讨论了“Have I Been Pwned”(HIBP)这个服务如何处理电子邮件地址的提取和存储。作者Troy Hunt解释了他们使用开源工具从数据泄露中提取电子邮件地址的过程,以及为什么一些看起来不真实的地址会被包含在内。比如,像[email protected]这样的测试地址或者包含粗俗词汇的地址,只要结构符合标准,就会被HIBP收录。 接着,作者提到这些假地址是如何进入真实网站的。通常情况下,网站在注册时会先保存用户输入的邮箱地址,然后发送验证邮件。如果用户没有完成验证,很多网站并不会删除这些未验证的地址,结果这些假地址就留在了数据库里,并最终出现在HIBP中。 作者还回应了一些用户的质疑,比如Bruce在Trustpilot上给出的一星评价。他解释说HIBP无法验证每个邮箱是否真实存在,因为这需要巨大的资源投入。最后,作者提供了测试方法来证明随机生成的假地址不会出现在HIBP中,并强调了服务的透明性和准确性。 总结一下,文章的核心是解释HIBP的工作原理、为什么会有假邮箱出现以及如何验证这些情况。所以,在总结的时候需要涵盖这几个点:HIBP如何提取和存储邮箱、假邮箱进入的原因、以及如何证明随机假邮箱不会被收录。 现在我需要用简洁的语言把这些要点浓缩到100字以内。可能需要提到工具、结构要求、未验证地址的留存、以及测试方法。同时保持语言流畅自然。 </think> 文章解释了“Have I Been Pwned”(HIBP)服务如何通过开源工具提取和存储数据泄露中的电子邮件地址,并说明了为何一些看似无效或恶意的地址会被收录。这些地址可能来自未验证的注册或测试数据,并强调HIBP无法验证每个地址的真实性。作者还回应了用户质疑,并提供方法证明随机生成的假地址不会出现在系统中。 2025-12-3 23:37:6 Author: www.troyhunt.com(查看原文) 阅读量:5 收藏

Normally, when someone sends feedback like this, I ignore it, but it happens often enough that it deserves an explainer, because the answer is really, really simple. So simple, in fact, that it should be evident to the likes of Bruce, who decided his misunderstanding deserved a 1-star Trustpilot review yesterday:

Now, frankly, Trustpilot is a pretty questionable source of real-world, quality reviews anyway, but the same feedback has come through other channels enough times that let's just sort this out once and for all. It all begins with one simple question:

What is an Email Address?

You think you know - and Bruce thinks he knows - but you might both be wrong. To explain the answer to the question, we need to start with how HIBP ingests data, and that really is pretty simple: someone sends us a breach (which is typically just text files of data), and we run the open source Email Address Extractor tool over it, which then dumps all the unique addresses into a file. That file is then uploaded into the system, where the addresses are then searchable.

The logic for how we extract addresses is all in that Github repository, but in simple terms, it boils down to this:

  1. There must be an @ symbol
  2. There can be up to 64 characters before it (the alias)
  3. There can be up to 255 characters after it (the domain)
  4. The domain must contain a period
  5. The domain must also have a valid TLD
  6. A few other little criteria that are all documented in the public repo

That is all! We can't then tell if there's an actual mailbox behind the address, as that would require massive per-address processing, for example, sending an email to each one and seeing if it bounces. Can you imagine doing that 7 billion times?! That's the number of unique addresses in HIBP, and clearly, it's impossible. So, that means all the following were parsed as being valid and loaded into HIBP (deep links to the search result):

  1. [email protected]
  2. [email protected]
  3. [email protected]

I particularly like that last one, as it feels like a sentiment Bruce would express. It's also a great example as it's clearly not "real"; the alias is a bit of a giveaway, as is the domain ("foo" is commonly used as a placeholder, similar to how we might also use "bar", or combine them as "foo bar"). But if you follow the link and see the breach it was exposed in, you'll see a very familiar name:

Which brings us to the next question:

How Do "Fake" Email Addresses End up in Real Websites?

This is also going to seem profoundly simple when you see it. Here goes:

Any questions, Bruce? This is just as easily explainable as why we considered it a valid address and ingested it into HIBP: the email address has a valid structure. That is all. That's how it got into Adobe, and that's how it then flowed through into HIBP.

Ah, but shouldn't Adobe verify the address? I mean, shouldn't they send an email to the address along the lines of "Hey, are you sure you want to sign up for this service?" Yes, they should, but here's the kicker: that doesn't stop the email address from being added to their database in the first place! The way this normally works (and this is what we do with HIBP when you sign up for the free notification service) is you enter the email address, the system generates a random token, and then the two are saved together in the database. A link with the token is then emailed to the address and used to verify the user if they then follow that link. And if they don't follow that link? We delete the email address if it hasn't been verified within a few days, but evidently, Adobe doesn't. Most services don't, so here we are.

How Can I Be Really Sure Actual Fake Addresses Aren't in HIBP?

This is also going to seem profoundly obvious, but genuinely random email addresses (not "thisisfuckinguseless@") won't show up in HIBP. Want to test the theory? Try 1Password's generator (yes, Bruce, they also sponsor HIBP):

Now, whack that on the foo.com domain and do a search:

Huh, would you look at that? And you can keep doing that over and over again. You’ll get the same result because they are fabricated addresses that no one else has created or entered into a website that was subsequently breached, ipso facto proving they cannot appear in the dataset.

Conclusion

Today is HIBP's 12th birthday, and I've taken particular issue with Bruce's review because it calls into question the integrity with which I run this service. This is now the 218th blog post I've written about HIBP, and over the last dozen years, I've detailed everything from the architecture to the ethical considerations to how I verify breaches. It's hard to imagine being any more transparent about how this service runs, and per the above, it's very simple to disprove the Bruces of the world. If you've read this far and have an accurate, fact-based review you'd like to leave, that'd be awesome 😊

Have I Been Pwned
Tweet Post Update Email RSS

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals


文章来源: https://www.troyhunt.com/why-does-have-i-been-pwned-contain-fake-email-addresses/
如有侵权请联系:admin#unsafe.sh