In yesterday's diary, I discussed a new proposed top-level domain, ".internal". This reminded me to talk a bit about what a top-level domain is all about, and some different ways to look at the definition of a top-level domain.
A quick trip to Google leads to the official definition of "top-level domain" in RFC 1591:
There are a set of what are called "top-level domain names" (TLDs). These are the generic TLDs (EDU, COM, NET, ORG, GOV, MIL, and INT), and the two letter country codes from ISO-3166. It is extremely unlikely that any other TLDs will be created.
That last sentence could have aged better. By my count, there are currently 1,452 different top-level domains [2]. But things are even a bit more complex as you start trying to figure out what a "domain" is all about.
There are some "domains" that behave more like top-level domains. For example, "co.uk", is used to assign entities domain names instead of "uk" (there are some legacy .uk domains left). This makes things more complex if you are trying to extract, for example, unique domain names from DNS logs. "co.uk" is likely not what you were looking for. This also affects cookies. Browsers will not allow you to set a cookie for a TLD name. And it would not make much sense to allow cookies for "co.uk", even though that is technically a "domain".
And HTTP cookies offer a path to a solution to the problem. From RFC 6265 [3]:
A "public suffix" is a domain that is controlled by a public registry, such as "com", "co.uk", and "pvt.k12.wy.us". This step is essential for preventing attacker.com from disrupting the integrity of example.com by setting a cookie with a Domain attribute of "com". Unfortunately, the set of public suffixes (also known as "registry controlled domains") changes over time. If feasible, user agents SHOULD use an up-to-date public suffix list, such as the one maintained by the Mozilla project at <http://publicsuffix.org/>.
If you are looking for unique domains, you should be looking for <domain>.<publicsuffix>.
For Python developers, there are luckily two different libraries to use the public suffix list from Mozilla: publicsuffix2 [4] and publicsuffixlist [5]. I prefer The second one, but I forgot why (I think it supports IDN better).
So what are some of these "public suffixes"?
I counted a total of 9,568 different public suffixes. The include expected suffixes like:
ac.uk
co.uk
gov.uk
ltd.uk
me.uk
But also some suffixes with more than three labels. For example:
no-ip.co.uk website.yandexcloud.net pro.typeform.com
Some use a wildcard for the third label:
*.transurl.be
*.transurl.eu
*.transurl.nl
*.webhare.dev
The idea is that companies who offer subdomains for individual customers, for example, to host blogs or other content, can identify these "subdomains" as controlled by an independent entity. Or, as Mozilla defines it, "mutually untrusting parties."
As stated above, the list is maintained by Mozilla. Domain owners must request the addition of their domain to the list. The list is, of course, ever-changing. There are currently about a hundred outstanding pull requests, and based on the GitHub history, updates are made at least weekly. If you use the list: Make sure you keep it updated.
[1] https://datatracker.ietf.org/doc/html/rfc1591
[2] https://data.iana.org/TLD/tlds-alpha-by-domain.txt
[3] https://datatracker.ietf.org/doc/html/rfc6265#section-5.3
[4] https://pypi.org/project/publicsuffix2/
[5] https://pypi.org/project/publicsuffixlist/
---
Johannes B. Ullrich, Ph.D. , Dean of Research, SANS.edu
https://jbu.me