Hi, amazing fellow hackers, I produced an interesting topic web content discovery. It is useful in bug bounty and the most important thing during recon.
Content can be different types such as images, files, videos, and so on.
There are 3 main ways to discover content on web pages which are:
Manually, Automated, Osint(Open-Source Intelligence) methods.
What is the Content Discovery method that begins with M?
Ans: Manually
What is the Content Discovery method that begins with A?
Ans: Automated
What is the Content Discovery method that begins with O?
Ans: OSINT
Let us investigate manual discovery, there are so many ways lookup for manual discovery, and one thing for it is robots.txt. Robots.txt document provides which page should not be allowed to show or crawl webpages.
What is the directory in the robots.txt that isn’t allowed to be viewed by web crawlers?
Ans: /staff-portal
Next, we move on to manual discovery by sitemap.xml which provides the page's owner like to list out on the website.
What is the path of the secret area that can be found in the sitemap.xml file?
Ans: /s3cr3t-area
The next step is the manual discovery by HTTP headers. HTTP Headers sometimes reveal important information such as web server version and technology used, and programming language used. That may be useful to find vulnerabilities in web applications.
for use command curl http://machine-ip -v.
What is the flag value from the X-FLAG header?
Ans: THM{HEADER_FLAG}
The next one is the manual discovery by framework stack. For detailed versions and more important details we can see the documentation part
What is the flag from the framework’s administration portal?
Ans: THM{CHANGE_DEFAULT_CREDENTIALS}
follow instructions on the documentation you would get the flag.
Next Osint — Google hacking/dorks
Google dorks are used to getting customized content from google search engines. From you could get exposed passwords or hidden stuff from websites.
It can be filtered by site, inurl, filetype, intitle
What Google dork operator can be used to only show results from a particular site?
Ans: site:
Wappalyzer is an online tool and extension is used to locate web technologies and version numbers.
What online tool can be used to identify what technologies a website is running?
Ans: Wappalyzer
Waybackmachine is a historical archive used for finding any previous web content which is now alive or not.
What is the website address for the Wayback Machine?
Github is another important website in which we can look up sensitive files such as config files, passwords, auth files, and so on. Git is a version control system that makes keeping track of files in a project.
What is Git?
Ans: version control system
Content discovery by Osint — S3 Buckets, S3 buckets are storage services provided by AWS. In this, some files are allocated by the public and some of them are private. In case it is incorrectly set which may lead to vulnerability.
The format of s3 buckets is http(s)://{name}.s3.amazonaws.com
One similar automation method is by using the company name followed by common terms {name}-assets, {name}-www, {name}-public, {name}-private, etc.
What URL format do Amazon S3 buckets end in?
Ans: .s3.amazonaws.com
Finally, we gonna see Automated discovery, which is simple, easy, and time-consuming compared to manual discovery. For this, we could use ffuf, dirb and gobuster so on.
Ex for ffuf
user@machine$ ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://MACHINE_IP/FUZZ
Ex for dirb
user@machine$ dirb http://MACHINE_IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
Ex for gobuster
user@machine$ gobuster dir --url http://MACHINE_IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
What is the name of the directory beginning “/mo….” that was discovered?
Ans: /monthly
What is the name of the log file that was discovered?
Ans: /development.log