One of my clients wanted us to scan their web servers for confidential information. This was going to be done both from the Internet, and from an internal intranet location (between cooperative but separate organizations). In particular they were concerned about social security numbers and credit cards being exposed, and wanted us to double-check their servers. These were large Class B network.
I wanted to do something like the Unix “grep”, and search for regular expressions on their web pages. It would be easier if I could log onto the server and get direct access to the file system. But that’s not what the customer wanted.
I looked at a lot of utilities that I could run on my Kali machine. I looked at several tools. It didn’t look hopeful at first. This is what I came up with, using Kali and shell scripts. I hope it helps others. And if someone finds a better way, please let me know,
As I had an entire network to scan, I started with nmap to discover hosts.
By chance nmap 7.0 was released that day, and I was using it to map out the network I was testing. I downloaded the new version, and noticed it had the http-grep script. This looked perfect, as it had social security numbers and credit card numbers built in! When I first tried it there was a bug. I tweeted about it and in hours Daniel “bonsaiviking” Miller fixed it. He’s just an awesome guy.
Anyhow, here is the command I used to check the web servers:
NETWORK="10.10.0.0/24" nmap -vv -p T:80,443 $NETWORK --script \ http-grep --script-args \ 'http-grep.builtins, http-grep.maxpagecount=-1, http-grep.maxdepth=-1 '
By using ‘http-grep.builtins’ – I could search fo all of the types of confidential information http-grep understood. And by setting maxpagecount and maxdepth to -1, I turned off the limits. It outputs something like:
Nmap scan report for example.com (10.10.1.2) Host is up, received syn-ack ttl 45 (0.047s latency). Scanned at 2015-10-25 10:21:56 EST for 741s PORT STATE SERVICE REASON 80/tcp open http syn-ack ttl 45 | http-grep: | (1) http://example.com/help/data.htm: | (1) email: | + [email protected] | (2) phone: | + 555-1212
Excellent! Just what I need. A simple grep of the output for ‘ssn:’ would show me any social security numbers (I had tested it on another web server to make sure it worked.) It’ always a good idea to not put too much faith in your tools.
I first used nmap to identify the hosts, and then I iterated through each host, and did a separate scan for each host, storing the outputs in separate files. So my script was little different. I ended up with a file that contained the URL’s of the top web page of the servers (e.g. http://www.example.com, https://blog.example.com, etc.) So the basic loop would be something like
while IFS= read url do nmap [arguments....] "$url" done <list_of_urls.txt
Later on, I used wget instead of nmap, but I’m getting ahead of myself.
We had to perform all actions during a specific time window, so I wanted to be able to break this into smaller steps, allowing me to quit and restart. I first identified the hosts, and scanned each one separately, in a loop. I also added a double-check to ensure that I didn’t scan past 3PM (as per our client’s request, and that I didn’t fill up the disk. So I added this check in the middle of my loop
LIMIT=5 # always keep 5% of the disk free HOUR=$(date "+%H") # Get hour in 0..24 format AVAIL=$(df . | awk '/dev/ {print $5}'|tr -d '%') # get the available disk space if [ "$AVAIL" -lt "$LIMIT" ] then echo "Out of space. I have $AVAIL and I need $LIMIT" exit fi if [ "$HOUR" -ge 15 ] # 3PM or 12 + 3 == 15 then echo "After 3 PM - Abort" exit fi
The second problem I had is that a lot of the files on the server were PDF files, Excel spreadsheets, etc. using the http-grep would not help me, as it doesn’t know how to examine non-ASCII files. I therefore needed to mirror the servers.
I needed to find and download all of the files on a list of web servers. After searching for some tools to use, I decided to use wget. To be honest – I wasn’t happy with the choice, but it seemed to be the best choice.
I used wget’s mirror (-m) option. I also disabled certificate checking (Some servers were using internal certificate an internal network. I also used the –continue command in case I had to redo the scan. I disabled the normal spider behavior of ignoring directories specified the the robots.txt file, and I also changed my user agent to be “Mozilla”
wget -m –no-check-certificate –continue –convert-links -p –no-clobber -e robots=off -U mozilla “$URL”
Some servers may not like this fast and furious download. You can slow it down by using these options: “–limit-rate=200k –random-wait –wait=2 ”
I sent the output to a log file. Let’s call it wget.out. I was watching the output, using
tail -f wget.out
I watched the output for errors. I did notice that there was a noticeable delay in a host name lookup. I did a name service lookup, and added the hostname/ip address to my machine’s /etc/hosts file. This made the mirroring faster. I also was counting the number of fies being created, using
find . -type f | wc
I noticed that an hour had passed, and only 10 new files we being downloaded. This was a problem. I also noticed that some of the files being downloaded had several consecutive “/” in the path name. That’s not good.
I first grepped for the string ‘///’ and then I spotted the problem. To make sure, I typed
grep /dir1/dir2/webpage.php wgrep.log | awk '{print $3}' | sort | uniq -c | sort -nr 15 `webserver/dir1/dir2/webpage.php' 2 http://webserver/dir1/dir2/webpage.php 2 http://webserver//dir1/dir2/webpage.php 2 http://webserver///dir1/dir2/webpage.php 2 http://webserver////dir1/dir2/webpage.php 2 http://webserver/////dir1/dir2/webpage.php 2 http://webserver//////dir1/dir2/webpage.php 2 http://webserver///////dir1/dir2/webpage.php 2 http://webserver////////dir1/dir2/webpage.php 2 http://webserver/////////dir1/dir2/webpage.php 2 http://webserver//////////dir1/dir2/webpage.php
Not a good thing to see. Time for plan B.
I use a method I had tried before – the wget –spider function. This does not download the files. It just gets their name. As it turns out, this is better in many ways. It doesn’t go “recursive” on you, and it also allows you to scan the results, and obtain a list of URL’s. You can edit this list and not download certain files.
Method 2 was done using the following command:
wget --spider --no-check-certificate --continue --convert-links -r -p --no-clobber -e robots=off -U mozilla "$URL"
I sent the output to a file. But it contains filenames, error messages, and a lot of other information. To get the URL’s from this file, I then extracted all of the URLS using
cat wget.out | grep '^--' | \ grep -v '(try:' | awk '{ print $3 }' | \ grep -v '\.\(png\|gif\|jpg\)$' | sed 's:?.*$::' | grep -v '/$' | sort | uniq >urls.out
This parses the wget output file. It removes all *.png *.gif and *.jpg files. It also strips out any parameters on a URL (i.e. index.html?parm=1&parm=2&parm3=3 becomes index.html). It also removes any URL that ends with a “/”. I then eliminate any duplicate URL’s using sort and uniq.
Now I have a list of URLS. Wget has a way for you to download multiple files using the -i option:
wget -i urls.out --no-check-certificate --continue \ --convert-links -p --no-clobber -e robots=off -U Mozilla
A scan of the network revealed a search engine that searched files in its domain. I wanted to make sure that I had included these files in the audit.
I tried to search for meta-characters like ‘.’ , but the web server complained. Instead, I searched for ‘e’ – the most common letter, and it gave me the largest number of hits – 20 pages long. I examined the URL for page 1, page 2, etc. and noticed that they were identical except for the value “jump=10”, “jump=20”, etc. I wrote a script that would extract all of the URL’s the search engine reported:
#!/bin/sh for i in $(seq 0 10 200) do URL="http://search.example.com/main.html?query=e&jump="$i" wget --force-html -r -l2 "$URL" 2>&1 | grep '^--' | \ grep -v '(try:' | awk '{ print $3 }' | \ grep -v '\.\(png\|gif\|jpg\)$' | sed 's:?.*$::' done
It’s ugly, and calls extra processes. I could write a sed or awk script that replaces five processes with one, but the script would be more complicated and harder to understand to my readers. Also – this was a “throw-away” script. It took me 30 seconds to write it, and the limited factor was network bandwidth. There is always a proper balance between readability, maintainability, time to develop, and time to execute. Is this code consuming excessive CPU cycles? No. Did it allow me to get it working quickly so I can spend time doing something else more productive? Yes.
Before I mentioned that I wasn’t happy with wget. That’s because I was not getting consistent results. I ended up repeating the scan of the same server from a different network, and I got different URL’s. I checked, and the second scan found URL’s that the first one missed. I did the best I could to get as many files as possible. I ended up writing some scripts to keep track of the files I scanned before. But that’s another post.
Now that I had a clone of several websites, I had to scan them for sensitive information. But I have to convert some binary files into ASCII.
I installed gnumeric, and used the program ssconvert to convert the Excel file into text files. I used:
find . -name '*.xls' -o -name '*.xlsx' | \ while IFS= read file; do ssconvert -S "$file" "$file.%s.csv";done
I used the following script to convert word files into ASCII
find . -name '*.do[ct]x' -o -name '*. | \ while IFS= read file; do unzip -p "$file" word/document.xml | \ sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' >"$file.txt";done
Here are some of the potential problems I expected to face
Having said that, this is what I did.
This process is not something that can be automated easily. Some of the times when I converted PDF files into text files, the process either aborted, or went into a CPU frenzy, and I had to abort the file conversion.
Also – there are several different ways to convert a PDF file into text. Because I wanted to minimize the risk of missing some information, I used multiple programs to convert PDF files. If one program broke, the other one might cach something.
The tools I used included
Other useful programs were exiftool and peepdf and Didier Steven’s pdf-tools. I also used pdfgrep, but I had to download the latest source, and then compile it with the perl PCRE library.
I wrote a script that takes each of the PDF files and converts them into text. I decided to use the following convention:
As the conversion of each file takes time, I used a mechanism to see if the output file exists. If it does, I can skip this step.
I also created some additional files naming conventions
This is useful because if any of the files generate an error, I can use ‘ls -s *.err|sort -nr’ to identify both the program and the input file that had the problem.
The *.time files could be used to see how long it took to run the conversion. The first time I tried this, my script ran all night, and did not complete. I didn’t know if one of the programs was stuck in an infinite loop or not. This file allows me to keep track of this information.
I used three helper functions in this script. The “X” function lets me easily change the script to show me what it would do, without doing anything. Also – it made it easier to capture STDERR and the timing information. I called it ConvertPDF
#!/bin/bash #ConvertPDF # Usage # ConvertPDF filename FNAME="${1?'Missing filename'}" TNAME="${FNAME}.txt" TXNAME="${FNAME}.text" # Debug command - do I echo it, execute it, or both? X() { # echo "$@" >&2 /usr/bin/time -o "$OUT.time" "$@" 2> "$OUT.err" } PDF2TXT() { IN="$1" OUT="$2" if [ ! -f "$OUT" ] then X pdf2txt -o "$OUT" "$IN" fi } PDFTOTEXT() { IN="$1" OUT="$2" if [ ! -f "$OUT" ] then X pdftotext "$IN" "$OUT" fi } if [ ! -f "$FNAME" ] then echo missing input file "$FNAME" exit 1 fi echo "$FNAME" >&2 # Output filename to STDERR PDF2TXT "$FNAME" "$TNAME" PDFTOTEXT "$FNAME" "$TXNAME"
Once this script is created, I called it using
find . -name '*.[pP][dD][fF]' | while IFS= read file; do ConvertPDF "$file"; done
Please note that this script can be repeated. If the conversion previously occurred, it would not repeat it. That is, if the output files already existed, it would skip that conversion.
As I’ve done it often in the past, I used a handy function above called “X” for eXecute. It just executes a command, but it captures any error message, and it also captures the elapsed time. If I move/add/replace the “#” character at the beginning of the line, I can make it just echo, and not execute anything. This makes it easy to debug without it executing anything. This is Very Useful.
Some of the file conversion process took hours. I could kill these processes. Because I captured the error messages, I could also search them to identify bad conversions, and delete the output files, and try again. And again.
Because some of the PDF files are so large, and the process wasn’t refined, I wanted to be more productive, and work on the smallest files first, where I defined smallest by “fewest number of pages”. Finding scripting bugs quickly was desirable.
I used exiftool to examine the PDF metadata. A snippet of the output of “exiftool file.pdf” might contain:
ExifTool Version Number : 9.74 File Name : file.pdf ..... [snip] ..... Producer : Adobe PDF Library 9.0 Page Layout : OneColumn Page Count : 84
As you can see, the page count is available in the meta-data. We can extract this and use it.
I sorted the PDF files by page count using
for i in *.pdf do NumPages=$(exiftool "$i" | sed -n '/Page Count/ s/Page Count *: *//p') printf "%d %s\n" "$NumPages" "$i" done | sort -n | awk '{print $2}' >pdfSmallestFirst
I used sed to search for ‘Page Count’ and then only print the number after the colon. I then output two columns of information: page count and filename. I sorted by the first column (number of pages) and then printed out the filenames only. I could use that file as input to the next steps.
If you have been following me, at this point I have directories that contain
So it’s a simple matter of using grap to find files. My tutorial on Regular Expressions is here if you have some questions Here is what I used to search the files
find dir1 dir2... -type f -print0| \ xargs -0 grep -i -P '\b\d\d\d-\d\d-\d\d\d\d\b|\b\d\d\d\d-\d\d\d\d-\d\d\d\d-\d\d\d\d\b|\b\d\d\d\d-\d\d\d\d\d\d-\d\d\d\d\d\b|account number|account #'
The regular expressions I used are perl-compatible. See pcre(3) and PCREPATTERN(3) manual pages. The special characters are
\d – a digit
\b – a boundary – either a character, end of line, beginning of line, etc. – This prevents 1111-11-1111 from matching a SSN.
This matches the following patterns
\d\d\d-\d\d-\d\d\d\d – SSN
\d\d\d\d-\d\d\d\d-\d\d\d\d-\d\d\d\d – Credit card number
\d\d\d\d-\d\d\d\d\d\d-\d\d\d\d\d – AMEX credit card
There were some more things I did, but this is a summary
It should be enough to allow someone to replicate the task
Have fun