Find random generatet URL's

MarkBagget did an excellent article on the problem I'm facing while trying to find DGA's.

Problem:

Most normal user traffic communicates via a hostname and not an IP address. So looking at traffic communicating directly by IP with no associated DNS request is a good thing do to. Some attackers use DNS names for their communications. There is also malware such as Skybot and the Styx exploit kit that use algorithmically chosen host name rather than IP addresses for their command and control channels. This malware uses what has been called DGA or Domain Generation Algorithms to create random looking host names for it's TLS command and control channel or to digitally sign it's SSL certificates. These do not look like normal host names. A human being can easily pick them out of our logs and traffic, but it turns out to be a somewhat challenging thing to do in an automated process. "Natural Language Processing" or measuring the randomness don't seem to work very well. Here is a video that illustrates the problem and one possible approach to solving it.

One way you might try to solve this is with a tool called ent. "ent" a great Linux tool for detecting entropy in files. Consider this:

[~]$ head -c 10000000 /dev/urandom | ent

Entropy = 7.999982 bits per byte. <-- 8 = random

[~]$ python -c "print 'A'*1000000" | ent
Entropy = 0.000021 bits per byte. <-- 0 = not random

So 8 is highly random and 0 is not random at all. Now lets look at some host names.

[~]$ echo "google" | ent
Entropy = 2.235926 bits per byte.
[~]$ echo "clearing-house" | ent
Entropy = 3.773557 bits per byte. <- Valid hosts are in the 2 to 4 range

Google scores 2.23 and clearing-house scores 3.7. So it appears as though legitimate host names will be in the 2 to 4 range. Now lets try some host names that we know are associated with malware that uses random host names.

[~]$ echo "e6nbbzucq2zrhzqzf" | ent
Entropy = 3.503258 bits per byte.
[~]$ echo "sdfe3454hhdf" | ent
Entropy = 3.085055 bits per byte. <- Malicious host from Skybot and Styx malware are in the same range as valid hosts

Another try with my first name:

freq$ echo "Marcus" | ent
Entropy = 2.807355 bits per byte.

freq$ echo "sucraM" | ent
Entropy = 2.807355 bits per byte.

Although "sucraM" lookes much more random to humans, the entropy check (of course) gives the same result.

That's no good. Known malicious host names are also in the 2 to 4 range. They score just about the same as normal host names. We need a different approach to this problem.

Even while me trying to limit the entropy check to the domain- or page part of the URL, it did not help much an never got me a more significantly result.

But Mark has a excellent idea:

Normal readable English has some pairs of characters that appear more frequently than others. "TH", "QU" and "ER" appear very frequently but other pairs like "WZ" appear very rarely. Specifically, there is approximately a 40% chance that a T will be followed by an H. There is approximately a 97% change that a Q will be followed by the letter U. There is approximately a 19% chance that E is followed by R. With regard to unlikely pairs, there is approximately a 0.004% chance that W will be followed by a Z. So here is the idea, lets analyze a bunch of text and figure out what normal looks like. Then measure the host names against the tables. I'm making this script and a Windows executable version of this tool available to you to try it out. Let me know how it works. Here is a look at how to use the tool.

Step 1) You need a frequency table. I shared two of them in my github if you want to use them you can download them and skip to step 2.

1a) Create the table: I'm creating a table called custom.freq. Create a table with the --create option

C:\freq>freq.exe --create custom.freq

1b) You can optionally turn ON case sensitivity if you want the frequency table to count uppercase letters and lowercase letters separately. Without this option the tool will convert everything to lowercase before counting character pairs. Toggle case sensitivity with -t or --toggle_case_sensitivity

C:\freq>freq.exe -t custom.freq

1c) Next fill the frequency table with normal text. You might load it with known legitimate host names like the Alexa top 1 million most commonly accessed websites. (http://s3.amazonaws.com/alexa-static/top-1m.csv.zip) I will just load it up with famous works of literature. The --normalfile argument is used to load the table with a text file containing normal text.

C:\freq>for %i in (txtdocs\*.*) do freq.exe --normalfile %i custom.freq
C:\freq>freq.exe --normalfile txtdocs\center_earth custom.freq
C:\freq>freq.exe --normalfile txtdocs\defoe-robinson-103.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\dracula.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\freck10.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\invisman.txt custom.freq

Step 2) Measure badness!

Once the frequency table is filled with data you can start to measure strings to see how probable they are according to our frequency tables. The --measure option is used for this purpose.

C:\freq>freq.exe --measure "google" custom.freq
6.59612840648
C:\freq>freq.exe --measure "clearing-house" custom.freq
12.1836883765

So normal host names have a probability above 5 (at least these two and most others do). We will consider anything above 5 to be good for our tests. Now lets feed it the host name we know are associated with malware.

C:\freq>freq.exe --measure "asdfl213u1" custom.freq
3.15113061843
C:\freq>freq.exe --measure "po24sf92cxlk" custom.freq
2.44994506765

I have a go with my above sample:

python freq.py --measure Marcus english_mixedcase.freq
12.5164459036
python freq.py --measure sucraM english_mixedcase.freq
3.5448362274

Y E S . . . ! ! !

Our malicious hosts are less than 5. 5 seems to be a pretty good benchmark. In my testing it seems to work pretty well for picking out these abnormal host names. But it isn't perfect. Nothing is. One problem is that very small host names and acronyms that are not in the source files you use to build your frequency tables will be below 5. For example, "fbi" and "cia" both come up below 5 when I just use classic literature to build my frequency tables. But I am not limited to classic literature. That leads us to step 3.

Step 3) Tune for your organization.

The real power of frequency tables is when you tune it to match normal traffic for your network. The program has two options for this purpose; --normal and --odd. --normal can be given a normal string and it will update the frequency table with that string. Both --normal and --odd can be used with the --weight option to control how much influence the given string has on the probabilities in the frequency table. It's effectiveness is demonstrated by the accompanying youtube video. Note that marking "random" host names as --odd is not a good strategy. It simply injects noise into the frequency table. Like everything else in security identifying all the bad in the world is a losing proposition. Instead focus on learning normal and identifying anomalies. So passing --normal "cia" --weight 10000 adds 10000 counts of the pair "ci" and the pair "ia" to the frequency table and increases the probability of "cia" to some number above 5..

C:\freq>freq.exe --normal "cia" --weight 10000 custom.freq

The source code and a Windows Executable version of this program can be downloaded from here: https://github.com/MarkBaggett/MarkBaggett/tree/master/freq

Find more background in this writing.