Entropy Story-time: From Claude Shannon to Equifax

Mix Two Colors / Pietro Jeng

There's an piece floating around that does a great, succinct job at summarizing Claude Shannon's contributions to our modern understanding of information. If you haven't read The bit bomb on Aeon, head over there. It'll make your brain happy with things like this:

"Shannon – mathematician, American, jazz fanatic, juggling enthusiast – is the founder of information theory, and the architect of our digital world. It was Shannon’s paper ‘A Mathematical Theory of Communication’ (1948) that introduced the bit, an objective measure of how much information a message contains."

The article digs deep into how easy it is to predict things - especially language. It ends up focusing on the power of pattern detection in being able to compress information:

"Shannon expanded this point by turning to a pulpy Raymond Chandler detective story […] He flipped to a random passage … then read out letter by letter to his wife, Betty. Her role was to guess each subsequent letter […] Betty’s job grew progressively easier as context accumulated […] a phrase beginning ‘a small oblong reading lamp on the’ is very likely to be followed by one of two letters: D, or Betty’s first guess, T (presumably for ‘table’). In a zero-redundancy language using our alphabet, Betty would have had only a 1-in-26 chance of guessing correctly; in our language, by contrast, her odds were closer to 1-in-2. "

Humans Vs. Randomness

Written English, overall, is up to 75 percent redundant (try this for yourself! Download a huge book in plaint text from Project Gutenberg and zip it up to see compression in action).

The amount that this is difficult to guess is called entropy, and Shannon's work on this is central to cryptography and secure passwords. You can actually calculate the (maximum) entropy of a collection of letters using his work -- these wikipedia articles are good explainers: Entropy and Password Strength and Information Theory and Entropy.

datageneticsDataGenetics put out a post on the entropy in your average PIN number. Their table is particularly fun to think through - a totally random number (0-9) has 3.3 "bits" of entropy (specifically, it would take 10 guesses (2^3.3) to guarantee you guess the right number. You can simply add the entropy bit calculation when adding more numbers, so your average 4-digit PIN is 3.3+3.3+3.3+3.3 or 13.2, or 2^13.2 guesses (9410).

A short aside on "key length"

As a rough comparison, "128 bit" encryption is referring to this same number, so to guarantee you will correctly guess the right decryption key takes 2^128 guesses, which is 3.402823669×10³⁸ , which is... a lot of guesses.

It's worth noting that this number only tracks "symmetric" encryption - things like the Advanced Encryption Standard, more commonly known as AES. Symmetric encryption uses the same key for encrypting and decrypting the data.

You might be more familiar on a daily basis with asymmetric encryption, which is the key (hah) to how the everything from the green lock showing secure websites to super-secure PGP based emails work. Asymmetric encryption is tracked on a different scale (most asymmetric crypto systems use the public/private system to create a secure symmetric system underneath). So your 2048 bit PGP key is strong, but actually closer to 128 bits of entropy, because … well, this StackExchange discussion is a somewhat readable explanation.

A quick history lesson

Nevertheless, you can see why that bit length is very important in terms of how hard it is to guess. The precursor to AES, DES ("Data Encryption Standard" -- thankfully the creativity on the formal names does not impact their security), was 56 bits. EFF demonstrated that generally available computational power in the late 90s could crack DES in a reasonable timeframe.

Bonus fun story - the process to select the algorithm behind AES is a worthwhile read, especially in the context of the ongoing Crypto Wars at the time, and the classification of cryptography as an export-controlled munition.

Oh, the 90s.

Computers (and Countries) Vs. Randomness

Despite our apparent love of chaos, humans are *horrible* and randomness, and Shannon's information theory work around predictable patterns really shreds this wonderful difficulty of guessing things. As an(other) aside, computers are also not great at randomness, so there's a lot of work to make good "pseudo random" number generators in software, which draw on the timings of keyboard presses, temperature, and mouse movement to "seed" their randomness. Some really interesting reading can be found on Wikipedia on Randon Number Generators, including international intrigue and even geekier discussions.

Somewhere between the international intrigue (remember that time the NSA backdoored hardware random number generators?) and a desire to maintain a domestic ability to do the same by blocking powerful random number generators is the why many countries have policies blocking the import of advanced encryption and even laptops with specific hardware random number generators -- for example, China, Russia, Belarus, and Kazakhstan all block computers with TPM chips.

I thought this was going to be about the data breach?

I know. You came here for the click bait on Equifax. Stick with me, I'm trying to make this a teachable moment here.

First -- back to PINs. Humans just don't select random numbers, we select patterns like 1234, 1111, 2222, or 2580 (a line straight down the keypad). You have a 1 in 5 chance of guessing a PIN by trying just five common ones: 1234 (10%), 1111 (6%), 0000 (2%), 1212 (1%), and 7777 (1%). With 426 guesses from the most common PINs list, you've crossed the 50% mark.

Beyond that, you can still have some wins by limiting your guesses to specific patterns (for example, limiting your guessing to numbers that would fit a MMYY pattern to find birthdays or anniversaries that occurred in, say, the last century) - can vastly improve how well you're able to guess, even if you just know rough the target's rough age.

And don't think that passwords are any better. Even when not constrained to 4 digits, humans use predictible passwords and the same stupid pattern tricks. There is an annual release of the worst passwords of the year based on whatever big breach leaked passwords. 123456 and password fight for the top spot, and 25 guesses off of top password lists will score you 1 in 10 passwords.

While that number is lower than for PINs, the pattern game in passwords is super fun (add ! at the end, change E to 3!) -- these are easy to program in to a guessing game for a computer. Indeed, people who study passwords create specific rulesets around these specific patterns -- check out these rulesets based on real-world passwords, including automating substituing letters with numbers, swapping out shift-characters (so 1999 would become "!((((" ) and adding dates basically everywhere.

Basically, I have now successfully over-explained the classic "correct horse battery stable" XKCD comic strip, and we can get on to social security numbers.


So, Equifax was hacked and Social Security Numbers of 143 million are potentially exposed.

A SSN is 9 digits, so it would theoretically have 30 bits of entropy if the numbers were completely random (9*3.3 = 29.7).

I have some bad news about how random SSNs are. It's not new information. In fact, this fine article from Phrack's June 1988 edition details a lot of the patterns behind SSNs (at least for those of us born before 2011) - The first three numbers track to your state, and the next two have specific patterns and sequences that can be predicted given your age. According to this paper, this drops the entropy of your SSN down to 11 bits in some cases - or a mere 2,048 guesses. This is exactly why you may have seen some phishing emails that have the first 6 digits of your SSN already guessed (and often verified using free online services) and seek to trick you into revealing the last four digits -- the only ones with any privacy left to them. It is an exercise left to the user as to why the last four digits of your SSN also make a horrible PIN number for other services.

All of this is to say: be angry that a company who's core business relied on protecting this information failed to do so. Be extra cautious about identity scams, especially as that same company is doing a horrible job at trustable incident response so far. Look into putting a freeze on new lines of credit in your name. But also remember that the SSN as a secure, unique identifier protects your privacy about as well as the emperor's new clothing protects his.

Crypto-types: Shoot me an email if I've oversimplified a concept to a point you feel is misleading.

Pietro Jeng