Building secure systems is difficult. It would be nice if we had a bunch of well-designed crypto building blocks that we could assemble in all sorts of ways and be certain that they would, no matter what, yield a secure system overall. There are, in fact, folks working on such things at a theoretical level [Universal Composability].
But even if you had these building blocks, you would still have to use them in their intended way. A component can only be secure under certain well-defined circumstances, not for any use that happens to look similar.
One area of secure protocol development that seems to consistently yield poor design choices is the use of hash functions. What I’m going to say is not 100% correct, but it is on the conservative side of correct, so if you follow the rule, you (probably) can’t go wrong. You might be considered overly paranoid, but as they say, just because you’re paranoid doesn’t mean they’re not after you.
So here it is: Don’t hash secrets. Never. No, sorry, I know you think your case is special but it’s not. No. Stop it. Just don’t do it. You’re making the cryptographers cry.
What the heck am I talking about, you say? I’ll explain. But before we get lost in the details, just remember. Don’t hash secrets. Ever. Kapish?
What exactly do you mean by “Hash”?
A hash function takes any document, big or small, and creates a short fingerprint. That gigabyte movie of your newborn baby? Hash it with SHA1, and you’ve got yourself a 160 bit (~30 alphanumeric characters) fingerprint. Now, hold on, you say, 30 characters? You’ve hashed my baby to pieces and all that’s left is a measly 30 characters? No, no, don’t worry, your baby is still a unique snowflake. You can’t take those 30 characters and, from them, recover your gigabyte video. This is not uber-data-compression.
But it’s going to be darn hard for you to find any other document, big or small, that hashes to the same 30 characters. In fact, it’s so hard, even the most powerful computer in the world dedicated to this one task for hundreds of years won’t succeed at finding that doppelganger document. You’ve got lots of computers you say? You’re Google and you have hundreds of thousands of computers? Yeah, well…. tough. You still won’t succeed.
In fact, you can try something that should be easier: rather than find another document that hashes specifically to those 30 characters that represent your baby, you can go looking for any two documents that happen to hash to the same thing (collide). And you won’t find any such pair. Promised. We call that “collision resistance”. That thing about how you can’t find another document that hashes to the same value as your baby video? We call that “second pre-image resistance.”
Oh, and I forgot to mention that this magical function, SHA1, is public. Anyone can see the code. There are no secrets. Even if you see the code, you can’t find a collision. No, really, I’m not screwing with you.
I want to hash everything!
That’s usually the reaction after discovering the amazing power of hash functions. There are all sorts of nails just waiting for this magical hammer, so let’s start hashing everything in sight. De-duplicating large documents? Hash and compare! Passwords in a database? Hash and store! Anonymizing names in a database? Hash and pseudonymize!
After all, the magical power of a hash function is that you can’t “go back,” right? Given a hash, it’s impossible to get that pre-image, so hash away, my magical crypto friends!
Yeah, so it’s not quite that magical.
Let’s say I give you a SHA1 hash value 29b0d7c86b88565b78efffeea634cee81a209c92. From that hash alone, you can’t tell what I hashed. But if I tell you that I hashed a password, then all you need to do is try a bunch of common passwords and see which one matches. In this case, I hashed “love”, one of the most common passwords there is.
Now you start to see how this “you can’t go back” reasoning fails: if you know the domain of possible pre-images, and that domain is not too large, then you can just try them all and see which one matches. That’s a big strike against the “hash everything” approach.
Sprinkle in some Salt
It gets more interesting with the complete password use-case. Many web developers already know that they shouldn’t store user passwords in the clear in the database, just in case a break-in reveals every user’s password. So, instead of storing passwords in the clear, let’s store a SHA1 hash of the password, against which a candidate password can be easily checked: hash it and compare.
Now the web developers who have been around the block a few times know that, if you just apply SHA1 blindly, you’ve got the “small domain” problem I just mentioned. An attacker can build up a huge dictionary of hashed passwords just once, and, when he breaks into your web site, check the hashes against this pre-built dictionary.
To prevent these “dictionary attacks”, we add salt to the hashing process, so that each user’s password is hashed differently, and generic attacks don’t work: you have to rebuild the dictionary for each user you choose to attack. Sprinkling in salt is easy: just concatenate the password with a random string:
SHA1("TheAnswerIs42" || "love") = ce75a1c90ed564a231de85d93520f1e47726df64
Then, when a user types in a password, e.g. “lvoe” (a typo), the system checks:
SHA1("TheAnswerIs42" || "lvoe") = f832b210d62251c19a374a175bff760935c540d4
and sure enough, that doesn’t match, so the password is rejected.
Of course, the system has to keep the salt “TheAnswerIs42” around to check the password, otherwise, it can’t re-perform the hash. So, if an attacker breaks in, he’ll find the salts, of course. This means that salting won’t protect a user with a weak password. But it will provide better protection for users with reasonable passwords, since, even with the salt in hand, the attacker will have to re-compute the dictionary for each salt, and thus each user.
So the moral of the story is that hashing the secret password directly is a bad idea.
And this is typically where most developers stand. They understand that hashing is good, they vaguely understand the notion of salting, and they figure that salt+hash is all they need to know. Except it’s not.
When hashing is really a signature
One interesting application of hash functions that has spread like wildfire in the last few years is in the realm of cheap signatures. Consider an application, SuperAnnoyingPoke that wants to send an authenticated message to MyFace. It could apply a full digital signature, using say RSA, so that MyFace can be sure that the message really came from SuperAnnoyingPoke. But that actually takes milliseconds on an average computer, and milliseconds are a lot. Plus, there’s all sorts of weird padding issues and size limitations that might require hybrid encryption, so it’s messy.
But hey, let’s take out our trusty cryptographic Swiss Army Knife, the hash function! Let’s salt+hash! We’ll just make sure that SuperAnnoyingPoke and MyFace share a secret string that’s a good 20 characters long or so, and when SuperAnnoyingPoke wants to send a message to MyFace, it will also send along a “Message Authentication Code” (MAC) that is computed as:
MAC = SHA1(secret_string || message)
MyFace can easily look at the message that is sent, recompute the MAC given the secret string it shares with SuperAnnoyingPoke, and compare it to the MAC sent along with the message. Heck, you can even put a timestamp in there to make sure the message can’t be re-played by an attacker at a later date. After all, since the hash function makes it hard to “go back” when you’re using a salt (the secret string), this should be a secure and cheap way to sign messages! Super!
Except this is where things really fall apart.
The security property we want here is that, if the attacker sees a message and its corresponding MAC, then it should not be able to figure out the MAC for a different message. That’s the whole point of a signature. And, unfortunately, there’s a property of SHA1 and lots of other hash functions like it that make it a fast hash function, but a terrible way to compute a MAC.
Here’s the deal: if I tell you that SHA1(foo) is X, then it turns out, in a lot of cases, to be quite easy for you to determine what SHA1(foo || bar) is. You don’t need to know what foo is. It’s just that, because SHA1 is iterative and works block by block, if you know the hash of foo, then you can extend the computation to determine the hash of foo || bar.
That means that if you know SHA1(secret || message), then you can compute SHA1(secret || message || ANYTHING), which is a valid signature for message || ANYTHING. So to break this system, you just need to see one signature from SuperAnnoyingPoke, and then you can impersonate SuperAnnoyingPoke for lots of other messages.
Why? How??? But…. I thought hash functions didn’t let me “go back!” Well, note how I didn’t say the attacker would recover the secret. It’s just that, given one hash, they can compute others for related pre-images. That’s why you have to be careful about using hash functions when you’re hashing secrets. Another strike against using hash functions willy-nilly.
(If you’re keeping up, your next suggestion is “well, put the secret AFTER the message, not before”. And yeah, that’s a reasonable suggestion, but it points out how you’re now assuming some extra properties of the SHA1 hash function in your design, and that’s bad. What if you upgrade to a different hash function in 5 years, do you then have to update your protocol to match? The point is that you shouldn’t be using a hash function for this, that’s not its purpose!)
What you should be using is HMAC: Hash-function Message Authentication Code. You don’t need to know exactly how it works, just like you don’t need to know exactly how SHA1 works. You just need to know that HMAC is specifically built for message authentication codes and the use case of SuperAnnoyingPoke/MyFace. Under the hood, what’s approximately going on is two hashes, one after the other, with the secret combined after the first hash… but don’t worry about it! That’s the whole point! HMAC is built for this feature.
HMAC has two inputs and one output: in go a message, and a secret, and out comes a message authentication code (i.e. a signature). The security of HMAC is such that, you can see as many messages and corresponding signatures as your heart desires, and you still won’t be able to determine the signature on a message you haven’t seen yet. That’s the security property you’re looking for. And HMAC is built on top of a hash function, so more specifically you should be using HMAC-SHA1.
So, again, don’t hash secrets. HMAC them.
There’s plenty more to read if you’re interested in this topic, but chances are, you’re not and you just want a recommendation. “Don’t Hash Secrets” is not always entirely necessary. In the password example, you can hash a password as long as you salt it correctly. But do you really want to have to worry about that? I don’t. In fact, I use HMAC for my password databases, too. It’s overkill, but it lets me use a standard library that likely makes me safer in the long run.
So the next time you’re using a hash function on anything, ask yourself: is any of the stuff I’m hashing supposed to stay secret? If so, don’t hash. Instead, use HMAC.