getting web sites to adopt a new identity system

My team at Mozilla works on Persona, an easy and secure web login solution. Persona delivers to web sites and apps just the right information for a meaningful login: an email address of the user’s choice. Persona is one of Mozilla’s first forays “up the stack” into web services.

Typically, at Mozilla, we improve the Web by way of Firefox, our major lever with hundreds of millions of users. Take asm.js, Firefox’s new awesome JavaScript optimization technology which lets you run 60-frame-per-seconds games in your web browser. It’s such a great thing that Chrome is fast-following. Of course, Chrome also innovates by deploying features first, and Firefox often fast-follows. Standardization ensues. The Web wins.

With Identity, we’ve taken a different approach: out of the gate, Persona works on all modern browsers, desktop and mobile, and some not-so-modern browsers like IE8 and Android 2.2 stock. We’re not simply building open specs for others to build against: we are putting in the time and effort to make Persona work everywhere. We even have iOS and Android native SDKs in the works.

Why would we do such a thing? Aren’t we helping to improve our competitors’ platforms instead of improving our own? That reasoning, though tempting, is misguided. Here’s why.

working on all modern platforms is table-stakes

We talk about Persona to Web developers all the time. We almost always get the following two questions:

  1. does this work in other browsers?
  2. does this work on mobile?

These questions are actually all-or-nothing: either Persona works on other browsers and on mobile, or, developers tell us, they won’t adopt it. To date, we have not found a single web site that would deploy a Firefox-only authentication system. Some web sites have adopted Persona, only to back out once they built an iOS app and couldn’t use Persona effectively (we’re actively fixing that.) So, grand theories aside, we’re targeting all platforms because web sites simply won’t adopt Persona otherwise. After all, Facebook Connect works everywhere.

When you think about it, is that actually so different from the asm.js strategy? asm.js is much faster on Firefox, but it works on Chrome and any other JavaScript engine, too. Heck, even Google’s DART, a brand new language they want to see browsers adopt, comes with a DART-to-JavaScript-compiler so it works on all other browsers out of the gate. These are not after-thoughts. These are not small investments. asm.js was designed as a proper subset of JavaScript. The DART-to-JS compiler is a freaking compiler, built just so non-Chrome browsers can run DART.

When appealing to web developers to make a significant investment — rewriting code, building against a new authentication system, .. —, cross-browser and cross-device functionality from day 1 is table-stakes. The alternative is not reduced adoption, it’s zero adoption.

priming users is the winning hand

The similarities between Identity and purely functional improvements like asm.js stop when it comes to users. The reason web sites choose Facebook Connect is not just because it works, but because 1 Billion users are primed with accounts and ready to log in. Same goes for Google+ and Twitter logins.

Persona doesn’t have a gigantic userbase to start from. That sucks. The good news is that, unlike other identity systems, we don’t want to create a huge silo’ed userbase. What we want is a protocol and a user-experience that make Web logins better. We want users to choose their identity. We’re happy to bridge to existing userbases to help them do just that!

So, bridging is what we’re doing. You’ve seen it already with Yahoo Identity Bridging in Persona Beta 2. More is coming. With each bridge, hundreds of millions of additional users are primed to log in with Persona. That’s powerful. And it’s a major reason why sites are adopting Persona.

Working everywhere is table-stakes. Priming users so they’re ready to log in with just a couple of clicks, that’s the winning hand.

beautiful native user-agent implementations sweetens the pot

Meanwhile, the Persona protocol is specifically tailored to be mediated by the user’s browser. Long-term, we think this will be a fantastic asset for the Persona login experience. Beautiful, device-specific UIs. Universal logout buttons. Innovation in trusted UIs. And lots of other tricks we haven’t even thought of yet. We’re doing just that kind of innovation on Firefox OS with a built-in trusted UI for Persona.

But let’s be clear: that’s not an adoption strategy. An optimized Firefox UI for Persona will not affect web-site adoption because it does nothing to reduce login friction. In a while, once Persona is widespread with hundreds of thousands of web sites supporting it, and users are actively logging in with Persona on many devices and browsers, Firefox’s optimized Persona UI will be a competitive advantage that other browsers will feel pressure to match. Until then, web site adoption is the only thing that matters.

now you know our priorities

Wherever it makes sense, we’re implementing Firefox-specific Persona UIs. However, when it comes to an adoption strategy, we know from our customers that this won’t help. What will help is:

  1. Persona working everywhere
  2. As many users as possible primed to log in

Those are our priorities.

We know this is different for Mozilla. But it’s quite common for folks implementing Services. What you’re seeing here is Mozilla adapting as it applies its strongly held principle of user sovereignty up the stack and into the cloud.

Identity Systems: white labeling is a no-go

There’s a new blog post with some criticism of Mozilla Persona, the easy and secure web login solution that my team works on. The great thing about working in the open at Mozilla is that we get this kind of criticism openly, and we respond to it openly, too.

The author’s central complaint is that the Persona brand is visible to the user:

It [Persona] needs white-labeling. I know that branding drives adoption, but showing the Persona name on the login box at all is too much; it needs to be transparent for the user. Most of the visits to any website are first-time visits, which means the user is seeing your site/brand for the first time. Introducing another brand at the sign-up point is a confusing distraction to the user.

The author compares Persona to Stripe, the payment company with a super-easy-to-use JavaScript API, which lets a web site display a payment form with no trace of the Stripe brand, and all the hard credit-card processing work is left to the Stripe service.

This is an interesting point, but unfortunately it is wrong for an Identity solution. Consider if Persona were fully white-labeled, integrated into the web site’s own pages, with no trace of the Persona system visible to the user. What happens then? Two possibilities:

  1. no user state is shared between sites: users create a new account on every site that uses Persona. The site doesn’t have to do the hard work of password storage, it can let Persona handle this. There’s no benefit to the user: every web site looks independent from the others, with its own account and password. And while this is incrementally better than having web sites store passwords themselves, that increment is quite small: web sites tend to use federated authentication solutions if they can lower the friction of users signing up. If users still have to create accounts everywhere, friction is high, and the benefit to the web site is small.
  2. user state is shared between sites: users don’t have to create new accounts at every web site, they can use their existing single Persona account, but now they have no branding whatsoever to indicate this. So, are users supposed to type in the same Persona password on every site they see? Are they supposed to feel good about seeing their list of identities embedded within a brand new site they’ve never seen before, with no indication of why that data is already there? This is a recipe for disastrous phishing and a deeply jarring user experience.

So what about Stripe? With Stripe, the user retypes their credit-card number at every web site they visit. That makes sense because the hard part of payment processing for web sites isn’t so much the prompting for a credit card, it’s the actual payment processing in the backend. And, frankly, it would be quite jarring if you saw your credit card number just show up on a brand new web site you’ve never visited before.

But identity is different. The hard part is not the backend processing, it’s getting the user to sign up in the first place, and for that you really want the user to not have to create yet another account. Plus, if you’re going to surface the user’s identity across sites, then you *have* to give them an indication of the system that’s helping them do that so they know what password to type in and why their data is already there. And that’s Persona. Built to provide clear benefits to users and sites.

By the way, though we need some consistent Persona branding to make a successful user experience, we don’t need the Persona brand to be overbearing. Already, with Persona, web sites can add a prominent logo of their choosing to the Persona login screen. And we’re working on new approaches that would give sites even more control over the branding, while giving users just the hint they need to understand that this is the same login system they trust everywhere else. Check it out.

Firefox is the unlocked browser

Anil Dash is a man after my own heart in his latest post, The Case for User Agent Extremism. Please go read this awesome post:

One of my favorite aspects of the infrastructure of the web is that the way we refer to web browsers in a technical context: User Agents. Divorced from its geeky context, the simple phrase seems to be laden with social, even political, implications.

The idea captured in the phrase “user agent” is a powerful one, that this software we run on our computers or our phones acts with agency on behalf of us as users, doing our bidding and following our wishes. But as the web evolves, we’re in fundamental tension with that history and legacy, because the powerful companies that today exert overwhelming control over the web are going to try to make web browsers less an agent of users and more a user-driven agent of those corporations. This is especially true for Google Chrome, Microsoft Internet Explorer and Apple Safari, though Mozilla’s Firefox may be headed down this path as well.

So so right… except for the misinformed inclusion of Firefox in that list. Anil: Firefox is the User Agent you’re looking for. Here’s why.

user agency

Two years ago, I joined Mozilla because Mozillians are constantly working to strengthen the User Agent:

In a few days, I’ll be joining Mozilla.

[..]

[I want] to work on making the browser a true user agent working on behalf of the user. Mozilla folks are not only strongly aligned with that point of view, they’ve already done quite a bit to make it happen.

browser extensions

Like Anil, I believe browser add-ons/extensions/user-scripts are critical for user freedom, as I wrote more than two years ago, before I even joined Mozilla:

Browser extensions, or add-ons, can help address this issue [of user freedom]. They can modify the behavior of specific web sites by making the browser defend user control and privacy more aggressively: they can block ads, block flash, block cookies for certain domains, add extra links for convenience (i.e. direct links to Flickr’s original resolution), etc.. Browser extensions empower users to actively defend their freedom and privacy, to push back on the more egregious actions of certain web publishers.

mobile

Again, like Anil, I saw, in that same blog post, the threat of mobile:

Except in the mobile space. Think about the iPhone browser. Apple disallows web browsers other than Safari, and there is no way to create browser extensions for Safari mobile. When you use Safari on an iPhone, you are using a browser that behaves exactly like all other iPhone Safaris, without exception. And that means that, as web publishers discover improved ways to track you, you continue to lose privacy and control over your data as you surf the Web.

This situation is getting worse: the iPad has the same limitations as the iPhone. Technically, other browsers can be installed on Android, but for all intents and purposes, it seems the built-in browser is the dominant one. Simplified computing is the norm, with single isolated applications, never applications that can modify the behavior of other applications. Thus, no browser extensions, and only one way to surf the web.

so Firefox?

To Anil’s concerns:

  • Firefox Sync, which lets you share bookmarks, passwords, tabs, etc. across devices, is entirely open-source, including the server infrastructure, and if you don’t want Mozilla involved, you can change your Firefox settings to point to a Sync server of your choosing, including one you run on your own using our open-source code. PICL (Profile in the Cloud), the next-generation Sync that my team is working on, will make it even easier for you to choose your own PICL server. We offer a sane default so things work out of the box, but no required centralization, unlike other vendors.
  • Mozilla Persona, our Web Identity solution, works today on any major browser (not just Firefox), and is fully decentralized: you can choose any identity provider you want today. This stands in stark contrast to competing solutions that tie browsers to vendor-specific accounts. Persona is the identity solution that respects users.
  • Firefox for Android is the only major mobile browser that supports add-ons. Anil, if you want “cloud-to-butt”, you can have it on Firefox for Android. You can also have AdBlock Plus. Try that on any other mobile browser.

the unlocked browser

Anil argues that we should talk about unlocked browsers. I love it. Let’s do that. Here’s my bet, Anil: write down your criteria for the ideal unlocked browser. I bet you’ll find that Firefox, on desktop, on mobile, and in all of the services Mozilla is offering as attachments, is exactly what you’re looking for.

the Web is the Platform, and the User is the User

Mid-2007, I wrote two blog posts — get over it, the web is the platform and the web is the platform [part 2] that turned out to be quite right on one front, and so incredibly wrong on another.

Let’s start with where I was right:

Apps will be written using HTML and JavaScript. […] The Web is the Platform. The Web is the Platform. It’s going to start to sink in fast.

[…]

Imagine if there’s a way to have your web application say: “please go pick a contact from your address book, then post that contact’s information back to this URL”. Or if the web application can actually prepare a complete email on your behalf, image attachments included (oh the security issues….), and have you just confirm that, yes, you really want to send that email (the web app definitely can’t do that without confirmation)?

[…]

[We could] begin to define some JavaScript APIs, much like Google Gears for offline data storage, that enables this kind of private-public mashup. It would be fantastically interesting, because the security issues are mind boggling, but the resulting features are worth it. And it would spearhead some standards body to look into this issue more closely.

Whatever happens, though, the web is the platform. If you’re not writing apps in cross-browser-compliant HTML+JavaScript, the clock is ticking.

And in my followup post:

Add incremental features in JavaScript. First an offline functionality package, like Google Gears, so applications can work offline. Then, an interface to access the user’s stored photos. Over time, a way for web applications to communicate with one another.

[…]

Then there’s one tweak that could make a huge difference. Let a web application add itself to the dashboard.

Where did I go wrong? I thought this innovation was going to be unleashed by Apple with their introduction of the iPhone.

In my defense, if you read between the lines of the iPhone announcements back in 2007, it’s possible that Apple actually meant to do this. But then they didn’t, and they released an Objective C API, and a single closed app store, and locked down payments, and disallowed competition with their own apps, … So much for the Web.

It’s only fitting that the organization that is making this happen is my employer, Mozilla, with Firefox OS. Don’t get me wrong, I’m not taking credit for Firefox OS: there is a whole team of amazing leaders, engineers, product managers, product marketers, and generally rockstars making that happen. But it’s nice to see that this vision from six years ago is now reality.

So, the Web is the platform. HTML and JavaScript are the engines.

What about data? What about services? It’s time we redesign those. They, too, need to be freed from central points of control and silos. Data & Services need to be re-architected around the user. I should get to choose which organization I want to trust and which of my existing accounts I want to use to log into a new site/app/service. I should be able to access my data, know who else is touching it, and move it around as I please. I should be able to like/share any piece of content from any publisher I read onto any social network I choose. Amazing new apps should have easy access to any data the user wishes to give them, so that new ideas can emerge, powered by consenting users’ data, at the speed of the Web.

That, by the way, is the mission of my team, Mozilla Identity, and those are the guiding principles of our Persona login service and our upcoming project codenamed PICL. And of course we’ll continue to build those principles and those technologies into the Firefox OS phone (Persona is already in there.)

The Web is the Platform. And the User is the User. I’m quite sure Mozilla is the organization made to deliver both.

connect on your terms

I want to talk about what we, the Identity Team at Mozilla, are working on.

Mozilla makes Firefox, the 2nd most popular browser in the world, and the only major browser built by a non-profit. Mozilla’s mission is to build a better Web that answers to no one but you, the user. It’s hard to overstate how important this is in 2012, when the Web answers less and less to individual users, more and more to powerful data silos whose interests are not always aligned with those of users.

To fulfill the Mozilla mission, the browser remains critical, but is no longer enough. Think of the Web’s hardware and software stack. The browser sits in the middle [1], hardware and operating system below it, cloud services above it. And the browser is getting squeezed: mobile devices, which outnumber desktop computers and are poised to dominate within a couple of years, run operating systems that limit, through technical means or bundling deals, which browser you can use and how you can customize their behavior. Meanwhile, browsers are fast becoming passive funnels of user data into cloud services that offer too little user control and too much lock-in.

Mozilla is moving quickly to address the first issue with Boot2Gecko, a free, open, and Web-based mobile operating system due to launch next year. This is an incredibly important project that aims to establish true user choice in the mobile stack and to power-charge the Open Web by giving HTML5 Apps new capabilities, including camera access, dialing, etc.

The Mozilla Identity Team is working on the top of the stack: we want users to control their transactions, whether using money or data, with cloud services. We want you to connect to the Web on your terms. To do that, we’re building services and corresponding browser features.

We’re starting with Persona, our simple distributed login system, which you can integrate into your web site in a couple of hours — a good bit more easily than our competitors. Persona is unique because it deeply respects users: the only data exchanged is that users wish to provide. For example, when you use Persona to sign into web sites, there is no central authority that learns about all of your activity.

From Persona, we’ll move to services connected to your identity. We’ll help you manage your data, connect the services that matter to you, all under your full control. We want to take user agency, a role typically reserved for the browser sitting on your device, into the cloud. And because we are Mozilla, and all of our code and protocols are open, you know the services will build will always be on your side.

All that said, we know that users pick products based on quality features, not grand visions. Our vision is our compass, but we work on products that fulfill specific user and developer needs today. We will work towards our vision one compelling and pragmatic product at a time.

The lines between client, server, operating system, browser, and apps are blurring. The Web, far more than a set of technologies, is now a rapidly evolving ecosystem of connections between people and services. The Mozilla Identity Team wants to make sure you, the user, are truly in control of your connections. We want to help you connect on your terms. Follow us, join us.


[1] David Ascher spoke about this in his post about the new Mozilla a few months ago.

BrowserID and me

A few weeks ago, I became Tech Lead on Identity and User Data at Mozilla. This is an awesome and challenging responsibility, and I’ve been busy. When I took on this new responsibility, BrowserID was already well under way, so we were able to launch it in my second week on the project (early July). It’s been a very fun ride.

Here’s the BrowserID demo at the Mozilla All-Hands last week:

Given my prior work on email-based authentication (EmID, Lightweight Email Signatures, BeamAuth), you might think BrowserID was my brainchild. In fact, it really wasn’t. And, in a testament to the shrinking impact of academic publication venues, none of the BrowserID team had ever heard of my work on email-based authentication before I arrived at Mozila, even though Mozilla folks are quite well versed in the state of the art. But who cares: when I found out about the ongoing work and how we agreed on just about every design principle, I was incredibly excited. And when I realized the fantastic work the team had already done on defining a scaffolding and adoption path for the technology, I was super impressed.

BrowserID started with the Verified Email Protocol, designed by Mike Hanson and Dan Mills, who came up with the approach after extensive exploration of web-based identity approaches over the last two years. It’s a simple idea: users can prove to web sites that they own a particular email address. That’s how they register, and that’s how they log back in the next time they visit the site. BrowserID, the code and site, was initially bootstrapped by Lloyd Hilaiel and Mike Hanson. Shane Tomlinson and I joined the team in June. We now also have an awesome UX design team (Bryan and Andy) and the team continues to grow (yay Austin!)

So, that’s what I’m working on these days: BrowserID and other Identity+UserData efforts at Mozilla. I’m excited to be leading this technical effort. The team is amazing, and we’ve got big aggressive plans to help you control your identity and data on the Web.

and the laws of physics changed

Google just introduced Google Plus, their take on social networking. Unsurprisingly, Arvind has one of the first great reviews of its most important feature, Circles. Google Circles effectively let you map all the complexities of real-world privacy into your online identity, and that’s simply awesome.

You can think of Circles as the actual circles of friends you have. The things that are easy to do in real life, like sharing a fun anecdote with the friends you generally go out with on Saturday nights, are easy to do in Circles. The things that are hard to do in real life, like planning your best friend’s surprise birthday party with all of his close friends but without him, are no easier in Circles: you have to make a new list of “everyone except Bob.” That’s great, because I don’t think our brains have evolved yet to really feel comfortable with a social model that supports all set operations, e.g. this circle minus this other circle. That’s usually how we get caught lying. (I mean the lies everyone tells as part of their normal social interactions.)

The most important point is that this feature shatters the previously universally accepted idea that privacy must change dramatically given social networking. For a few years, Facebook has defined the Laws of Physics of social networking. On Facebook, it’s not possible to show different people a different face. On Facebook, relationships are, for the most part, symmetrical. And so we all believed that this was the inevitable path forward with social networking. We conflated the fact that users wanted to connect online with the constraints that Facebook created, and we assumed users wanted those constraints. We forgot that software engineers define the Laws of Physics of the worlds they create. We weren’t living in the inherent world of social networking. We were living in Facebook’s definition of social networking.

We now know it doesn’t have to be this way. The Laws of Physics in the online world are mutable. Google just busted open a world of possibility. Users will question, now more than ever, why sharing must work the way it does on Facebook, given that Google has shown it can work differently.

It will make Facebook better. Which will make Google better. And so on. We may be witnessing the beginning of a new era of online privacy, a maturation of sorts. This is an incredibly exciting time.

The evolution of OpenID: you’re not a URL after all

The US government has just announced a pilot program to integrate OpenID (and Information Cards) into public government web sites. This is very interesting news, as it will likely catalyze even greater OpenID deployment and use.

[I’ve poo-poo’ed OpenID here and here, because of phishing and privacy concerns. I’m still very worried. I’ve suggested ways to defend OpenID against phishing, and I helped Creative Commons deploy a privacy-conscious OpenID service.]

What’s fascinating to me is the evolution of OpenID. The pitch used to be “log in with your URL.” The backend protocol was cool, but it didn’t really matter. Authentication was reduced to proving that you own a particular URL: I can prove that I control http://ben.adida.net, thus, for all intents and purposes, that URL is my identity. I *am* http://ben.adida.net. *How* I prove that I own that URL, that’s a good thing to define precisely, but it wasn’t central to the OpenID story.

A lot of folks, myself included, think URLs as human identifiers are not ideal. People aren’t used to them. They don’t provide a communication channel. It is awkward to type a URL into what is effectively a “username” box. Plus, if you give every site your URL, then multiple sites can correlate identities easily, and that’s probably not a good thing when all you really want is single sign-on.

So OpenID evolved. In version 2.0, instead of typing your full OpenID URL, you can just type the domain name of your provider, i.e. yahoo.com. Then, you get redirected to Yahoo, where you log in, and when you’re done, Yahoo provides a pseudonymous identifier to the third-party web site where you want to log in. And as it turns out, that’s exactly the mode of authentication that the government is requiring for its approved OpenID providers, because they don’t want the NIH and CDC to have the ability to correlate your activities across government services. (In passing… how refreshing to see this privacy concern come out of the US government!). So the NIH thinks you’re 83nbxcvndfs34@google.com and the CDC thinks you’re j23nsnsblkjsdfl45@google.com.

My guess is that URLs as human identifiers are effectively dead. OpenID is now the backend protocol. Identifiers are pseudonyms, not public URLs.

I also suspect the next step is a communication channel for these pseudonyms: identity providers will give relying parties a way to send messages to the pseudonyms that logged into their sites, the same way Facebook lets apps notify its users. Something missing in your NIH grant application? The NIH will make an API call to Google saying “please deliver this note to user 83nbxcvndfs34” and Google will forward it appropriately. (Maybe this feature already exists in OpenID 2.0 and I just don’t know about it?)

On the phishing front, OpenID providers will probably duke it out with various mitigating solutions. It would be nice if the OpenID standard tackled the issue, though.

On the privacy front, only the core OpenID protocol can help. I’d like to use my Google credentials to log in everywhere, but I don’t see why Google needs to look over my shoulder every time I log in to every weird site I visit. The only way to fix this is with cryptographic credentials. I don’t see that anywhere in the OpenID spec’s future, but without it, there are going to be deep privacy issues.

In the end, OpenID 2009 looks almost nothing like OpenID 2006. That’s okay, though. One lesson is that, in the end, the OpenID effort inspired and coalesced a number of disparate efforts to achieve an open-standard for web-based single sign-on. Though the solution has evolved significantly, OpenID has succeeded: we have an open web-based single sign-on system. Now, OpenID will have to deal with the consequences of its success. It will get attacked, a lot. Its issues with phishing and privacy will become greater concerns. It should be a fun ride.

Don’t Hash Secrets

Building secure systems is difficult. It would be nice if we had a bunch of well-designed crypto building blocks that we could assemble in all sorts of ways and be certain that they would, no matter what, yield a secure system overall. There are, in fact, folks working on such things at a theoretical level [Universal Composability].

But even if you had these building blocks, you would still have to use them in their intended way. A component can only be secure under certain well-defined circumstances, not for any use that happens to look similar.

One area of secure protocol development that seems to consistently yield poor design choices is the use of hash functions. What I’m going to say is not 100% correct, but it is on the conservative side of correct, so if you follow the rule, you (probably) can’t go wrong. You might be considered overly paranoid, but as they say, just because you’re paranoid doesn’t mean they’re not after you.

So here it is: Don’t hash secrets. Never. No, sorry, I know you think your case is special but it’s not. No. Stop it. Just don’t do it. You’re making the cryptographers cry.

What the heck am I talking about, you say? I’ll explain. But before we get lost in the details, just remember. Don’t hash secrets. Ever. Kapish?

What exactly do you mean by “Hash”?

A hash function takes any document, big or small, and creates a short fingerprint. That gigabyte movie of your newborn baby? Hash it with SHA1, and you’ve got yourself a 160 bit (~30 alphanumeric characters) fingerprint. Now, hold on, you say, 30 characters? You’ve hashed my baby to pieces and all that’s left is a measly 30 characters? No, no, don’t worry, your baby is still a unique snowflake. You can’t take those 30 characters and, from them, recover your gigabyte video. This is not uber-data-compression.

But it’s going to be darn hard for you to find any other document, big or small, that hashes to the same 30 characters. In fact, it’s so hard, even the most powerful computer in the world dedicated to this one task for hundreds of years won’t succeed at finding that doppelganger document. You’ve got lots of computers you say? You’re Google and you have hundreds of thousands of computers? Yeah, well…. tough. You still won’t succeed.

In fact, you can try something that should be easier: rather than find another document that hashes specifically to those 30 characters that represent your baby, you can go looking for any two documents that happen to hash to the same thing (collide). And you won’t find any such pair. Promised. We call that “collision resistance”. That thing about how you can’t find another document that hashes to the same value as your baby video? We call that “second pre-image resistance.”

Oh, and I forgot to mention that this magical function, SHA1, is public. Anyone can see the code. There are no secrets. Even if you see the code, you can’t find a collision. No, really, I’m not screwing with you.

I want to hash everything!

That’s usually the reaction after discovering the amazing power of hash functions. There are all sorts of nails just waiting for this magical hammer, so let’s start hashing everything in sight. De-duplicating large documents? Hash and compare! Passwords in a database? Hash and store! Anonymizing names in a database? Hash and pseudonymize!

After all, the magical power of a hash function is that you can’t “go back,” right? Given a hash, it’s impossible to get that pre-image, so hash away, my magical crypto friends!

Wrong.

Yeah, so it’s not quite that magical.

Let’s say I give you a SHA1 hash value 29b0d7c86b88565b78efffeea634cee81a209c92. From that hash alone, you can’t tell what I hashed. But if I tell you that I hashed a password, then all you need to do is try a bunch of common passwords and see which one matches. In this case, I hashed “love”, one of the most common passwords there is.

Now you start to see how this “you can’t go back” reasoning fails: if you know the domain of possible pre-images, and that domain is not too large, then you can just try them all and see which one matches. That’s a big strike against the “hash everything” approach.

Sprinkle in some Salt

It gets more interesting with the complete password use-case. Many web developers already know that they shouldn’t store user passwords in the clear in the database, just in case a break-in reveals every user’s password. So, instead of storing passwords in the clear, let’s store a SHA1 hash of the password, against which a candidate password can be easily checked: hash it and compare.

Now the web developers who have been around the block a few times know that, if you just apply SHA1 blindly, you’ve got the “small domain” problem I just mentioned. An attacker can build up a huge dictionary of hashed passwords just once, and, when he breaks into your web site, check the hashes against this pre-built dictionary.

To prevent these “dictionary attacks”, we add salt to the hashing process, so that each user’s password is hashed differently, and generic attacks don’t work: you have to rebuild the dictionary for each user you choose to attack. Sprinkling in salt is easy: just concatenate the password with a random string:

SHA1("TheAnswerIs42" || "love") = ce75a1c90ed564a231de85d93520f1e47726df64

Then, when a user types in a password, e.g. “lvoe” (a typo), the system checks:

SHA1("TheAnswerIs42" || "lvoe") = f832b210d62251c19a374a175bff760935c540d4
                               != ce75a1c90ed564a231de85d93520f1e47726df64

and sure enough, that doesn’t match, so the password is rejected.

Of course, the system has to keep the salt “TheAnswerIs42” around to check the password, otherwise, it can’t re-perform the hash. So, if an attacker breaks in, he’ll find the salts, of course. This means that salting won’t protect a user with a weak password. But it will provide better protection for users with reasonable passwords, since, even with the salt in hand, the attacker will have to re-compute the dictionary for each salt, and thus each user.

So the moral of the story is that hashing the secret password directly is a bad idea.

And this is typically where most developers stand. They understand that hashing is good, they vaguely understand the notion of salting, and they figure that salt+hash is all they need to know. Except it’s not.

When hashing is really a signature

One interesting application of hash functions that has spread like wildfire in the last few years is in the realm of cheap signatures. Consider an application, SuperAnnoyingPoke that wants to send an authenticated message to MyFace. It could apply a full digital signature, using say RSA, so that MyFace can be sure that the message really came from SuperAnnoyingPoke. But that actually takes milliseconds on an average computer, and milliseconds are a lot. Plus, there’s all sorts of weird padding issues and size limitations that might require hybrid encryption, so it’s messy.

But hey, let’s take out our trusty cryptographic Swiss Army Knife, the hash function! Let’s salt+hash! We’ll just make sure that SuperAnnoyingPoke and MyFace share a secret string that’s a good 20 characters long or so, and when SuperAnnoyingPoke wants to send a message to MyFace, it will also send along a “Message Authentication Code” (MAC) that is computed as:

MAC = SHA1(secret_string || message)

MyFace can easily look at the message that is sent, recompute the MAC given the secret string it shares with SuperAnnoyingPoke, and compare it to the MAC sent along with the message. Heck, you can even put a timestamp in there to make sure the message can’t be re-played by an attacker at a later date. After all, since the hash function makes it hard to “go back” when you’re using a salt (the secret string), this should be a secure and cheap way to sign messages! Super!

Except this is where things really fall apart.

The security property we want here is that, if the attacker sees a message and its corresponding MAC, then it should not be able to figure out the MAC for a different message. That’s the whole point of a signature. And, unfortunately, there’s a property of SHA1 and lots of other hash functions like it that make it a fast hash function, but a terrible way to compute a MAC.

Here’s the deal: if I tell you that SHA1(foo) is X, then it turns out, in a lot of cases, to be quite easy for you to determine what SHA1(foo || bar) is. You don’t need to know what foo is. It’s just that, because SHA1 is iterative and works block by block, if you know the hash of foo, then you can extend the computation to determine the hash of foo || bar.

Oh crap.

That means that if you know SHA1(secret || message), then you can compute SHA1(secret || message || ANYTHING), which is a valid signature for message || ANYTHING. So to break this system, you just need to see one signature from SuperAnnoyingPoke, and then you can impersonate SuperAnnoyingPoke for lots of other messages.

Why? How??? But…. I thought hash functions didn’t let me “go back!” Well, note how I didn’t say the attacker would recover the secret. It’s just that, given one hash, they can compute others for related pre-images. That’s why you have to be careful about using hash functions when you’re hashing secrets. Another strike against using hash functions willy-nilly.

(If you’re keeping up, your next suggestion is “well, put the secret AFTER the message, not before”. And yeah, that’s a reasonable suggestion, but it points out how you’re now assuming some extra properties of the SHA1 hash function in your design, and that’s bad. What if you upgrade to a different hash function in 5 years, do you then have to update your protocol to match? The point is that you shouldn’t be using a hash function for this, that’s not its purpose!)

HMAC

What you should be using is HMAC: Hash-function Message Authentication Code. You don’t need to know exactly how it works, just like you don’t need to know exactly how SHA1 works. You just need to know that HMAC is specifically built for message authentication codes and the use case of SuperAnnoyingPoke/MyFace. Under the hood, what’s approximately going on is two hashes, one after the other, with the secret combined after the first hash… but don’t worry about it! That’s the whole point! HMAC is built for this feature.

HMAC has two inputs and one output: in go a message, and a secret, and out comes a message authentication code (i.e. a signature). The security of HMAC is such that, you can see as many messages and corresponding signatures as your heart desires, and you still won’t be able to determine the signature on a message you haven’t seen yet. That’s the security property you’re looking for. And HMAC is built on top of a hash function, so more specifically you should be using HMAC-SHA1.

So, again, don’t hash secrets. HMAC them.

In Conclusion

There’s plenty more to read if you’re interested in this topic, but chances are, you’re not and you just want a recommendation. “Don’t Hash Secrets” is not always entirely necessary. In the password example, you can hash a password as long as you salt it correctly. But do you really want to have to worry about that? I don’t. In fact, I use HMAC for my password databases, too. It’s overkill, but it lets me use a standard library that likely makes me safer in the long run.

So the next time you’re using a hash function on anything, ask yourself: is any of the stuff I’m hashing supposed to stay secret? If so, don’t hash. Instead, use HMAC.

Open(Social) Will Win ; and now Privacy?

If you’re hooked into the social networking world, you know about Facebook and the Facebook platform, which lets developers create all sorts of applications that make use of your Facebook social network in interesting ways. Flixster, for example, lets you share and compare your movie tastes with your existing Facebook friends. No need to reconnect to your friends in every web-based application.

But there is one problem: if you write a Facebook application, you’re pretty much stuck with Facebook. Facebook never lets the application see the user’s email address or Instant Messenger account name, or any other fields that would allow the application to contact the user independently. So, unless you convince users to reenter their email addresses, you’re stuck.

Yesterday, Google launched OpenSocial, a generic API for building Facebook-like applications, but this time on top of a dozen competing social networks now all supporting the same programmatic interface. It’s very cool that you can build an app once and run it on any social network, but what’s far more interesting to me is social network portability, the idea that I, as a user, can theoretically pack up my social network and go to a different site altogether, if I so choose. And that as a developer, I can contact users and their friends directly, if users so choose.

I’m not sure yet if OpenSocial will truly allow this, but it looks like it will: the People API has gd:email and gd:im fields which are exactly the key to social network portability: global unique identifiers for your friends that can be used for messaging.

And so OpenSocial will win. Because open is better: more competition and the reduction of artificial user lock-in results in better products. Google has done a very good thing.

One interesting wrinkle: I wonder how the social network operators feel right now. Participating in OpenSocial is inevitable at this point, unless you plan on giving up the significant value that’s about to be created by hordes of developers. But then, how do you compete, as the underlying social network? How do you prevent from bleeding all of your users to a newer, hipper network when users can pack up their friends and applications at the click of a button?

Maybe one way a social network can compete is on privacy: provide very good privacy controls, allow users to present different faces to different sets of friends, and generally make it really worth their while to host their social network with you. Social networks are not going to be judged on their applications anymore, but purely on how well they manage the actual social network data. So maybe, just maybe, privacy will become a competitive advantage.

UPDATE: a little birdie tells me that OpenSocial, in its current form, offers no more data portability than Facebook. That’s unfortunate, but it’s not entirely surprising because of the argument above: social network platforms don’t care to be commoditized. So, if there’s no data portability, but there is application portability, it’s still hard to move from one network to another, and there’s less incentive, since you can use any application on your network. This could completely reverse my point: OpenSocial might cause *more* lock-in, not less, and then there’s no hope for privacy improvements anytime soon.