Certificate Validation Example: Facebook

Most people know the concepts of SSL, but not the gory details.  By using Facebook as a walkthrough example, I’m going to discuss how it works from the browser’s viewpoint, and how it impacts latency to your site.  BTW, this is not intended as a criticism of Facebook – they’re doing all the right things to make sure your data is encrypted and authenticated and fast.  The failures highlighted here are failures of a system that wasn’t designed for speed.

Fetching the Certificate
When you first connect to a SSL site, the client and server use the server’s public key to exchange a secret which will be used to encrypt the session.  So the first thing the client needs to do is to get the server’s public key.  The public key is sent as part of the SSL Server Hello message.   When we look at the Server Hello Message from Facebook, we see that it sent us a Certificate which was 4325 bytes in size.  This means that before your HTTP request even gets off your computer, the server had to send 4KB of data to the client.  That’s a pretty big bundle, considering that the entire Facebook login page is only 8.8KB.  Now, if a public key is generally only 1024 or 2048 bits, with elliptic curve keys being much smaller than that, how did Facebook’s certificate mushroom from 256 to 4325 bytes?  Clearly there is a lot of overhead.  More on this later.

Trusting the Certificate
Once the browser has the server’s certificate, it needs to validate that the certificate is authentic.  After all, did we really get Facebook’s key? Maybe someone is trying to trick us.  To deal with this, public keys are always transferred as part of a certificate, and the certificate is signed by a source, which needs to be trusted.  Your operating system shipped with a list of known and trusted signers (certificate authority roots).  The browser will verify that the Facebook certificate was signed by one of these known, trusted signers.  There are dozens of trusted parties already known to your browser.  Do you trust them all? Well, you don’t really get a choice.  More on this later.

But very few, if any, certificates are actually signed by these CA’s.  Because the Root CA’s are so important to the overall system, they’re usually kept offline to minimize chances of hackery.  Instead, these CAs periodically delegate authority to intermediate CAs, when then validate Facebook’s certificate.  The browser doesn’t care who signs the certificate, as long the chain of certificates ultimately flows to a trusted root CA.

And now we can see why Facebook’s Certificate is so large.  It’s actually not just one Certificate – it is 3 certificates rolled into one bundle:

The browser must verify each link of the chain in order to authenticate that this is really Facebook.com.

Facebook, being as large as they are, would be well served by finding a way to reduce the size of this certificate, and by removing one level from their chain.  They should talk to DigiSign about this immediately.

Verifying The Certificate
With the Facebook Certificate in hand, the browser can almost verify the site is really Facebook.  There is one catch – the designers of Certificates put in an emergency safety valve.  What happens if someone does get a fraudulent certificate (like what happened last month with Comodo) or steal your private key?  There are two mechanisms built into the browser to deal with this.

Most people are familiar with the concept of the “Certificate Revocation List” (CRL).  Inside the certificate, the signer puts a link to where the CRL for this certificate would be found.  If this certificate were ever compromised, the signer could add the serial number for this certificate to the list, and then the browser would refuse to accept the certificate. CRLs can be cached by the operating system, for a duration specified by the CA.

The second type of check is to use the Online Certificate Status Protocol (OCSP).  With OCSP, instead of the browser having to download a potentially very large list (CRL), the browser simply checks this one certificate to see if it has been revoked.  Of course it must do this for each certificate in the chain.  Like with CRLs, these are cacheable, for durations specified in the OCSP response.

In the Facebook.com example, the DigiCert certificates specify an OCSP server.  So as soon as the browser received the Server Hello message, it took a timeout with Facebook and instead issued a series of OCSP requests to verify the certificates haven’t been revoked.

In my trace, this process was quick, with a 17ms RTT, and spanning 4 round-trips (DNS, TCP, OCSP Request 1, OCSP Request 2), this process took 116ms.  That’s a pretty fast case.  Most users have 100+ms RTTs and would have experienced approximately a ½ second delay.  And again, this all happens before we’ve transmitted a single byte of actual Facebook content.  And by the way, the two OCSP responses were 417 bytes and 1100 bytes, respectively.

Oh but the CDN!
All major sites today employ Content Delivery Networks to speed the site, and Facebook is no exception.  For Facebook, the CDN site is “static.ak.facebook.com”, and it is hosted through Akamai. Unfortunately, the browser has no way of knowing that static.ak.facebook.com is related to facebook.com, and so it must repeat the exact same certificate verification process that we walked through before.

For Facebook’s CDN, the Certificate is 1717 bytes, comprised of 2 certificates:

Unlike the certificate for facebook.com, these certificates specify a CRL instead of an OCSP server.  By manually fetching the CRL from the Facebook certificate, I can see that the CRL is small – only 886 bytes. But I didn’t see the browser fetch it in my trace.  Why not?  Because the CRL in this case specifies an expiration date of July 12, 2011, so my browser already had it cached.  Further, my browser won’t re-check this CRL until July, 4 months from now.  This is interesting, for reasons I’ll discuss later.

Oh but the Browser Bug!
But for poor Facebook, there is a browser bug (present in all major browsers, including IE, FF, and Chrome) which is horribly sad.  The main content from Facebook comes from www.facebook.com, but as soon as that page is fetched, it references 6 items from static.ak.facebook.com.  The browser, being so smart, will open 6 parallel SSL connections to the static.ak.facebook.com domain. Unfortunately, each connection will resend the same SSL certificate (1717 bytes).  That means that we’ll be sending over 10KB of data to the browser for redundant certificate information.

The reason this is a bug is because, when the browser doesn’t have certificate information cached for facebook.com, it should have completed the first handshake first (downloading the certificate information once), and then used the faster, SSL session resumption for each of the other 5 connections.

Putting It All Together
So, for Facebook, the overall impact of SSL on the initial user is pretty large.  On the first connection, we’ve got:

  • 2 round trips for the SSL handshake
  • 4325 bytes of Certificate information
  • 4 round trips of OCSP validation
  • 1500 bytes of OCSP response data

Then, for the CDN connections we’ve got:

  • 2 round trips for the SSL handshake
  • 10302 bytes of Certificate information (1717 duplicated 6 times)

The one blessing is that SSL is designed with a fast-path to re-establish connectivity.  So subsequent page loads from Facebook do get to cut out most of this work, at least until tomorrow, when the browser probably forgot most of it and has to start over again.

Making it Better

OCSP & CRLs are broken
In the above example, if the static.ak.facebook.com keys are ever compromised, browsers around the planet will not notice for 4 months. In my opinion, that is too long.  For the OCSP checks, we cache the result for usually ~7 days.  Having users exposed to broken sites for 7 days is also a long time.  And when Comodo was hacked a month ago, the browser vendors elected to immediately patch every browser user on the planet rather than wait for the OCSP caches to expire in a week.  Clearly the industry believes the revocation checking is broken when it is easier to patch than rely on the built-in infrastructure.

But it is worse than that.  What does a browser do when if the OCSP check fails?  Of course, it proceeds, usually without even letting the user know that it has done so (heck, users wouldn’t know what to do about this anyway)!   Adam Langley points this out in great detail, but the browsers really don’t have an option.  Imagine if DigiCert were down for an hour, and because of that users couldn’t access Facebook?  It’s far more likely that DigiCert had downtime than that the certificate has been revoked.

But why are we delaying our users so radically to do checks that we’re just going to ignore the result of if they fail anyway?  Having a single-point-of-failure for revocation checking makes it impossible to do anything else.

Certificates are Too Wordy
I feel really sorry for Facebook with it’s 4KB certificate.  I wish I could say theirs was somehow larger than average.  They are so diligent about keeping their site efficient and small, and then they get screwed by the Certificate.  Keep in mind that their public key is only 2048bits. We could transmit that with 256B of data.  Surely we can find ways to use fewer intermediate signers and also reduce the size of these certificates?

Certificate Authorities are Difficult to Trust
Verisign and others might claim that most of this overhead is necessary to provide integrity and all the features of SSL.  But is the integrity that we get really that much better than a leaner PGP-like system?  The browser today has dozens of root trust points, with those delegating trust authority to hundreds more.  China’s government is trusted by browsers today to sign certificates for google.com, or even facebook.com.  Do we trust them all?

A PGP model could reduce the size of the Certificates, provide decentralization so that we could enforce revocation lists, and eliminate worries about trusting China, the Iranian government, the US government, or any dubious entities that have signature authority today.

Better Browser Implementations
I mentioned above about the flaw where the browser will simultaneously open multiple connections to a single site when it knows it doesn’t have the server’s certificate, and thus redundantly download potentially large certs.  All browsers need to be smarter.
Although I expressed my grievances against the OCSP model above, it is used today.  If browsers continue to use OCSP, they need to fully implement OCSP caching on the client, they need to support OCSP stapling, and they need to help push the OCSP multi-stapling forward.

SSL Handshake Round Trips
The round trips in the handshake are tragic.  Fortunately, we can remove one, and Chrome users get this for free thanks to SSL False Start.  False Start is a relatively new, client-side only change.  We’ve measured that it is effective at removing one round trip from the handshake, and that it can reduce page load times by more than 5%.

Hopefully I got all that right, if you read this far, you deserve a medal.

14 thoughts on “Certificate Validation Example: Facebook

  • April 20, 2011 at 5:35 pm

    Hey Mike,

    I’m an engineer at Facebook. Do you have any recommendations for a CA that offers certs signed directly from the root? A long time ago, when I was working on reCAPTCHA we got a huge benefit by finding such a CA. But many CAs have transitioned to issuing certs with intermediate keys.

  • April 20, 2011 at 9:01 pm

    You can’t get a cert signed directly at the root – that’s generally a security no-no, they like to keep the root offline and only bring it up for signatures rarely. The FB cert right now is 4 levels deep:
    – GTE CyberTrust GlobalRoot (this is baked into the browser already)
    – DigiCert High Assurance EV Root CA
    – DigiCert High Assurance CA-3

    I’ll do some research on which CA’s provide the lightest weight certs and the smallest chains. I don’t know the answer.

    In the meantime, you could call up DigiCert and tell them you want a total certchain less than 1500 bytes 🙂

    Google Mail uses Thawte, and the chain is a little smaller:
    – VeriSign Class 3 Public Primary CA (baked into the browser)
    – Thawte SGC CA
    – mail.google.com

    I’ll get back to you offline.

  • April 20, 2011 at 10:00 pm

    We got a really great deal on the cert for reCAPTCHA (https://api-secure.recaptcha.net/) — 5 year validity, signed direct from the root, 1024 bit, no OCSP. Sadly, this can’t be reproduced currently :-(. When we got this cert, it saved us a non-trivial amount of money on our bandwidth bill.

    Agreed about signing from the root not being great for security. But maybe we can get some CA to break out their root key to sign some keys for super high profile websites. Based on the recent incident with Comodo, it’d be a Good Thing to make keys for these types of sites a bit more manual.

    Thanks for doing some research, looking forward to hearing your results.

  • April 21, 2011 at 9:58 am

    Great post! Everyone needs to learn more about SSL.
    1. What is the use case you’re analyzing?
    I typically go to http://www.facebook.com/ (non-SSL). Typically I’m logged in and never hit https. If I have to login there’s only one SSL request and then I’m redirected (via JS) back to http.

    Are the SSL inefficiencies more critical on sites that do everything over SSL vs sites that use SSL just for authentication?

    How much of a performance issue is it for sites like FB that just use SSL for auth?

    On a similar note, most of the pages I visited (home, profile, edit friends, privacy settings) accessed http://static.ak.facebook.com (non-SSL). It wasn’t until I went to Account Settings that I hit https://s-static.ak.facebook.com/.

    What motivated Facebook to put three certs in the response? Is that better for security, or better performance, dictated by their CA root, or an accident?

    Any good links to find out more about PGP?

    Who is taking the lead on sorting this out? If no one, perhaps the W3C Web Performance Working Group should take a look.

  • April 21, 2011 at 10:48 am

    I have 3 comments.
    The first one is, Facebook is using DigiCert which uses 3 certs from DigiCert to verify facebook.com. I know StartSSL uses 2. I think I’ve seen some certs from Verisign for example use 2.
    But it might also be possible to leave out the root cert on your webserver, did you know that ? Because that one is already in the browser.
    If you do that, you only need 2 the one from Facebook and the intermediate from say StartSSL.
    The second comment is, their seems to be a kind of issue with a corrupt tag in the previous comment from Steve Souders. So in a way, you have HTML-injection on your page. 🙁
    Steve Sounders: the idea is that if you are on wifi at Starbucks you don’t fall victim to a FireSheep user.

  • April 21, 2011 at 10:50 am

    I’m now thinking, why can’t we just compress the certificates ? SPDY compresses headers, so why not compress certificates ?

  • April 21, 2011 at 10:39 pm

    @Steve Souders: I was testing the facebook login page, which is https://www.facebook.com/. Further, I have my Facebook user settings set to always use SSL (thank you Facebook!)

    How critical these issues are is hard to tell. Remember that SSL has two modes – the full handshake mode, and the session-resumption mode. Further, OCSP responses and CRL responses can be cached by the operating system. So, users don’t experience these properties every time they hit these sites. But, for major sites, this stuff adds up.

    The 3-certs in the response is completely common and normal. Remember, the browser only has a limited number of pre-configured, built-in roots. If your site’s public key is not signed directly by the root (which is likely isn’t), then you need to provide all of the intermediate CA certificates so that the browser can verify everything between your certificate (leaf) and the root. If you don’t, the browser will pop up a warning that it can’t verify your site.

    As for who is taking this on – the IETF has a whole group dedicated to TLS. Google will likely be coming to the IETF in the future with proposals for speeding it up. The CA infrastructure, however, is pretty deeply rooted in everything we do. I don’t know when (or if) it can realistically change.

  • April 21, 2011 at 10:43 pm

    @silentlennie: Leaving off that cert might work in the popular browsers, but it won’t work on older browsers or older mobile phones. And sadly, because SSL does not expose a user-agent, you can’t selectively send different cert chains to different browsers. For a really bad example of this, check out sites that have GoDaddy certificates. They’re sending like 5KB of data, with 2 certs that are (for modern browsers two levels above the ‘root’! This whole multi-rooted-over-time CA hierarchy is real problem.

    Your idea to compress the certs is a good one. One of my colleagues tested just that, and it does look like they are pretty compressible. It will require a protocol change, but we might be able to get a reasonable byte redux with compression.

    Another cute problem I saw was a site that had 50 different names embedded in one cert (www.foo.com, http://www.bar.com, etc). While this enabled them to buy a single cert, it also made their cert quite fat.

  • Pingback:Getting SPDY With Your SSL and Node | FunctionSource Development

  • Pingback:How to Get a Small Cert Chain « Mike's Lookout

  • May 6, 2011 at 5:09 pm

    So is it safe if I get the SSL certificate from “s.static.ak.facebook.com” which my browser says it’s not verified when I visit facebook? unlike the other certificate which is verified by Digicert. i.e. i’m getting the certs “s.static.ak.facebook.com” from the Content Delivery Networks “Akamai ” which is hired byFacebook.
    Link to the certificate:

  • June 6, 2011 at 3:42 am

    Surely the answer is for each browser vendor to aggregate revocations and updates from the various CA’s and provide that information via daily/hourly updates to the browser?
    Kind of like an aggregated RSS feed.
    Then the browser can hit a single site once per hour to get updated revocation lists, and that site can be heavily cached via CDN.

  • Pingback:Performance Calendar » Advice on Trusting Advice

Leave a Reply

Your email address will not be published. Required fields are marked *