October 4, 2013
The problem was so bizarre that for a moment I suspected I was witnessing a man-in-the-middle attack using valid certificates. Many popular HTTPS websites, including Google and DuckDuckGo, but not all HTTPS websites, were taking up to 20 seconds to load. The delay occurred in all browsers, and according to Chromium's developer tools, it was occurring in the SSL (aka TLS) handshake. I was perplexed to see Google taking several seconds to complete the TLS handshake. Google employs TLS experts to squeeze every last drop of performance out of TLS, and uses the highly efficient elliptic curve Diffie-Hellman key exchange. It was comical to compare that to my own HTTPS server, which was handshaking in a fraction of a second, despite using stock OpenSSL and the more expensive discrete log Diffie-Hellman key exchange.
Not yet willing to conclude that it was a targeted man-in-the-middle attack that was affecting performance, I looked for alternative explanations. Instinctively, I thought this had the whiff of a DNS problem. After a slow handshake, there was always a brief period during which all handshakes were fast, even if I restarted the browser. This suggested to me that once a DNS record was cached, everything was fast until the cache entry expired. Since I run my own recursive DNS server locally, this hypothesis was easy to test by flushing my DNS cache. I found that flushing the DNS cache would consistently cause the next TLS handshake to be slow.
This didn't make much sense: using tools like host and dig, I could find no DNS problems with the affected domains, and besides, Chromium said the delay was in the TLS handshake. It finally dawned on me that the delay could be in the OCSP check. OCSP, or Online Certificate Status Protocol, is a mechanism for TLS clients to check if a certificate has been revoked. During the handshake, the client makes a request to the OCSP URI specified in the certificate to check its status. Since the URI would typically contain a hostname, a DNS problem could manifest here.
I checked the certificates of the affected sites, and all of them specified OCSP URIs that ultimately resolved to ocsp.verisign.net. Upon investigation, I found that of the seven name servers listed for ocsp.verisign.net (ns100.nstld.net through ns106.nstld.net), only two of them (ns100.nstld.net and ns102.nstld.net) were returning a response to AAAA queries. The other five servers returned no response at all, not even a response to say that an AAAA record does not exist. This was very bad, since it meant any attempt to resolve an AAAA record for this host required the client to try again and wait until it timed out, leading to unsavory delays.
If you're curious what an AAAA record is and why this matters, an AAAA record is the type of DNS record that maps a hostname to its IPv6 address. It's the IPv6 equivalent to the A record, which maps a hostname to its IPv4 address. While the Internet is transitioning from IPv4 to IPv6, hosts are expected to be dual-homed, meaning they have both an IPv4 and an IPv6 address. When one system talks to another, it prefers IPv6, and falls back to IPv4 only if the peer doesn't support IPv6. To figure this out, the system first attempts an AAAA lookup, and if no AAAA record exists, it tries an A record lookup. So, when a name server does not respond to AAAA queries, not even with a response to say no AAAA record exists, the client has to wait until it times out before trying the A record lookup, causing the delays I was experiencing here. Cisco has a great article that goes into more depth about broken name servers and AAAA records.
(Note: the exact mechanics vary between operating systems. The Linux resolver tries AAAA lookups even if the system doesn't have IPv6 connectivity, meaning that even IPv4-only users experience these delays. Other operating systems might only attempt AAAA lookups if the system has IPv6 connectivity, which would mitigate the scope of this issue.)
A History of Brokenness
This is apparently not the first time Verisign's servers have had problems: A year ago, the name servers for ocsp.verisign.net exhibited the same broken behavior:
The unofficial response from Verisign was that the queries are being handled by a GSLB, which apparently means that we should not expect it to behave correctly.
"GSLB" means "Global Server Load Balancing" and I interpret that statement to mean Verisign is using an expensive DNS appliance to answer queries instead of software running on a conventional server. The snarky comment about such appliances rings true for me. Last year, I noticed that my alma matter's website was taking 30 seconds to load. I tracked the problem down to the exact same issue: the DNS servers for brown.edu were not returning any response to AAAA queries. In the process of reporting this to Brown's IT department, I learned that they were using buggy and overpriced-looking DNS appliances from F5 Networks, which, by default, do not properly respond to AAAA queries under circumstances that appear to be common enough to cause real problems. To fix the problem, the IT people had to manually configure every single DNS record individually to properly reply to AAAA queries.
I find it totally unconscionable for a DNS appliance vendor to be shipping a product with such broken behavior which causes serious delays for users and gives IPv6 a bad reputation. It is similarly outrageous for Verisign to be operating broken DNS servers that are in the critical path for an untold number of TLS handshakes. That gives HTTPS a bad reputation, and lends fuel to the people who say that HTTPS is too slow. It's truly unfortunate that even if you're Google and do everything right with IPv6, DNS, and TLS, your handshake speeds are still at the mercy of incompetent certificate authorities like Verisign.
I worked around this issue by disabling OCSP (in Firefox, set security.OCSP.enabled to 0 in about:config). While OCSP may theoretically be good for security, since it enables browsers to reject certificates that have been compromised and revoked, in practice it's a total mess. Since OCSP servers are often unreliable or are blocked by restrictive firewalls, browsers don't treat OCSP errors as fatal by default. Thus, an active attacker who is using a revoked certificate to man-in-the-middle HTTPS connections can simply block access to the OCSP server and the browser will accept the revoked certificate. Frankly, OCSP is better at protecting certificate authorities' business model than protecting users' security, since it allows certificate authorities to revoke certificates for things like credit card chargebacks. As if this wasn't bad enough already, OCSP introduces a minor privacy leak because it reports every HTTPS site you visit to the certificate authority. Google Chrome doesn't even use OCSP anymore because it is so dysfunctional.
While I was writing this blog post, Verisign fixed their DNS servers and now every single one is returning a proper response to AAAA queries. I know for sure their servers were broken for at least two days. I suspect it was longer considering the slowness was happening for quite some time before I finally investigated.
Posted on 2013-10-04 at 01:26:54 UTC | Comments