Skip to Content [alt-c]

October 17, 2014

Renewing an SSL Certificate Without Even Logging in to My Server

Yesterday I renewed the about-to-expire SSL certificate for one of my websites. I did so without running a single command, filling out a single form, taking out my credit card, or even logging into any server. All I did was open a link in an email and click a single button. Soon after, my website was serving a renewed certificate to visitors.

It's all thanks to the auto-renewal feature offered by my SSL certificate management startup, SSLMate, which I finally had a chance to dogfood yesterday with a real, live expiring certificate.

There are two halves to auto-renewals. The first, relatively straightforward half, is that the SSLMate service is constantly monitoring the expiration dates of my certificates. Shortly before a certificate expires, SSLMate requests a new certificate from a certificate authority. As with any SSL certificate purchase, the certificate authority emails me a link which I have to click to confirm that I still control the domain. Once I've done that, the certificate authority delivers the new certificate to SSLMate, and SSLMate stores it in my SSLMate account. SSLMate charges the credit card already on file in my account, just like any subscription service.

But a certificate isn't very useful sitting in my SSLMate account. It needs to be installed on my web servers so that visitors are served with the new certificate. This is the second half of auto-renewals. If I were using a typical SSL certificate vendor, this step would probably involve downloading an email attachment, unzipping it, correctly assembling the certificate bundle, installing it on each of my servers, and restarting my services. This is tedious, and worse, it's error-prone. And as someone who believes strongly in devops, it offends my sensibilities.

Fortunately, SSLMate isn't a typical SSL certificate vendor. SSLMate comes with a command line tool to help manage your SSL certificates. Each of my servers runs the following script daily (technically, it runs from a configuration management system, but it could just as easily run from a cron job in /etc/cron.daily):


if sslmate download --all
	service apache2 restart
	service titus restart

exit 0

The sslmate download command downloads certificates from my SSLMate account to the /etc/sslmate directory on the server. --all tells sslmate to look at the private keys in /etc/sslmate and download the corresponding certificate for each one (alternatively, I could explicitly specify certificate common names on the command line). sslmate download exits with a zero status code if new certificates were downloaded, or a non-zero status if the certificates were already up-to-date. The if condition tests for a zero exit code, and restarts my SSL-using services if new certificates were downloaded.

On most days of the year, this script does nothing, since my certificates are already up-to-date. But on days when a certificate has been renewed, it comes to life, downloading new certificate files (not just the certificate, but also the chain certificate) and restarting services so that they use the new files. This is devops at its finest, applied to SSL certificate renewals for the first time.

You may be wondering - is it a good idea to do this unattended? I think so. First, the risk of installing a broken certificate is very low, and certainly lower than when an installation is done by hand. Since SSLMate takes care of assembling the certificate bundle for you, there's no risk of forgetting to include the chain certificate or including the wrong one. Chain certificate problems are notoriously difficult to debug, since they don't materialize in all browsers. While tools such as SSL Labs are invaluable for verifying certificate installation, they can only tell you about a problem after you've installed a bad certificate, which is too late to avoid downtime. Instead, it's better to automate certificate installation to eliminate the possibility of human error.

I'm also unconcerned about restarting Apache unattended. sslmate download contains a failsafe that refuses to install a new certificate if it doesn't match the private key, ensuring that it won't install a certificate that would prevent Apache from starting. And I haven't done anything reckless with my Apache configuration that might make restarts unreliable. Besides, it's already essential to be comfortable with restarting system services, since you may be required to restart a service at any time in response to a security update.

One more thing: this certificate wasn't originally purchased through SSLMate, yet SSLMate was able to renew it, thanks to SSLMate's upcoming import feature. A new command, sslmate import, will let you import your existing certificates to your SSLMate account. Once imported, you can set up your auto-renewal cron job, and you'll be all set when your certificates begin to expire. And you'll be charged only when a certificate renews; importing certificates will be free.

sslmate import is in beta testing and will be released soon. If you're interested in taking part in the beta, shoot an email to Also consider subscribing to the SSLMate blog or following @SSLMate on Twitter so you get future announcements - we have a lot of exciting development in the pipeline.


September 30, 2014

CloudFlare: SSL Added and Removed Here :-)

One of the more infamous leaked NSA documents was a slide showing a hand-drawn diagram of Google's network architecture, with the comment "SSL added and removed here!" along with a smiley face, written underneath the box for Google's front-end servers.

NSA slide showing diagram of Google's network architecture, with the comment "SSL added and removed here!" along with a smiley face, written underneath the box for Google's front-end servers.

"SSL added and removed here! :-)"

The point of the diagram was that although Google tried to protect their users' privacy by using HTTPS to encrypt traffic between web browsers and their front-end servers, they let traffic travel unencrypted between their datacenters, so their use of HTTPS was ultimately no hindrance to the NSA's mass surveillance of the Internet.

Today the NSA can draw a new diagram, and this time, SSL will be added and removed not by a Google front-end server, but by a CloudFlare edge server. That's because, although CloudFlare has taken the incredibly generous and laudable step of providing free HTTPS by default to all of their customers, they are not requiring the connection between CloudFlare and the origin server to be encrypted (they call this "Flexible SSL"). So although many sites will now support HTTPS thanks to CloudFlare, by default traffic to these sites will be encrypted only between the user and the CloudFlare edge server, leaving plenty of opportunity for the connection to be eavesdropped beyond the edge server.

Arguably, encrypting even part of the connection path is better than the status quo, which provides no encryption at all. I disagree, because CloudFlare's Flexible SSL will lull website visitors into a false sense of security, since these partially-encrypted connections will appear to the browser as normal HTTPS connections, padlock and all. There will be no distinction made whatsoever between a connection that's protected all the way to the origin, and a connection that's protected only part of the way. Providing a false sense of security is often worse than providing no security at all, and I find the security of Flexible SSL to be quite lacking. That's because CloudFlare aims to put edge nodes as close to the visitor as possible, which minimizes latency, but also minimizes the percentage of an HTTPS connection which is encrypted. So although Flexible SSL will protect visitors against malicious local ISPs and attackers snooping on coffee shop WIFI, it provides little protection against nation-state adversaries. This point is underscored by a map of CloudFlare's current and planned edge locations, which shows a presence in 37 different countries, including China. China has abysmal human rights and pervasive Internet surveillance, which is troubling because CloudFlare explicitly mentions human rights organizations as a motivation for deploying HTTPS everywhere:

Every byte, however seemingly mundane, that flows encrypted across the Internet makes it more difficult for those who wish to intercept, throttle, or censor the web. In other words, ensuring your personal blog is available over HTTPS makes it more likely that a human rights organization or social media service or independent journalist will be accessible around the world.

It's impossible for Flexible SSL to protect a website of a human rights organization from interception, throttling, or censoring when the connection to that website travels unencrypted through the Great Firewall of China. What's worse is that CloudFlare includes the visitor's original IP address in the request headers to the origin server, which of course is unencrypted when using Flexible SSL. A nation-state adversary eavesdropping on Internet traffic will therefore see not only the URL and content of a page, but also the IP address of the visitor who requested it. This is exactly the same situation as unencrypted HTTP, yet as far as the visitor can tell, the connection is using HTTPS, with a padlock icon and an https:// URL.

It is true that HTTPS has never guaranteed the security of a connection behind the host that terminates the SSL, and it's already quite common to terminate SSL in a front-end host and forward unencrypted traffic to a back-end server. However, in almost all instances of this architecture, the SSL terminator and the back-end are in the same datacenter on the same network, not in different countries on opposite sides of the world, with unencrypted connections traveling over the public Internet. Furthermore, an architecture where unencrypted traffic travels a significant distance behind an SSL terminator should be considered something to fix, not something to excuse or encourage. For example, after the Google NSA slide was released, Google accelerated their plans to encrypt all inter-datacenter traffic. In doing so, they strengthened the value of HTTPS. CloudFlare, on the other hand, is diluting the value of HTTPS, and in astonishing numbers: according to their blog post, they are doubling the number of HTTPS sites on the Internet from 2 million to 4 million. That means that after today, one in two HTTPS websites will be using encryption where most of the connection path is actually unencrypted.

Fortunately, CloudFlare has an alternative to Flexible SSL which is free and provides encryption between CloudFlare and the origin, which they "strongly recommend" site owners enable. Unfortunately, it requires manual action on the part of website operators, and getting users to follow security recommendations, even when strongly recommended, is like herding cats. The vast majority of those 2 million website operators won't do anything, especially when their sites already appear to be using HTTPS and thus benefit from the main non-security motivation for HTTPS, which is preference in Google search rankings.

This is a difficult problem. CloudFlare should be commended for tackling it and for their generosity in making their solution free. However, they've only solved part of the problem, and this is an instance where half measures are worse than no measures at all. CloudFlare should abolish Flexible SSL and make setting up non-Flexible SSL easier. In particular, they should hurry up the rollout of the "CloudFlare Origin CA," and instead of requiring users to submit a CSR to be signed by CloudFlare, they should let users download, in a single click, both a private key and a certificate to be installed on their origin servers. (Normally I'm averse to certificate authorities generating private keys for their users, but in this case, it would be a private CA used for nothing but the connection between CloudFlare and the origin, so it would be perfectly secure for CloudFlare to generate the private key.)

If CloudFlare continues to offer Flexible SSL, they should at least include an HTTP header in the response indicating that the connection was not encrypted all the way to the origin. Ideally, this would be standardized and web browsers would not display the same visual indication as proper HTTPS connections if this header is present. Even without browser standardization, the header could be interpreted by a browser extension that could be included in privacy-conscious browser packages such as the Tor Browser Bundle. This would provide the benefits of Flexible SSL without creating a false sense of security, and help fulfill CloudFlare's stated goal to build a better Internet.


September 6, 2014

SHA-1 Certificate Deprecation: No Easy Answers

Google recently announced that they will be phasing out support for SHA-1 SSL certificates in Chrome, commencing almost immediately. Although Microsoft was the first to announce the deprecation of SHA-1 certificates, Google's approach is much more aggressive than Microsoft's, and will start treating SHA-1 certificates differently well before the January 1, 2017 deadline imposed by Microsoft. A five year SHA-1 certificate purchased after January 1, 2012 will be treated by Chrome as "affirmatively insecure" starting in the first half of 2015.

This has raised the hackles of Matthew Prince, CEO of CloudFlare. In a comment on Hacker News, Matthew cites the "startling" number of browsers that don't support SHA-2 certificates (namely, pre-SP3 Windows XP and pre-2.3 Android) and expresses his concern that the aggressive deprecation of SHA-1 will lead to organizations declining to support HTTPS. This comment resulted in a very interesting exchange between him and Adam Langley, TLS expert and security engineer at Google, who, as you'd expect, supports Google's aggressive deprecation plan.

Matthew raises legitimate concerns. We're at a unique point in history: there is incredible momentum behind converting sites to HTTPS, even sites that traditionally would not have used HTTPS, such as entirely static sites. The SHA-1 deprecation might throw a wrench into this and cause site operators to reconsider switching to HTTPS. Normally I have no qualms with breaking some compatibility eggs to make a better security omelette, but I'm deeply ambivalent about the timeframe of this deprecation. Losing the HTTPS momentum would be incredibly sad, especially since switching to HTTPS provides an immediate defense against large-scale passive eavesdropping.

Of course, Adam raises a very good point when he asks "if Microsoft's 2016/2017 deadline is reckless, what SHA-1 deprecation date would be right by your measure?" In truth, the Internet should have already moved away from SHA-1 certificates. Delaying the deprecation further hardly seems like a good idea.

Ultimately, there may be no good answer to this question, and it's really just bad luck and bad timing that this needs to happen right when HTTPS is picking up momentum.

This affects me as more than just a site operator, since I resell SSL certificates over at SSLMate. Sadly, SSLMate's upstream certificate authority, RapidSSL, does not currently support SHA-2 certificates, and has not provided a definite timeframe for adding support. RapidSSL is not alone: Gandi does not support SHA-2 either, and GoDaddy's SHA-2 support is purportedly a little bumpy. The fact that certificate authorities are not all ready for this change makes Google's aggressive deprecation schedule all the more stressful. On the other hand, I expect RapidSSL to add SHA-2 support soon in response to Google's announcement. If I'm correct, it will certainly show the upside of an aggressive deprecation in getting lethargic players to act.

In the meantime, SSLMate will continue to sell SHA-1 certificates, though it will probably stop selling certificates that are valid for more than one or two years. Switching to a certificate authority that already supports SHA-2 is out of the question, since they are either significantly more expensive or take a long time to issue certificates, which doesn't work with SSLMate's model of real time purchases from the command line. When RapidSSL finally adds SHA-2 support, SSLMate customers will be able to replace their existing SHA-1 certificates for free, and SSLMate will do its best to make this process as easy as possible.

Speaking of certificate lifetimes, Adam Langley made the case in the Hacker News thread that site operators should purchase certificates that last only a year. I agree heartily. In addition to Adam's point that short-lived certificates insulate site operators from changes like the SHA-1 deprecation, I'd like to add that they're more secure because certificate revocation doesn't really work. If your private key is compromised, you're not truly safe until your certificate expires, so the shorter the lifetime the better. The main argument against short-lived certificates has always been that they're really inconvenient, so I'm happy to say that at SSLMate I'm working on some very exciting features that will make yearly certificate renewals extremely easy. Stay tuned for an announcement next week.


August 12, 2014

STARTTLS Considered Harmful

There are two ways that otherwise plain text protocols can provide encryption with TLS. The first way is to listen on two ports: one port that is always plain text, and a second port that is always encrypted with TLS. The other way is to use a single port on which communication starts out unencrypted, but can be "upgraded" to a TLS encrypted connection using an application-level command specific to the protocol. HTTP/HTTPS uses exclusively the first approach, with ports 80 and 443. The second approach, called STARTTLS, is used by SMTP, XMPP, IMAP, and POP3, though several of those protocols also support the first approach.

There's a clear bias for STARTTLS in the IETF's email standards. The use of alternative TLS-only ports for IMAP, POP3, and SMTP was never formally standardized: people just started doing it that way, and although port numbers were registered for the purpose, the registration of the encrypted SMTP (SMTPS) port (465) was later rescinded. When the IETF finally standardized the use of TLS with IMAP and POP3 in 1999, they prescribed the use of STARTTLS and gave several reasons why STARTTLS should be used instead of an alternative TLS-only port. Briefly, the reasons are:

  1. Separate ports lead to a separate URL scheme, which means the user has to choose between them. The software is often more capable of making this choice than the user.
  2. Separate ports imply a model of either "secure" or "not secure," which can be misleading. For example, the "secure" port might be insecure because it's using export-crippled ciphers, or the normal port might be using a SASL mechanism which includes a security layer.
  3. Separate ports has caused clients to implement only two security policies: use TLS or don't use TLS. The desirable security policy "use TLS when available" would be cumbersome with the separate port model, but is simple with STARTTLS.
  4. Port numbers are a limited resource.

Except for reason four, these reasons are pretty terrible. Reason one is not very true: unless the software keeps a database of hosts which should use TLS, the software is incapable of making the choice between TLS and non-TLS on behalf of the user without being susceptible to active attacks. (Interestingly, web browsers have recently started keeping a database of HTTPS-only websites with HSTS preload lists, but this doesn't scale.)

Reason three is similarly dubious because "use TLS when available" is also susceptible to active attacks. (If the software detects that TLS is not available, it doesn't know if that's because the server doesn't support it or if it's because an active attacker is blocking it.)

Reason two may have made some sense in 1999, but it certainly doesn't today. The export cipher concern was mooted when export controls were lifted in 2000, leading to the demise of export-crippled ciphers. I have no idea how viable SASL security layers were in 1999, but in the last ten years TLS has clearly won.

So STARTTLS is really no better than using an alternative TLS-only port. But that's not all. There are several reasons why STARTTLS is actually worse for security.

The first reason is that STARTTLS makes it impossible to terminate TLS in a protocol-agnostic way. It's trivial to terminate a separate-port protocol like HTTPS in a software proxy like titus or in a hardware load balancer: you simply accept a TLS connection and proxy the plain text stream to the backend's non-TLS port. Terminating a STARTTLS protocol, on the other hand, requires the TLS terminator to understand the protocol being proxied, so it can watch for the STARTTLS command and only upgrade to TLS once the command is sent. Supporting IMAP/POP3/SMTP isn't too difficult since they are simple line-based text protocols. (Though you have to be careful - you don't want the TLS terminator to misfire if it sees the string "STARTTLS" inside the body of an email!) XMPP, on the other hand, is an XML-based protocol, and do you really want your TLS terminator to contain an XML parser?

I care about this because I'd like to terminate TLS for my SMTP, IMAP, and XMPP servers in the highly-sandboxed environment provided by titus, so that a vulnerability in the TLS implementation can't compromise the state of my SMTP, IMAP, and XMPP servers. STARTTLS makes it needlessly difficult to do this.

Another way that STARTTLS harms security is by adding complexity. Complexity is a fertile source of security vulnerabilities. Consider CVE-2011-0411, a vulnerability caused by SMTP implementations failing to discard SMTP commands pipelined with the STARTTLS command. This vulnerability allowed attackers to inject SMTP commands that would be executed by the server during the phase of the connection that was supposed to be protected with TLS. Such a vulnerability is impossible when the connection uses TLS from the beginning.

STARTTLS also adds another potential avenue for a protocol downgrade attack. An active attacker can strip out the server's advertisement of STARTTLS support, and a poorly-programmed client would fall back to using the protocol without TLS. Although it's trivial for a properly-programmed client to protect against this downgrade attack, there are already enough ways for programmers to mess up TLS client code and it's a bad idea to add yet another way. It's better to avoid this pitfall entirely by connecting to a port that talks only TLS.

Fortunately, despite the IETF's recommendation to use STARTTLS and the rescinding of the SMTPS port assignment, IMAP, POP3, and SMTP on dedicated TLS ports are still widely supported by both server and client email implementations, so you can easily avoid STARTTLS with these protocols. Unfortunately, the SMTPS port is only used for the submission of authenticated mail by mail clients. Opportunistic encryption between SMTP servers, which is extremely important for preventing passive eavesdropping of email, requires STARTTLS on port 25. And modern XMPP implementations support only STARTTLS.

Moving forward, this shouldn't even be a question for new protocols. To mitigate pervasive monitoring, new protocols should have only secure versions. They can be all TLS all the time. No need to choose between using STARTTLS and burning an extra port number. I just wish something could be done about the existing STARTTLS-only protocols.


July 13, 2014

LibreSSL's PRNG is Unsafe on Linux [Update: LibreSSL fork fix]

The first version of LibreSSL portable, 2.0.0, was released a few days ago (followed soon after by 2.0.1). Despite the 2.0.x version numbers, these are only preview releases and shouldn't be used in production yet, but have been released to solicit testing and feedback. After testing and examining the codebase, my feedback is that the LibreSSL PRNG is not robust on Linux and is less safe than the OpenSSL PRNG that it replaced.

Consider a test program, fork_rand. When linked with OpenSSL, two different calls to RAND_bytes return different data, as expected:

$ cc -o fork_rand fork_rand.c -lcrypto $ ./fork_rand Grandparent (PID = 2735) random bytes = f05a5e107f5ec880adaeead26cfff164e778bab8e5a44bdf521e1445a5758595 Grandchild (PID = 2735) random bytes = 03688e9834f1c020765c8c5ed2e7a50cdd324648ca36652523d1d71ec06199de

When the same program is linked with LibreSSL, two different calls to RAND_bytes return the same data, which is a catastrophic failure of the PRNG:

$ cc -o fork_rand fork_rand.c libressl-2.0.1/crypto/.libs/libcrypto.a -lrt $ ./fork_rand Grandparent (PID = 2728) random bytes = f5093dc49bc9527d6d8c3864be364368780ae1ed190ca0798bf2d39ced29b88c Grandchild (PID = 2728) random bytes = f5093dc49bc9527d6d8c3864be364368780ae1ed190ca0798bf2d39ced29b88c

The problem is that LibreSSL provides no way to safely use the PRNG after a fork. Forking and PRNGs are a thorny issue - since fork() creates a nearly-identical clone of the parent process, a PRNG will generate identical output in the parent and child processes unless it is reseeded. LibreSSL attempts to detect when a fork occurs by checking the PID (see line 122). If it differs from the last PID seen by the PRNG, it knows that a fork has occurred and automatically reseeds.

This works most of the time. Unfortunately, PIDs are typically only 16 bits long and thus wrap around fairly often. And while a process can never have the same PID as its parent, a process can have the same PID as its grandparent. So a program that forks from a fork risks generating the same random data as the grandparent process. This is what happens in the fork_rand program, which repeatedly forks from a fork until it gets the same PID as the grandparent.

OpenSSL faces the same issue. It too attempts to be fork-safe, by mixing the PID into the PRNG's output, which works as long as PIDs don't wrap around. The difference is that OpenSSL provides a way to explicitly reseed the PRNG by calling RAND_poll. LibreSSL, unfortunately, has turned RAND_poll into a no-op (lines 77-81). fork_rand calls RAND_poll after forking, as do all my OpenSSL-using programs in production, which is why fork_rand is safe under OpenSSL but not LibreSSL.

You may think that fork_rand is a contrived example or that it's unlikely in practice for a process to end up with the same PID as its grandparent. You may be right, but for security-critical code this is not a strong enough guarantee. Attackers often find extremely creative ways to manufacture scenarios favorable for attacks, even when those scenarios are unlikely to occur under normal circumstances.

Bad chroot interaction

A separate but related problem is that LibreSSL provides no good way to use the PRNG from a process running inside a chroot jail. Under Linux, the PRNG is seeded by reading from /dev/urandom upon the first use of RAND_bytes. Unfortunately, /dev/urandom usually doesn't exist inside chroot jails. If LibreSSL fails to read entropy from /dev/urandom, it first tries to get random data using the deprecated sysctl syscall, and if that fails (which will start happening once sysctl is finally removed), it falls back to a truly scary-looking function (lines 306-517) that attempts to get entropy from sketchy sources such as the PID, time of day, memory addresses, and other properties of the running process.

OpenSSL is safer for two reasons:

  1. If OpenSSL can't open /dev/urandom, RAND_bytes returns an error code. Of course the programmer has to check the return value, which many probably don't, but at least OpenSSL allows a competent programmer to use it securely, unlike LibreSSL which will silently return sketchy entropy to even the most meticulous programmer.
  2. OpenSSL allows you to explicitly seed the PRNG by calling RAND_poll, which you can do before entering the chroot jail, avoiding the need to open /dev/urandom once in the jail. Indeed, this is how titus ensures it can use the PRNG from inside its highly-isolated chroot jail. Unfortunately, as discussed above, LibreSSL has turned RAND_poll into a no-op.
What should LibreSSL do?

First, LibreSSL should raise an error if it can't get a good source of entropy. It can do better than OpenSSL by killing the process instead of returning an easily-ignored error code. In fact, there is already a disabled code path in LibreSSL (lines 154-156) that does this. It should be enabled.

Second, LibreSSL should make RAND_poll reseed the PRNG as it does under OpenSSL. This will allow the programmer to guarantee safe and reliable operation after a fork and inside a chroot jail. This is especially important as LibreSSL aims to be a drop-in replacement for OpenSSL. Many properly-written programs have come to rely on OpenSSL's RAND_poll behavior for safe operation, and these programs will become less safe when linked with LibreSSL.

Unfortunately, when I suggested the second change on Hacker News, a LibreSSL developer replied:

The presence or need for a [RAND_poll] function should be considered a serious design flaw.

I agree that in a perfect world, RAND_poll would not be necessary, and that its need is evidence of a design flaw. However, it is evidence of a design flaw not in the cryptographic library, but in the operating system. Unfortunately, Linux provides no reliable way to detect that a process has forked, and exposes entropy via a device file instead of a system call. LibreSSL has to work with what it's given, and on Linux that means RAND_poll is an unfortunate necessity.


If the LibreSSL developers don't fix RAND_poll, and you want your code to work safely with both LibreSSL and OpenSSL, then I recommend putting the following code after you fork or before you chroot (i.e. anywhere you would currently need RAND_poll):

unsigned char c;
if (RAND_poll() != 1) {
	/* handle error */
if (RAND_bytes(&c, 1) != 1) {
	/* handle error */

In essence, always follow a call to RAND_poll with a request for one random byte. The RAND_bytes call will force LibreSSL to seed the PRNG if it's not already seeded, making it unnecessary to later open /dev/urandom from inside the chroot jail. It will also force LibreSSL to update the last seen PID, fixing the grandchild PID issue. (Edit: the LibreSSL PRNG periodically re-opens and re-reads /dev/urandom to mix in additional entropy, so unfortunately this won't avoid the need to open /dev/urandom from inside the chroot jail. However, as long as you have a good initial source of entropy, mixing in the sketchy entropy later isn't terrible.)

I really hope it doesn't come to this. Programming with OpenSSL already requires dodging numerous traps and pitfalls, often by deploying obscure workarounds. The LibreSSL developers, through their well-intended effort to eliminate the pitfall of forgetting to call RAND_poll, have actually created a whole new pitfall with its own obscure workaround.

Update (2014-07-16 03:33 UTC): LibreSSL releases fix for fork issue

LibreSSL has released a fix for the fork issue! (Still no word on the chroot/sketchy entropy issue.) Their fix is to use pthread_atfork to register a callback that reseeds the PRNG when fork() is called. Thankfully, they've made this work without requiring the program to link with -lpthread.

I have mixed feelings about this solution, which was discussed in a sub-thread on Hacker News. The fix is a huge step in the right direction but is not perfect - a program that invokes the clone syscall directly will bypass the atfork handlers (Hacker News commenter colmmacc suggests some legitimate reasons a program might do this). I still wish that LibreSSL would, in addition to implementing this solution, just expose an explicit way for the programmer to reseed the PRNG when unusual circumstances require it. This is particularly important since OpenSSL provides this facility and LibreSSL is meant to be a drop-in OpenSSL replacement.

Finally, though I was critical in this blog post, I really appreciate the work the LibreSSL devs are doing, especially their willingness to solicit feedback from the community and act on it. (I also appreciate their willingness to make LibreSSL work on Linux, which, despite being a Linux user, I will readily admit is lacking in several ways that make a CSPRNG implementation difficult.) Ultimately their work will lead to better security for everyone.


June 27, 2014 IPv6 Broken, Buggy DNS to Blame

I've previously discussed the problems caused by buggy DNS servers that don't implement IPv6-related queries properly. The worst problem I've faced is that, by default, F5 Network's "BIG-IP GTM" appliance doesn't respond to AAAA queries for certain records, causing lengthy delays as IPv6-capable resolvers continue to send AAAA queries before finally timing out.

Now is exhibiting broken AAAA behavior, and it looks like an F5 "GTM" appliance may be to blame. is a CNAME for, which is itself a CNAME for (which is in turn a CNAME for another CNAME, but that's not important here). The nameservers for ({ns1-bn, ns1-qy, ns2-bn, ns2-qy}, contrary to the DNS spec, do not return the CNAME record when queried for the AAAA record for

$ dig AAAA ; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> AAAA ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44411 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ; IN AAAA ;; Query time: 137 msec ;; SERVER: ;; WHEN: Tue Jun 24 14:59:58 2014 ;; MSG SIZE rcvd: 34

Contrast to an A record query, which properly returns the CNAME:

$ dig A ; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> A ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 890 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ; IN A ;; ANSWER SECTION: 30 IN CNAME ;; Query time: 115 msec ;; SERVER: ;; WHEN: Tue Jun 24 15:04:34 2014 ;; MSG SIZE rcvd: 79

Consequentially, any IPv6-only host attempting to resolve will fail, even though has IPv6 connectivity and ultimately (once you follow 4 CNAMEs) has an AAAA record. You can witness this by running ping6 from a Linux box (but flush your DNS caches first). Fortunately, IPv6-only hosts are rare, and dual-stack or IPv4-only hosts won't have a problem resolving because they'll try resolving the A record. Nevertheless, this buggy behavior is extremely troubling, especially when the bug is with such simple logic (it doesn't matter what kind of record is being requested: if there's a CNAME, you return it), and is bound to cause headaches during the IPv6 transition.

I can't tell for sure what DNS implementation is powering the nameservers, but considering that "GTM" stands for "Global Traffic Manager," F5's DNS appliance product, which has a history of AAAA record bugginess, I have a hunch...


June 27, 2014

Titus Isolation Techniques, Continued

In my previous blog post, I discussed the unique way in which titus, my high-security TLS proxy server, isolates the TLS private key in a separate process to protect it against Heartbleed-like vulnerabilities in OpenSSL. In this blog post, I will discuss the other isolation techniques used by titus to guard against OpenSSL vulnerabilities.

A separate process for every connection

The most basic isolation performed by titus is using a new and dedicated process for every TLS connection. This ensures that a vulnerability in OpenSSL can't be used to compromise the memory of another connection. Although most of the attention around Heartbleed focused on extracting the private key, Heartbleed also exposed other sensitive information, such as user passwords contained in buffers from other connections. By giving each TLS connection a new and dedicated process, titus confines an attacker to accessing memory related to only his own connection. Such memory would contain at most the session key material for the connection and buffers from previous packets, which is information the attacker already knows.

Currently titus forks but does not call execve(). This simplifies the code greatly, but it means that the child process has access to all the memory of the parent process at the time of the fork. Therefore, titus is careful to avoid loading anything sensitive into the parent's memory. In particular, the private key is never loaded by the parent process.

However, some low-grade sensitive information may be loaded into the parent's memory before forking. For instance, the parent process calls getpwnam() during initialization, so the contents of /etc/passwd may persist in memory and be accessible by child processes. A future version of titus should probably call execve() to launch the child process so it starts off with a clean slate.

Another important detail is that titus must reinitialize OpenSSL's random number generator by calling RAND_poll() in the child process after forking. Failure to do so could result in multiple children generating the same random numbers, which would have catastrophic consequences.

Privilege separation

To protect against arbitrary code execution vulnerabilities, titus runs as dedicated non-root users. Currently two users are used: one for running the processes that talk to the network, and another for the processes that hold the private key.

Using the same user for all connections has security implications. The most serious problem is that, by default, users can use the ptrace() system call to access the memory of other processes running as the same user. To prevent this, titus disables ptracing using the PR_SET_DUMPABLE option to the prctl() syscall. This is an imperfect security measure: it's Linux-specific, and doesn't prevent attackers from disrupting other processes by sending signals.

Ultimately, titus should use a separate user for every concurrent connection. Modern Unix systems use 32 bit UIDs, making it completely feasible to allocate a range of UIDs to be used by titus, provided that titus reuses UIDs for future connections. To reuse a UID securely, titus would need to first kill off any latent process owned by that user. Otherwise, an attacker could fork a process that lies in wait until the UID is reused. Unfortunately, Unix provides no airtight way to kill all processes owned by a user. It may be necessary to leverage cgroups or PID namespaces, which are unfortunately Linux-specific.

Filesystem isolation

Finally, in order to reduce the attack surface even further, titus chroots into an empty, unwritable directory. Thus, an attacker who can execute arbitrary code is unable to read sensitive files, attack special files or setuid binaries, or download rootkits to the filesystem. Note that titus chroots after reseeding OpenSSL's RNG, so it's not necessary include /dev/urandom in the chroot directory.

Future enhancements

In addition to the enhancements mentioned above, I'd like to investigate using seccomp filtering to limit the system calls which titus is allowed to execute. Limiting titus to a minimal set of syscalls would reduce the attack surface on the kernel, preventing an attacker from breaking out of the sandbox if there's a kernel vulnerability in a particular syscall.

I'd also like to investigate network and process namespaces. Network namespaces would isolate titus from the network, preventing attackers from launching attacks on systems on your internal network or on the Internet. Process namespaces would provide an added layer of isolation and make it easy to kill off latent processes when a connection ends.

Why titus?

The TLS protocol is incredibly complicated, which makes TLS implementations necessarily complex, which makes them inevitably prone to security vulnerabilities. If you're building a simple server application that needs to talk TLS, the complexity of the TLS implementation is going to dwarf the complexity of your own application. Even if your own code is securely written and short and simple enough to be easily audited, your application may nevertheless be vulnerable if you link with a TLS implementation. Titus provides a way to isolate the TLS implementation, so its complexity doesn't affect the security of your own application.

By the way, titus was recently discussed on the Red Hat Security Blog along with some other interesting approaches to OpenSSL privilege separation, such as sslps, a seccomp-based approach. The blog post is definitely worth a read.


May 5, 2014

Protecting the OpenSSL Private Key in a Separate Process

Ever since Heartbleed, I've been thinking of ways to better isolate OpenSSL so that a vulnerability in OpenSSL won't result in the compromise of sensitive information. This blog post will describe how you can protect the private key by isolating OpenSSL private key operations in a dedicated process, a technique I'm using in titus, my open source high-isolation TLS proxy server.

If you're worried about OpenSSL vulnerabilities, then simply terminating TLS in a dedicated process, such as stunnel, is a start, since it isolates sensitive web server memory from OpenSSL, but there's still the tricky issue of your private key. OpenSSL needs access to the private key to perform decryption and signing operations. And it's not sufficient to isolate just the key: you must also isolate all intermediate calculations, as Akamai learned when their patch to store the private key on a "secure heap" was ripped to shreds by security researcher Willem Pinckaers.

Fortunately, OpenSSL's modular nature can be leveraged to out-source RSA private key operations (sign and decrypt) to user-defined functions, without having to modify OpenSSL itself. From these user-defined functions, it's possible to use inter-process communication to transfer the arguments to a different process, where the operation is performed, and then transfer the result back. This provides total isolation: the process talking to the network needs access to neither the private key nor any intermediate value resulting from the RSA calculations.

I'm going to show you how to do this. Note that, for clarity, the code presented here lacks proper error handling and resource management. For production quality code, you should look at the source for titus.

Traditionally, you initialize OpenSSL using code like the following:

SSL_CTX*	ctx;
FILE*		cert_filehandle;
FILE*		key_filehandle;
// ... omitted: initialize CTX, open cert and key files ...
X509*		cert = PEM_read_X509_AUX(cert_filehandle, NULL, NULL, NULL);
EVP_PKEY*	key = PEM_read_PrivateKey(key_filehandle, NULL, NULL, NULL);
SSL_CTX_use_certificate(ctx, cert);
SSL_CTX_use_PrivateKey(ctx, key);

The first thing we do is replace the call to PEM_read_PrivateKey, which reads the private key into memory, with our own function that creates a shell of a private key with references to our own implementations of the sign and decrypt operations. Let's call that function make_private_key_shell:

EVP_PKEY* make_private_key_shell (X509* cert)
	EVP_PKEY* key = EVP_PKEY_new();
	RSA*      rsa = RSA_new();

	// It's necessary for our shell to contain the public RSA values (n and e).
	// Grab them out of the certificate:
	RSA*      public_rsa = EVP_PKEY_get1_RSA(X509_get_pubkey(crt));
	rsa->n = BN_dup(public_rsa->n);
	rsa->e = BN_dup(public_rsa->e);

	static RSA_METHOD  ops = *RSA_get_default_method();
	ops.rsa_priv_dec = rsa_private_decrypt;
	ops.rsa_priv_enc = rsa_private_encrypt;
	RSA_set_method(rsa, &ops);

	EVP_PKEY_set1_RSA(key, rsa);
	return key;

The magic happens with the call to RSA_set_method. We pass it a struct of function pointers from which we reference our own implementations of the private decrypt and private encrypt (sign) operations. These implementations look something like this:

int rsa_private_decrypt (int flen, const unsigned char* from, unsigned char* to, RSA* rsa, int padding)
	do_rsa_operation(1, flen, from, to, rsa, padding);

int rsa_private_encrypt (int flen, const unsigned char* from, unsigned char* to, RSA* rsa, int padding)
	do_rsa_operation(2, flen, from, to, rsa, padding);

int do_rsa_operation (char command, int flen, const unsigned char* from, unsigned char* to, RSA* rsa, int padding)
	write(sockpair[0], &command, sizeof(command));
	write(sockpair[0], &padding, sizeof(padding));
	write(sockpair[0], &flen, sizeof(flen));
	write(sockpair[0], from, flen);

	int to_len;
	read(sockpair[0], &to_len, sizeof(to_len));
	if (to_len > 0) {
		read(sockpair[0], to, to_len);

	return to_len;

The arguments and results are sent to and from the other process over a socket pair that has been previously opened. Our message format is simply:

  • uint8_t command; // 1 for decrypt, 2 for sign
  • int padding; // the padding argument
  • int flen; // the flen argument
  • unsigned char from[flen]; // the from argument

The response format is:

  • int to_len; // length of result buffer (to)
  • unsigned char to[to_len]; // the result buffer

Here's the code to open the socket pair and run the RSA private key process:

void run_rsa_process (const char* key_path)
	socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair);
	if (fork() == 0) {
		FILE* key_filehandle = fopen(key_path, "r");
		RSA*  rsa = PEM_read_RSAPrivateKey(key_filehandle, NULL, NULL, NULL);

		int command;
		while (read(sockpair[1], &command, sizeof(command)) == 1) {
			int padding;
			int flen;
			read(sockpair[1], &padding, sizeof(padding));
			read(sockpair[1], &flen, sizeof(flen));
			unsigned char* from = (unsigned char*)malloc(flen);
			read(sockpair[1], from, flen);
			unsigned char* to = (unsigned char*)malloc(RSA_size(rsa));
			int to_len = -1;
			if (command == 1) {
				to_len = RSA_private_decrypt(flen, from, to, rsa, padding);
			} else if (command == 2) {
				to_len = RSA_private_encrypt(flen, from, to, rsa, padding);
			write(sockpair[1], &to_len, sizeof(to_len));
			if (to_len > 0) {
				write(sockpair[1], to, sizeof(to_len));

In the function above, we first create a socket pair for communicating between the parent (untrusted) process and child (trusted) process. We fork, and in the child process, we load the RSA private key, and then repeatedly service RSA private key operations received over the socket pair from the parent process. Only the child process, which never talks to the network, has the private key in memory. If the memory of the parent process, which does talk to the network, is ever compromised, the private key is safe.

That's the basic idea, and it works. There are other ways to do the interprocess communication that are more complicated but may be more efficient, such as using shared memory to transfer the arguments and results back and forth. But the socket pair implementation is conceptually simple and a good starting point for further improvements.

This is one of the techniques I'm using in titus to achieve total isolation of the part of OpenSSL that talks to the network. However, this is only part of the story. While this technique protects your private key against a memory disclosure bug like Heartbleed, it doesn't prevent other sensitive data from leaking. It also doesn't protect against more severe vulnerabilities, such as remote code execution. Remote code execution could be used to attack the trusted child process (such as by ptracing it and dumping its memory) or your system as a whole. titus protects against this using additional techniques like chrooting and privilege separation.

My next blog post will go into detail on titus' other isolation techniques. Follow me on Twitter, or subscribe to my blog's Atom feed, so you know when it's posted.

Update: Read part two of this blog post.


April 8, 2014

Responding to Heartbleed: A script to rekey SSL certs en masse

Because of the Heartbleed vulnerability in OpenSSL, I'm treating all of my private SSL keys as compromised and regenerating them. Fortunately, certificate authorities will reissue a certificate for free that signs a new key and is valid for the remaining time on the original certificate.

Unfortunately, using the openssl commands by hand to rekey dozens of SSL certificates is really annoying and is not my idea of a good time. So, I wrote a shell script called openssl-rekey to automate the process. openssl-rekey takes any number of certificate files as arguments, and for each one, generates a new private key of the same length as the original key, and a new CSR with the same common name as the original cert.

If you have a directory full of certificates, it's easy to run openssl-rekey on all of them with find and xargs:

$ find -name '*.crt' -print0 | xargs -0 /path/to/openssl-rekey

Once you've done this, you just need to submit the .csr files to your certificate authority, and then install the new .key and .crt files on your servers.

By the way, if you're like me and hate dealing with openssl commands and cumbersome certificate authority websites, you should check out my side project, SSLMate, which makes buying certificates as easy as running sslmate buy 2 and reissuing certificates as easy as running sslmate reissue I was able to reissue each of my SSLMate certs in under a minute. As my old certs expire I'm replacing them with SSLMate certs, and that cannot happen soon enough.


December 4, 2013

The Sorry State of Xpdf in Debian

I discovered today that Xpdf deterministically segfaults when you try to print any PDF under i386 Wheezy. It doesn't segfault on amd64 Wheezy which is why I had not previously noticed this. Someone reported this bug over two years ago, which you'd think would have given ample time for this to be fixed for Wheezy. This bug led me to a related bug which paints quite a sad picture. To summarize:

  • Xpdf uses its own PDF rendering engine. Some other developers thought it would be nice to be able to use this engine in other applications, so they ripped out Xpdf's guts and put them in a library called Poppler. Unfortunately, Xpdf continued to use its own rendering engine instead of linking with Poppler, so now there are two diverging codebases that do basically the same thing.
  • Apparently this code is bad and has a history of security vulnerabilities. The Debian maintainers quite reasonably don't want to support both codebases, so they have attempted to patch Xpdf to use Poppler instead of its own internal engine.
  • Unfortunately, this is hard and they haven't done a perfect job at it, which has led to a situation where Xpdf and Poppler each define their own, incompatible, versions of the same struct... with which they then attempt to interoperate.
  • As a consequence, Xpdf accesses uninitialized memory, or accesses initialized memory incorrectly, so it's a miracle any time you manage to use Xpdf without it crashing.

The Debian maintainers have three options:

  1. Stop trying to patch Xpdf to use Poppler.
  2. Fix their patch of Xpdf so it actually works.
  3. Do nothing.

There is a patch for option 2, but it has been rejected for being too long and complicated. Indeed it is long and complicated, and will make it more difficult to package new upstream versions of Xpdf. But that's the price that will have to be paid as Xpdf and Poppler continue to diverge, if the maintainers insist that Xpdf use Poppler. Unfortunately they seem unwilling to pay this price, nor are they willing to go with option 1. Instead they have taken the easy way out, option 3, even though this results in a totally broken, and possibly insecure, package that is only going to get more broken.

Clearly I can't be using this quagmire of a PDF viewer, so I have uninstalled it after being a user for more than 10 years. Sadly, the alternatives are not great. Most of them take the position that UIs are for chumps, preferring instead to use keyboard shortcuts that I will never remember because I don't view PDFs often enough. Debian helpfully provides a version of Evince that isn't linked with GNOME and it works decently. Unfortunately, after quitting it, I noticed a new process in my process list called "evinced." As in "evince daemon." As in, a PDF viewer launched its own daemon. As in, a PDF viewer launched its own daemon. I don't even...

epdfview is like Evince but less cool because it doesn't have its own daemon has buggy text selection. qpdfview is rather nice but copies selected text in such a way that it can't be pasted by middle-clicking (it copies into XA_CLIPBOARD instead of XA_PRIMARY).

I decided to go with Evince and pretend I never saw evinced.

Update (2014-01-20): The Xpdf bug has allegedly been fixed, by defining a new version of the incompatible struct. This is option 2, which means we can expect Xpdf to break again as Poppler and Xpdf continue to diverge. It appears to be a different patch from the one that was rejected earlier, though I can't find any of the discussion which led to this fix being adopted. I will update this blog post if I learn more information.


October 4, 2013

Verisign's Broken Name Servers Slow Down HTTPS for Google and Others

The problem was so bizarre that for a moment I suspected I was witnessing a man-in-the-middle attack using valid certificates. Many popular HTTPS websites, including Google and DuckDuckGo, but not all HTTPS websites, were taking up to 20 seconds to load. The delay occurred in all browsers, and according to Chromium's developer tools, it was occurring in the SSL (aka TLS) handshake. I was perplexed to see Google taking several seconds to complete the TLS handshake. Google employs TLS experts to squeeze every last drop of performance out of TLS, and uses the highly efficient elliptic curve Diffie-Hellman key exchange. It was comical to compare that to my own HTTPS server, which was handshaking in a fraction of a second, despite using stock OpenSSL and the more expensive discrete log Diffie-Hellman key exchange.

Not yet willing to conclude that it was a targeted man-in-the-middle attack that was affecting performance, I looked for alternative explanations. Instinctively, I thought this had the whiff of a DNS problem. After a slow handshake, there was always a brief period during which all handshakes were fast, even if I restarted the browser. This suggested to me that once a DNS record was cached, everything was fast until the cache entry expired. Since I run my own recursive DNS server locally, this hypothesis was easy to test by flushing my DNS cache. I found that flushing the DNS cache would consistently cause the next TLS handshake to be slow.

This didn't make much sense: using tools like host and dig, I could find no DNS problems with the affected domains, and besides, Chromium said the delay was in the TLS handshake. It finally dawned on me that the delay could be in the OCSP check. OCSP, or Online Certificate Status Protocol, is a mechanism for TLS clients to check if a certificate has been revoked. During the handshake, the client makes a request to the OCSP URI specified in the certificate to check its status. Since the URI would typically contain a hostname, a DNS problem could manifest here.

I checked the certificates of the affected sites, and all of them specified OCSP URIs that ultimately resolved to Upon investigation, I found that of the seven name servers listed for ( through, only two of them ( and were returning a response to AAAA queries. The other five servers returned no response at all, not even a response to say that an AAAA record does not exist. This was very bad, since it meant any attempt to resolve an AAAA record for this host required the client to try again and wait until it timed out, leading to unsavory delays.

If you're curious what an AAAA record is and why this matters, an AAAA record is the type of DNS record that maps a hostname to its IPv6 address. It's the IPv6 equivalent to the A record, which maps a hostname to its IPv4 address. While the Internet is transitioning from IPv4 to IPv6, hosts are expected to be dual-homed, meaning they have both an IPv4 and an IPv6 address. When one system talks to another, it prefers IPv6, and falls back to IPv4 only if the peer doesn't support IPv6. To figure this out, the system first attempts an AAAA lookup, and if no AAAA record exists, it tries an A record lookup. So, when a name server does not respond to AAAA queries, not even with a response to say no AAAA record exists, the client has to wait until it times out before trying the A record lookup, causing the delays I was experiencing here. Cisco has a great article that goes into more depth about broken name servers and AAAA records.

(Note: the exact mechanics vary between operating systems. The Linux resolver tries AAAA lookups even if the system doesn't have IPv6 connectivity, meaning that even IPv4-only users experience these delays. Other operating systems might only attempt AAAA lookups if the system has IPv6 connectivity, which would mitigate the scope of this issue.)

A History of Brokenness

This is apparently not the first time Verisign's servers have had problems: A year ago, the name servers for exhibited the same broken behavior:

The unofficial response from Verisign was that the queries are being handled by a GSLB, which apparently means that we should not expect it to behave correctly.

"GSLB" means "Global Server Load Balancing" and I interpret that statement to mean Verisign is using an expensive DNS appliance to answer queries instead of software running on a conventional server. The snarky comment about such appliances rings true for me. Last year, I noticed that my alma matter's website was taking 30 seconds to load. I tracked the problem down to the exact same issue: the DNS servers for were not returning any response to AAAA queries. In the process of reporting this to Brown's IT department, I learned that they were using buggy and overpriced-looking DNS appliances from F5 Networks, which, by default, do not properly respond to AAAA queries under circumstances that appear to be common enough to cause real problems. To fix the problem, the IT people had to manually configure every single DNS record individually to properly reply to AAAA queries.

Marketing for F5's 'Global Traffic Manager': 'Fast, secure DNS and optimized global apps'

F5's "Global Traffic Manager": Because nothing says "optimized" like shipping with broken defaults that cause delays for users

I find it totally unconscionable for a DNS appliance vendor to be shipping a product with such broken behavior which causes serious delays for users and gives IPv6 a bad reputation. It is similarly outrageous for Verisign to be operating broken DNS servers that are in the critical path for an untold number of TLS handshakes. That gives HTTPS a bad reputation, and lends fuel to the people who say that HTTPS is too slow. It's truly unfortunate that even if you're Google and do everything right with IPv6, DNS, and TLS, your handshake speeds are still at the mercy of incompetent certificate authorities like Verisign.

Disabling OCSP

I worked around this issue by disabling OCSP (in Firefox, set security.OCSP.enabled to 0 in about:config). While OCSP may theoretically be good for security, since it enables browsers to reject certificates that have been compromised and revoked, in practice it's a total mess. Since OCSP servers are often unreliable or are blocked by restrictive firewalls, browsers don't treat OCSP errors as fatal by default. Thus, an active attacker who is using a revoked certificate to man-in-the-middle HTTPS connections can simply block access to the OCSP server and the browser will accept the revoked certificate. Frankly, OCSP is better at protecting certificate authorities' business model than protecting users' security, since it allows certificate authorities to revoke certificates for things like credit card chargebacks. As if this wasn't bad enough already, OCSP introduces a minor privacy leak because it reports every HTTPS site you visit to the certificate authority. Google Chrome doesn't even use OCSP anymore because it is so dysfunctional.

Finally Resolved

While I was writing this blog post, Verisign fixed their DNS servers and now every single one is returning a proper response to AAAA queries. I know for sure their servers were broken for at least two days. I suspect it was longer considering the slowness was happening for quite some time before I finally investigated.


July 1, 2013

ICMP Redirect Attacks in the Wild

I recently lost an afternoon dealing with a most vexing routing problem on a server which turned out to be the result of an ICMP redirect attack.

ICMP redirects are a "feature" of IP which allows a router to inform a host that there's a more efficient route to a destination and that the host should adjust its routing table accordingly. This might be OK on a trusted LAN, but on the wild Internet, where malice abounds, it may not be such a good idea to alter your routing table at someone else's whim. Nevertheless, ICMP redirects are enabled by default on Linux.

The problem manifested itself with a customer unable to contact one of our servers. The customer provided a traceroute which starred out after the last hop before our server, leading him to conclude that our firewall was blocking him. We would probably have concluded the same thing, except we knew that no such firewall existed. From our end, traceroutes to him starred out immediately before even reaching our default gateway. Even more bizarrely, if we picked a certain one of our server's four IP addresses as the source address for the traceroute, the traceroute worked! Furthermore, we were able to traceroute to adjacent IP addresses on the customer's subnet from any source address without issue.

We looked again and again at our routing table and our iptables rules. There was nothing to explain this behavior. This server is virtualized, so we were starting to suspect the underlying host, when I decided to look at the kernel's route cache with the ip route show cache command. What I saw worried me greatly:

root@tommy:~# ip route show cache | grep via dev eth0 src from via dev eth0 from via dev eth0 from via dev eth0 from via dev eth0

These entries say to route packets to (the customer's IP address, changed to protect their identity) via However, is not our default gateway. In fact, we don't use any private IP addresses that look remotely like that. The fourth entry, which has a legitimate gateway and applies only to packets with a source address of, explains why we could successfully use that one IP address as a source address.

I had read about ICMP redirect attacks before and suspected that they may be at play here. To test this hypothesis I spun up a test server and used an extremely useful tool called scapy to send my own fake ICMP redirect packets. The results were strange. Sending the fake ICMP redirect did not immediately put a bogus entry in the route cache. However, if I attempted to contact the host targeted by the redirect within 10 minutes of sending the redirect, then an entry appeared in the route cache that prevented me from contacting the host. Furthermore, once the entry got in the cache, there was no getting rid of it. Even if I flushed the cache, the rogue entry came back the next time I tried to contact the targeted host. The kernel was clearly keeping some separate state of redirected route entries, and as far as I could tell, there was no way to inspect it. This meant that the only way to recover the affected server was to reboot it!

There are clear denial-of-service possibilities. If you want to prevent a host (the "victim") from contacting another host (the "target"), just send a fake redirect packet for the target to the victim! Since you don't need to forge a packet's source address to send a rogue ICMP redirect, it will make its way past RP filters. The only constraint is that your victim needs to contact the target within 10 minutes of receiving the redirect for it to stick. This is easy to overcome: you can send the redirect when the victim is likely to be contacting the target, or simply send a new redirect every 10 minutes (that's hardly high volume). Or, more diabolically, if you have the ability to spoof source addresses, you can follow the redirect packet with a TCP SYN packet with its source address spoofed as the target. The victim will reply to the target with a SYN-ACK, and in doing so make permanent the effect of the ICMP redirect.

Obviously you need to disable ICMP redirect packets on any public-facing host. Unfortunately, the most intuitive and widely-documented way of disabling ICMP redirects on Linux (by writing 0 to /proc/sys/net/ipv4/conf/all/accept_redirects) doesn't always work! From Documentation/networking/ip-sysctl.txt of the Linux source:

accept_redirects - BOOLEAN Accept ICMP redirect messages. accept_redirects for the interface will be enabled if: - both conf/{all,interface}/accept_redirects are TRUE in the case forwarding for the interface is enabled or - at least one of conf/{all,interface}/accept_redirects is TRUE in the case forwarding for the interface is disabled accept_redirects for the interface will be disabled otherwise default TRUE (host) FALSE (router)

This arcane logic means that for a non-router (i.e. most servers), not only must /proc/sys/net/ipv4/conf/all/accept_redirects be 0, but so must /proc/sys/net/ipv4/conf/interface/accept_redirects. So, to recap, the following will reliably disable ICMP redirects (assuming your interface is eth0):

echo 0 > /proc/sys/net/ipv4/conf/all/accept_redirects

echo 0 > /proc/sys/net/ipv4/conf/eth0/accept_redirects

If you put that in a system startup script, you should be safe.

A final note: the precise behavior for handling ICMP redirects and the route cache may differ between kernel versions. My tests were conducted on 2.6.39.


March 27, 2013

Running a Robust NTP Daemon

Accurate time is essential on a server, and running ntpd is the best way to ensure it. Unfortunately, ntpd, especially on Debian, can be finicky and in the past I've had trouble with clock drift and ntpd failing to start on boot. Here are my best practices to avoid the problems.

On Debian, make sure lockfile-progs is installed

On Debian (and likely Ubuntu too), there's a nasty race condition on boot between ntpdate and ntpd. When the network comes up, ifupdown runs ntpdate to synchronize the clock. But at about the same time, ntpd starts. If ntpdate is still running when ntpd starts, ntpd can't bind to the local NTP port and terminates. Sometimes ntpd starts on boot and sometimes it doesn't!

The Debian scripts avoid this using locks, but only if the lockfile-progs package is installed. This is a Recommends: for ntpdate, but if you don't install Recommends: by default, you may miss this.

If you use DHCP, don't request ntp-servers

If your system gets its IP address from DHCP using dhclient, then by default dhclient will update your ntpd configuration with NTP server information it receives from the DHCP server. It is extremely frustrating to configure reliable upstream NTP servers only to have them replaced with unreliable servers (such as those that advertise phony leap seconds, as has happened to me multiple times). And the last thing you want is your configuration management system fighting with dhclient over what NTP servers to use.

To prevent this, edit /etc/dhcp/dhclient.conf and remove ntp-servers from the request line.

Don't use the undisciplined local clock (i.e. server

Make sure these lines aren't in your ntp.conf:

server fudge stratum 10

These lines enable the Undisciplined Local Clock, and cause ntpd to start using your local clock as a time source if the real NTP servers aren't reachable. This can be useful if you want to keep a group of servers on a local network in sync even if your Internet connection goes down, but in general you don't need or want this. I've seen strange situations where the local clock becomes preferred over the real NTP servers, resulting in clock drift that goes uncorrected. Best to disable the local clock by removing all references to


March 2, 2013

GCC's Implementation of basic_istream::ignore() is Broken

The implementation of std::basic_istream::ignore() in GCC's C++ standard library suffers from a serious flaw. After ignoring the n characters as requested, it checks to see if end-of-file has been reached. If it has, then the stream's eofbit is set. The problem is that to check for end-of-file, ignore() has to essentially peek ahead in the stream one character beyond what you've ignored. That means that if you ask to ignore all the characters currently available in the stream buffer, ignore() causes an underflow of the buffer. If it's a file stream, the buffer can be refilled by reading from the filesystem in a finite amount of time, so this is merely inefficient. But if it's a socket, this underflow can be fatal: your program may block forever waiting for bytes that never come. This is horribly unintuitive and is inconsistent with the behavior of std::basic_istream::read(), which does not check for end-of-file after reading the requested number of characters.

The origin of this problem is that the C++ standard is perhaps not as clear as it should be regarding ignore(). From section

basic_istream<charT,traits>& ignore(streamsize n = 1, int_type delim = traits::eof());

Effects: Behaves as an unformatted input function (as described in, paragraph 1). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:

  • if n != numeric_limits<streamsize>::max() (18.3.2), n characters are extracted
  • end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit), which may throw ios_base::failure (;
  • traits::eq_int_type(traits::to_int_type(c), delim) for the next available input character c (in which case c is extracted).

Note that the Standard does not specify the order in which the checks should be performed, suggesting that a conformant implementation may check for end-of-file before checking if n characters have been extracted, as GCC does. You may think that the order is implicit in the ordering of the bullet points, but if it were, then why would the Standard explicitly state the order in the case of getline()? From section

basic_istream<charT,traits>& getline(char_type* s, streamsize n, char_type delim);

Effects: Behaves as an unformatted input function (as described in, paragraph 1). After constructing a sentry object, extracts characters and stores them into successive locations of an array whose first element is designated by s. Characters are extracted and stored until one of the following occurs:

  1. end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit));
  2. traits::eq(c, delim) for the next available input character c (in which case the input character is extracted but not stored);
  3. n is less than one or n - 1 characters are stored (in which case the function calls setstate(failbit)).

These conditions are tested in the order shown.

At least this is one GCC developer's justification for GCC's behavior. However, I have a different take: I believe that the only way to satisfy the Standard's requirements for ignore() is to perform the checks in the order presented. The Standard says that "characters are extracted until any of the following occurs." That means that when n characters have been extracted, ignore() needs to terminate, since this condition is among "any of the following." But, if ignore() first checks for end-of-file and blocks forever, then it doesn't terminate. This constrains the order in which a conformant implementation can check the conditions, and is perhaps why the Standard does not need to specify an explicit order here, but does for getline() where it really does want the end-of-file check to occur first.

I have left a comment on the GCC bug stating my interpretation. One problem with fixing this bug is that it will break code that has come to depend on eofbit being set if you ignore all the data remaining on a stream, though I'm frankly skeptical that much code would make that assumption. Also, both LLVM's libcxx and Microsoft Visual Studio (version 2005, at least) implement ignore() according to my interpretation of the Standard.

In the meantime, be very, very careful with your use of ignore(). Only use it on file streams or when you know you'll be ignoring fewer characters than are available to be read. And don't rely on eofbit being set one way or the other.

If you need a more reliable version of ignore(), I've written a non-member function implementation which takes a std::basic_istream as its first argument. It is very nearly a drop-in replacement for the member function (it even properly throws exceptions depending on the stream's exceptions mask), except that it returns the number of bytes ignored (not a reference to the stream) in lieu of making the number of bytes available by a call to gcount(). (It's not possible for a non-member function to set the value returned by gcount().)


March 1, 2013

Why Do Hackers Love Namecheap and Hate

Namecheap has brilliant marketing. The day that GoDaddy announced their support of SOPA, Namecheap pounced on the opportunity. They penned a passionate blog post and declared December 29, 2011 "Move Your Domain Day," complete with a patriotic theme, a coupon code "SOPAsucks," and donations to the EFF. Move Your Domain Day was such a success that it has its own Wikipedia article. Namecheap led the charge against GoDaddy, and I think it's safe to assume that most people who transferred from GoDaddy because of SOPA transferred to Namecheap. Now they seem to be the preferred registrar of the Hacker News/Reddit crowd.

Now consider They too opposed SOPA and encouraged transfers using a "nodaddy" coupon code. But they didn't exert nearly as much effort as Namecheap and as a consequence probably lost out on a lot of potential transfers.

But has a bigger problem. They get raked over the coals on Hacker News because their free DNS hosting service adds a wildcard record that points their users' otherwise non-existent subdomains to their own ad-laden landing page. I think that's bad and they shouldn't do it. But at the same time, people should understand the distinction between domain registration and DNS hosting.

I'm very happy with as a domain registrar. It is the best I've used (among Network Solutions, GoDaddy, Directnic, Gandi, and 1&1) and the first that I haven't had any significant complaints about. I haven't used Namecheap. Namecheap looks like a good registrar too, but appears at least as good, if not better. Their UI is friendly and uncluttered. Their about page makes them seem just as non-evil as Namecheap. has long supported both IPv6 glue records and DNSSEC (Namecheap recently added IPv6 glue but still has no DNSSEC support). has two-factor authentication, which is pretty important for such a critical service.

When you buy a domain from, you're paying for the registration. You don't have to use their DNS service, especially when there are so many good options for DNS hosting: Amazon's Route 53 is very inexpensive, Cloudflare offers DNS as part of their free plan, Hurricane Electric has a free DNS service, Linode has free DNS for their customers, there are paid providers like ZoneEdit, SlickDNS, etc. Or you can host your own DNS.

As a general rule, registrars make crummy DNS providers. Usually the interface is clunky and they don't support all the record types. Only a few months after registering my first domain with Network Solutions, their entire DNS service suffered an hours-long outage during which my domain was unresolvable. Ever since, I've hosted my own DNS without a problem (recently I added Linode as my slave).

I don't have a dog in this race, but I think it would be a shame for someone to exclude a good registrar like from consideration just because they're a bad DNS provider. It would also be a shame for someone to use any registrar's crummy DNS service when there are so many better options out there.


Older Posts Newer Posts