Andrew Ayer - Blog

The Story Behind Last Week's Let's Encrypt Downtime

2023-06-22T14:18:52Z

Last Thursday (June 15th, 2023), Let's Encrypt went down for about an hour, during which time it was not possible to obtain certificates from Let's Encrypt. Immediately prior to the outage, Let's Encrypt issued 645 certificates which did not work in Chrome or Safari. In this post, I'm going to explain what went wrong and how I detected it.

The Law of Precertificates

Before I can explain the incident, we need to talk about Certificate Transparency. Certificate Transparency (CT) is a system for putting certificates issued by publicly-trusted CAs, such as Let's Encrypt, in public, append-only logs. Certificate authorities have a tremendous amount of power, and if they misuse their power by issuing certificates that they shouldn't, traffic to HTTPS websites could be intercepted by attackers. Historically, CAs have not used their power well, and Certificate Transparency is an effort to fix that by letting anyone examine the certificates that CAs issue.

A key concept in Certificate Transparency is the "precertificate". Before issuing a certificate, the certificate authority creates a precertificate, which contains all of the information that will be in the certificate, plus a "poison extension" that prevents the precertificate from being used like a real certificate. The CA submits the precertificate to multiple Certificate Transparency logs. Each log returns a Signed Certificate Timestamp (SCT), which is a signed statement acknowledging receipt of the precertificate and promising to publish the precertificate in the log for anyone to download. The CA takes all of the SCTs and embeds them in the certificate. When a CT-enforcing browser (like Chrome or Safari) validates the certificate, it makes sure that the certificate embeds a sufficient number of SCTs from trustworthy logs. This doesn't prevent the browser from accepting a malicious certificate, but it does ensure that the precertificate is in public logs, allowing the attack to be detected and action taken against the CA.

The certificate itself may or may not end up in CT logs. Some CAs, notably Let's Encrypt and Sectigo, automatically submit their certificates. Certificates from other CAs only end up in logs if someone else finds and submits them. Since only the precertificate is guaranteed to be logged, it is essential that a precertificate be treated as incontrovertible proof that a certificate containing the same data exists. When someone finds a precertificate for a malicious or non-compliant certificate, the CA can't be allowed to evade responsibility by saying "just kidding, we never actually issued the real certificate" (and boy, have they tried). Otherwise, CT would be useless.

There are two ways a CA could create a certificate. They could take the precertificate, remove the poison extension, add the SCTs, and re-sign it. Or, they could create the certificate from scratch, making sure to add the same data, in the same order, as used in the precertificate.

The first way is robust because it's guaranteed to produce a certificate which matches the precertificate. At least one CA, Sectigo, uses this approach. Let's Encrypt uses the second approach. You can probably see where this is going...

The Let's Encrypt incident

On June 15, 2023, Let's Encrypt deployed a planned change to their certificate configuration which altered the contents of the Certificate Policies extension from:

X509v3 Certificate Policies: 
    Policy: 2.23.140.1.2.1
    Policy: 1.3.6.1.4.1.44947.1.1.1
      CPS: http://cps.letsencrypt.org

to:

X509v3 Certificate Policies: 
    Policy: 2.23.140.1.2.1

Unfortunately, any certificate which was requested while the change was being rolled out could have its precertificate and certificate created with different configurations. For example, when Let's Encrypt issued the certificate with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71, the precertificate contained the new Certificate Policies extension, and the certificate contained the old Certificate Policies extension.

This had two consequences:

First, this certificate won't work in Chrome or Safari, because its SCTs are for a precertificate containing different data from the certificate. Specifically, the SCTs fail signature validation. When logs sign SCTs, they compute the signature over the data in the precertificate, and when browsers verify SCTs, they compute the signature over the data in the certificate. In this case, that data was not the same.

Second, remember how I said that precertificates are treated as incontrovertible proof that a certificate containing the same data exists? When Let's Encrypt issued a precertificate with the new Certificate Policies value, it implied that they also issued a certificate with the new Certificate Policies value. Thus, according to the Law of Precertificates, Let's Encrypt issued two certificates with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71:

A certificate containing the old Certificate Policies extension
A certificate containing the new Certificate Policies extension (implied by the existence of the precertificate with the new Certificate Policies extension)

Issuing two certificates with the same serial number is a violation of the Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates. Consequentially, Let's Encrypt must revoke the certificate and post a public incident report, which must be noted on their next audit statement.

You might think that it's harsh to treat this as a compliance incident if Let's Encrypt didn't really issue two certificates with the same serial number. Unfortunately, they have no way of proving this, and the whole reason for Certificate Transparency is so we don't have to take CAs at their word that they aren't issuing certificates that they shouldn't. Any exception to the Law of Precertificates creates an opening for a malicious CA to exploit.

How I found this

My company, SSLMate, operates a Certificate Transparency monitor called Cert Spotter, which continuously downloads and indexes the contents of every Certificate Transparency log. You can use Cert Spotter to get notifications when a certificate is issued for one of your domains, or search the database using a JSON API.

When Cert Spotter ingests a certificate containing embedded SCTs, it verifies each SCT's signature and audits that the log really published the precertificate. (If it detects that a log has broken its promise to publish a precertificate, I'll publicly disclose the SCT and the log will be distrusted. Happily, Cert Spotter has never found a bogus SCT, though it has detected logs violating other requirements.)

On Thursday, June 15, 2023 at 15:41 UTC, Cert Spotter began sending me alerts about certificates containing embedded SCTs with invalid signatures. Since I was getting hundreds of alerts, I decided to stop what I was doing and investigate.

I had received these alerts several times before, and have gotten pretty good at zeroing in on the problem. When only one SCT in a certificate has an invalid signature, it probably means that the CT log screwed up. When all of the embedded SCTs have an invalid signature, it probably means the CA screwed up. The most common reason is issuing certificates that don't match the precertificate. So I took one of the affected certificates and searched for precertificates containing the same serial number in Cert Spotter's database of every (pre)certificate ever logged to Certificate Transparency. Decoding the certificate and precertificate with the openssl command immediately revealed the different Certificate Policies extension.

Since I was continuing to get alerts from Cert Spotter about invalid SCT signatures, I quickly fired off an email to Let's Encrypt's problem reporting address alerting them to the problem.

I sent the email at 15:52 UTC. At 16:08, Let's Encrypt replied that they had paused issuance to investigate. Meanwhile, I filed a CA Certificate Compliance bug in Bugzilla, which is where Mozilla and Chrome track compliance incidents by publicly-trusted certificate authorities.

At 16:54, Let's Encrypt resumed issuance after confirming that they would not issue any more certificates with mismatched precertificates.

On Friday, June 16, 2023, Let's Encrypt emailed the subscribers of the affected certificates to inform them of the need to replace their certificates.

On Monday, June 19, 2023 at 18:00 UTC, Let's Encrypt revoked the 645 affected certificates, as required by the Baseline Requirements. This will cause the certificates to stop working in any client that checks revocation, but remember that these certificates were already being rejected by Chrome and Safari for having invalid SCTs.

On Tuesday, June 20, 2023, Let's Encrypt posted their public incident report, which explained the root cause of the incident and what they're doing to prevent it from happening again. Specifically, they plan to add a pre-issuance check that ensures certificates contain the same data as the precertificate.

Hundreds of websites are still serving broken certificates

I've been periodically checking port 443 of every DNS name in the affected certificates, and as of publication time, 261 certificates are still in use, despite not working in CT-enforcing or revocation-checking clients.

I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt. I thought that maybe these certificates were being used by API endpoints which are accessed by non-browser clients that don't enforce CT or check revocation, but this doesn't appear to be the case, as most of the DNS names are for bare domains or www subdomains. It's fortunate that Let's Encrypt issued only a small number of non-compliant certificates, because otherwise it would have broken a lot of websites.

There is a new standard under development called ACME Renewal Information which enables certificate authorities to inform ACME clients to renew certificates ahead of their normal expiration. Let's Encrypt supports ARI, and used it in this incident to trigger early renewal of the affected certificates. Clearly, more ACME clients need to add support for ARI.

This is my 50th CA compliance bug

It turns out this is the 50th CA compliance bug that I've filed in Bugzilla, and the 5th which was uncovered by Cert Spotter's SCT signature checks. Additionally, I reported a number of incidents before 2018 which didn't end up in Bugzilla.

Some of the problems I uncovered were quite serious (like issuing certificates without doing domain validation) and snowballed until the CA was ultimately distrusted. Most are minor in comparison, and ten years ago, no one would have cared about them: there was no Certificate Transparency to unearth non-compliant certificates, and even when someone did notice, the revocation requirement was not enforced, and CAs were not required to file incident reports or document the non-compliance on their next audit. Thankfully, that's no longer the case, and even compliance violations that seem minor are treated seriously, which has led to enormous improvements in the certificate ecosystem:

The improvements which certificate authorities make in response to seemingly-minor incidents also improve their compliance with the most security-critical rules.
TLS clients no longer need to work around non-standards-compliant certificates, which means they can be simpler. Simpler code is easier to make secure.
The way that CAs handle minor incidents can uncover much larger problems. Minor compliance problems are like "Brown M&M's".

Mozilla deserves enormous credit for being the first to require public incident reports from CAs, as does Google for creating and fostering Certificate Transparency.

You should monitor Certificate Transparency too

One limitation of my compliance monitoring is that I am generally only able to detect certificates that are intrinsically non-compliant, like those which violate encoding rules or are valid for too many days. While I do monitor certificates for domains that are likely to be abused, like example.com and test.com, I can't tell if a certificate issued for your domain is authorized or not. Only you know that.

Fortunately, it's pretty easy to monitor Certificate Transparency and get alerts when a certificate is issued for one of your domains. Cert Spotter has a standalone, open source version that's easy to set up. The paid version has additional features like expiration monitoring, Slack integration, and ways to filter alerts so you're not bothered about legitimate certificates. But most importantly, subscribing to the paid version helps me continue my compliance monitoring of the certificate authority ecosystem.

The Difference Between Root Certificate Authorities, Intermediates, and Resellers

2023-06-18T11:39:47Z

It happens every so often: some organization that sells publicly-trusted SSL certificates does something monumentally stupid, like generating, storing, and then intentionally disclosing all of their customers' private keys (Trustico), letting private keys be stolen via XSS (ZeroSSL), or most recently, literally exploiting remote code execution in ACME clients as part of their issuance process (HiCA aka QuantumCA).

When this happens, people inevitably refer to the certificate provider as a certificate authority (CA), which is an organization with the power to issue trusted certificates for any domain name. They fear that the integrity of the Internet's certificate authority system has been compromised by the "CA"'s incompetence. Something must be done!

But none of the organizations listed above are CAs - they just take certificate requests and forward them to real CAs, who validate the request and issue the certificate. The Internet is safe - from these organizations, at least.

In this post, I'm going to define terms like certificate authority, root CA, intermediate CA, and reseller, and explain whom you do and do not need to worry about.

Note that I'm going to talk only about publicly-trusted SSL certificate authorities - i.e those which are trusted by mainstream web browsers.

Certificate Authority Certificates

"Certificate authority" is a label that can apply both to SSL certificates and to organizations. A certificate is a CA if it can be used to issue certificates which will be accepted by browsers. There are two types of CA certificates: root and intermediate.

Root CA certificates, also known as "trust anchors" or just "roots", are shipped with your browser. If you poke around your browser's settings, you can find a list of root CA certificates.

Intermediate CA certificates, also known as "subordinate CAs" or just "intermediates", are certificates with the "CA" boolean set to true in the Basic Constraints extension, and which were issued by a root or another intermediate. If you decode an intermediate CA certificate with openssl, you'll see this in the extensions section:

X509v3 Basic Constraints: critical
    CA:TRUE

When you connect to a website, your browser has to verify that the website's certificate was issued by a CA certificate. If the website's certificate was issued by a root, it's easy because the browser already knows the public key needed to verify the certificate's signature. If the website's certificate was issued by an intermediate, the browser has to retrieve the intermediate from elsewhere, and then verify that the intermediate was issued by a CA certificate, recurring as necessary until it reaches a root. The web server can help the browser out by sending intermediate certificates along with the website's certificate, and some browsers are able to retrieve intermediates from the URL included in the certificate's AIA (Authority Information Access) field.

The purpose of this post is to discuss organizations, so for the rest of this post, when I say "certificate authority" or "CA" I am referring to an organization, not a certificate, unless otherwise specified.

Certificate Authority Organizations

An organization is a certificate authority if and only if they hold the private key for one or more certificate authority certificates. Holding the private key for a CA certificate is a big deal because it gives the organization the power to create certificates that are valid for any domain name on the Internet. These are essentially the Keys to the Internet.

Unfortunately, figuring out who holds a CA certificate's private key is not straightforward. CA certificates contain an organization name in their subject field, and many people reasonably assume that this must be the organization which holds the private key. But as I explained earlier this year, this is often not the case. I spend a lot of time researching the certificate ecosystem, and I am not exaggerating when I say that I completely ignore the organization name in CA certificates. It is useless. You have to look at other sources, like audit statements and the CCADB, to figure out who really holds a CA certificate's key. Consequentially, just because you see a company's name in a CA certificate, it does not mean that it has a key which can be used to create certificates.

Internal vs External Intermediate Certificates

CAs aren't allowed to issue website certificates directly from root certificates, so any CA which operates a root certificate is going to have to issue themselves at least one intermediate certificate. Since the CA retains control of the private key for the intermediate, these are called internally-operated intermediate certificates. Most intermediate certificates are internally-operated.

CAs are also able to issue intermediate certificates whose private key is held by another organization, making the other organization a certificate authority too. These are called externally-operated intermediate certificates, and there are two reasons they exist.

The more legitimate reason is that the other organization operates, or intends to operate, root certificates, and would like to issue certificates which work in older browsers whose trust stores don't include their roots. By getting an intermediate from a more-established CA, they can issue certificates which chain back to a root that is in more trust stores. This is called cross-signing, and a well-known example is Identrust cross-signing Let's Encrypt.

The less-savory reason is that the other organization would like to become a certificate authority without having to go through the onerous process of asking each browser to include them. Historically, this was a huge loophole which allowed organizations to become CAs with less oversight and thus less investment into security and compliance. Thankfully, this loophole is closing, and nowadays Mozilla and Chrome require CAs to obtain approval before issuing externally-operated intermediates to organizations which aren't already trusted CAs. I would not be surprised if browsers eventually banned the practice outright, leaving cross-signing as the only acceptable use case for externally-operated intermediates.

Certificate Resellers

There's no standard definition of certificate reseller, but I define it as any organization which provides certificates which they do not issue themselves. When someone requests a certificate from a reseller, the reseller forwards the request to a certificate authority, which validates the request and issues the certificate. The reseller has no access to the CA certificate's private key and no ability to cause issuance themselves. Typically, the reseller will use an API (ACME or proprietary) to request certificates from the CA. My company, SSLMate, is a reseller (though these days we mostly focus on Certificate Transparency monitoring).

The relationship between the reseller and the CA can take many forms. In some cases, the reseller may enter into an explicit reseller agreement with the CA and get access to special pricing or reseller-only APIs. In other cases, the reseller might obtain certificates from the CA using the same API and pricing available to the general public. The reseller might not even pay the CA - there's nothing stopping someone from acting like a reseller for free CAs like Let's Encrypt. For example, DNSimple provides paid certificates issued by Sectigo right alongside free certificates issued by Let's Encrypt. The CA might not know that their certificates are being resold - Let's Encrypt, for instance, allows anyone to create an anonymous account.

An organization may be a reseller of multiple CAs, or even be both a reseller and a CA. A reseller might not get certificates directly from a CA, but via a different reseller (SSLMate did this in the early days because it was the only way to get good pricing without making large purchase commitments).

A reseller might just provide certificates to customers, or they might use the certificates to provide a larger service, such as web hosting or a CDN (for example, Cloudflare).

White-Label / Branded Intermediate Certificates

Ideally, it would be easy to distinguish a reseller from a CA. However, most resellers are middlemen who provide no value over getting a certificate directly from a CA, and they don't want people to know this. So, for the right price, a CA will issue themselves an internally-operated intermediate CA certificate with the reseller's name in the organization field, and when the reseller requests a certificate, the CA will issue the certificate from the "branded" intermediate certificate instead of from an intermediate certificate with the CA's name in it.

The reseller does not have access to the private key of the branded intermediate certificate. Except for the name, everything about the branded intermediate certificate - like the security controls and the validation process - is exactly the same as the CA's non-branded intermediates. Thus, the mere existence of a branded intermediate certificate does not in any way affect the integrity of the certificate authority ecosystem, regardless of how untrustworthy, incompetent, or malicious the organization named in the certificate is.

What Should Worry You?

I spend a lot of time worrying about bad certificate authorities, but bad resellers don't concern me. A bad reseller can harm only their own customers, but a bad certificate authority can harm the entire Internet. Yeah, it sucks if you choose a reseller who screws you, but it's no different from choosing a bad hosting provider, domain registrar, or any of the myriad other vendors needed to host a website. As always, caveat emptor. In contrast, you can pick the best, most trustworthy certificate authority around, and some garbage CA you've never heard of can still issue an attacker a certificate for your domain. This is why web browsers tightly regulate CAs, but not resellers.

Unfortunately, this doesn't stop people from freaking out when a two-bit reseller with a branded intermediate (such as Trustico, ZeroSSL, or HiCA/QuantumCA) does something awful. People flood the mozilla-dev-security-policy mailing list to voice their concerns, including those who have little knowledge of certificates and have never posted there before. These discussions are a distraction from far more important issues. While the HiCA discussion was devolving into off-topic blather, the certificate authority Buypass disclosed in an incident report that they have a domain validation process which involves their employees manually looking up and interpreting CAA records. They are far from the only CA which does domain validation manually despite it being easy to automate, and it's troubling because having a human involved gives attackers an opening to exploit mistakes, bribery, or coercion to get unauthorized certificates for any domain. Buypass' incident response, which blames "human error" instead of addressing the root cause, won't make headlines on Hacker News or be shared on Twitter with the popcorn emoji, but it's actually far more concerning than the worst comments a reseller has ever made.

About the Author

I've reported over 50 certificate authority compliance issues, and uncovered evidence that led to the distrust of multiple certificate authorities and Certificate Transparency logs. My research has prompted changes to ACME and Mozilla's Root Store Policy. My company, SSLMate, offers a JSON API for searching Certificate Transparency logs and Cert Spotter, a service to monitor your domains for unauthorized, expiring, or incorrectly-installed certificates.

If you liked this post, you might like:

Stay Tuned: Let's Encrypt Incident

By popular demand, I will be blogging about how I found the compliance bug which prompted last week's Let's Encrypt downtime. Be sure to subscribe by email or RSS, or follow me on Mastodon or Twitter.

The SSL Certificate Issuer Field is a Lie

2023-01-18T14:39:00Z

A surprisingly hard, and widely misunderstood, problem with SSL certificates is figuring out what organization (called a certificate authority, or CA) issued a certificate. This information is useful for several reasons:

You've discovered an unauthorized certificate for your domain via Certificate Transparency logs and need to contact the certificate authority to get the certificate revoked.
You've discovered a certificate via Certificate Transparency and want to know if it was issued by one of your authorized certificate providers.
You're a researcher studying the certificate ecosystem and want to count how many certificates each certificate authority has issued.

On the surface, this looks easy: every certificate contains an issuer field with human-readable attributes, including an organization name. Problem solved, right?

Not so fast: a certificate's issuer field is frequently a lie that tells you nothing about the organization that really issued the certificate. Just look at the certificate chain currently served by doordash.com:

According to this, DoorDash's certificate was issued by an intermediate certificate belonging to "Cloudflare, Inc.", which was issued by a root certificate belonging to "Baltimore". Except Cloudflare is not a certificate authority, and Baltimore is a city.

In reality, both DoorDash's certificate and the intermediate certificate were issued by DigiCert, a name which is mentioned nowhere in the above chain. What's going on?

First, Cloudflare has paid DigiCert to create and operate an intermediate certificate with Cloudflare's name in it. DigiCert, not Cloudflare, controls the private key and performs the security-critical validation steps prior to issuance. All Cloudflare does is make an API call to DigiCert. Certificates issued from the "Cloudflare" intermediate are functionally no different from certificates issued from any of DigiCert's other intermediates. This type of white-labeling is common in the certificate industry, since it lets companies appear to be CAs without the expense of operating a CA.

Sidebar: that time everyone freaked out about Blue Coat

whoarethey: Determine Who Can Log In to an SSH Server

2023-01-10T21:02:07Z

Filippo Valsorda has a neat SSH server that reports the GitHub username of the connecting client. Just SSH to whoami.filippo.io, and if you're a GitHub user, there's a good chance it will identify you. This works because of two behaviors: First, GitHub publishes your authorized public keys at https://github.com/USERNAME.keys. Second, your SSH client sends the server the public key of every one of your key pairs.

Let's say you have three key pairs, FOO, BAR, and BAZ. The SSH public key authentication protocol works like this:

Client: Can I log in with public key FOO?
Server looks for FOO in ~/.ssh/authorized_keys; finds no match
Server: No
Client: Can I log in with public key BAR?
Server looks for BAR in ~/.ssh/authorized_keys; finds no match
Server: No
Client: Can I log in with public key BAZ?
Server looks for BAZ in ~/.ssh/authorized_keys; finds an entry
Server: Yes
Client: OK, here's a signature from private key BAZ to prove I own it

whoami.filippo.io works by taking each public key sent by the client and looking it up in a map from public key to GitHub username, which Filippo populated by crawling the GitHub API. If it finds a match, it tells the client the GitHub username:

Client: Can I log in with public key FOO?
Server looks up FOO, finds no match
Server: No
Client: Can I log in with public key BAR?
Server looks up BAR, finds no match
Server: No
Client: Can I log in with public key BAZ?
Server looks up BAZ, finds a match to user AGWA
Server: Aha, you're AGWA!

This works the other way as well: if you know that AGWA's public keys are FOO, BAR, and BAZ, you can send each of them to the server to see if the server accepts any of them, even if you don't know the private keys:

Client: Can I log in with public key FOO?
Server: No
Client: Can I log in with public key BAR?
Server: No
Client: Can I log in with public key BAZ?
Server: Yes
Client: Aha, AGWA has an account on this server!

This behavior has several implications:

If you've found a server that you suspect belongs to a particular GitHub user, you can confirm it by downloading their public keys and asking if the server accepts any of them.
If you want to find servers belonging to a particular GitHub user, you could scan the entire IPv4 address space asking each SSH server if it accepts any of the user's keys. This wouldn't work with IPv6, but scanning every IPv4 host is definitely practical, as shown by masscan and zmap.
If you've found a server and want to find out who controls it, you can try asking the server about every GitHub user's keys until it accepts one of them. I'm not sure how practical this would be; testing every GitHub user's keys would require sending an enormous amount of traffic to the server.

As a proof of concept, I've created whoarethey, a small Go program that takes the hostname:port of an SSH server, an SSH username, and a list of GitHub usernames, and prints out the GitHub username which is authorized to connect to the server. For example, you can try it on a test server of mine:

$ whoarethey 172.104.214.125:22 root github:AGWA github:FiloSottile github:AGWA

whoarethey reports that I, but not Filippo, can log into root@172.104.214.125.

You can also use whoarethey with public key files stored locally, in which case it prints the name of the public key file which is accepted:

$ whoarethey 172.104.214.125:22 root agwa.pub filosottile.pub agwa.pub

Note that just because a server accepts a key (or claims to accept a key), it doesn't mean that the holder of the private key authorized the server to accept it. I could take Filippo's public key and put it in my authorized_keys file, making it look like Filippo controls my server. Therefore, this information leak doesn't provide incontrovertible proof of server control.

Nevertheless, I think it's a useful way to deanonymize a server, and it concerns me much more than whoami.filippo.io. I only SSH to servers which already know who I am, and I'm not very worried about being tricked into connecting to a malicious server - it's not like the Web where it's trivial to make someone visit a URL. However, I do have accounts on a few servers which are not otherwise linkable to me, and it came as an unpleasant surprise that anyone would be able to learn that I have an account just by asking the SSH server.

The simplest way to thwart whoarethey would be for SSH servers to refuse to answer if a particular public key would be accepted, and instead make clients pick a private key and send the signature to the server. Although I don't know of any SSH servers that can be configured to do this, it could be done within the bounds of the current SSH protocol. The user experience would be the same for people who use a single key per client, which I assume is the most common configuration. Users with multiple keys would need to tell their client which key they want to use for each server, or the client would have to try every key, which might require the user to enter a passphrase or press a physical button for each attempt. (Note that to prevent a timing leak, the server should verify the signature against the public key provided by the client before checking if the public key is authorized. Otherwise, whoarethey could determine if a public key is authorized by sending an invalid signature and measuring how long it takes the server to reject it.)

There's a more complicated solution (requiring protocol changes and fancier cryptography) that leverages private set intersection to thwart both whoarethey and whoami.filippo.io. However, it treats SSH keys as encryption keys instead of signing keys, so it wouldn't work with hardware-backed keys like the YubiKey. And it requires the client to access the private key for every key pair, not just the one accepted by the server, so the user experience for multi-key users would be just as bad as with the simple solution.

Until one of the above solutions is implemented, be careful if you administer any servers which you don't want linked to you. You could use unique key pairs for such servers, or keep SSH firewalled off from the Internet and connect over a VPN. If you do use a unique key pair, make sure your SSH client never tries to send it to other servers - a less benign version of whoami.filippo.io could save the public keys that it sees, and then feed them to whoarethey to find your servers.

No, Google Did Not Hike the Price of a .dev Domain from $12 to $850

2022-12-12T16:02:55Z

It was perfect outrage fodder, quickly gaining hundreds of upvotes on Hacker News:

As you know, domain extensions like .dev and .app are owned by Google. Last year, I bought the http://forum.dev domain for one of our projects. When I tried to renew it this year, I was faced with a renewal price of $850 instead of the normal price of $12.

It's true that most .dev domains are just $12/year. But this person never paid $12 for forum.dev. According to his own screenshots, he paid 4,360 Turkish Lira for the initial registration on December 6, 2021, which was $317 at the time. So yes, the price did go up, but not nearly as much as the above comment implied.

According to a Google worker, this person should have paid the same, higher price in 2021, since forum.dev is a "premium" domain, but got an extremely favorable exchange rate so he ended up paying less. That's unsurprising for a currency which is experiencing rampant inflation.

Nevertheless, domain pricing has become quite confusing in recent years, and when reading the ensuing Hacker News discussion, I learned that a lot of people have some major misconceptions about how domains work. Multiple people said untrue or nonsensical things along the lines of "Google has a monopoly on the .dev domain. GoDaddy doesn't have a monopoly on .com, .biz, .net, etc." So I decided to write this blog post to explain some basic concepts and demystify domain pricing.

Registries vs Registrars

If you want to have an informed opinion about domains, you have to understand the difference between registries and registrars.

Every top-level domain (.com, .biz, .dev, etc.) is controlled by exactly one registry, who is responsible for the administration of the TLD and operation of the TLD's DNS servers. The registry effectively owns the TLD. Some registries are:

.com	Verisign
.biz	GoDaddy
.dev	Google

Registries do not sell domains directly to the public. Instead, registrars broker the transaction between a domain registrant and the appropriate registry. Registrars include Gandi, GoDaddy, Google, Namecheap, and name.com. Companies can be both registries and registrars (Technically, they're required to be separate legal entities; e.g. Google's registry is the wholly-owned subsidiary Charleston Road Registry): e.g. GoDaddy and Google are registrars for many TLDs, but registries for only some TLDs.

When you buy or renew a domain, the bulk of your registration fee goes to the registry, with the registrar adding some markup. Additionally, 18 cents goes to ICANN (Internet Corporation for Assigned Names and Numbers), who is in charge of the entire domain system.

For example, Google's current .com price of $12 is broken down as follows:

$0.18	ICANN fee
$8.97	Verisign's registry fee
$2.85	Google's registrar markup

Registrars typically carry domains from many different TLDs, and TLDs are typically available through multiple registrars. If you don't like your registrar, you can transfer your domain to a different one. This keeps registrar markup low. However, you'll always be stuck with the same registry. If you don't like their pricing, your only recourse is to get a whole new domain with a different TLD, which is not meaningful competition.

At the registry level, it's not true that there is no monopoly on .com - Verisign has just as much of a monopoly on .com as Google has on .dev.

At the registrar level, Google holds no monopoly over .dev - you can buy .dev domains through registrars besides Google, so you can take your business elsewhere if you don't like the Google registrar. Of course, the bulk of your fee will still go to Google, since they're the registry.

ICANN Price Controls

So if .com is just as monopoly-controlled as .dev, why are all .com domains the same low price? Why are there no "premium" domains like with .dev?

It's not because Verisign is scared by the competition, since there is none. It's because Verisign's contract with ICANN is different from Google's contract with ICANN.

The .com registry agreement between Verisign and ICANN capped the price of .com domains at $7.85 in 2020, with at most a 7% increase allowed every year. Verisign has since imposed two 7% price hikes, putting the current price at $8.97.

In contrast, .dev is governed by ICANN's standard registry agreement, which has no price caps. It does, however, forbid "discriminatory" renewal pricing:

In addition, Registry Operator must have uniform pricing for renewals of domain name registrations ("Renewal Pricing"). For the purposes of determining Renewal Pricing, the price for each domain registration renewal must be identical to the price of all other domain name registration renewals in place at the time of such renewal, and such price must take into account universal application of any refunds, rebates, discounts, product tying or other programs in place at the time of renewal. The foregoing requirements of this Section 2.10(c) shall not apply for (i) purposes of determining Renewal Pricing if the registrar has provided Registry Operator with documentation that demonstrates that the applicable registrant expressly agreed in its registration agreement with registrar to higher Renewal Pricing at the time of the initial registration of the domain name following clear and conspicuous disclosure of such Renewal Pricing to such registrant, and (ii) discounted Renewal Pricing pursuant to a Qualified Marketing Program (as defined below). The parties acknowledge that the purpose of this Section 2.10(c) is to prohibit abusive and/or discriminatory Renewal Pricing practices imposed by Registry Operator without the written consent of the applicable registrant at the time of the initial registration of the domain and this Section 2.10(c) will be interpreted broadly to prohibit such practices.

This means that Google is only allowed to increase a domain's renewal price if it also increases the renewal price of all other domains. If Google wants to charge more to renew a "premium" domain, the higher price must be clearly and conspicuously disclosed to the registrant at time of initial registration. This prevents Google from holding domains hostage: they can't set a low price and later increase it after your domain becomes popular.

(By displaying prices in lira instead of USD for forum.dev, did Google violate the "clear and conspicuous" disclosure requirement? I'm not sure, but if I were a registrar I would display prices in the currency charged by the registry to avoid misunderstandings like this.)

I wouldn't assume that the .com price caps will remain forever. .org used to have price caps too, before switching to the standard registry agreement in 2019. But even if .com switched to the standard agreement, we probably wouldn't see "premium" .com domains: at this point, every .com domain which would be considered "premium" has already been registered. And Verisign wouldn't be allowed to increase the renewal price of already-registered domains due to the need for disclosure at the time of initial registration.

There ain't no rules for ccTLDs (.io, .tv, .au, etc.)

It's important to note that registries for country-code TLDs (which is every 2-letter TLD) do not have enforceable registry agreements with ICANN. Instead, they are governed by their respective countries (or similar political entities), which can do as they please. They can sucker you in with a low price and then hold your domain hostage when it gets popular. If you register your domain in a banana republic because you think the TLD looks cool, and el presidente wants your domain to host his cat pictures, tough luck.

This is only scratching the surface of what's wrong with ccTLDs, but that's a topic for another blog post. Suffice to say, I do not recommend using ccTLDs unless all of the following are true:

You live in the country which owns the ccTLD and don't plan on moving.
You don't expect the region where you live to secede from the political entity which owns the ccTLD. (Just ask the British citizens who had .eu domains.)
You trust the operator of the ccTLD to be fair and competent.

Further Complications

To make matters more confusing, sometimes when you buy a domain from a registrar, you're not getting it from the registry, but from an existing owner who is squatting the domain. In this case, you pay a large upfront cost to get the squatter to transfer the domain to you, after which the domain renews at the lower, registry-set price. It used to be fairly obvious when this was happening, as you'd transact directly with the squatter, but now several registrars will broker the transaction for you. The Google registrar calls these "aftermarket" domains, which I think is a good name, but other registrars call them "premium" domains, which is confusing because such domains may or may not be considered "premium" by the registry and subject to higher renewal prices.

Yet another confounding factor is that registrars sometimes steeply discount the initial registration fee, taking a loss in the hope of making it up with renewals and other services.

To sum up, there are multiple scenarios you may face when buying a domain:

Scenario	Initial Fee	Renewal Fee
Non-premium domain, no discount	$$	$$
Non-premium domain, first year discount	$	$$
Premium domain, no discount	$$$	$$$
Premium domain, first year discount	$$	$$$
Aftermarket non-premium domain	$$$$	$$
Aftermarket premium domain	$$$$	$$$
ccTLD domain	Varies	Sky's the limit!

I was curious how different registrars distinguish between these cases, so I tried searching for the following domains at Gandi, GoDaddy, Google, Namecheap, and name.com:

safkjfkjfkjfdkjdfkj.com and nonpremium.online - decidedly non-premium domains
8b.dev - premium domain
dnsarchive.com - aftermarket domain

Gandi

Non-premium domain, no discount:

Non-premium domain, first year discount:

Premium domain:

Gandi does not seem to sell aftermarket domains.

GoDaddy

Non-premium domain, first year discount:

Premium domain:

Aftermarket domain:

Google

Non-premium domain, no discount:

Premium domain:

Aftermarket domain:

Namecheap

Non-premium domain, first year discount:

Premium domain:

Aftermarket domain:

name.com

Non-premium domain, first year discount:

Premium domain:

Aftermarket domain:

Thoughts

I think Gandi and Google do the best job conveying the first year and renewal prices using clear and consistent UI. Unfortunately, since the publication of this post, Gandi has been acquired by another company, and Google is discontinuing their registrar service. Namecheap is the worst, only showing a clear renewal price when it's less than the initial price, but obscuring it when it's the same or higher (note the use of the term "Retail" instead of "Renews at", and the lack of a "/yr" suffix for the 8b.dev price). name.com also obscures the renewal price for nonpremium.online (I very much doubt it's $1.99). GoDaddy also fails to show a clear renewal price for the non-premium domains, but at least says the quoted price is "for the first year."

My advice is to pay very close attention to the renewal price when buying a domain, because it may be the same, lower, or higher than the first year's fee. And be very wary of 2-letter TLDs (ccTLDs).

Checking if a Certificate is Revoked: How Hard Can It Be?

2022-12-01T15:00:09Z

This wasn't my first rodeo so I knew it would be hard. And I was right! The only question was what flavor of dysfunction I'd be encountering.

SSLMate's Certificate Transparency Search API now returns two new fields that tell you if, why, and when the certificate was revoked:

"revoked":true,
"revocation":{"time":"2021-10-27T21:38:48Z","reason":0,"checked_at":"2022-10-18T14:49:56Z"},

(See the complete API response)

This simple-sounding feature was obnoxious to implement, and required dealing with some amazingly creative screwups by certificate authorities, and a clunky system called the Common CA Database that's built on Salesforce. Just how dysfunctional is the WebPKI? Buckle up and find out!

Background on Certificate Revocation

There are two ways for a CA to publish that a certificate is revoked: the online certificate status protocol (OCSP), and certificate revocation lists (CRLs).

With OCSP, you send an HTTP request to the CA's OCSP server asking, "hey is the certificate with this serial number revoked?" and the CA is supposed to respond "yeah" or "nah", but often responds with "I dunno" or doesn't respond at all. CAs are required to support OCSP, and it's easy to find a CA's OCSP server (the URL is included in the certificate itself) but I didn't want to use it for the CT Search API: each API response can contain up to 100 certificates, so I'd have to make up to 100 OCSP requests just to build a single response. Given the slowness and unreliability of OCSP, that was a no go.

With CRLs, the CA publishes one or more lists of all revoked serial numbers. This would be much easier to deal with: I could write a cron job to download every CRL, insert all the entries into my trusty PostgreSQL database, and then building a CT Search API response would be as simple as JOINing with the crl_entry table!

Historically, CRLs weren't an option because not all CAs published CRLs, but on October 1, 2022, both Mozilla and Apple began requiring all CAs in their root programs to publish CRLs. Even better, they required CAs to disclose the URLs of their CRLs in the Common CA Database (CCADB), which is available to the public in the form of a chonky CSV file. Specifically, two new columns were added to the CSV: "Full CRL Issued By This CA", which is populated with a URL if the CA publishes a single CRL, and "JSON Array of Partitioned CRLs", which is populated with a JSON array of URLs if the CA splits its list of revoked certificates across multiple CRLs.

So I got to work writing a cron job in Go that would 1) download and parse the CCADB CSV file to determine the URL of every CRL 2) download, parse, and verify every CRL and 3) insert the CRL entries into PostgreSQL.

How hard could this be?

This wasn't my first rodeo so I knew it would be hard. And I was right! The only question was what flavor of dysfunction I'd be encountering.

CCADB Sucks

The CCADB is a database run by Mozilla that contains information about publicly-trusted certificate authorities. The four major browser makers (Mozilla, Apple, Chrome, and Microsoft) use the CCADB to keep track of the CAs which are trusted by their products.

CCADB could be a fairly simple CRUD app, but instead it's built on Salesforce, which means it's actual crud. CAs use a ~~clunky~~ enterprise-grade UI to update their information, such as to disclose their CRLs. Good news: there's an API. Bad news: here's how to get API credentials:

Salesforce will redirect to the callback url (specified in 'redirect_uri'). Quickly freeze the loading of the page and look in the browser address bar to extract the 'authorization code', save the code for the next steps.

To make matters worse, CCADB's data model is wrong (it's oriented around certificates rather than subject+key) which means the same information about a CA needs to be entered in multiple places. There is very little validation of anything a CA inputs. Consequentially, the information in the CCADB is often missing, inconsistent, or just flat out wrong.

In the "Full CRL Issued By This CA" column, I saw:

URLs without a protocol
Multiple URLs
The strings "expired" and "revoked"

Meanwhile, the data for "JSON Array of Partitioned CRLs" could be divided into three categories:

The empty array ([]).
A comma-separated list of URLs, with no square brackets or quotes.
A comma-separated list of URLs, with square brackets but without quotes.

In other words, the only well-formed JSON in sight was the empty array.

Initially, I assumed that CAs didn't know how to write non-trivial JSON, because that seems like a skill they would struggle with. Turned out that Salesforce was stripping quotes from the CSV export. OK, CAs, it's not your fault this time. (Well, except for the one who left out square brackets.) But don't get too smug, CAs - we haven't tried to download your CRLs yet.

(The CSV was eventually fixed, but to unblock my progress I had to parse this column with a mash of strings.Trim and strings.Split. Even Mozilla had to resort to such hacks to parse their own CSV file.)

CAs Suck

Once I got the CCADB CSV parsed successfully, it was time to download some CRLs! Surely, this would be easy - even though CRLs weren't mandatory before October 1, the vast majority of CAs had been publishing CRLs for years, and plenty of clients were already consuming them. Surely, any problems would have been discovered and fixed by now, right?

Ah hah hah hah hah.

I immediately ran into some fairly basic issues, like Amazon's CRLs returning a 404 error, D-TRUST encoding CRLs as PEM instead of DER, or Sectigo disclosing a CRL with a non-existent hostname because they forgot to publish a DNS record, as well as some more... interesting issues:

GoDaddy

Since root certificate keys are kept offline, CRLs for root certificates have to be generated manually during a signing ceremony. Signing ceremonies are extremely serious affairs that involve ~~donning ceremonial robes,~~ entering a locked cage, pulling a laptop out of a safe, and manually running openssl commands based on a script - and not the shell variety, but the reams of dead tree variety. Using the openssl command is hell in the best of circumstances - now imagine doing it from inside a cage. The smarter CAs write dedicated ceremony tooling instead of using openssl. The rest bungle ceremonies on the regular, as GoDaddy did here when they generated CRLs with an obsolete version number and missing a required extension, which consequentially couldn't be parsed by Go. To GoDaddy's credit, they are now planning to switch to dedicated ceremony tooling. Sometimes things do get better!

GlobalSign

Instead of setting the CRL signature algorithm based on the algorithm of the issuing CA's key, GlobalSign was setting it based on the algorithm of the issuing CA's signature. So when an elliptic curve intermediate CA was signed by an RSA root CA, the intermediate CA would produce CRLs that claimed to have RSA signatures even though they were really elliptic curve signatures.

After receiving my report, GlobalSign fixed their logic and added a test case.

Google Trust Services

Here is the list of CRL revocation reason codes defined by RFC 5280:

CRLReason ::= ENUMERATED {
        unspecified             (0),
        keyCompromise           (1),
        cACompromise            (2),
        affiliationChanged      (3),
        superseded              (4),
        cessationOfOperation    (5),
        certificateHold         (6),
             -- value 7 is not used
        removeFromCRL           (8),
        privilegeWithdrawn      (9),
        aACompromise           (10) }

And here is the protobuf enum that Google uses internally for revocation reasons:

enum RevocationReason {
        UNKNOWN              = 0;
        UNSPECIFIED          = 1;
        KEYCOMPROMISE        = 2;
        CACOMPROMISE         = 3;
        AFFILIATIONCHANGED   = 4;
        SUPERSEDED           = 5;
        CESSATIONOFOPERATION = 6;
        CERTIFICATEHOLD      = 7;
        PRIVILEGEWITHDRAWN   = 8;
        AACOMPROMISE         = 9;
}

As you can see, the reason code for unspecified is 0, and the protobuf enum value for unspecified is 1. The reason code for keyCompromise is 1 and the protobuf enum value for keyCompromise is 2. Therefore, by induction, all reason codes are exactly one less than the protobuf enum value. QED.

That was the logic of Google's code, which generated CRL reason codes by subtracting one from the protobuf enum value, instead of using a lookup table or switch statement. Of course, when it came time to revoke a certificate for the reason "privilegeWithdrawn", this resulted in a reason code of 7, which is not a valid reason code. Whoops.

At least this bug only materialized a few months ago, unlike most of the other CAs mentioned here, who had been publishing busted CRLs for years.

After receiving my report, Google fixed the CRL and added a test case, and will contribute to CRL linting efforts.

Conclusion

There are still some problems that I haven't investigated yet, but at this point, SSLMate knows the revocation status of the vast majority of publicly-trusted SSL certificates, and you can access it with just a simple HTTP query.

If you need to programmatically enumerate all the SSL certificates for a domain, such as to inventory your company's SSL certificates, then check out SSLMate's Certificate Transparency Search API. I don't know of any other service that pulls together information from over 40 Certificate Transparency logs and 3,500+ CRLs into one JSON API that's queryable by domain name. Best of all, I stand between you and all the WebPKI's dysfunction, so you can work on stuff you actually like, instead of wrangling CSVs and debugging CRL parsing errors.

Parsing a TLS Client Hello with Go's cryptobyte Package

2022-05-18T13:17:41Z

In my original post about SNI proxying, I showed how you can parse a TLS Client Hello message (the first message that the client sends to the server in a TLS connection) in Go using an amazing hack that involves calling tls.Server with a read-only net.Conn wrapper and a GetConfigForClient callback that saves the tls.ClientHelloInfo argument. I'm using this hack in snid, and if you accessed this blog post over IPv4, it was used to route your connection.

However, it's pretty gross, and only gives me access to the parts of the Client Hello message that are exposed in the tls.ClientHelloInfo struct. So I've decided to parse the Client Hello properly, using the golang.org/x/crypto/cryptobyte package, which is a great library that makes it easy parse length-prefixed binary messages, such as those found in TLS.

cryptobyte was added to Go's quasi-standard x/crypto library in 2017. Since then, more and more parts of Go's TLS and X.509 libraries have been updated to use cryptobyte for parsing, often leading to significant performance gains.

In this post, I will show you how to use cryptobyte to parse a TLS Client Hello message, and introduce https://tlshello.agwa.name, an HTTP server that returns a JSON representation of the Client Hello message sent by your client.

Using cryptobyte

The cryptobyte parser is centered around the cryptobyte.String type, which is just a slice of bytes that points to the message that you are parsing:


type String []byte

cryptobyte.String contains methods that read a part of the message and advance the slice to point to the next part.

For example, let's say you have a message consisting of a variable-length string prefixed by a 16-bit big-endian length, followed by a 32-bit big-endian integer:

len(name)		name						id
`00`	`06`	`'A'`	`'n'`	`'d'`	`'r'`	`'e'`	`'w'`	`00`	`5A`	`F6`	`0C`

First, you create a cryptobyte.String variable, message, which points to the above bytes.

Then, to read the name, you use ReadUint16LengthPrefixed:

var name cryptobyte.String
message.ReadUint16LengthPrefixed(&name)

ReadUint16LengthPrefixed reads two things. First, it reads the 16-bit length. Second, it reads the number of bytes specified by the length. So, after the above function call, name points to the 6 byte string "Andrew", and message is mutated to point to the remaining 4 bytes containing the ID.

To read the ID, you use ReadUint32:

var id uint32
message.ReadUint32(&id)

After this call, id contains 5961228 (0x5AF60C) and message is empty.

Note that cryptobyte.String's methods return a bool indicating if the read was successful. In real code, you'd want to check the return value and return an error if necessary.

It's also a good idea to call Empty to make sure that the string is really empty at the end, so you can detect and reject trailing garbage.

cryptobyte.String's methods are generally zero-copy. In the above example, name will point to the same memory region which message originally pointed to. This makes cryptobyte very efficient.

Parsing the TLS Client Hello

Let's write a function that takes the bytes of a TLS Client Hello handshake message as input, and returns a struct with info about the TLS handshake:


func UnmarshalClientHello(handshakeBytes []byte) *ClientHelloInfo

We start by constructing a cryptobyte.String from handshakeBytes:


handshakeMessage := cryptobyte.String(handshakeBytes)

For guidance, we turn to Section 4 of RFC 8446, which describes TLS 1.3's handshake protocol.

Here's the definition of a handshake message: struct { HandshakeType msg_type; /* handshake type */ uint24 length; /* remaining bytes in message */ select (Handshake.msg_type) { case client_hello: ClientHello; case server_hello: ServerHello; case end_of_early_data: EndOfEarlyData; case encrypted_extensions: EncryptedExtensions; case certificate_request: CertificateRequest; case certificate: Certificate; case certificate_verify: CertificateVerify; case finished: Finished; case new_session_ticket: NewSessionTicket; case key_update: KeyUpdate; }; } Handshake;

The first field in the message is a HandshakeType, which is an enum defined as:

enum {
    client_hello(1),
    server_hello(2),
    new_session_ticket(4),
    end_of_early_data(5),
    encrypted_extensions(8),
    certificate(11),
    certificate_request(13),
    certificate_verify(15),
    finished(20),
    key_update(24),
    message_hash(254),
    (255)
} HandshakeType;

According to the above definition, a Client Hello message has a value of 1. The last entry of the enum specifies the largest possible value of the enum. In TLS, enums are transmitted as a big-endian integer using the smallest number of bytes needed to represent the largest possible enum value. That's 255, so HandshakeType is transmitted as an 8-bit integer. Let's read this integer and verify that it's 1:

var messageType uint8
if !handshakeMessage.ReadUint8(&messageType) || messageType != 1 {
	return nil
}

The second field, length, is a 24-bit integer specifying the number of bytes remaining in the message.

The third and last field depends on the type of handshake message. Since it's a Client Hello message, it has type ClientHello.

Let's read these two fields using ReadUint24LengthPrefixed and then make sure there are no bytes remaining in handshakeMessage:

var clientHello cryptobyte.String
if !handshakeMessage.ReadUint24LengthPrefixed(&clientHello) || !handshakeMessage.Empty() {
	return nil
}

clientHello now points to the bytes of the ClientHello structure, which is defined in Section 4.1.2 as follows:

struct {
    ProtocolVersion legacy_version;
    Random random;
    opaque legacy_session_id<0..32>;
    CipherSuite cipher_suites<2..2^16-2>;
    opaque legacy_compression_methods<1..2^8-1>;
    Extension extensions<8..2^16-1>;
} ClientHello;

The first field is legacy_version, whose type is defined as a 16-bit integer:


uint16 ProtocolVersion;

To read it, we do:

var legacyVersion uint16
if !clientHello.ReadUint16(&legacyVersion) {
	return nil
}

Next, random, whose type is defined as:

opaque Random[32];

That means it's an opaque sequence of exactly 32 bytes. To read it, we do:

var random []byte
if !clientHello.ReadBytes(&random, 32) {
	return nil
}

Next, legacy_session_id. Like random, it is an opaque sequence of bytes, but this time the RFC specifies the length as a range, <0..32>. This syntax means it's a variable-length sequence that's between 0 and 32 bytes long, inclusive. In TLS, the length is transmitted just before the byte sequence as a big-endian integer using the smallest number of bytes necessary to represent the largest possible length. In this case, that's one byte, so we can read legacy_session_id using ReadUint8LengthPrefixed:

var legacySessionID []byte
if !clientHello.ReadUint8LengthPrefixed((*cryptobyte.String)(&legacySessionID)) {
	return nil
}

Now we're on to cipher_suites, which is where things start to get interesting. As with legacy_session_id, it's a variable-length sequence, but rather than being a sequence of bytes, it's a sequence of CipherSuites, which is defined as a pair of 8-bit integers:

uint8 CipherSuite[2];

In TLS, the length of the sequence is specified in bytes, rather than number of items. For cipher_suites, the largest possible length is just shy of 2^16, which means a 16-bit integer is used, so we'll use ReadUint16LengthPrefixed to read the cipher_suites field:

var ciphersuitesBytes cryptobyte.String
if !clientHello.ReadUint16LengthPrefixed(&ciphersuitesBytes) {
	return nil
}

Now we can iterate to read each item:

for !ciphersuitesBytes.Empty() {
	var ciphersuite uint16
	if !ciphersuitesBytes.ReadUint16(&ciphersuite) {
		return nil
	}
	// do something with ciphersuite, like append to a slice
}

Next, legacy_compression_methods, which is similar to legacy_session_id:

var legacyCompressionMethods []uint8
if !clientHello.ReadUint8LengthPrefixed((*cryptobyte.String)(&legacyCompressionMethods)) {
	return nil
}

Finally, we reach the extensions field, which is another variable-length sequence, this time containing the Extension struct, defined as:

struct {
	ExtensionType extension_type;
	opaque extension_data<0..2^16-1>;
} Extension;

ExtensionType is an enum with maximum value 65535 (i.e. a 16-bit integer).

As with cipher_suites, we read all the bytes in the field into a cryptobyte.String:

var extensionsBytes cryptobyte.String
if !clientHello.ReadUint16LengthPrefixed(&extensionsBytes) {
	return nil
}

Since this is the last field, we want to make sure clientHello is now empty:

if !clientHello.Empty() {
	return nil
}

Now we can iterate to read each Extension item:

for !extensionsBytes.Empty() {
	var extType uint16
	if !extensionsBytes.ReadUint16(&extType) {
		return nil
	}
	var extData cryptobyte.String
	if !extensionsBytes.ReadUint16LengthPrefixed(&extData) {
		return nil
	}
	// Parse extData according to extType
}

And that's it! You can see working code, including parsing of several common extensions, in my tlshacks package.

tlshello.agwa.name

To test this out, I wrote an HTTP server that returns a JSON representation of the Client Hello. This is rather handy for checking what ciphers and extensions a client supports. You can check out what your client's Client Hello looks like at https://tlshello.agwa.name.

Making the Client Hello message available to an HTTP handler required some gymnastics, including writing a net.Conn wrapper struct that peeks at the first TLS handshake message and saves it in the struct, and then a ConnContext callback that grabs the saved message out of the wrapper struct and makes it available in the request's context. You can read the code if you're curious.

I'm happy to say that deploying this HTTP server was super easy thanks to snid. This service cannot run behind an HTTP reverse proxy - it has to terminate the TLS connection itself. Without snid, I would have needed to use a dedicated IPv4 address.

How I'm Using SNI Proxying and IPv6 to Share Port 443 Between Webapps

2022-04-15T12:06:05Z

My preferred method for deploying webapps is to have the webapp listen directly on port 443, without any sort of standalone web server or HTTP reverse proxy in front. I have had it with standalone web servers: they're all over-complicated and I always end up with an awkward bifurcation of logic between my app's code and the web server's config. Meanwhile, my preferred language, Go, has a high-quality, memory-safe HTTPS server in the standard library that is well suited for direct exposure on the Internet.

However, only one process at a time can listen on a given IP address and port number. In a world of ubiquitous IPv6, this wouldn't be a problem - each of my servers has literally trillions of IPv6 addresses, so I could easily dedicate one IPv6 address per webapp. Unfortunately, IPv6 is not ubiquitous, and due to the shortage of IPv4 addresses, it would be too expensive to give each app its own IPv4 address.

The conventional solution for this problem is HTTP reverse proxying, but I want to do better. I want to be able to act like IPv6 really is ubiquitous, but continue to support IPv4-only clients with a minimum amount of complexity and mental overhead. To accomplish this, I've turned to SNI-based proxying.

I've written about SNI proxying before, but in a nutshell: a proxy server can use the first message in a TLS connection (the Client Hello message, which is unencrypted and contains the server name (SNI) that the client wants to connect to) to decide where to route the connection. Here's how I'm using it:

My webapps listen on port 443 of a dedicated IPv6 address. They do not listen on IPv4.
Each of my servers runs snid, a Go daemon which listens on port 443 of the server's single IPv4 address.
When snid receives a connection, it peeks at the first TLS message to get the desired server name. It does a DNS lookup for the server name's IPv6 address, and proxies the TCP connection there. To prevent snid from being used as an open proxy, snid only forwards the connection if the IPv6 address in within my server's IPv6 range.
The AAAA record for a webapp is the dedicated IPv6 address, and the A record is the shared IPv4 address. Thus, IPv6 clients connect directly to the webapp, while IPv4 clients are proxied via snid.

Preserving the Client's IP Address

One of the headaches caused by proxies is that the backend doesn't see the client's IP address - connections appear to be from the proxy server instead. With HTTP proxying, this problem is typically solved by stuffing the client's IP address in a header field, which is a minefield of security problems that allow client IP addresses to be spoofed if you're not careful. With TCP proxying, a common solution is to use the PROXY protocol, which puts the client's IP address at the beginning of the proxied connection. However, this requires backends to understand the PROXY protocol.

snid can do better. Since IPv6 addresses are 128 bits long, but IPv4 addresses are only 32 bits, it's possible to embed IPv4 addresses in IPv6 addresses. snid embeds the client's IP address in the lower 32 bits of the source address which it uses to connect to the backend. It's trivial for the backend to translate the source address back to the IPv4 address, but this is purely a user interface concern. If a backend doesn't do the translation, it's possible for the human operator to do the translation manually when viewing log entries, configuring access control, etc.

For the IPv6 prefix, I use 64:ff9b:1::/48, which is a non-publicly-routed prefix reserved for IPv4/IPv6 translation mechanisms. For example, the IPv4 address 192.0.2.10 translates to:

64:ff9b:1::c000:20a

Conveniently, it can also be written using embedded IPv4 notation:

64:ff9b:1::192.0.2.10

O(1) Config

snid's configuration is just a few command line arguments. Here's the command line for the instance of snid that's running on the server that serves www.agwa.name and src.agwa.name:

snid -listen tcp:18.220.42.202:443 -mode nat46 -nat46-prefix 64:ff9b:1:: -backend-cidr 2600:1f16:719:be00:5ba7::/80

-listen tells snid to listen on port 443 of 18.220.42.202, which is the IPv4 address for www.agwa.name and src.agwa.name. -mode nat46 tells snid to forward connections over IPv6, with the source IPv4 address embedded using the prefix specified by -nat46-prefix. -backend-cidr tells snid to only forward connections to addresses within the 2600:1f16:719:be00:5ba7::/80 subnet, which includes the IPv6 addresses for www.agwa.name and src.agwa.name (2600:1f16:719:be00:5ba7::2 and 2600:1f16:719:be00:5ba7::1, respectively).

The best thing about snid's configuration is that I only have to touch it once. I don't have to change it when deploying new webapps. Deploying a new webapp only requires assigning it an IPv6 address and publishing DNS records for it, just like it would be in my dream world of ubiquitous IPv6. I call this O(1) configuration since it doesn't get longer or more complex with the number of webapps I run.

Guaranteed Secure

HTTP reverse proxying is a minefield of security concerns. In addition to the IP address spoofing problems discussed above, you have to contend with request smuggling and HTTP desync vulnerabilities. This is a class of vulnerability that will never truly be solved: you can patch vulnerabilities as they're discovered, but thanks to the inherent ambiguity in parsing HTTP, you can never be sure there won't be more.

I don't have to worry about any of this with snid. Since snid doesn't decrypt the TLS connection (and lacks the necessary keys to do so), proxying with snid is guaranteed to be secure as long as TLS is secure. It can harm security no more than any untrusted router on the Internet can. This helps put snid out of my mind so I can forget that it even exists.

Compatible with ACME ALPN

Since ACME's TLS-ALPN challenge uses SNI to convey the hostname being validated, snid will forward TLS-ALPN requests from the certificate authority to the appropriate backend. Automatic certificate acquisition, such as with Go's autocert package, Just Works.

What About Encrypted Client Hello?

Since the SNI hostname is in plaintext, a network eavesdropper can determine what hostname a client is connecting to. This is bad for privacy and censorship resistance, so there is an effort underway to encrypt not just the SNI hostname, but the entire Client Hello message. How does this affect snid?

First, it's important to note that the destination IP address in the IP header is always going to be unencrypted, so by putting my webapps on different IPv6 addresses, I'm giving eavesdroppers the ability to find out which webapp clients are connecting to, regardless of SNI. However, a single webapp might handle multiple hostnames, and I'd like to hide the specific hostname from eavesdroppers, so Encrypted Client Hello still has some value. Fortunately, Encrypted Client Hello works with snid.

Encrypted Client Hello doesn't actually encrypt the initial Client Hello message. It's still sent in the clear, but with a decoy SNI hostname. The actual Client Hello message, with the true SNI hostname, is encrypted and placed in an extension of the unencrypted Client Hello. To make Encrypted Client Hello work with snid, I just need to ensure that the decoy SNI hostname resolves to the IPv6 address of the backend server. snid will see this hostname and route the connection to the correct backend server, as usual. The backend will decrypt the true Encrypted Client Hello to determine which specific hostname the client wants.

For additional detail about this approach, see my comment on Hacker News.

What About Port 80?

Obviously, I can't proxy unencrypted HTTP traffic using SNI-based proxying. But at this point, port 80 exists solely to redirect clients to HTTPS. To handle this, I plan to run a tiny, zero-config daemon on port 80 of all IPv4 and IPv6 addresses that will redirect the client to the same URL but with http:// replaced with https://. (For now, I'm using Apache for this.)

Installing snid

If you have Go, you can install snid by running:

go install src.agwa.name/snid@latest

You can also download a statically-linked binary.

See the README for the command line usage.

Rejected Approach: UNIX Domain Sockets

Before settling on the approach described above, I had snid listen on port 443 of all interfaces (both IPv4 and IPv6) and forward connections to a UNIX domain socket whose path contained the SNI hostname. For example, connections to example.com would be forwarded to /var/tls/example.com. The client's IP address was preserved using the PROXY protocol.

This had some nice properties. I could use filesystem permissions to control who was allowed to create sockets, either by setting permissions on /var/tls, or by symlinking specific hostnames under /var/tls to other locations on the filesystem which non-root users could write to. It felt really elegant that applications could listen on an SNI hostname rather than on an IP address and port.

However, few server applications support the PROXY protocol or listening on UNIX domain sockets. I could make sure my own apps had support, but I really wanted to be able to use off-the-shelf apps with snid. I did write an amazing LD_PRELOAD library that intercepts the bind system call and transparently replaces binding to a TCP port with binding to a UNIX domain socket. It even intercepts getpeername and makes it returns the IP address received via the PROXY protocol. Although this worked with every application I tried it with, it felt like a hack.

Additionally, UNIX domain sockets have some annoying semantics: if the socket file already exists (perhaps because the application crashed without removing the socket file), you can't bind to it - even if no other program is actually bound to it. But if you remove the socket file, any program bound to it continues running, completely unaware that it will never again accept a client. The semantics of TCP port binding feel more robust in comparison.

For these reasons I switched to the IPv6 approach described above, allowing me to use standard, unmodified TCP-listening server apps without any hacks that might compromise robustness. However, support for UNIX domain sockets lives on in snid with the -mode unix flag.

Comcast Shot Themselves in the Foot with MTA-STS

2022-01-19T17:04:58Z

I recently heard from someone, let's call them Alex, who was unable to email comcast.net addresses. Alex's emails were being bounced back with an MTA-STS policy error:

MX host mx2h1.comcast.net does not match any MX pattern in MTA-STS policy
MTA-STS failure for Comcast.net: Validation error (E_HOST_MISMATCH)
MX host mx1a1.comcast.net does not match any MX pattern in MTA-STS policy
MTA-STS failure for Comcast.net: Validation error (E_HOST_MISMATCH)

MTA-STS is a relatively new standard that allows domain owners such as Comcast to opt in to authenticated encryption for their mail servers. (By default, SMTP traffic between mail servers uses opportunistic encryption, which can be defeated by active attackers to intercept email.) MTA-STS requires the domain owner to duplicate their MX record (the DNS record that lists a domain's mail servers) in a text file served over HTTPS. Sending mail servers, like Alex's, refuse to contact mail servers that aren't listed in the MTA-STS text file. Since HTTPS uses authenticated encryption, the text file can't be altered by active attackers. (In contrast, the MX record is vulnerable to manipulation unless DNSSEC is used, but people don't like DNSSEC which is why MTA-STS was invented.)

The above error messages mean that although mx2h1.comcast.net and mx1a1.comcast.net are listed in comcast.net's MX record, they are not listed in comcast.net's MTA-STS policy file. Consequentially, Alex's mail server thinks that the MX record was altered by attackers, and is refusing to deliver mail to what it assumes are rogue mail servers.

However, mx2h1.comcast.net and mx1a1.comcast.net are not rogue mail servers. They are in fact listed in Comcast's current MTA-STS policy:

version: STSv1
mode: enforce 
mx: mx2c1.comcast.net
mx: mx2h1.comcast.net
mx: mx1a1.comcast.net
mx: mx1h1.comcast.net
mx: mx1c1.comcast.net
mx: mx2a1.comcast.net
max_age: 2592000

This means that Alex's mail server is not consulting comcast.net's current MTA-STS policy. Instead, it's consulting a cached policy which does not list mx2h1.comcast.net and mx1a1.comcast.net.

This can happen because mail servers cache MTA-STS policies to avoid having to re-download an unchanged policy file every time an email is sent. To determine whether a domain's policy has changed, mail servers query the domain's _mta-sts TXT record (e.g. _mta-sts.comcast.net), and only re-download the MTA-STS policy file if the ID in the TXT record is different from the ID of the currently-cached policy.

The obvious implication of the above is that if you ever change your domain's MTA-STS policy, you have to remember to update the TXT record as well.

A more subtle implication is that you have to do the updates in the right order. If you update the TXT record before changing the policy file, and a mail server fetches the policy in the intervening time, it will download the old policy file but cache it under the new ID. It won't ever download the new policy because it thinks it already has it in its cache.

This pitfall could have been avoided had MTA-STS required the ID to also be specified in the policy file instead of just in the TXT record. That would have prevented mail servers from caching policies under the wrong ID.

There's some evidence that this is what happened with comcast.net. The ID in the _mta-sts.comcast.net TXT record appears to be a UNIX timestamp (seconds since the Epoch):

_mta-sts.comcast.net. 7200 IN TXT "v=STSv1; id=1638997389;"

That timestamp translates to 2021-12-08 21:03:09 UTC.

However, the Last-Modified time of https://mta-sts.comcast.net/.well-known/mta-sts.txt is three minutes later:

Last-Modified: Wed, 08 Dec 2021 21:06:05 GMT

If the ID in the TXT record reflects when the TXT record was updated, there was a three minute gap between the updates. If Alex's mail server fetched comcast.net's MTA-STS policy during this window, it would have cached the old policy under the new ID, causing the errors seen above.

Recommendations for Domain Owners Who Use MTA-STS

You should automate MTA-STS policy publication to ensure that your MTA-STS policy always matches your MX records and that the TXT record is reliably updated, in the correct order, when your policy changes. If your policy file is served by a CDN, you have to be extra careful not to update the TXT record until your new policy file is fully propagated throughout the CDN.

I further recommend that you rotate the ID in the TXT record daily even if no changes have been made to your policy. This will force mail servers to re-download your policy file if it's more than a day old, which provides a backstop in case something goes wrong with the order of updates.

It may be tempting, but you should not reduce your policy's max_age value as this will diminish your protection against active attackers who block retrieval of your policy. Having a long max_age but a frequently rotating ID keeps your policy up-to-date in mail servers but ensures that in an attack scenario mail servers will fail safe by using a cached policy.

It's quite a bit of work to get this all right. If you want the easy option, SSLMate will automate all aspects of MTA-STS for you: all you need to do is publish two CNAME records delegating the mta-sts and _mta-sts subdomains to SSLMate-operated servers and SSLMate takes care of the rest.

Recommendations for Mail Server Developers

You should assume that domain operators are not going to properly update their TXT records and you should always attempt to re-download policy files that are more than a day old, regardless of what the ID in the TXT record says.

Is this Really Better than DNSSEC/DANE?

Thanks to MTA-STS' duplication of information and requirement for updates to be done in the right order, there is a high chance of human error when MTA-STS is deployed manually. Unfortunately, it's very likely to be deployed manually because there's a dearth of automation software, and on the surface it looks easy to manage by hand. To make matters worse, MTA-STS' caching semantics mean that the inevitable human error leads to hard-to-diagnose problems, such as a subset of mail servers being unable to mail your domain. I suspect that many problems will never be detected - email delivery will just become less reliable than it was before MTA-STS was deployed.

Meanwhile, DNSSEC is increasingly automated, and if you use a modern cloud provider like Route 53, Google Cloud DNS, or Cloudflare, you don't have to worry about remembering to sign zones before they expire, which was traditionally a major source of DNSSEC mistakes.

However, not all mail server operators support DNSSEC/DANE. Although Microsoft recently added DNSSEC/DANE support to Office 365 Exchange, Gmail only supports MTA-STS. Thus, there is still value in deploying MTA-STS despite its flaws. But we should not be happy about this state of affairs.

It's Now Possible To Sign Arbitrary Data With Your SSH Keys

2021-11-12T16:14:45Z

Did you know that you can use the ssh-keygen command to sign and verify signatures on arbitrary data, like files and software releases? Although this feature isn't super new - it was added in 2019 with OpenSSH 8.0 - it seems to be little-known. That's a shame because it's super useful and the most viable alternative to PGP for signing data. If you're currently using PGP to sign data, you should consider switching to SSH signatures.

Here's why I like SSH signatures:

It's not PGP. For years, security professionals have been sounding the alarm on PGP, including its most popular implementation, GnuPG/GPG. PGP is absurdly complex, has an awful user experience, and is full of crufty old cryptography which shouldn't be touched with a ten foot pole.
SSH is everywhere, and people already have SSH keys. If you use Debian Bullseye or Ubuntu 20.04 or newer, you already have a new enough version of SSH installed. And if you use GitHub, or any other service that uses SSH keys for authentication, you already have an SSH key that can be used to generate signatures. This is why I'm more excited about SSH signatures than other PGP signature alternatives like signify or minisign. Signify and minisign are great, but require you to install new software and generate new keys, which will hinder widespread adoption.
SSH key distribution is easy. SSH public keys are one line strings that are easy to copy around. You don't need to use the Web of Trust or worry about configuring "trust levels" for keys. GitHub already acts as a key distribution service which is far easier to use and more secure than any of the PGP key servers ever were. You can retrieve the SSH public keys for any GitHub user by visiting a URL like https://github.com/USERNAME.keys. (For example, my public keys are at https://github.com/AGWA.keys.)

(GitHub acts as a trusted third party here, and you have to trust them not to lie about people's public keys, so it may not be appropriate for all use cases. But relying on a trusted third party with a professional security team like GitHub seems like a way better default than PGP's Web of Trust, which was nigh impossible to use. Key Transparency would address the concerns with trusted third parties, if anyone ever figures out how to audit transparency logs in practice.)
SSH has optional lightweight certificates. You don't have to use SSH certificates (and most people shouldn't) but if certificates would make your life easier, SSH has a lightweight certificate system that is considerably simpler than X.509. This makes SSH signatures a good alternative to S/MIME as well!

You'll soon be able to sign Git commits and tags with SSH

Signing Git commits and tags gives consumers of your repository assurance that your code hasn't been tampered with. Unfortunately, you currently have to use either PGP or S/MIME, and personally I haven't bothered to sign Git tags since my PGP keys expired in 2018.

But that will soon change in Git 2.34, which adds support for SSH signatures.

Signing files

Signing a file is straightforward:

ssh-keygen -Y sign -f ~/.ssh/id_ed25519 -n file file_to_sign

Here are the arguments you may need to change:

~/.ssh/id_ed25519 is the path to your private key. This is the standard path to your SSH Ed25519 private key. If you have an RSA key, use id_rsa instead.
file is the "namespace", which describes the purpose of the signature. SSH defines file for signing generic files, and email for signing emails. Git uses git for its signatures.

If you are using the signature for a different purpose, such as a custom protocol, you must specify your own namespace. This prevents cross-protocol attacks whereby a valid signature is removed from a message for one protocol and attached to a message from a different protocol. If the protocols don't use distinct namespaces for their signatures, there's a risk that the signature is considered valid by the second protocol even though it was meant for the first protocol.

Namespaces can be arbitrary strings. To ensure global uniqueness of namespaces, SSH recommends that you structure them like an email address under a domain that you own. For example, I would use a namespace like protocolname-v1@agwa.name.
file_to_sign is the path to the file to be signed.

The signature is written to a new file called file_to_sign.sig, which looks like this:

-----BEGIN SSH SIGNATURE-----
U1NIU0lHAAAAAQAAADMAAAALc3NoLWVkMjU1MTkAAAAg2rirQQddpzEzOZwbtM0LUMmlLG
krl2EkDq4CVn/Hw7sAAAAEZmlsZQAAAAAAAAAGc2hhNTEyAAAAUwAAAAtzc2gtZWQyNTUx
OQAAAEDyjWPjmOdG8HJ8gh1CbM8WDDWoGfm+TTd8Qa8eua9Bt5Cc+43S24i/JqVWmk98qV
YXoQmOYL4bY8t/q7cSNeMH
-----END SSH SIGNATURE-----

If you specify - for the filename, the file to sign is read from standard in and the signature is written to standard out.

Verifying signatures

Verifying signatures is a bit more involved. First you need to create an allowed signers file which maps email addresses to public keys, like this:

alice@example.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINq4q0EHXacxMzmcG7TNC1DJpSxpK5dhJA6uAlZ/x8O7
alice@example.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCfHGCK5jjI/Oib4vRBLB9rG30A8y/Br9U75rfAYsitwFPFfl/CaTAvfRlW1lIBqOCshLWxGsN+PFiJCiCWzpW4iILkD5X5KcBBYHTq1ojYXb70BrQXQ+QBDcGxqQjcOp/uTq1D9Z82mYq/usI5wdz6f1KNyqM0J6ZwRXMu6u7NZaAwmY7j1fV4DRiYdmIfUDIyEdqX4a1Gan+EMSanVUYDcNmeBURqmTkkOPYSg8g5xYgcXBMOZ+V0ZUjreV9paKraUD/mVDlZbb/VyWhJGT4FLMNXHU6UHC2FFgqANMUKIlL4vhqc23MoygKbfF3HgNB6BNfv3s+GYlaQ3+66jc5j
bob@example.net ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBgQuuEvhUXerOTIZ2zoOx60M/HHJ/tcHnD84ZvTiX5b
eve@example.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFxsKcWHB9hamTXCPWKVUw0WM0S3IXH0YArf8iJE0dMG

Once you have your allowed signers file, verification works like this:

ssh-keygen -Y verify -f allowed_signers -I alice@example.com -n file -s file_to_verify.sig < file_to_verify

Here are the arguments you may need to change:

allowed_signers is the path to the allowed signers file.
alice@example.com is the email address of the person who allegedly signed the file. This email address is looked up in the allowed signers file to get possible public keys.
file is the "namespace", which must match the namespace used for signing as described above.
file_to_verify.sig is the path to the signature file.
file_to_verify is the path to the file to be verified. Note that this file is read from standard in. In the above command, the < shell operator is used to redirect standard in from this file.

If the signature is valid, the command exits with status 0 and prints a message like this:

Good "file" signature for alice@example.com with ED25519 key SHA256:ZGa8RztddW4kE2XKPPsP9ZYC7JnMObs6yZzyxg8xZSk

Otherwise, the command exits with a non-zero status and prints an error message.

Is it safe to repurpose SSH keys?

Short answer: yes.

Always be wary of repurposing cryptographic keys for a different protocol. If not done carefully, there's a risk of cross-protocol attacks. For example, if the structure of the messages signed by Git is similar to the structure of SSH protocol messages, an attacker might be able to forge Git artifacts by misappropriating the signature from an SSH transcript.

Fortunately, the structure of SSH protocol messages and the structure of messages signed by ssh-keygen are dissimilar enough that there is no risk of confusion.

To convince ourselves, let's consult RFC 4252 section 7, which specifies how SSH keys are traditionally used by SSH to authenticate a user logging into a server. The RFC specifies that the input to the signature algorithm has the following structure:

string session identifier byte SSH_MSG_USERAUTH_REQUEST string user name string service name string "publickey" boolean TRUE string public key algorithm name string public key to be used for authentication

The first field is the session identifier, a string. In the SSH protocol, strings are prefixed by a 32-bit big endian length. The session identifier is a hash. Since hashes are short, the first three bytes of the above signature input will always be zero.

Meanwhile, the PROTOCOL.sshsig file the OpenSSH repository specifies how SSH keys are used with ssh-keygen-generated signatures. It specifies that the input to the signature algorithm has this structure:

#define MAGIC_PREAMBLE "SSHSIG" byte[6] MAGIC_PREAMBLE string namespace string reserved string hash_algorithm string H(message)

Here, the first three bytes are SSH, from the magic preamble. Since the first three bytes of the SSH protocol signature input are different from the ssh-keygen signature input, the SSH client and ssh-keygen will never produce identical signatures. Therefore, there is no risk of cross-protocol attacks, and I am totally comfortable using my existing SSH keys to sign messages with ssh-keygen.

How Certificate Transparency Logs Fail and Why It's OK

2021-07-09T19:06:16Z

Last week, a Certificate Transparency log called Yeti 2022 suffered a single bit flip, likely due to a hardware error or cosmic ray, which rendered the log unusable. Although this event will have zero impact on Web users and website operators, and was reported on an obscure mailing list for industry insiders, it captured the interest of people on Hacker News, Twitter, and Reddit. Certificate Transparency plays an essential role in ensuring security on the Web, and numerous commentators were concerned that logs could be wiped out by a single bit flip. I'm going to explain why these concerns are misplaced and why log failure doesn't worry me.

Background: Certificate Transparency (CT) is a system to log publicly-trusted SSL certificates in public, append-only logs. Website owners can monitor these logs and take action if they discover an unauthorized certificate for one of their domains. Thanks to Certificate Transparency, several untrustworthy certificate authorities have been distrusted, and the ecosystem has improved enormously compared to the pre-CT days when misissued certificates usually went unnoticed.

To ensure that CT logs remain append-only, submitted certificates are placed in the leaves of a data structure called a Merkle Tree. The leaves of the Merkle Tree are recursively hashed together with SHA-256 to produce a root hash that represents the contents of the tree. Periodically, the CT log publishes a signed statement, called a Signed Tree Head or STH, containing the current tree size and root hash. The STH is a commitment that at the specified size, the log has the specified contents. To enforce the commitment, monitors collect STHs and verify that the root hashes match the certificates downloaded from the log. If the downloaded certificates don't match the published STHs, or if a monitor detects two STHs for the same tree size with different root hashes, it means the log contents have been altered - perhaps to conceal a malicious certificate.

On June 30, 2021 at 01:02 UTC, my CT monitor, Cert Spotter, raised an alert that the root hash it calculated from the first 65,569,149 certificate entries downloaded from Yeti 2022 did not equal the root hash in the STH that Yeti 2022 had published for tree size 65,569,149.

I noticed the alert the next morning and reported the problem to the ct-policy mailing list, where these matters are discussed. The Google Chrome CT team reported that they too had observed problems, as did a user of the open source certspotter. Later that day, I drilled down that the problem was with entry 65,562,066 - the hash of the certificate returned by Yeti at this position was not part of the Merkle Tree.

On Thursday, the log operator reported that this entry had "shifted one bit" before the STH was signed. Curious, I calculated the correct hash of entry 65,562,066 and then tried flipping every bit and seeing if the resulting hash was part of the Merkle Tree. Sure enough, flipping the lowest bit of the first byte of the hash resulted in a hash that was part of the Merkle Tree and would ultimately produce the root hash from the STH.

There is no way for the log operator to fix this problem: they can't change entry 65,562,066 to match the errant hash, as this would require breaking SHA-2's preimage resistance, which is computationally infeasible. And since they've already published an STH for tree size 65,569,149, they can't publish an updated STH with a root hash that correctly reflects entry 65,562,066.

Consequentially, Yeti 2022 is toast. It has been made read-only and web browsers will soon cease to rely on it.

Yeti 2022 is not the first Certificate Transparency log to fail. Seven logs have previously failed, all without causing any impact to Web users:

In 2016, a log failed for having excessive downtime.
In 2016, a log failed because the operator reused the log's private key with a test log. Since the same private key was being used to sign STHs from two different logs, each log appeared to be violating the append-only property.
In 2017, a log failed because it violated the append-only property after its database was rolled back during a botched backup restore.
In 2017, a log failed after its operator, a Chinese blockchain company that no one had ever heard of, vanished into thin air. (I tried calling their phone number in China and got a message that their voicemail box was full.) (Maybe they thought CT was a cryptocurrency and pulled the plug when they didn't get rich?)
In 2018, a log failed because it didn't include submitted certificates in the log.
In 2018, another log failed because it didn't include submitted certificates in the log.
In 2020, a log failed because its private key was presumed compromised after the server fell victim to a Salt vulnerability.

The largest risk of log failure is that previously-issued certificates will stop working and require replacement prior to their natural expiration dates. CT-enforcing browsers only accept certificates that contain receipts, called Signed Certificate Timestamps or SCTs, from a sufficient number of approved (i.e. non-failed) CT logs. Each SCT is a promise by the respective log to publish the certificate (Technically, the precertificate, which contains the same information as the certificate). What if one or more of the SCTs in a certificate is from a log which failed after the certificate was issued?

Fortunately, browser Certificate Transparency policies anticipated the possibility. At a high level, the Chrome and Apple policies require the following:

At least one SCT from a log that is approved at time of certificate validation
At least 2-5 SCTs (depending on certificate lifetime) from logs that were approved at time of SCT issuance

(The precise details are a bit more nuanced but don't matter for this blog post. Also, some very advanced website operators deliver SCTs using alternative mechanisms, which are subject to different rules, but this is extremely rare. Read the policies if you want the nitty gritty.)

Consequentially, a single log failure can't cause a certificate to stop working. The first requirement is still satisfied because the certificate still has at least one SCT from a currently-approved log. The second requirement is still satisfied because the failed log was approved at time of SCT issuance. The minimum number of SCTs from approved-at-issuance logs increases with certificate lifetime to reflect the increased probability of log failure: a 180 day certificate only needs 2 SCTs, whereas a 3 year certificate (back when they were allowed) needed 5 SCTs.

Note that when a log fails, its public key is not totally removed from browsers like a distrusted certificate authority's key would be. Instead, the log transitions to a state called "retired" (previously known as "disqualified"), which means its SCTs are still recognized for satisfying the second requirement. This led to an interesting question in 2020 when a log's private key was compromised: should the log be retired, or should it be totally distrusted? Counter-intuitively, it was retired, even though SCTs could have been forged to make it seem like a certificate was included in the log when it really wasn't. But that's OK, because the second requirement isn't about making sure certificates are logged, but about making sure certificate authorities aren't dumbasses. In a world of competent CAs, the second requirement wouldn't be necessary since CAs would have the good sense to include SCTs from extra logs in case some logs failed. But we do not live in a world of competent CAs - indeed, that's why CT exists - and there would no doubt be CAs embedding a single SCT in certificates if browsers didn't require redundancy.

Of course, there is still a chance that all of the SCTs in a certificate come from logs that end up failing. That would suck. But I don't think the solution is to change Certificate Transparency. Catastrophic CT failure is just one of several reasons that a certificate might need to be replaced before its natural expiration, and empirically it's the least likely reason. When a certificate authority is distrusted, as has happened several times, all of its certificates must be replaced. When a certificate is misissued, it has to be revoked and replaced, and there have been numerous incidents since 2019 in which a considerable number of certificates have required revocation - sometimes as many as 100% of a CA's active certificates:

The ecosystem is currently ill-prepared to handle mass replacement events like these, and in many of the above cases CAs missed the revocation deadline or declined to revoke entirely. Although the above misissuances had relatively low security impact, other cases, such as distrusting a compromised certificate authority, or events like Heartbleed or the Debian random number fiasco, are very security critical. This makes the inability to quickly replace certificates at scale a serious problem, larger than the problem of CT logs failing. To address the problem, Let's Encrypt is working on a specification called ACME Renewal Info (ARI) that would allow CAs to instruct TLS servers to request new certificates prior to their normal expiration. They've committed to deploying ARI or a similar technology in their staging environment by 2021-11-12.

Security Vulnerabilities in Smallstep PKI Software

2020-12-17T18:11:03Z

I recently did a partial security review of Smallstep, a commercially-backed open source private certificate authority written in Go. I found that Smallstep is vulnerable to JSON injection, misuses JWTs, and relies on client-side enforcement of server-side security. These vulnerabilities can be exploited to obtain unauthorized certificates. This post is a full disclosure of the issues.

I reviewed the certificates repository as of commit 1feb4fcb26dc78d70bc1d9e586237a36a8fbea9c and the crypto repository as of commit 162770cad29063385cb768b0191814e4c6a94e45. The vulnerabilities are present in version 0.15.5 of step-certificates and version 0.15.3 of step-cli, and have been fixed as of version 0.15.6.

JSON Injection in Certificate Templates

Like many PKI systems, Smallstep supports user-definable certificate profiles (which they call certificate templates) to specify the contents of a certificate (the subject, SANs, extensions, etc.). Smallstep's certificate profiles are implemented as JSON objects which are templated using Go's text/template package. Using templates, you can substitute a variety of information into the profile, including information from the certificate request.

Here's an example template from Smallstep's blog post announcing the feature:

{
  "subject": {"commonName":"{{ .Insecure.CR.Subject.CommonName }}"},
  "sans": {{ toJson .SANs }},
  "keyUsage": ["digitalSignature", "keyAgreement"],
  "extKeyUsage": ["clientAuth"]
}

20 years of HTML and SQL injection vulnerabilities have shown that it's a bad idea to use raw text templating to construct syntactic data like HTML, SQL, or JSON, which is why best practice is to use context-aware templating like SQL prepared statements or Go's html/template package. When using raw text templates, it's way too easy to suffer an injection vulnerability when the author of the template inevitably forgets to escape a value. Indeed, in the above example, the commonName field is unescaped, and Smallstep is rife with other examples of unescaped data in their documentation and in Go string literals that define the default SSH and X.509 templates.

Two factors make it easy for attackers to exploit the injection vulnerability. First, if a JSON object has more than one field with the same name, Go's JSON decoder takes the value from the last one. Second, Smallstep uses json.Decode to decode the object instead of json.Unmarshal, which means trailing garbage is ignored (this is an unfortunate foot gun in Go). Thus, an attacker who can inject values into a template has total control over the resulting object, as they can override the values of earlier fields, and then end the object so later fields are ignored. For example, signing a CSR containing the following common name using the above template results in a CA:TRUE certificate with a SAN of wheeeeeee.example and no extended key usage (EKU) extension:

"}, "basicConstraints":{"isCA":true}, "sans":[{"type":"dns","value":"wheeeeeee.example"}]}

Fortunately, despite the use of unescaped values in the default templates, I found that in virtually all cases the unescaped values come from trusted sources or are validated first. The one exception is the AWS provisioner, which in the default configuration will inject an unvalidated common name from the CSR into the template. However, in this configuration the DNS SANs are also completely unvalidated, so clients are already pretty trusted. Still, the vulnerability could be used to get a certificate with arbitrary or no EKUs, giving an attacker more capabilities than they would otherwise have. (The attacker could also get a CA:TRUE certificate, although by default the issuing CA has a pathlen:0 constraint so the CA:TRUE certificate wouldn't work.)

In any case, this approach to templates is extremely concerning. This issue cannot be fixed just by updating the built-in templates to use proper escaping, as users who write their own templates are likely to forget to escape, just as scores of developers have forgotten to escape HTML and SQL. The responsible way to offer this feature is with a context-aware JSON templating system akin to template/html that automatically escapes values.

Weaknesses in AWS Provisioner

This section applies to Smallstep's AWS provisioner, which enables EC2 instances to obtain X.509 certificates which identify the instance.

To prove that it is authorized to obtain a certificate, the Smallstep client retrieves a signed instance identity document from the EC2 metadata service and sends it to the Smallstep server. The Smallstep server verifies that the instance identity document was validly signed by AWS, and that the EC2 instance belongs to an authorized AWS account. If so, the Smallstep server issues a certificate.

EC2 instance identity documents aren't bound to a particular purpose when they are signed and they never expire. An attacker who obtains a single instance identity document from any EC2 instance in their victim's account has a permanent capability to obtain certificates. Instance identity documents are not particularly protected; anyone who can make an HTTP request from the instance can obtain them. Smallstep attempts to address this in a few ways. Unfortunately, the most effective protection is off by default, and the others are ineffective.

First, instance identity documents contain the date and time that the instance was created, and Smallstep can be configured to reject instance identity documents from instances that are too old. This is a good idea and works well with architectures where an instance obtains its certificate after first bootup and never needs to obtain future certificates. However, this protection is off by default. (Also, the documentation for the configuration parameter is not very good: it's described as "the maximum duration to grant a certificate in AWS and GCP provisioners" which sounds like maximum certificate lifetime. This is not going to help people choose a sensible value for this option.)

Second, by default Smallstep only allows an EC2 instance to request one certificate. After it requests a certificate, the instance is not supposed to be able to request any more certificates. Unfortunately, this logic is enforced client-side. Specifically, the server relies on the client to generate a unique ID, which in a non-malicious implementation is derived from the instance ID. Of course, a malicious client can just generate any ID it wants and the server will think the instance has never requested a certificate before.

(Aside: Smallstep claims that the one-certificate-per-instance logic "allows" them to not validate certificate identifiers by default. This doesn't make much sense to me, as an instance which has never requested a certificate before would still have carte blanche ability to request a certificate for any identifier, including those belonging to instances which already have requested certificates. It seems to me that to get the TOFU-like security properties that they desire, they should be enforcing a one-certificate-per-identifier rule rather than one-certificate-per-instance.)

Finally, Smallstep tries to use JWT in a bizarre and futile attempt to limit the lifetime of identity documents to 5 minutes. Instead of simply sending the identity document and signature to the server, the client sticks the identity document and signature inside a JSON Web Token which is authenticated with HMAC, using the signature as the HMAC key - the same signature that is in the payload of the JWT.

This leads to an incredible sequence of code on the server side which first deserializes the JWT payload without verifying the HMAC, and then immediately deserializes the payload again, this time verifying the HMAC using the key just taken out of the unverified payload:

var unsafeClaims awsPayload
if err := jwt.UnsafeClaimsWithoutVerification(&unsafeClaims); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error unmarshaling claims")
}

var payload awsPayload
if err := jwt.Claims(unsafeClaims.Amazon.Signature, &payload); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error verifying claims")
}

Obviously, this code adds zero security, as an attacker who wants to use an expired JWT can just adjust the expiration date and then recompute the HMAC using the key that is right there in the JWT.

Edited to add: Smallstep explains that they used JWT here because their other cloud provisioners use JWT natively and therefore it was easier to use JWT here as well. It was not intended as a security measure. Indeed, the 5 minute expiration is not documented anywhere, so I wouldn't call this a vulnerability. However, it would be wrong to call it a harmless no-op: the JWT provided the means for an untrusted, client-generated ID to be transmitted to the server, which was accidentally used instead of the trusted ID from the signed instance identity document, causing the vulnerability above. This is why it's so important to avoid unnecessary code and protocol components, especially in security-sensitive contexts.

Timeline

2020-12-17 18:17 UTC: full disclosure and notification to Smallstep
2020-12-17 21:02 UTC: Smallstep merges patch to escape JSON in default X.509 template
2020-12-17 23:52 UTC: Smallstep releases version 0.15.6 fixing the vulnerabilities, plans to address other feedback
2020-12-27 19:04 UTC: Ten days later, Smallstep's blog post and documentation still show examples of unescaped JSON, giving me doubts that they have fully understood the issue

The Lengths People Go To Just To Avoid DNSSEC

2020-11-30T17:31:06Z

Connecting to a website, say example.com, over TLS is a relatively straightforward affair. The client looks up the DNS A/AAAA record for example.com, connects to the IP address over TLS, and confirms that the presented certificate is valid for example.com.

In contrast, connecting to other services, like XMPP or SMTP, over TLS is less straightforward. That's because clients don't directly look up the A/AAAA record for example.com. Instead they look up a SRV record (for XMPP) or an MX record (for SMTP) which contains the hostname of the XMPP or SMTP server. Then they look up the A/AAAA record of that hostname and connect to it. This layer of indirection makes it easy to delegate the operation of certain services to other hosts. For instance, if example.com wants to use Gmail for their email, their MX record would contain aspmx.l.google.com.

This raises the question of which hostname the certificate should certify: the original hostname (example.com), or the hostname listed in the SRV/MX record (aspmx.l.google.com). Both options have problems.

Approach 1, using the original hostname (example.com), is undesirable because there is no automated way for the operator of a service like SMTP or XMPP to obtain certificates for the hostnames which are delegated to them. Google can automatically get a certificate for aspmx.l.google.com because they own google.com, but they can't for example.com. The admins of example.com would have to request the certificate themselves and give it to their SMTP and XMPP providers. Such a manual approach is bound to cause outages as people forget to renew expiring certificates. But there's another problem: the certificate would permit the SMTP and XMPP providers to impersonate all services for example.com. You probably don't want your instant messaging provider to be able to impersonate your website.

Approach 2, using the SRV/MX hostname (aspmx.l.google.com), doesn't have these problems. It's quite easy for the service operator to automatically obtain a certificate for their own domain. Unfortunately, this approach is not secure. Since the DNS lookup for the SRV/MX record is most likely unauthenticated, a man-in-the-middle attacker could intercept the query and return a rogue record that says the SMTP or XMPP service for example.com is handled by a server operated by the attacker. The attacker would have no problem obtaining a valid certificate for their own domain.

Perhaps the most obvious solution to this problem is to just make the DNS lookup authenticated. The IETF has been trying to do this since 1997 with DNSSEC. More than 20 years later, the results are not promising: fewer than 2% of .com domains support DNSSEC, only 25% of Internet users validate DNSSEC even when it is supported, the use of insecure crypto like 1024 bit RSA and SHA-1 is still rampant, and DNSSEC is so hard to deploy correctly that outages are common.

Unsurprisingly, many people would like to avoid the DNSSEC quagmire, which has led to some very interesting workarounds...

POSH

One workaround is POSH, or "PKIX over Secure HTTP", standardized in RFC 7711. With POSH, the owner of example.com publishes a JSON document at https://example.com/.well-known/posh/SERVICE.json containing a list of certificate fingerprints which are allowed to be used for the given SERVICE (e.g. xmpp-server) on example.com. To connect to example.com's XMPP server, a POSH-aware client would first retrieve https://example.com/.well-known/posh/xmpp-server.json, ensuring that the HTTPS server presents a valid, publicly-trusted certificate for example.com. It would then connect to the XMPP server indicated in example.com's XMPP SRV record, and ensure that it presents a certificate whose fingerprint is listed in the JSON document.

POSH solves the problems presented above. It's secure, because the JSON document containing the fingerprints is authenticated by a certificate for example.com, which remains under the control of the owner of example.com. The operator of the XMPP service doesn't need to obtain a publicly-trusted certificate for example.com. To facilitate certificate rotation, POSH supports delegation: the JSON file at example.com can reference the URL of a different JSON file which is hosted by the XMPP operator. This gives the XMPP operator flexibility to rotate certificates at any time without needing to inform their customers to update their JSON documents.

Although POSH was designed to be protocol agnostic, it was only ever used with XMPP, and even then I could only find a few XMPP clients which support it. It's fair to say POSH never caught on.

MTA-STS

A more recent workaround, for server-to-server SMTP only, is MTA-STS, standardized in RFC 8461. The underlying concept of MTA-STS is the same as POSH: the owner of example.com publishes a document over HTTPS with information about how to validate secure SMTP connections to example.com's mail servers. Several details are different. One cosmetic difference is that the document is published at mta-sts.example.com rather than example.com, which simplifies deployment for domain owners who can't easily make changes to their main website. A more fundamental difference is that instead of listing certificate fingerprints, the document lists the hostnames which are allowed in the MX record set. When connecting to an SMTP server for example.com, the client verifies both that it's connecting to a server listed in the MTA-STS document at mta-sts.example.com, and that the server presents a publicly-trusted certificate valid for the SMTP server's hostname.

The amusing thing about MTA-STS is that it basically boils down to duplicating the contents of the MX record in a document that is published over HTTPS, leveraging the WebPKI to authenticate the MX record rather than DNSSEC. It's kind of incredible that this is considered easier than using DNSSEC, despite having more moving parts and requiring duplication. That says way more about DNSSEC being a failure than about MTA-STS being good, and I've written before about how I think MTA-STS will prove hard to deploy in practice. I suggested how DNS providers could alleviate the problems by automating MTA-STS for their customers, but I'm not aware of any DNS providers to do so.

I am happy to report that SSLMate now offers MTA-STS automation as part of Cert Spotter. It's not quite as seamless as what a DNS provider could offer, but it's pretty good: Cert Spotter continuously monitors your domains' MX records and automatically publishes an appropriate MTA-STS policy for you, obtaining and renewing the necessary SSL certificates. All you need to do is publish two CNAME records delegating the MTA-STS-related subdomains (mta-sts and _mta-sts) to SSLMate-operated servers. Cert Spotter updates the policy automatically any time it detects a change to your MX records, ensuring the policy never falls out of sync with your DNS. For transparency, Cert Spotter emails you when this happens, so you can detect unauthorized MX record changes. Cert Spotter will also alert you if any of your MX servers have a TLS or certificate problem that would prevent MTA-STS from working.

Looking forward: SRVName certificates

There is another solution which, if it's ever deployed, will be much nicer than POSH or MTA-STS: SRVName certificates. SRVName certificates authenticate not just a domain name, like normal certificate, but a particular service running on that domain. For example, you could get a certificate that's valid for only SMTP on example.com, or only XMPP on example.com. This solves the security problem of the first approach above: the owner of example.com can give their mail server operator a SRVName certificate that's valid only for SMTP, allowing the mail server to operate an SMTP service for example.com, but not impersonate any other example.com services. Assuming the validation rules are flexible enough, the SMTP service operator could even obtain the SRVName certificate themselves provided the certificate authority validates that they are listed in the MX record for example.com.

Technically speaking, SRVName is a type of subject alternative name (SAN) which can be placed in a certificate, akin to the DNS SAN which certificates use today for authenticating domain names. It's possible for a certificate to contain both SRVName and DNS SANs, and here I use "SRVName certificate" to mean a certificate containing a SRVName SAN.

Unfortunately, there's a major obstacle blocking SRVName certificates: technically-constrained subordinate certificate authorities. A technically-constrained sub-CA is a certificate authority which is restricted to issuing certificates only for namespaces that are enumerated in the sub-CA certificate's name constraints field. For example, an enterprise that needs to issue a large number of certificates might operate a publicly-trusted, technically-constrained sub-CA that is constrained to the enterprise's domains and IP address ranges. Since their sub-CA can only issue certificates for namespaces that they control, they're allowed to operate it under looser security standards than unconstrained publicly-trusted CAs.

The problem is that if a particular type of SAN (in this case, the SRVName SAN) isn't listed in the name constraints as either allowed or denied, the standard says that all instances of that SAN type are allowed by default. Allow-by-default is usually a bad idea when security is concerned, and this case is no different. Clients can't accept SRVName certificates because it would be unsafe: every existing technically-constrained sub-CA that doesn't have SRVName in its name constraints field has unconstrained ability to issue SRVName certificates. Unfortunately, the Baseline Requirements (the rules governing public certificate issuance) only require technically-constrained sub-CAs to have DNS, IP, and Directory Name constraints. Consequentially, there are many existing technically-constrained sub-CAs out there that would need to be revoked and reissued with SRVName constraints before it's safe to deploy SRVName certificates.

Will that ever happen? Who knows. In the meantime, we're stuck with hacks like MTA-STS.

Writing an SNI Proxy in 115 Lines of Go

2020-06-25T14:35:55Z

The very first message sent in a TLS connection is the Client Hello record, in which the client greets the server and tells it, among other things, the server name it wants to connect to. This is called Server Name Indication, or SNI for short, and it's quite handy as it allows many different servers to be co-located on a single IP address.

The server name is sent in plaintext, which is unfortunately really bad for privacy and censorship resistance, but does enable something very useful: a proxy server can read the server name and use it to decide where to route the connection, without having to decrypt the connection. You can leverage this to make many different physical servers accessible from the Internet even if you have only one public IPv4 address: the proxy listens on your public IP address and forwards connections to the appropriate private IP address based on the SNI.

I just finished writing such a proxy server, which I plan to run on my home network's router so that I can easily access my internal servers from anywhere on the Internet, without a VPN or SSH port forwarding. I was pleased by how easy it was to write this proxy server using only Go's standard library. It's a great example of how well-suited Go is for programs involving networking and cryptography.

Let's start with a standard listen/accept loop (right out of the examples for Go's net package):

func main() {
	l, err := net.Listen("tcp", ":443")
	if err != nil {
		log.Fatal(err)
	}
	for {
		conn, err := l.Accept()
		if err != nil {
			log.Print(err)
			continue
		}
		go handleConnection(conn)
	}
}

Here's a sketch of the handleConnection function, which reads the Client Hello record from the client, dials the backend server indicated by the Client Hello, and then proxies the client to and from the backend. (Note that we dial the backend using the SNI value, which works well with split-horizon DNS where the proxy sees the backend's private IP address and external clients see the proxy's public IP address. If that doesn't work for you, can use more complicated routing logic.)

func handleConnection(clientConn net.Conn) {
	defer clientConn.Close()

	// ... read Client Hello from clientConn ...

	backendConn, err := net.Dial("tcp", net.JoinHostPort(clientHello.ServerName, "443"))
	if err != nil {
		log.Print(err)
		return
	}
	defer backendConn.Close()

	// ... proxy clientConn <==> backendConn ...
}

Let's assume for now we have a convenient function to read a Client Hello record from an io.Reader and return a tls.ClientHelloInfo:

func readClientHello(reader io.Reader) (*tls.ClientHelloInfo, error)

We can't simply call this function from handleConnection, because once the Client Hello is read, the bytes are gone. We need to preserve the bytes and forward them along to the backend, which is expecting a proper TLS connection that starts with a Client Hello record.

What we need to do instead is "peek" at the Client Hello record, and thanks to some simple but powerful abstractions from Go's io package, this can be done with just six lines of code:

func peekClientHello(reader io.Reader) (*tls.ClientHelloInfo, io.Reader, error) {
        peekedBytes := new(bytes.Buffer)
        hello, err := readClientHello(io.TeeReader(reader, peekedBytes))
        if err != nil {
                return nil, nil, err
        }
        return hello, io.MultiReader(peekedBytes, reader), nil
}

What this code does is create a TeeReader, which is a reader that wraps another reader and writes everything that is read to a writer, which in our case is a byte buffer. We pass the TeeReader to readClientHello, so every byte read by readClientHello gets saved to our buffer. Finally, we create a MultiReader which essentially concatenates our buffer with the original reader. Reads from the MultiReader initially come out of the buffer, and when that's exhausted, continue from the original reader. We return the MultiReader to the caller along with the ClientHelloInfo. When the caller reads from the MultiReader it will see a full TLS connection stream, starting with the Client Hello.

Now we just need to implement readClientHello. We could open up the TLS RFCs and learn how to parse a Client Hello record, but it turns out we can let crypto/tls do the work for us, thanks to a callback function in tls.Config called GetConfigForClient:

// GetConfigForClient, if not nil, is called after a ClientHello is
// received from a client.
GetConfigForClient func(*ClientHelloInfo) (*Config, error) // Go 1.8

Roughly, what we need to do is create a TLS server-side connection with a GetConfigForClient callback that saves the ClientHelloInfo passed to it. However, creating a TLS connection requires a full-blown net.Conn, and readClientHello is passed merely an io.Reader. So let's create a type, readOnlyConn, which wraps an io.Reader and satisfies the net.Conn interface:

type readOnlyConn struct {
        reader io.Reader
}
func (conn readOnlyConn) Read(p []byte) (int, error)         { return conn.reader.Read(p) }
func (conn readOnlyConn) Write(p []byte) (int, error)        { return 0, io.ErrClosedPipe }
func (conn readOnlyConn) Close() error                       { return nil }
func (conn readOnlyConn) LocalAddr() net.Addr                { return nil }
func (conn readOnlyConn) RemoteAddr() net.Addr               { return nil }
func (conn readOnlyConn) SetDeadline(t time.Time) error      { return nil }
func (conn readOnlyConn) SetReadDeadline(t time.Time) error  { return nil }
func (conn readOnlyConn) SetWriteDeadline(t time.Time) error { return nil }

readOnlyConn forwards reads to the reader and simulates a broken pipe when written to (as if the client closed the connection before the server could reply). All other operations are a no-op.

Now we're ready to write readClientHello:

func readClientHello(reader io.Reader) (*tls.ClientHelloInfo, error) {
        var hello *tls.ClientHelloInfo

        err := tls.Server(readOnlyConn{reader: reader}, &tls.Config{
                GetConfigForClient: func(argHello *tls.ClientHelloInfo) (*tls.Config, error) {
                        hello = new(tls.ClientHelloInfo)
                        *hello = *argHello
                        return nil, nil
                },
        }).Handshake()

        if hello == nil {
                return nil, err
        }

        return hello, nil
}

Note that Handshake always fails because the readOnlyConn is not a real connection. As long as the Client Hello is successfully read, the failure should only happen after GetConfigForClient is called, so we only care about the error if hello was never set.

Let's put everything together to write the full handleConnection function. I've added deadlines (thanks, Filippo!) and a check that the SNI value ends with .internal.example.com to prevent this from being used as an open proxy. When I deploy this, I will use the DNS suffix of my home network.

func handleConnection(clientConn net.Conn) {
	defer clientConn.Close()

	if err := clientConn.SetReadDeadline(time.Now().Add(5 * time.Second)); err != nil {
		log.Print(err)
		return
	}

	clientHello, clientReader, err := peekClientHello(clientConn)
	if err != nil {
		log.Print(err)
		return
	}

	if err := clientConn.SetReadDeadline(time.Time{}); err != nil {
		log.Print(err)
		return
	}

	if !strings.HasSuffix(clientHello.ServerName, ".internal.example.com") {
		log.Print("Blocking connection to unauthorized backend")
		return
	}

	backendConn, err := net.DialTimeout("tcp", net.JoinHostPort(clientHello.ServerName, "443"), 5*time.Second)
	if err != nil {
		log.Print(err)
		return
	}
	defer backendConn.Close()

        var wg sync.WaitGroup
        wg.Add(2)

        go func() {
                io.Copy(clientConn, backendConn)
                clientConn.(*net.TCPConn).CloseWrite()
                wg.Done()
        }()
        go func() {
                io.Copy(backendConn, clientReader)
                backendConn.(*net.TCPConn).CloseWrite()
                wg.Done()
        }()

        wg.Wait()
}

Here's the complete Go source code - just 115 lines! (Not counting copyright legalese)

Security Review of CFSSL Signer Code

2020-06-18T14:05:20Z

Certificate signing is the most security-sensitive task performed by a certificate authority. The CA has to sign values, like DNS names, that are provided by untrusted sources. The CA must rigorously validate these values before signing them. If an attacker can bypass validation and get untrusted data included in a certificate, the results can be dire. For example, if an attacker can trick a CA into including an arbitrary SAN extension, they can get a certificate for domains they don't control.

Unfortunately, there is a history of CAs including unvalidated information in certificates. A common cause is CAs copying information directly from CSRs instead of from more constrained information sources. Since CSRs can contain both subject identity information and arbitrary certificate extensions, directly ingesting CSRs is extremely error-prone for CAs. For this reason, CAs would be well-advised to extract the public key from the CSR very early in the certificate enrollment process and discard everything else. In a perfect world, CAs would accept standalone public keys from subscribers instead of CSRs. (Before you say "proof-of-possession", see the appendix.)

I decided to review the signing code in CFSSL, the open source PKI toolkit used by Let's Encrypt's Boulder, to see how it stacks up against this advice. Unfortunately, I found that CFSSL copies subject identity information from CSRs by default, has features that are hard to use safely, and uses complicated logic that obfuscates what is included in certificate fields. I recommend that publicly-trusted CAs not use CFSSL.

Update: since publication of this post, Let's Encrypt has begun moving away from CFSSL!

Scope of review

I reviewed the CFSSL Git repository as of commit 6b49beae21ff90a09aea3901741ef02b1057ee65 (the HEAD of master at the time of my review). I reviewed the code in the signer and signer/local packages.

The signing operation

In CFSSL, you sign a certificate by invoking the Sign function on the signer.Signer interface, which has this signature:


Sign(req SignRequest) (cert []byte, err error)

There is only one actual implementation of signer.Signer: local.Signer. (The other implementations, remote.Signer and universal.Signer, are ultimately wrappers around local.Signer.)

Inputs to the signing operation

At a high level, inputs to the signing operation come from three to four places:

The Signer object, which contains:
- The private key and certificate of the CA
- A list of named profiles plus a default profile
- A signature algorithm
I will refer to this object as Signer.
The SignRequest argument, whose relevant fields are:
Hosts []string Request string // The CSR Subject *Subject Profile string CRLOverride string Serial *big.Int Extensions []Extension NotBefore time.Time NotAfter time.Time
I will refer to this object as SignRequest.
The Signer's default certificate profile, represented by an instance of the SigningProfile struct. I will refer to the default profile as defaultProfile.
The effective certificate profile, represented by an instance of the SigningProfile struct. I will refer to the effective profile as profile. If the profile named by SignRequest.Profile exists in Signer, then profile is that profile. If it doesn't exist, then profile equals defaultProfile.

The Sign function takes values from these places and combines them to produce the input to x509.CreateCertificate in Go's standard library. There is overlap - for instance SANs can be specified in the CSR, SignRequest.Hosts, or SignRequest.Extensions. How does Sign decide which source to use when constructing the certificate?

Certificate construction logic

To understand how Sign works, I looked at each certificate field and worked backwards to figure out how Sign decides to populate the field. Below are my findings.

Serial Number

If profile.ClientProvidesSerialNumbers is true: use SignRequest.Serial (error if not set).
Else: generate a random 20 byte serial number.

Not Before

If SignRequest.NotBefore is non-zero: use it.
Else if profile.NotBefore is non-zero: use it.
Else if profile.Backdate is non-zero: use current time - profile.Backdate.
Else: use current time - 5 minutes.

Not After

If SignRequest.NotAfter is non-zero: use it.
Else if profile.NotAfter is non-zero: use it.
Else if profile.Expiry is non-zero: use not before + profile.Expiry.
Else: use not before + defaultProfile.Expiry.

Signature Algorithm

If profile.CSRWhitelist is nil or profile.CSRWhitelist.SignatureAlgorithm is true: Use Signer's signature algorithm.
Else: CFSSL leaves the signature algorithm unspecified and the Go standard library picks a sensible algorithm.

Comments: it's weird how something named CSRWhitelist is used to decide whether to use a value that comes not from the CSR, but from Signer. This is probably because CFSSL's ParseCertificateRequest function gets this field from Signer rather than from the CSR that it is parsing. This sort of indirection and misleading naming makes the code hard to understand.

Public Key

If profile.CSRWhitelist is nil or profile.CSRWhitelist.PublicKey is is true: Use the CSR's public key.
Else: the certificate won't have a public key (this probably causes x509.CreateCertificate to return an error).

Comments: it's unclear why you'd ever want profile.CSRWhitelist.PublicKey to be false. The public key is literally the only piece of information that should be taken from the CSR.

SANs

This one's a doozy...

If profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a SAN extension and SignRequest.Extensions contains a SAN extension, and the SAN OID is present in profile.ExtensionWhitelist: add two SAN extensions to the certificate, one from the CSR and one from SignRequest.Extensions. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a SAN extension: use the SAN extension verbatim from the CSR. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if SignRequest.Extensions contains a SAN extension, and the SAN OID is present in profile.ExtensionWhitelist: use the SAN extension verbatim from SignRequest.Extensions. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if profile.CAConstraint.IsCA is true: the certificate will not contain a SAN extension.
Else if SignRequest.Hosts is non-nil:
1. Use each string in SignRequest.Hosts as follows:
  - If string parses as an IP address: make it an IP Address SAN.
  - Else if string parses as an email address: make it an email address SAN.
  - Else if string parses as a URI, make it a URI SAN.
  - Else: make it a DNS SAN.
2. If profile.NameWhitelist is non-nil: return an error unless the string representation of every DNS, email, and URI SAN matches the profile.NameWhitelist regex (IP address SANs are not checked).
Else if profile.CSRWhitelist is nil and the CSR contains a SAN extension:
1. Copy the DNS names, IP addresses, email addresses, and URIs from the CSR's SAN extension.
2. If profile.NameWhitelist is non-nil: enforce whitelist as described above.
Else if profile.CSRWhitelist is non-nil and the CSR contains a SAN extension:
1. If profile.CSRWhitelist.DNSNames is true: use DNS names from the CSR's SAN extension.
2. If profile.CSRWhitelist.IPAddresses is true: use IP addresses from the CSR's SAN extension.
3. If profile.CSRWhitelist.EmailAddresses is true: use email addresses from the CSR's SAN extension.
4. If profile.CSRWhitelist.URIs is true: use URIs from the CSR's SAN extension.
5. If profile.NameWhitelist is non-nil: enforce whitelist as described above.

Subject

For each supported subject attribute (common name, country, province, locality, organization, organizational unit, serial number):

If the attribute was specified in SignRequest.Subject: use it.
Else if profile.CSRWhitelist is nil or profile.CSRWhitelist.Subject is true: use the attribute from the CSR's subject, if present.

Common name only: if profile.NameWhitelist is non-nil: return an error unless the common name matches the profile.NameWhitelist regex.

Note: SignRequest.Hosts does not override the common name.

Basic Constraints

If SignRequest.Extensions contains a basic constraints extension, and the basic constraints OID is present in profile.ExtensionWhitelist: copy the basic constraints extension verbatim from SignRequest.Extensions.
Else: use the values from profile.CAConstraint.

Comments: given how security-sensitive this extension is, it's a relief that there's no way for the value to come from the CSR. Despite this, there is code earlier in the signing process that looks at the CSR's Basic Constraints extension. First it's extracted from the CSR in ParseCertificateRequest and then it's validated in Sign. This code ultimately has no effect, but it makes the logic harder to follow (and gave me a mild heart attack when I saw it).

Extensions besides SAN and Basic Constraints

For a given extension FOO:

If profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a FOO extension and SignRequest.Extensions contains a FOO extension, and FOO is present in profile.ExtensionWhitelist: add two FOO extensions to the certificate, one from the CSR and one from SignRequest.Extensions. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else if profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a FOO extension: copy it verbatim from the CSR. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else if SignRequest.Extensions contains a FOO extension, and FOO is present in profile.ExtensionWhitelist: copy it verbatim to the certificate. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else: use fields from SignRequest (like CRLOverride) and profile (like OCSP, CRL, etc.) to decide what value the extension should have, if any.

Other comments

By default, CSRWhitelist is nil. This is a bad default, as it means SANs will be copied from the CSR unless SignRequest.Hosts is set. Likewise, any subject attribute not specified in SignRequest.Subject will be copied from the CSR. This is practically impossible to use safely: to avoid including unvalidated subject information you have to specify a value for every attribute in SignRequest.Subject - and if you don't want the attribute included in the final certificate you're out of luck. If CFSSL ever adds support for a new attribute type, you had better update your code to specify a value for the attribute or unvalidated information might slip through. This is exactly the sort of logic that makes it so easy to accidentally issue certificates with "Some-State" in the subject.

If the profile specified by SignRequest.Profile doesn't exist, the default profile is used. This could lead to an unexpected certificate profile being used if a CA deletes a profile from their configuration but there are still references to it elsewhere. Considering the trouble that CAs have with profile management (see the infamous TURKTRUST incident or the CA that discovered they had a whopping 85 buggy profiles), I think it would be much safer if a non-existent profile resulted in an error.

SignRequest.Hosts is untyped - everything is a string and there is no distinction between IP addresses, email addresses, URIs, and DNS names. (Also, Hosts is a misleading name because URIs and email addresses aren't hosts.) CFSSL decides what type of SAN to include based on what the string in Hosts successfully parses as, and assumes it's a DNS name if it doesn't parse as anything else. This could lead to unexpected SAN types in the certificate. Determining if a string was intended to be a URI by trying to parse it is an especially bad idea considering how hellish URIs are to parse, and how much variation there is between different URI parsing implementations. If the user of CFSSL adds a string which they believe to be a valid URI to SignRequest.Hosts, but Go's URI parser rejects it, the URI will end up in a DNS SAN instead.

Variable names are inconsistent and often unhelpful. In Sign, req is used for values from SignRequest and safeTemplate is used for values from the CSR. But in PopulateSubjectFromCSR (which is called by Sign), req is used for values from the CSR, and s is used for values from the SignRequest. This increases the likelihood of accidentally using data from the wrong source.

ParseCertificateRequest blindly and unconditionally copies the extensions from the CSR to the Extensions field of the x509.Certificate template - even if profile.CopyExtensions is false. Fortunately, this field is ignored by x509.CreateCertificate so it's probably harmless. It just means that attacker-controlled input is propagated further through the program, increasing the opportunity for it to be misused.

CopyExtensions is a foot cannon

I am extremely concerned by the presence of the CopyExtensions option. Enabling it practically guarantees misissuance because all extensions (except Basic Constraints) are copied verbatim from the CSR, overriding any value specified in the profile or the SignRequest. In particular, SignRequest.Hosts and profile.NameWhitelist are ignored if the CSR contains a SAN extension. Also, profile.ExtensionWhitelist only applies to extensions specified in SignRequest - not those specified in the CSR. I think it's quite likely that users of CopyExtensions will be surprised when neither of these whitelists are effective.

Lack of documentation

As I showed above, the logic for constructing a certificate is very complicated, and you have to use CFSSL in exactly the right way to avoid copying unvalidated information from CSRs. Unfortunately, documentation is practically non-existent and I could only figure out CFSSL's logic by reading the source code. Obviously, the lack of documentation makes it hard to use CFSSL safely. But the more fundamental problem is that documentation writing wasn't a core part of CFSSL's engineering process. Had documentation been written in tandem with the design and implementation of CFSSL, it would have been evident that incomprehensibility was spiraling out of control. This information could have been fed back into the engineering process and used to redesign or even reject features that made the system too hard to understand. I have personally saved myself many times from releasing overly-complicated software just by writing the documentation for it.

Final thoughts

CFSSL has some nice features, like its friendly command line interface and its certificate bundler for building optimal certificate chains. However, I struggle to see the value provided by its signer package. Its truly useful functionality, like Certificate Transparency submission and pre-issuance linting, could be extracted into standalone libraries. The rest of the signer is just a complicated wrapper around Go's x509.CreateCertificate that obscures what gets included in certificates and will include the wrong thing if you hold it wrong. A long history of misissuance shows us why we need better. If you're a CA, just call x509.CreateCertificate directly - it will be much easier to ensure you are only including validated information in your certificates.

Appendix: Proof-of-Possession and TLS

A common but unfounded objection to discarding everything in a CSR except the public key is that checking the CSR's signature is necessary because it ensures proof-of-possession of the private key. If a CA doesn't verify proof-of-possession, then someone could obtain a certificate for a key which belongs to someone else. (In fact, someone recently got a certificate containing Let's Encrypt's public key.) For TLS, this doesn't matter. (Other protocols, like S/MIME, may be different.) The TLS protocol ensures proof-of-possession every time the certificate is used.

For TLS 1.3, this is easy to see: the server or client has to send a Certificate Verify message which contains a signature from their private key over a transcript of the handshake. The handshake includes their certificate, which is a superset of the information in a CSR. Therefore, the Certificate Verify message proves at least as much as the CSR signature does. In fact it's better, since the proof is fresh and not reliant on a trusted third party doing its job correctly.

In earlier versions of TLS, client certificates are verified in the same way (signing a handshake transcript which includes the certificate). Server certificates are used differently, but ultimately the handshake transcript (which includes the server certificate) is authenticated by a shared secret that is known only to the client and the holder of the certificate private key (provided neither party deliberately sabotages their security). So as with TLS 1.3, private key possession is proven, rendering the CSR signature unnecessary.

Fixing the Breakage from the AddTrust External CA Root Expiration

2020-05-30T16:09:42Z

A lot of stuff on the Internet is currently broken on account of a Sectigo root certificate expiring at 10:48:38 UTC today. Generally speaking, this is affecting older, non-browser clients (notably OpenSSL 1.0.x) which talk to TLS servers which serve a Sectigo certificate chain ending in the expired certificate. See also this Twitter thread by Ryan Sleevi.

This post is going to explain what you should do to avoid problems, from the perspectives of both server operators (tldr: test your server with What's My Chain Cert? and do what it says) and client operators (tldr: upgrade your TLS libraries if possible, otherwise remove AddTrust External CA Root from your trust store).

Quick primer on certificate chains

When you connect to a TLS server, the server sends the client a certificate that proves its identity. The client needs to build a chain of certificates from the server certificate to a root certificate that the client trusts. To help the client build this chain, the server sends back one or more intermediate certificates after its own certificate.

For example, my website sends the following two certificates:

Subject	Issuer	Expiration
www.agwa.name	Sectigo RSA Domain Validation Secure Server CA	2021-04-03
Sectigo RSA Domain Validation Secure Server CA	USERTrust RSA Certification Authority	2030-12-31

The first certificate is mine and is issued by Sectigo RSA Domain Validation Secure Server CA. The second certificate is Sectigo RSA Domain Validation Secure Server CA and is issued by USERTrust RSA Certification Authority, which is a root certificate. These two certificates form a complete chain to a trusted root.

However, USERTrust RSA Certification Authority is a relatively new root. It was created in 2010, and it took many years for it to become trusted by all clients. As recently as last year I heard reports of clients not trusting this root.

For this reason, some servers send back a chain with an additional intermediate certificate:

Subject	Issuer	Expiration
www.agwa.name	Sectigo RSA Domain Validation Secure Server CA	2021-04-03
Sectigo RSA Domain Validation Secure Server CA	USERTrust RSA Certification Authority	2030-12-31
USERTrust RSA Certification Authority	AddTrust External CA Root	2020-05-30

This sequence of certificates form a chain to another root called AddTrust External CA Root which was created in 2000 and is trusted by many client platforms. Or rather, it was trusted before it expired today.

Fortunately, modern clients with well-written certificate validators (this includes all mainstream web browsers) won't have a problem with the expiration. Since they trust the USERTrust RSA Certification Authority root, they will build a chain to that root and ignore the fact that the server sent an expired intermediate certificate.

Other clients, notably anything using OpenSSL 1.0.x or GnuTLS, will have a problem. Even if these clients trust the USERTrust RSA Certification Authority root, and could build a chain to it if they wanted, they'll end up building a chain to AddTrust External CA Root instead, causing the certificate validation to fail with an expired certificate error.

Fixing this problem as a server operator

Basically, you need to remove the intermediate certificate issued by AddTrust External CA Root from your certificate chain.

If you get your certificates from SSLMate, you don't need to worry. I saw this coming over a year ago, and configured SSLMate to start providing a chain without AddTrust External CA Root. As certificates renewed, SSLMate customers received the new chain, and since SSLMate has long capped certificate lifetimes at one year, the older chain was cycled out before the intermediate expired.

But if your server is using Sectigo certificates from another source, you might need to worry. You can quickly test if your server is affected using What's My Chain Cert?. If your server is OK, it will say "correct chain". If it's sending the expired intermediate, it will say "trusted chain containing an expired certificate" and provide you with a link to download a correct, non-expired chain.

Fixing this problem as a client operator

In a perfect world, all of your libraries would be up-to-date and you wouldn't be using clownish TLS implementations like GnuTLS. But the world isn't perfect. OpenSSL 1.0.x is still common, and curl used it as recently as Debian Stretch. And APT, the package manager used by Debian and Ubuntu, links with GnuTLS.

Fortunately, OpenSSL 1.0.x and GnuTLS (at least on Debian) only choke on the expired intermediate if the AddTrust External CA Root root is in the local trust store. If it isn't, they will build a chain to USERTrust RSA Certification Authority instead. On Debian (and probably Ubuntu but I haven't tested), you can easily remove this root from the trust store as follows:

Edit /etc/ca-certificates.conf and put a bang/exclamation mark (!) before mozilla/AddTrust_External_Root.crt
Run update-ca-certificates

For Fedora and RHEL, see this Tweet by Christian Heimes.

Short Take: Why Trust-On-First-Use Doesn't Work (Even for SSH)

2020-02-08T16:08:08Z

Considering all the progress that has been made over the last decade making SSL certificates on the Web easy, free, automated, and transparent, it's a bit jarring to see someone arguing in 2020 that trust-on-first-use (TOFU) would be better for the Web:

Unpopular opinion. Most people would be better off with a Trust On First Use system for accessing sites. Like SSH, perhaps with some unique (per user) OOB addition to it. Would we really design it this way of starting again?

— Nick Hutton @nickdothutton, Feb 6, 2020

First, be wary of any comparison with SSH, because in the grand scheme of things, very few people use SSH. *nix sysadmins do, obviously. Many, but not all, software developers do. Some people in engineering/science fields might. But that's a drop in the bucket compared to the Web, which basically everyone uses. So just because something appears to work for SSH doesn't mean it will work for the Web.

And I would argue that TOFU actually doesn't work very well for SSH, and the only reason we put up with it is because of SSH's low deployment. SSH server host keys rarely change (which is bad for post-compromise security, so this is nothing to celebrate), but when they do, SSH handles it very poorly. The user gets a big scary message about a possible man-in-the-middle attack. And then what do you think they do? They do this:

Hi all,

It appears that as of midnight last night, SSH and login are working. However, there were a couple students last night who were getting errors such as “REMOTE HOST IDENTIFICATION HAS CHANGED!” or “POSSIBLE DNS SPOOFING DETECTED!” when trying to SSH in.

To fix this, you can run `ssh-keygen -R [REDACTED]` then try to SSH in again. I believe someone else mentioned last night that you could also just delete the entire ~/.ssh/known_hosts file as well to fix the issue, but this seems to be a less destructive solution.

That's from a real email that I once received. I would not be at all surprised if TOFU actually devolves to opportunistic encryption in practice, because users just bypass any man-in-the-middle error they receive.

You could make it really hard to bypass man-in-the-middle errors, but then people would brick their servers, as happened with HTTP public key pinning, which is one of the reasons why that technology is now extinct.

Proponents of TOFU might say that even if TOFU devolves to opportunistic encryption, the man-in-the-middle errors at least make attacks noisy. True, but the errors are seen by people who generally don't know what they mean and even if they did, can't evaluate whether an error is a legitimate key change or an actual attack. In contrast, a PKI with Certificate Transparency (i.e. the system currently deployed on the Web) also makes attacks noisy, but alerts about new certificates go to server operators, who actually know whether a new certificate is legitimate or not. They just need to be monitoring Certificate Transparency logs.

So yes, I do believe we would design the Web this way if starting again.

When Will Your DNS Record Be Published?

2020-02-03T14:00:00Z

When publishing a DNS record through an API, it's often useful to know when the DNS record has been fully published and is visible to DNS resolvers. A perfect example which comes up at SSLMate is automatically validating a certificate request by publishing a DNS record. SSLMate must be sure that the DNS record is visible before it tells the certificate authority to validate it, or the certificate request may fail.

Unfortunately, I know of only one DNS provider that has an API to tell you when a change is published: Route 53. After submitting a DNS change request to Route 53, the API returns a ChangeInfo object which contains a status of either "PENDING" or "INSYNC". You can poll the change until its status becomes "INSYNC", which means the change has taken effect on all Route 53 servers. SSLMate has published a lot of DNS records through Route 53 and this API has never let me down, which makes me happy.

Other DNS providers offer absolutely nothing to help you determine when a DNS change is visible. In these cases, SSLMate can do nothing but sleep for 10-120 seconds (depending on the provider) and hope for the best. Unfortunately, it doesn't help for SSLMate to try to resolve the DNS record to see if the record has been published - modern authoritative DNS services use many different servers, often with anycast or load balancing, so just because SSLMate sees the record doesn't mean that others will.

And then there's Google Cloud DNS, which deserves a special mention because they offer an API that looks very similar to Route 53's: after submitting a change request, the API returns a change object with a status of "pending" that you can poll until the status becomes "done". Sounds perfect! Except if you read the fine print, it says:

A status of "done" means that the request to update the authoritative servers has been sent, but the servers might not be updated yet

Sure enough, I found that it often takes two minutes after a change becomes "done" for it to be fully visible. The change object also contains a very bizarre boolean called "isServing", which is documented as:

If the DNS queries for the zone will be served.

I'm not sure what this means, or why information about the zone's status would be present in a record change object. In my testing I never once saw a value besides false, even long after queries for both the individual record and the zone as a whole were being served.

So the change object API is completely useless, and I don't know why it exists - who cares if the "request to update the authoritative servers has been sent"? That's an internal implementation detail. It only matters to users of the API if the change has been fully applied everywhere. So, SSLMate doesn't use change objects. It sleeps for 2 minutes after adding the record and hopes for the best.

All of this is exasperated when requesting a certificate using the ACME protocol. With ACME, if you tell the server the DNS record is published, but the server doesn't see the record, your certificate order is invalidated. You have to create a new order, and you're given a different DNS record that you have to publish. That means your ACME client could potentially get in a situation where it never makes forward progress, because on each attempt it fails to wait long enough before telling the ACME server to check the record.

SSLMate has a workaround for this when talking to an ACME-using certificate authority such as Let's Encrypt. Instead of publishing the record returned by the ACME server, SSLMate publishes an NS record that delegates the record to a custom-built authoritative DNS server operated by SSLMate. SSLMate's authoritative server returns the record provided by the ACME server. The NS record never changes, so if checking the record fails and SSLMate has to create a new ACME order, it doesn't need to republish a DNS record in the customer's zone; instead it just has to update the record that SSLMate's authoritative server returns, which can be done instantaneously. Therefore, every retry is more likely to succeed than the previous one since more time has elapsed since publishing the NS record. All of this happens completely automatically and transparently to the user of SSLMate, and is one of the ways that SSLMate provides great dependability. (Another benefit is that if the customer's DNS provider doesn't provide an API, they can publish the NS record manually and never have to touch it again, even for renewals.)

Nevertheless, it would be really nice if more DNS providers offered an API like Route 53 to report when a DNS record has been published.

This Is Why You Always Review Your Dependencies, AGPL Edition

2020-01-04T18:47:04Z

Before adding a dependency to one of my software projects, I do some basic vetting of the dependency. Among the things I check are:

How is the code licensed?
Who are the authors?
Are there any serious unresolved issues in the issue tracker?
Is there a history of serious bugs in the issue tracker?
What kind of code review process is used for pull requests?

Finally, I do a cursory review of the code. I look for anything blatantly insecure or malicious, and try to get a feel for the quality of the code base. I look for "Brown M&Ms" - minor inattention to detail that might indicate a larger problem.

I repeat the above recursively on transitive dependencies as many times as necessary. I also repeat the cursory code review any time I upgrade a dependency.

This is quite a bit of work, but is necessary to avoid falling victim to attacks like event-stream. I was recently reminded of yet another reason to review dependencies, as I reviewed Duo's highly-publicized Go library for WebAuthn, github.com/duo-labs/webauthn.

It started off poorly when I noticed some Brown M&M's: despite being a library, it was logging messages to stdout, and there were several code smells which indicated inexperience with Go. Sure enough, these minor issues foreshadowed a far larger problem: when I started reviewing the transitive dependency github.com/katzenpost/core/crypto/eddsa, I was greeted with an AGPLv3 license header.

This was bad news for most people wanting to use Duo's WebAuthn library. Although Duo had licensed their library under a BSD license, when you linked your application with Duo's library, you'd also be linking with the AGPL-licensed library, creating a "modified" work in the eyes of the (A)GPL, thus subjecting your application to section 13 of the AGPL:

Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software.

In other words, if you used github.com/duo-labs/webauthn in a public-facing web app, your web app had to be open source.

The most galling thing about this dependency is that it's redundant with golang.org/x/crypto/ed25519, which is one of Go's quasi-standard "x" libraries. In fact, github.com/duo-labs/webauthn originally used golang.org/x/crypto/ed25519. That changed during a pull request from an external collaborator titled "Consolidate COSE things to their own area". In the process of moving some code from one file to another, this pull request subtly changed the implementation of OKPPublicKeyData.Verify.

Here's the old OKPPublicKeyData.Verify, which uses golang.org/x/crypto/ed25519:

// Verify Octet Key Pair (OKP) Public Key Signature
func (k *OKPPublicKeyData) Verify(data []byte, sig []byte) (bool, error) {
	f := HasherFromCOSEAlg(COSEAlgorithmIdentifier(k.PublicKeyData.Algorithm))
	h := f()
	h.Write(data)
	return ed25519.Verify(k.XCoord, h.Sum(nil), sig), nil
}

Here's the new OKPPublicKeyData.Verify, which uses the AGPL-licensed github.com/katzenpost/core/crypto/eddsa:

// Verify Octet Key Pair (OKP) Public Key Signature
func (k *OKPPublicKeyData) Verify(data []byte, sig []byte) (bool, error) {
	f := HasherFromCOSEAlg(COSEAlgorithmIdentifier(k.PublicKeyData.Algorithm))
	h := f()
	h.Write(data)
	var oKey eddsa.PublicKey
	err := oKey.FromBytes(k.XCoord)
	if err != nil {
		return false, err
	}
	return oKey.Verify(h.Sum(nil), sig), nil
}

There was zero explanation provided for this change. The pull request was reviewed by two Duo employees, who approved and merged it.

Aside: this is why I don't like to accept pull requests that move code around. Even if the new code organization is better, it's usually not worth the time it takes to ensure the pull request isn't doing anything extra.

I filed an issue about the AGPL-licensed dependency, and the developers switched back to using golang.org/x/crypto/ed25519. Nevertheless, I've decided not to use github.com/duo-labs/webauthn. The bulk of the library and its dependencies are to support a WebAuthn misfeature called attestation, which I have less-than-zero desire to use. I just finished writing a vastly simpler, attestation-free library which is less than one tenth the size (I will open source it soon - watch this space). (There's another lesson here, which is that complicated "features" like attestation that serve a minority's use case shouldn't be added to Web standards.) Developing this library is less costly than the liability of using an existing WebAuthn Go library.

This incident reminded me of why I like programming in Go. Go's extensive standard library, along with its quasi-standard "x" libraries, mean that the dependency graph of my projects is typically quite small. The bulk of my trust is consolidated in the Go project, and thanks to their stellar reputation and solid operating procedures, I don't feel a need to review the source code of the Go compiler and standard libraries. Even though I love Rust, I am terrified every time I look at the dependency graph of a typical Rust library: I usually see dozens of transitive dependencies written by Internet randos whom I have zero reason to trust. Vetting all those dependencies takes far too much time, which is why I'm much less productive in Rust than Go.

One final note: as a fan of verifiable data structures like Certificate Transparency, I have to love the new Go checksum database. However, the checksum database does you no good if you don't take the time to review your dependencies. Unfortunately, I've already seen one over-enthusiastic Go user claim that the Go checksum database solves all problems with dependency management. It doesn't. There's no easy way around this basic fact: you have to review your dependencies.

Preventing Server Side Request Forgery in Golang

2019-12-20T03:00:12Z

If your application makes requests to URLs provided by untrusted sources (such as users), you must take care to avoid server side request forgery (SSRF) attacks. Otherwise, an attacker might be able to induce your application to make a request to a service on your server's localhost or internal network. Since the service thinks the request is coming from a trusted source, it might perform a privileged action or return sensitive data that gets relayed by your application back to the attacker. This is particularly a problem when running in EC2, which exposes sensitive credentials over its metadata service, which is accessible over HTTP at a private IP address. SSRF attacks can be serious; one was exploited earlier this year to steal more than 100 million credit applications from Capital One.

One way to prevent SSRF attacks is to validate all addresses before connecting to them. However, you must do the validation at a very low layer to be effective. It's not sufficient to simply block URLs that contain "localhost" or an internal IP address, since an attacker could publish a DNS record under a public domain that resolves to an internal IP address. It's also insufficient to do the DNS lookup yourself and block a URL if the hostname resolves to an unsafe address; an attacker could set up a special DNS server that returns a safe address the first time it's queried, and the target address the second time when your application actually connects to the URL.

Instead, you need to hook deep into your HTTP client's networking stack and check for a safe address right before the HTTP client tries to access it.

Fortunately, Go makes it easy to hook in at just the right place, thanks to the Control field of net.Dialer, introduced in Go 1.11:

// If Control is not nil, it is called after creating the network
// connection but before actually dialing.
//
// Network and address parameters passed to Control method are not
// necessarily the ones passed to Dial. For example, passing "tcp" to Dial
// will cause the Control function to be called with "tcp4" or "tcp6".
Control func(network, address string, c syscall.RawConn) error // Go 1.11

This function is called by Go's standard library after the address has been resolved, but before connecting. The network argument is tcp4, udp4, tcp6, or udp6, and the address argument is an IP address and port number separated by a colon (e.g. 192.0.2.0:80 or [2001:db8:f942::3ab2]:443; split it with net.SplitHostPort, not strings.Split, to avoid IPv6 breakage). If the control function returns an error, the dial is aborted.

Here's an example control function that returns an error if the address is not safe. It's quite conservative, permitting only TCP connections to port 80 and 443 on public IP addresses (see here for the implementation of isPublicIPAddress). You may want to customize the control function to suit your application's needs.

func safeSocketControl(network string, address string, conn syscall.RawConn) error {
	if !(network == "tcp4" || network == "tcp6") {
		return fmt.Errorf("%s is not a safe network type", network)
	}

	host, port, err := net.SplitHostPort(address)
	if err != nil {
		return fmt.Errorf("%s is not a valid host/port pair: %s", address, err)
	}

	ipaddress := net.ParseIP(host)
	if ipaddress == nil {
		return fmt.Errorf("%s is not a valid IP address", host)
	}

	if !isPublicIPAddress(ipaddress) {
		return fmt.Errorf("%s is not a public IP address", ipaddress)
	}

	if !(port == "80" || port == "443") {
		return fmt.Errorf("%s is not a safe port number", port)
	}

	return nil
}

Once you have a control function, you can use it to make HTTP requests as follows (the various numbers below match those used by http.DefaultClient):

safeDialer := &net.Dialer{
	Timeout:   30 * time.Second,
	KeepAlive: 30 * time.Second,
	DualStack: true,
	Control:   safeSocketControl,
}

safeTransport := &http.Transport{
	Proxy:                 http.ProxyFromEnvironment,
	DialContext:           safeDialer.DialContext,
	ForceAttemptHTTP2:     true,
	MaxIdleConns:          100,
	IdleConnTimeout:       90 * time.Second,
	TLSHandshakeTimeout:   10 * time.Second,
	ExpectContinueTimeout: 1 * time.Second,
}

safeClient := &http.Client{
	Transport: safeTransport,
}

resp, err := safeClient.Get(untrustedURL)

The above code examples are in the public domain.

Programmatically Accessing Your Customers' Google Cloud Accounts (While Avoiding the Confused Deputy Problem)

2019-12-03T15:07:06Z

SaaS applications often need to access their customers' cloud resources at providers like Amazon Web Services and Google Cloud Platform. For instance, a monitoring service might require read-only access to their customers' AWS accounts so it can inventory resources. At SSLMate, we request access to our customers' DNS zones so we can publish DNS records to automatically validate the certificates that they request.

Doing this with AWS was easy, thanks to their detailed documentation for precisely this use case. However, when it came to Google Cloud, the only hint of a solution that I could find was buried deep in this FAQ:

How can I access data from my users' Google Cloud Platform project using Cloud APIs?

You can access data from your users' Google Cloud Platform projects by creating a service account to represent your service, and then having your customers grant that service account appropriate access to their cloud data using IAM policies. Note that you might want to create a service account per customer if you need to avoid confused deputy problems.

The first part sounds pretty easy: create a service account which SSLMate uses when making DNS changes and ask customers to grant this account access to Cloud DNS in their Google Cloud project. What isn't as easy, unfortunately, is solving the confused deputy problem. Contrary to the FAQ, this isn't something that we "might" want to do - it's something we absolutely must do. If SSLMate used just a single service account and all of our customers authorized it, then a malicious customer would be able to request a certificate for any of our customer's domains. SSLMate would access the victim's project using the single SSLMate service account and publish the validation record for the attacker's certificate request. This would succeed since the victim had authorized the single SSLMate service account. I can't think of any application that would not also be vulnerable if it did not address the confused deputy problem.

AWS provides a nice solution to the problem. The customer doesn't just authorize SSLMate's AWS account; they authorize the AWS account plus an "external ID", which for SSLMate is the same as their SSLMate customer ID. When SSLMate connects to AWS to add a DNS record, it sends the ID of the customer on whose behalf it is acting. If it doesn't match the authorized external ID, AWS blocks the request.

But with Google Cloud we're stuck creating a different service account for each SSLMate customer. The customer would authorize only that service account, and SSLMate would use it to access their Google Cloud project. Creating all these service accounts isn't hard - there's an API for that - but what's hard is giving each of these service accounts their own key. I'd have to securely store all those keys somewhere, and rotate them periodically, which is a pain.

Fortunately, Google Cloud has an API to generate short-lived OAuth2 access tokens for a service account. SSLMate invokes this API from a single master service account to get an access token for the customer-specific service account, and uses the access token to make the DNS change requests. When it's done, SSLMate discards the credentials. The customer-specific accounts have no long-term keys associated with them; only the single master service account does, making key management significantly easier, and equivalent to AWS.

Note that Google Cloud limits projects to 100 service accounts by default, so you should keep a close eye on your utilization and request a quota increase when necessary. Unfortunately, when I hit the limit, Google initially denied my quota increase despite my use case literally being in their documentation. I only got the limit increased after complaining on Twitter and getting retweeted by Corey Quinn. Hopefully this won't happen again!

Here's how you can set this up for your SaaS application...

1. Create the Master Service Account

This service account will be used by your app to create customer-specific service accounts, and to create short-lived access credentials for accessing those accounts.

Create the service account, giving it a name of your choosing.
Assign the service account the following roles:
- Service Account Admin - this is needed to create customer-specific service accounts
- Service Account Token Creator - this is needed to get the short-lived access credentials
Create a key for this service account and save a copy. This key will be used by your app.

2. Onboarding a Customer

Your app needs to do this every time you onboard a new customer (or an existing customer wants to integrate their Google Cloud account with your service).

Create a service account for the customer using the IAM API, authenticating with the key for your master service account. I suggest deriving the service account ID from the customer's internal ID. For instance, if the customer ID is 1234, use customer-1234 as the service account ID.
Instruct your customer to authorize this service account as follows:
1. Visit the IAM page for their project.
2. Click Add.
3. In the New Member box, your customer must enter the email address of the service account created in step 1. The email address will look like: SERVICE_ACCOUNT_ID@YOUR_PROJECT_ID.iam.gserviceaccount.com
4. In the Role box, choose the roles necessary for your app to function (in SSLMate's case, this is "DNS Administrator").
5. Click Save.
After your customer authorizes the account, I recommend testing the access by making a simple read-only request as described in part 3 below and displaying an error if it doesn't work.

3. Accessing Your Customer's Account

Use the generateAccessToken API to create an access token for the customer-specific service account. Authenticate to this API using the master service account's key. The name parameter must contain the email address of the customer-specific service account, which you can derive from the customer ID if you follow the naming convention suggested above. The scope array must contain the OAuth scopes you need to access. (In SSLMate's case, this is https://www.googleapis.com/auth/ndev.clouddns.readwrite)
Use the token returned by the API call as a Bearer token for accessing your customer's account.

If you're using Go, it's helpful to put the above logic in a type that implements the oauth2.TokenSource interface, which you can pass to oauth2.NewClient to create the http.Client that you use with the Google Cloud API library. Here's a drop-in token source type you can use, and here's an adaptable example of how to use it.

Conclusion

While it's more complicated to set up than AWS, in the end this solution has the same desirable properties as AWS: protection against confused deputy attacks, and just one long term credential for your application. However, Google Cloud needs to do better. First, they need to make their documentation for this use case as helpful as AWS's. Second, they should significantly increase the default service account quota or abolish it entirely: otherwise, it serves as a disincentive for people to do the secure thing.

Addendum: Why I Didn't Use Three-Legged OAuth

My first attempt at implementing this feature used three-legged OAuth. SSLMate would redirect the customer to Google, they'd click a single button to authorize SSLMate for read/write access to Google Cloud DNS, and then Google would redirect back to SSLMate with an access code. This worked well and provided a great user experience since the customer didn't need to do any configuration. Unfortunately, the access was linked to the customer's Google account, rather than their Google Cloud project. I thought this was a bad idea because if the Google Account ever lost access to the project, it would break the SSLMate integration. The integration needs to continue working even if the employee who set it up leaves the company which owns the Google Cloud project.

Another thing which turned me off three-legged OAuth was that Google wanted to subject my integration to a lengthy review because accessing DNS is considered a "sensitive scope." This sounded like a hassle, and I assume it is only going to get more restrictive in the future as Google tries to stop malicious OAuth apps like last year's viral Google Docs attack. For example, if they decided one day to classify DNS access as a "restricted scope," I would be subjected to a 5 figure security audit.

Consequentially, I ditched three-legged OAuth and turned to the solution described above. The user experience is not quite as nice but it's much more robust. SSLMate still uses three-legged OAuth with other DNS providers which support it, like DNSimple and Digital Ocean.

Article updated on 2022-05-19 to mention the service account quota and my trouble getting it increased.

MTA-STS is Hard. Here's how DNS Providers Can Make it Awesome With Automation...

2019-04-15T14:00:00Z

Last week, Gmail became the first major email provider to enable the new MTA-STS standard, which will prevent attackers from intercepting email sent to and from Gmail. If you operate a domain which receives email, you should be looking into enabling MTA-STS too, even if you've out-sourced operation of the actual mail servers to a third party provider.

Unfortunately, MTA-STS introduces several new moving parts to operating a domain which I anticipate will cause operational problems. However, a smart DNS provider can offer automation that makes MTA-STS as easy for domain owners as checking a box. Let me explain how...

First, some background. MTA-STS can be distilled into two parts:

Your domain's mail servers need to support modern TLS (TLS 1.2 or higher) and present a publicly-trusted certificate that's valid for the MX server hostname (that is, the hostname which you put in the MX record - not the domain name which receives email). Since this part is straightforward and is taken care of by your mail server provider (who might even be in compliance already), I will not be focusing on it in this post.
You need to duplicate the contents of your domain's MX records in a text file which you serve over HTTPS at https://mta-sts.YOURDOMAIN/.well-known/mta-sts.txt. The reason for the duplication is that DNS is not authenticated, but HTTPS is. By requiring the publication of your MX records at an HTTPS URL which is derived from your domain name, MTS-STS prevents an attacker from swapping out your MX servers with their own. (DNSSEC would accomplish the same thing, but DNSSEC hasn't worked out very well in practice, which is why MTA-STS exists.)

In addition, you need to publish a TXT record at _mta-sts.YOURDOMAIN indicating that your domain uses MTA-STS. Any time you change your mta-sts.txt file, you need to update the ID value in this TXT record. The reason for the TXT record is that retrieving a web page over HTTPS is expensive compared to a DNS lookup. The TXT record allows mail servers to skip retrieving mta-sts.txt if a domain doesn't use MTA-STS or if mta-sts.txt hasn't changed since the last retrieval.

The consequence of the above is that any time you update your domain's mail servers, you have to make a change in three places: the MX record itself (as you do now), the mta-sts.txt file on your web server, and finally the _mta-sts TXT record. If you forget one of the updates or mess one up, you risk losing mail.

Unfortunately, my experience indicates that humans are quite bad at remembering this type of thing. Two common failures which I've seen are forgetting to keep a domain's NS records in sync with the NS records in the parent zone, and forgetting to update a zone's SOA serial number after making changes. Therefore, I anticipate DNS administrators forgetting to keep their MX records in sync with their mta-sts.txt file, and forgetting to update the ID in the _mta-sts TXT record. Ironically, one of the motivations for MTA-STS is that DNSSEC is too difficult to deploy correctly, but I think it's premature to say that MTA-STS will cause fewer outages.

Generally, the way to cope with error-prone, repetitive tasks like this is to automate them away, and DNS providers are in a perfect position to automate MTA-STS policy maintenance for their customers.

DNS providers should offer a checkbox to enable MTA-STS on any domain with an MX record. If you check this box, the DNS provider should do the following:

Automatically publish an A/AAAA/CNAME record at mta-sts.YOURDOMAIN that points to a web server operated by the DNS provider. The DNS provider should automatically obtain a certificate for this hostname. Their web server should respond to requests for /.well-known/mta-sts.txt by consulting YOURDOMAIN's MX records and dynamically generating an mta-sts.txt file containing each MX server. (Rather than looking up the MX records over the open Internet, which would be insecure, they should directly consult the data source for the domain's records, which they can do since they are the DNS provider.)
Automatically publish a TXT record at _mta-sts.YOURDOMAIN containing an automatically-generated ID value. The ID can be up to 32 alphanumeric characters and must change every time the mta-sts.txt file changes. The DNS provider could use a sequentially increasing integer, or 16 random hex-encoded bytes that are regenerated with every update. It is critical that new mta-sts.txt files are published before the ID in the TXT record is updated, or else other mail servers might cache the old mta-sts.txt file under the new ID, causing delivery failures. It's also a good idea to update the ID at least once a day, whether or not the policy changes, to ensure that other mail servers refresh their caches, just in case an old policy has been cached erroneously.

I hope DNS providers implement this. If domain owners have to manage MTA-STS manually, I anticipate a lot of “human error” that could hamper MTA-STS' adoption. I put “human error” in quotes because although it is a very common expression, it often indicates a problem not with the human who made the error, but with software failing to automate tasks that computers can do easily. Let's make sure MTA-STS can't have human error!

Update (2020-11-12): One-and-a-half years later, I'm not aware of any DNS providers offering MTA-STS automation. But SSLMate now does! It's not quite as seamless as what a DNS provider could offer, but it's pretty good: SSLMate continuously monitors your domains' MX records and automatically publishes an appropriate MTA-STS policy for you, obtaining and renewing the necessary SSL certificates. All you need to do is publish two CNAME records delegating the mta-sts and _mta-sts subdomains to SSLMate-operated servers. For transparency, SSLMate emails you whenever an MX record changes, so you can detect unauthorized changes.

Update (2022-10-25): Improved advice about generating policy IDs. I now recommend using a sequentially increasing integer or random bytes, and regenerating the ID at least once a day to avoid caching problems. I previously suggested generating the ID from a hash of the policy file, but this is not robust: if the policy changes inbetween a mail server checking the TXT record and downloading the policy file, the mail server will cache the new file under the old ID. If you later switch back to the original policy, the other mail server won't detect the policy change and will keep using the new policy, possibly causing delivery failures.

Making Certificates Easier and Helping the Ecosystem: Four Years of SSLMate

2018-04-13T18:06:02Z

I'm not actually sure when SSLMate was born. I got the idea, registered the domain name, and wrote the first lines of code in August 2013, but I put it on the backburner until March 2014. I think I "launched" in early April, but since I thought of SSLMate as a side project mainly for my own use, I didn't do anything special.

I do know that I sold my first certificate on April 13, 2014, four years ago to this day. I sold it to a friend who needed to replace his certificates after Heartbleed and was fed up with how hard his certificate authority was making it. It turns out a lot of people were generally fed up with how hard certificate authorities made things, and in the last four years, SSLMate has exceeded my wildest expectations and become my full-time job.

Sadly, certificate resellers have a deservedly bad reputation, which has only gotten worse in recent months thanks to the likes of Trustico and other resellers with terrible security practices. But it's not entirely a hellscape out there, and for SSLMate's birthday, I thought it would be nice to celebrate the ways SSLMate has been completely unlike a typical certificate reseller, and how SSLMate has allowed me to pursue work over the last four years that lifts up the Web PKI as a whole.

SSLMate was different from the beginning. When SSLMate launched in 2014, it was the first, and only, way you could get a publicly-trusted SSL certificate entirely from the command line. SSLMate made automated certificate issuance accessible to anyone, not just large customers of certificate authorities. More importantly, it significantly improved the usability of certificate issuance at a time when getting a certificate meant running long OpenSSL commands, copy-and-pasting PEM blobs to and from websites, and manually extracting a Zip file and assembling the correct certificate chain. The state of the art in usability was resellers generating, and possibly storing, your private key on their servers. SSLMate showed that certificate issuance could be easy without having to resort to insecure practices.

In September 2014, SSLMate stopped selling multi-year certificates. This was unusual at a time when certificates could be valid for up to five years. The change was unpopular among a handful of customers, but it was unquestionably the right thing to do. Shorter-lived certificates are better for the ecosystem, since they allow security standards to advance more quickly, but they are also better for customers. I learned from the SHA-1 deprecation that a multi-year certificate might not remain valid for its entire term, and it felt wrong to sell a product that I might not be able to deliver on. Sure enough, the five year Symantec certificates that SSLMate was reselling at the time will all become prematurely invalid as of the next Chrome release. (The affected customers all got free replacements, but it's still not what they were expecting.)

The industry is following SSLMate's lead: the CA/Browser Forum limited certificate lifetimes to three years beginning in 2015, and further limited lifetimes to two years beginning last month.

In April 2015, SSLMate released its first public REST API. (While we already had an API for use by the command-line tool, it wasn't previously documented.) As far as I know, this was the world's first fully self-service API for the automated issuance of publicly-trusted SSL certificates. Although major certificate authorities, and some resellers, had APIs (indeed, SSLMate used them under-the-hood), every one of the many APIs I looked at required you to ask a human to enable API access for your account. Some even required you to get on the phone to negotiate prices. With SSLMate's API, you could sign up yourself and immediately start issuing publicly-trusted certificates.

In June, SSLMate published a blog post explaining how to set up OCSP Stapling in Apache and nginx. Resources at the time were pretty bad, so I had to dive into the source code for Apache and nginx to learn how stapling really worked. I was rather horrified at what I saw. The worst bug was that nginx would sometimes staple an expired OCSP response, which would cause Firefox to reject the certificate. So, I submitted a patch fixing it.

In August, several folks asked me to review the ACME specification being worked on at the IETF and provide feedback based on my experience with automated certificate issuance APIs. While reading the draft, I was very bothered by the fact that RSA and ECDSA signatures were being used without any associated message. I had never heard of duplicate signature key selection attacks, but I knew that crypto wasn't being used properly, and when crypto isn't used properly, bad things tend to happen. So I dusted off my undergraduate number theory textbook and came up with an attack that broke ACME, allowing attackers to get unauthorized certificates. After my disclosure, ACME was fixed, before it was deployed in the Web PKI.

In March 2016, it occurred to me that signing OCSP responses with a weak hash function such as SHA-1 could probably lead to the forgery of a trusted certificate. It was already known that signing a certificate with a weak hash function could lead to the forgery of another certificate using a chosen-prefix collision attack, so CAs were forbidden from signing certificates using SHA-1. However, no one had demonstrated a collision attack against OCSP responses, and CAs were allowed to sign OCSP responses with SHA-1.

I figured out how to execute a chosen-prefix attack against OCSP responses, rented a GPU instance in EC2 to make a proof-of-concept with MD5, and scanned every OCSP responder I could find to see which ones could be used to forge a certificate with a SHA-1 collision attack. I reported my findings to the mozilla.dev.security.policy mailing list. This led to a change in Mozilla's Root Store Policy to forbid CAs from signing OCSP responses with SHA-1 except under safe conditions.

Since then, I've periodically scanned OCSP responders to ensure they remain in compliance.

In July 2016, SSLMate launched Cert Spotter, a Certificate Transparency monitor. The core of Cert Spotter is open source, because I wanted non-profits to be able to easily use Certificate Transparency without depending on a commercial service. I'm proud to say that the Wikimedia Foundation uses the open source Cert Spotter to watch for unauthorized certificates for wikipedia.org and their other domains.

Certificate Transparency was designed to be verifiable, but this only matters if a diverse set of people bother to actually do the verification. Cert Spotter has always verified log behavior, and it has detected log misbehavior that was missed by other monitors.

In March 2017, SSLMate started operating the world's second Certificate Transparency gossip endpoint (Graham Edgecombe gets credit for the first) to provide further resiliency to the Certificate Transparency ecosystem. SSLMate also released ct-honeybee, a lightweight program that queries each Certificate Transparency log for its current state and uploads it to Graham's and SSLMate's gossip endpoints. People are now running ct-honeybee on devices all around the world, helping ensure that logs do not present different views to different parts of the Internet.

In 2017, I attended both Certificate Transparency Policy Days hosted by Google to help hash out policy for the burgeoning Certificate Transparency ecosystem.

In September, to help with the upcoming CAA enforcement deadline, I released a free CAA Test Suite for CAs to use to test their implementations.

What's next for SSLMate? The biggest change over the last four years is that the price of certificates as individual goods has gone to zero. But SSLMate has never really been about selling certificates, but about selling easy-to-use software, good support, and a service for managing certificates. And I still see a lot of work to be done to make certificates even easier to work with, particularly with all the new ways certificates are going to be used in the future. I'm pleased to be kicking off SSLMate's fifth year with the release of SSLMate for SaaS, a new service that provides an easy, high-level way for SaaS companies to get certificates for the customer domains they host. This is the first of many exciting announcements in store for this year.

These Three Companies Are Doing the Internet a Solid By Running Certificate Transparency Logs

2018-03-29T16:36:23Z

When we use the Internet, we rely on the security of the certificate authority system to ensure we are talking with the right people. Unfortunately, the certificate authority system is a bit of a mess. One of the ways we're trying to clean up the mess is Certificate Transparency, an effort to put all SSL certificates issued by public certificate authorities in public, verifiable, append-only logs. Domain owners can monitor the logs for unauthorized certificates, and web browsers can monitor for compliance with the rules and take action against non-compliant certificate authorities. After ramping up for the last four years, Certificate Transparency is about to enter prime time: Google Chrome is requiring that all certificates issued on or after April 30, 2018 be logged.

But who is supposed to run these Certificate Transparency logs? Servers, electricity, bandwidth, and system administrators cost money. Although Google is spearheading Certificate Transparency and operates nine logs that are recognized by Chrome, Certificate Transparency is supposed to benefit everyone and it would be unhealthy for the Internet if Google ran all the logs. For this reason, Chrome requires that certificates be included in at least one log operated by an organization besides Google.

So far, three organizations have stepped up and are operating Certificate Transparency logs that are recognized by Chrome and are open to certificates from any public certificate authority:

DigiCert was the first non-Google organization to set up a log, and they now operate several logs recognized by Chrome. Their DigiCert 2 log accepts certificates from all public certificate authorities. They are also applying for recognition of their Nessie and Yeti log sets, which accept certificates from all public certificate authorities and are each split into five shards based on the expiration year of the certificate. (They also operate DigiCert 1, which only accepts certificates from some certificate authorities, and have three logs acquired from Symantec which they are shutting down later this year.)

DigiCert is notable because they've written their own Certificate Transparency log implementation instead of using an open source one. This is helpful because it adds diversity to the ecosystem, which ensures that a bug in one implementation won't take out all logs.

Comodo Certification Authority (which is thankfully no longer owned by the blowhard who thinks he invented 90 day certificates) operates two logs recognized by Chrome: Mammoth and Sabre. Both logs accept certificates from all public certificate authorities, and run SuperDuper, which is Google's original open source log implementation.

In addition to operating two open logs, Comodo CA runs crt.sh, a search engine for certificates found in Certificate Transparency logs. crt.sh has been an invaluable resource for the community when investigating misbehavior by certificate authorities.

Cloudflare is the latest log operator to join the ecosystem. They operate the Nimbus log set, which accepts certificates from all public certificate authorities and is split into four shards based on the expiration year of the certificate. Nimbus runs Trillian, Google's latest open source implementation, with some Cloudflare-specific patches.

Cloudflare is unique because unlike DigiCert and Comodo CA, they are not a certificate authority. DigiCert and Comodo have an obvious motivation to run logs: they need somewhere to log their certificates so they will be trusted by Chrome. Cloudflare doesn't have such a need, but they've chosen to run logs anyways.

DigiCert, Comodo CA, and Cloudflare should be lauded for running open Certificate Transparency logs. None of them have to do this. Even DigiCert and Comodo could have adopted the strategy of their competitors and waited for someone else to run a log that would accept their certificates. Their willingness to run logs shows that they are invested in improving the Internet for everyone's benefit.

We need more companies to step up and join these three in running public Certificate Transparency logs. How about some major tech companies? Although we all benefit from the success of Certificate Transparency, large tech companies benefit even more: they are bigger targets than the rest of us, and they have more to gain when the public feels secure conducting business online. Major tech companies are also uniquely positioned to help, since they already run large-scale Internet infrastructure which could be used to host Certificate Transparency logs. And what kind of tech company doesn't want the cred that comes from helping the Internet out?

If you're a big tech company that knows how to run large-scale infrastructure, why aren't you running a Certificate Transparency log too?

Google's Certificate Revocation Server Is Down - What Does It Mean?

2018-01-21T19:35:36Z

Earlier today, someone reported to the mozilla.dev.security.policy mailing list that they were unable to access any Google websites over HTTPS because Google's OCSP responder was down. David E. Ross says the problem started two days ago, and several Tweets confirm this. Google has since acknowledged the report. As of publication time, the responder is still down for me, though Ross reports it's back up. (Update: a fix is being rolled out.)

What's an OCSP Responder?

OCSP, which stands for Online Certificate Status Protocol, is the system used by SSL/TLS clients (such as web browsers) to determine if an SSL/TLS certificate is revoked or not. When an OCSP-using TLS client connects to a TLS server such as https://www.google.com, it sends a query to the OCSP responder URL listed in the TLS server's certificate to see if the certificate is revoked. If the OCSP responder replies that it is, the TLS client aborts the connection. OCSP responders are operated by the certificate authority which issued the certificate. Google has its own publicly-trusted certificate authority (Google Internet Authority G2) which issues certificates for Google websites.

I thought Chrome didn't support OCSP?

You're correct. Chrome famously does not use OCSP. Chrome users connecting to Google websites are blissfully unaware that Google's OCSP responder is down.

But other TLS clients do use OCSP, and as a publicly-trusted certificate authority, Google is required by the Baseline Requirements to operate an OCSP responder. However, certificate authorities have historically done a bad job operating reliable OCSP responders, and firewalls often get in the way of OCSP queries. Consequentially, although web browsers like Edge, Safari, and Firefox do contact OCSP responders, they use "soft fail" and allow a connection if they don't get a well-formed response from the responder. Otherwise, they'd constantly reject connections that they shouldn't. (This renders OCSP almost entirely pointless from a security perspective, since an attacker with a revoked certificate can usually just block the OCSP response and the browser will accept the revoked certificate.) Therefore, the practical impact from Google's OCSP responder outage is probably very small. Nearly all clients are going to completely ignore the fact that Google's OCSP responder is down.

That said, Google operates some of the most heavily trafficked sites on the Internet. A wide variety of devices connect to Google servers (Google's FAQ for Certificate Changes mention set-top boxes, gaming consoles, printers, and even cameras). Inevitably, at least some of these devices are going to use "hard fail" and reject connections when an OCSP responder is down. There are also people, like the mozilla.dev.security.policy poster, who configure their web browser to use hard fail. Without a doubt, there are people who are noticing problems right now.

I thought Google had Site Reliability Engineers?

Indeed they do, which is why this incident is noteworthy. As I mentioned, certificate authorities tend to do a poor job operating OCSP responders. But most certificate authorities run off-the-shelf software and employ no software engineers. Some regional European certificate authorities even complain when you report security incidents to them during their months-long summer vacations. So no one is surprised when those certificate authorities have OCSP responder outages. Google, on the other hand, sets higher expectations.

How will Certificate Transparency Logs be Audited in Practice?

2018-01-10T00:41:54Z

Certificate Transparency, the effort to detect misissued SSL certificates by publishing all certificates in public logs, only works if TLS clients reject certificates that are not logged. Otherwise, certificate authorities could just not log the certificates that they misissue. TLS clients accomplish this by requiring that a certificate be accompanied by a "signed certificate timestamp" (SCT), which is a promise by a log to include the certificate within 24 hours of the SCT's issuance timestamp. (This period is called the Maximum Merge Delay. It doesn't have to be 24 hours, but it is for all logs currently trusted by Chrome.) The SCT can be embedded in the certificate, the OCSP response (which the TLS server must staple), or the TLS handshake.

But an SCT is only a promise. What if the log breaks its promise and never includes the certificate? After all, certificate authorities promise not to issue bad certificates, but they do so in droves. We don't want to replace the problem of untrustworthy certificate authorities with the problem of untrustworthy Certificate Transparency logs. Fortunately, Certificate Transparency was designed to be verifiable, making it possible, in theory, for the TLS client to verify that the log fulfills its promise. Unfortunately, SCT verification seems to be a rather elusive problem in practice. In this post, I'm going to explore some of the possible solutions, and explain their shortcomings.

Direct Proof Fetching

A Certificate Transparency log stores its certificates in the leaves of an append-only Merkle Tree, which means the log can furnish efficient-to-verify cryptographic proofs that a certificate is included in the tree, and that one tree is an appendage of another. The log provides several HTTP endpoints for acquiring these proofs, which a TLS client can use to audit an SCT as follows:

Use the get-sth endpoint to retrieve the log's latest signed tree head (STH). Store the returned tree head.
Use the get-sth-consistency endpoint to retrieve a consistency proof between the tree represented by the previously-stored tree head, and the tree represented by the latest tree head. Verify the proof. If the proof is invalid, it means the new tree is not an appendage of the old tree, so the log has violated its append-only property.
Use the get-proof-by-hash endpoint to retrieve an inclusion proof for the SCT based on the latest tree head. Verify the proof. If the proof is invalid, it means the log has not included the corresponding certificate in the log.

Note that steps 1 and 2 only need to be done periodically for each log, rather than once for every certificate validation.

There are several problems with direct proof fetching:

Logs can't handle the load of every web browser everywhere contacting them for every TLS connection ever made.
When the client asks for an inclusion proof in step 3, it has to reveal to the log which SCT it wants the proof for. Since each SCT corresponds to a certificate, and certificates contain domain names, the log learns every domain the client is visiting, which is a violation of the user's privacy.
Since the log has up to 24 hours to include a certificate, the certificate might not be included at the time the client sees an SCT. Furthermore, even if 24 hours have elapsed, it would slow down the TLS client to retrieve a proof during the TLS handshake. Therefore, the client has to store the SCT and certificate and fetch the inclusion proof at some future time.

This exposes the client to denial-of-service attacks, wherein a malicious website could try to exhaust the client's storage by spamming it with certificates and SCTs faster than the client can audit them (imagine a website which uses JavaScript to make a lot of AJAX requests in the background). To avoid denial-of-service, the client has to limit the size of its unverified SCT store. Once the limit is reached, the client has to stop adding to the store, or evict old entries. Either way, it exposes the client to flushing attacks, wherein an attacker with a misissued certificate and bogus SCT spams the client with good certificates and SCTs so the bad SCT never gets verified.

Another question is what to do if the verification fails. It's too late to abort the TLS handshake and present a certificate error. Most users would be terribly confused if their browser presented an error message about a misbehaving log, so there needs to be a way for the client to automatically report the SCT to some authority (e.g. the browser vendor) so they know that the log has misbehaved and needs to be distrusted.

Proxied Auditing

To solve the scalability problem, clients can talk with servers operated by the client software supplier, rather than the logs themselves. (The client software supplier has to be comfortable operating scalable infrastructure; Google is, at least.)

For example, Chrome doesn't retrieve STHs and consistency proofs directly from logs. Instead, Google-operated servers perform steps 1 and 2 and distribute batches of STHs (called STHSets) to clients using Chrome's update system. Users have to trust the Google servers to faithfully audit the STHs, but since Chrome users already have to trust Google to provide secure software, there is arguably no loss in security with this approach.

Unfortunately, proxied auditing doesn't fully solve the privacy problem. Instead of leaking every domain the user visits to the log operator, requesting an inclusion proof leaks visited domains to the client software supplier.

DNS Proof Fetching

Chrome's solution to the privacy problem is to fetch inclusion proofs over DNS. Instead of making an HTTP request to the get-proof-by-hash endpoint, Chrome will make a series of DNS requests for sub-domains of ct.googleapis.com which contain the parameters of the inclusion proof request. For example:

D4S6DSV2J743QJZEQMH4UYHEYK7KRQ5JIQOCPMFUHZVJNFGHXACA.hash.pilot.ct.googleapis.com

The name servers for ct.googleapis.com respond with the inclusion proof inside TXT records.

Chrome believes DNS proof fetching is better for privacy because instead of their servers learning the Chrome user's own IP address, they will learn the IP address of the user's DNS resolver. If the user is using an ISP's DNS resolver, its IP address will be shared among all the ISP's users. Although DNS is unencrypted and an eavesdropper along the network path could learn what inclusion proofs the user is fetching, they would already know what domains the user is visiting thanks to the DNS queries used to resolve the domain in the first place.

However, there are some caveats: if a user doesn't share a DNS resolver with many other users, it may be possible to deanonymize them based on the timing of the DNS queries. Also, if a user visits a domain handled by an internal DNS server (e.g. www.intranet.example.com), that DNS query won't leak onto the open Internet, but the DNS query for the inclusion proof will, causing a DNS privacy leak that didn't exist previously. A privacy analysis is available if you want to learn more.

Chrome's DNS proof fetching is still under development and hasn't shipped yet.

Embedded Proofs

The second version of Certificate Transparency (which is not yet standardized or deployed) allows inclusion proofs to be presented to clients alongside SCTs (embedded in the certificate, OCSP response, or TLS handshake) saving the client the need to fetch inclusion proofs itself. That solves the problems above, but creates new ones.

First, inclusion proofs can't be obtained until the certificate is included in the log, which currently takes up to 24 hours. If clients were to require inclusion proofs along with SCTs, a newly issued certificate wouldn't be usable for up to 24 hours. That's OK for renewals (as long as you don't wait until the last minute to renew), but people are used to being able to get new certificates right away.

Second, clients can no longer choose the STH on which an inclusion proof is based. With direct or DNS proof fetching, the client always requests an inclusion proof to the latest STH, and the client can easily verify that the latest STH is consistent with the previous STH. When the client gets an embedded inclusion proof, it doesn't immediately know if the STH on which it is based is consistent with other STHs that the client has observed. The client has to audit the STH.

Unfortunately, the more frequently a log produces STHs that incorporate recently-submitted certificates, the harder it is to audit them. Consider the extreme case of a log creating a new STH immediately after every certificate submission. Although this would let an inclusion proof be obtained immediately for an SCT, auditing the new STH has all the same problems as SCT auditing that were discussed above. It's bad for privacy, since a log can assume that a client auditing a particular STH visited the domain whose certificate was submitted right before the STH was produced. And if the client wants to avoid making auditing requests during the TLS handshake, it has to store the STH for later auditing, exposing it to denial-of-service and flushing attacks.

STH Frequency and Freshness

To address the privacy problems of excessive STH production, CTv2 introduces a new log attribute called the STH Frequency Count, defined as the maximum number of STHs a log may produce in any period equal to the Maximum Merge Delay. The CT gossip draft defines a fresh STH to be one that was produced less than 14 days in the past. Clients could require that embedded inclusion proofs be based on a fresh STH. Then, with an STH Frequency Count that permits one STH an hour, there are only 336 fresh STHs at any given time for any given log - few enough that auditing them is practical and private. Auditing one of 336 STHs doesn't leak any information, and since the number of STHs is bounded, there is no risk of denial-of-service or flushing attacks.

It would also be possible for the client software supplier to operate a service that continuously fetches fresh STHs from logs, audits them for consistency, and distributes them to clients, saving clients the need to audit STHs themselves. (Just like Chrome currently does with the latest STHs.)

Unfortunately, this is not a perfect solution.

First, there would still be a delay before an inclusion proof can be obtained and a newly-issued certificate used. Many server operators and certificate authorities aren't going to like that.

Second, server operators would need to refresh the embedded inclusion proof every 14 days so it is always based on a fresh STH. That rules out embedding the inclusion proof in the certificate, unless the certificate is valid for less than 14 days. The server operator could use OCSP stapling, with the certificate authority responsible for embedding a new inclusion proof every 14 days in the OCSP response. Or the server operator's software could automatically obtain a new inclusion proof every 14 days and embed it in the TLS handshake. Unfortunately, there are no implementations of the latter, and there are very few robust implementations of OCSP stapling. Even if implementations existed, there would be a very long tail of old servers that were never upgraded.

One possibility is to make DNS proof fetching the default, but allow server operators to opt-in to embedded proofs, much like server operators can use HSTS to opt-in to enforced HTTPS. Server operators who run up-to-date software with reliable OCSP Stapling and who don't mind a delay in certificate issuance would be able to provide better security and privacy to their visitors. Maybe at some point in the distant future, embedded proofs could become required for everyone.

All of this is a ways off. CTv2 is still not standardized. Chrome still doesn't do any SCT auditing, and consequentially its CT policy requires at least one SCT to be from a Google-operated log, since Google obviously trusts its own logs not to break its promises. Fortunately, even without widespread log auditing, Certificate Transparency has been a huge success, cleaning up the certificate authority ecosystem and making everyone more secure. Nevertheless, I think it would be a shame if Certificate Transparency's auditability were never fully realized, and I hope we'll be able to find a way to make it work.

Why Man-in-the-Middle Detection is Overrated

2017-09-28T15:34:49Z

Last week, Nick Sullivan launched mitm.watch, a website that purports to tell you whether or not your HTTPS connection is being intercepted by a man-in-the-middle (MitM). mitm.watch uses Caddy's HTTPS MitM Detection Feature, which implements the techniques described in this paper. Basically, Caddy compares the browser name and version number advertised by the User-Agent header to the properties of the TLS handshake initiated by the client (e.g. ciphersuites). If the TLS handshake doesn't match the known properties of the purported browser, then the TLS handshake was probably not initiated by the browser, but by a man-in-the-middle. Caddy's documentation suggests that you could display an error message if a MitM is detected. mitm.watch displays either a green "No MITM!" page, or a red "Likely MITM!" page.

Unfortunately, there is a significant and intractable shortcoming to MitM detection: a MitM can defeat the detection by making its TLS implementation work exactly like that of the browser it's proxying, or at least similar enough that the differences are not observable by the server. You should assume a malicious MitM (one designed to steal data) will conceal itself this way. And if websites start displaying errors when a MitM is detected, you should expect the makers of commercial TLS interception devices (e.g. Bluecoat) to respond by making their interception devices indistinguishable from browsers.

We need to stop obsessing over MitM detection. In addition to server-side MitM detection, another recurring idea is to apply HTTP Public Key Pinning (HPKP) to certificates issued by private certificate authorities (e.g. those used by MitM devices), or to display a special icon in the browser when an HTTPS connection uses a private certificate authority. These proposals are barking up the wrong tree. Short of protocol or implementation vulnerabilities, there are only two ways a TLS connection can be intercepted without the consent of the server operator: one, an unauthorized certificate is issued by a publicly-trusted certificate authority, or two, a private certificate authority has been added to the client's trust store. (I assume the website is using HSTS, which prevents certificate errors from being bypassed.) For the first case, we have Certificate Transparency, which is better than even pie-in-the-sky MitM detection, since it detects rogue certificates even if they are never used. And the second case can only happen if the client's trust store is modified. At that point, the client's security should be considered compromised, as the ability to modify the trust store typically implies the ability to do much worse, such as install spyware that monitors and exfiltrates everything you do, without so much as touching a TLS connection. It's pointless to try to ensure end-to-end encryption when the security of an endpoint is in doubt.

That said, there is one potential benefit to MitM detection. Despite claiming to improve security, many commercial TLS interception devices actually harm security by using TLS client implementations that are vastly inferior to those of modern browsers. For instance, they use old, insecure ciphers, or even neglect to validate the certificate. If the makers of commercial TLS interception devices are forced to emulate the TLS implementations of browsers to avoid detection, they may end up improving their security in the process. However, if this is the goal, MitM detection is rather superfluous: servers might as well just check for insecure attributes of the connection and raise an error if found, MitM or not. After all, MitMs are not the only perpetrators of poor TLS security; there are plenty of old and insecure browsers out there as well. Jeff Hodges' How's My SSL, which recently launched a subscription service that lets you use it on your own site, is one example of this approach.

Thoughts on the Systemd Root Exploit

2017-01-24T17:54:16Z

Sebastian Krahmer of the SUSE Security Team has discovered a local root exploit in systemd v228. A local user on a system running systemd v228 can escalate to root privileges. That's bad.

At a high level, the exploit is trivial:

Systemd uses -1 to represent an invalid mode_t (filesystem permissions) value.
Systemd was accidentally passing this value to open when creating a new file, resulting in a file with all permission bits set: that is, world-writable, world-executable, and setuid-root.
The attacker writes an arbitrary program to this file, which succeeds because it's world-writable.
The attacker executes this file, which succeeds because it's world-executable.
The attacker-supplied program runs as root, because the file is setuid-root.

In mitigation: The vulnerability was fixed a year ago and less than three months after it was introduced. It is present only in v228.

In aggravation: The vulnerability was mislabeled at the time as a local denial-of-service and the systemd team did not request a CVE-ID for it. Had they requested a CVE-ID, someone may have noticed that this was more than a DoS. (Krahmer accurately points out that the systemd commit log is "really huge," which makes it hard to spot security-relevant commits.)

In mitigation: The vulnerability depends on a yet-unfixed hole in how Linux clears a file's setuid and setgid bits when writing to it. Systemd merely creates an empty setuid-root file. Gaining root requires writing to this file, and when a non-root user writes to a setuid-root file, the setuid bit is supposed to be cleared. halfdog found a clever way to circumvent this by tricking a root process into writing to the file instead. This is an extremely interesting vulnerability in itself and I can't wait to dive deeper into it.

In aggravation: The vulnerability would have been prevented if systemd used a fail-safe umask rather than setting it to 0, something I called out last September as evidence of systemd's poor security hygiene. A more sensible umask, such as 022, would have caused open to create the setuid-root file without world-writable permissions, preventing exploitation. However, systemd maintainer David Strauss rejected a safe umask with a completely illogical argument that shows his cluelessness over how systemd uses umask.

Lastly, this is yet another example of "The Billion Dollar Mistake": systemd was using a magic value (-1) to represent an invalid mode_t value, and C's type system did not prevent passing it to the mode argument of open. A language with a better type system, such as Rust or C++ (which has std::optional) can help prevent this kind of error.

That said, this is not about programming languages. Dovecot (among a handful of others) has demonstrated that adherence to good coding practices can produce secure software written in C. Rewriting systemd in a safer language would not transform it into quality software, although certain classes of bugs would likely be reduced or eliminated.

Rather, this is about lock-in. Systemd is introducing unprecedented lock-in to the Linux userspace. They are replacing previously-independent userspace services with ones whose development is controlled by the systemd project and which only work if systemd is PID 1. They are defining their own non-standard protocols and encouraging applications to use them. They have even replaced DNS with a dbus-based protocol, which they "strongly recommend" applications use instead of DNS. Sadly, the most recent version of Ubuntu ships with this travesty.

Systemd's developers have repeatedly demonstrated their poor judgment and unfitness to hold such responsibility. Unfortunately, the lock-in they're creating will deprive people of the ability to vote with their feet and switch to better alternatives.

Systemd is not Magic Security Dust

2016-10-02T18:19:33Z

Systemd maintainer David Strauss has published a response to my blog post about systemd. The first part of his post is replete with ad hominem fallacies, strawmen, and factual errors. Ironically, in the same breath that he attacks me for not understanding the issues around threads and umasks, he betrays an ignorance of how the very project which he works on uses threads and umasks. This doesn't deserve a response beyond what I've called out on Twitter.

In the second part of his blog post, Strauss argues that systemd improves security by making it easy to apply hardening techniques to the network services which he calls the "keepers of data attackers want." According to Strauss, I'm "fighting one of the most powerful tools we have to harden the front lines against the real attacks we see every day." Although systemd does make it easy to restrict the privileges of services, Strauss vastly overstates the value of these features.

The best systemd can offer is whole application sandboxing. You can start a daemon as a non-root user, in a restricted filesystem namespace, with mandatory access control. Sandboxing an entire application is an effective way to run potentially malicious code, since it protects other applications from the malicious one. This makes sandboxing useful on smartphones, which need to run many different untrustworthy, single-user applications. However, since sandboxing a whole application cannot protect one part of the application from a compromise of a different part, it is ineffective at securing benign-but-insecure software, which is the problem faced on servers. Server applications need to service requests from many different users. If one user is malicious and exploits a vulnerability in the application, whole application sandboxing doesn't protect the other users of the service.

For concrete examples, let's consider Apache and Samba, two daemons which Strauss says would benefit from systemd's features.

First Apache. You can start Apache as a non-root user provided someone else binds to ports 443 and 80. You can further sandbox it by preventing it from accessing parts of the filesystem it doesn't need to access. However, no matter how much you try to sandbox Apache, a typical setup is going to need a broad amount of access to do its job, including read permission to your entire website (including password-protected parts) and access to any credential (database password, API key, etc.) used by your CGI, PHP, or similar webapps.

Even under systemd's most restrictive sandboxing, an attacker who gains remote code execution in Apache would be able to read your entire website, alter responses to your visitors, steal your HTTPS private keys, and gain access to your database and any API consumed by your webapps. For most people, this would be the worst possible compromise, and systemd can do nothing to stop it. Systemd's sandboxing would prevent the attacker from gaining access to the rest of your system (absent a vulnerability in the kernel or systemd), but in today's world of single-purpose VMs and containers, that protection is increasingly irrelevant. The attacker probably only wants your database anyways.

To provide a meaningful improvement to security without rewriting in a memory-safe language, Apache would need to implement proper privilege separation. Privilege separation means using multiple processes internally, each running with different privileges and responsible for different tasks, so that a compromise while performing one task can't lead to the compromise of the rest of the application. For instance, the process that accepts HTTP connections could pass the request to a sandboxed process for parsing, and then pass the parsed request along to yet another process which is responsible for serving files and executing webapps. Privilege separation has been used effectively by OpenSSH, Postfix, qmail, Dovecot, and over a dozen daemons in OpenBSD. (Plus a couple of my own: titus and rdiscd.) However, privilege separation requires careful design to determine where to draw the privilege boundaries and how to interface between them. It's not something which an external tool such as systemd can provide. (Note: Apache already implements privilege separation that allows it to process requests as a non-root user, but it is too coarse-grained to stop the attacks described here.)

Next Samba, which is a curious choice of example by Strauss. Having configured Samba and professionally administered Windows networks, I know that Samba cannot run without full root privilege. The reason why Samba needs privilege is not because it binds to privileged ports, but because, as a file server, it needs the ability to assume the identity of any user so it can read and write that user's files. One could imagine a different design of Samba in which all files are owned by the same unprivileged user, and Samba maintains a database to track the real ownership of each file. This would allow Samba to run without privilege, but it wouldn't necessarily be more secure than the current design, since it would mean that a post-authentication vulnerability would yield access to everyone's files, not just those of the authenticated user. (Note: I'm not sure if Samba is able to contain a post-authentication vulnerability, but it theoretically could. It absolutely could not if it ran as a single user under systemd's sandboxing.)

Other daemons are similar. A mail server needs access to all users' mailboxes. If the mail server is written in C, and doesn't use privilege separation, sandboxing it with systemd won't stop an attacker with remote code execution from reading every user's mailbox. I could continue with other daemons, but I think I've made my point: systemd is not magic pixie dust that can be sprinkled on insecure server applications to make them secure. For protecting the "data attackers want," systemd is far from a "powerful" tool. I wouldn't be opposed to using a library or standalone tool to sandbox daemons as a last line of defense, but the amount of security it provides is not worth the baggage of running systemd as PID 1.

Achieving meaningful improvement in software security won't be as easy as adding a few lines to a systemd config file. It will require new approaches, new tools, new languages. Jon Evans sums it up eloquently:

... as an industry, let's at least set a trajectory. Let's move towards writing system code in better languages, first of all -- this should improve security and speed. Let's move towards formal specifications and verification of mission-critical code.

Systemd is not part of this trajectory. Systemd is more of the same old, same old, but with vastly more code and complexity, an illusion of security features, and, most troubling, lock-in. (Strauss dismisses my lock-in concerns by dishonestly claiming that applications aren't encouraged to use their non-standard DBUS API for DNS resolution. Systemd's own documentation says "Usage of this API is generally recommended to clients." And while systemd doesn't preclude alternative implementations, systemd's specifications are not developed through a vendor-neutral process like the IETF, so there is no guarantee that other implementers would have an equal seat at the table.) I have faith that the Linux ecosystem can correct its trajectory. Let's start now, and stop following systemd down the primrose path.

How to Crash Systemd in One Tweet

2016-09-28T19:14:27Z

The following command, when run as any user, will crash systemd:

NOTIFY_SOCKET=/run/systemd/notify systemd-notify ""

After running this command, PID 1 is hung in the pause system call. You can no longer start and stop daemons. inetd-style services no longer accept connections. You cannot cleanly reboot the system. The system feels generally unstable (e.g. ssh and su hang for 30 seconds since systemd is now integrated with the login system). All of this can be caused by a command that's short enough to fit in a Tweet.

Edit (2016-09-28 21:34): Some people can only reproduce if they wrap the command in a while true loop. Yay non-determinism!

The bug is remarkably banal. The above systemd-notify command sends a zero-length message to the world-accessible UNIX domain socket located at /run/systemd/notify. PID 1 receives the message and fails an assertion that the message length is greater than zero. Despite the banality, the bug is serious, as it allows any local user to trivially perform a denial-of-service attack against a critical system component.

The immediate question raised by this bug is what kind of quality assurance process would allow such a simple bug to exist for over two years (it was introduced in systemd 209). Isn't the empty string an obvious test case? One would hope that PID 1, the most important userspace process, would have better quality assurance than this. Unfortunately, it seems that crashes of PID 1 are not unusual, as a quick glance through the systemd commit log reveals commit messages such as:

Systemd's problems run far deeper than this one bug. Systemd is defective by design. Writing bug-free software is extremely difficult. Even good programmers would inevitably introduce bugs into a project of the scale and complexity of systemd. However, good programmers recognize the difficulty of writing bug-free software and understand the importance of designing software in a way that minimizes the likelihood of bugs or at least reduces their impact. The systemd developers understand none of this, opting to cram an enormous amount of unnecessary complexity into PID 1, which runs as root and is written in a memory-unsafe language.

Some degree of complexity is to be expected, as systemd provides a number of useful and compelling features (although they did not invent them; they were just the first to aggressively market them). Whether or not systemd has made the right trade-off between features and complexity is a matter of debate. What is not debatable is that systemd's complexity does not belong in PID 1. As Rich Felker explained, the only job of PID 1 is to execute the real init system and reap zombies. Furthermore, the real init system, even when running as a non-PID 1 process, should be structured in a modular way such that a failure in one of the riskier components does not bring down the more critical components. For instance, a failure in the daemon management code should not prevent the system from being cleanly rebooted.

In particular, any code that accepts messages from untrustworthy sources like systemd-notify should run in a dedicated process as a unprivileged user. The unprivileged process parses and validates messages before passing them along to the privileged process. This is called privilege separation and has been a best practice in security-aware software for over a decade. Systemd, by contrast, does text parsing on messages from untrusted sources, in C, running as root in PID 1. If you think systemd doesn't need privilege separation because it only parses messages from local users, keep in mind that in the Internet era, local attacks tend to acquire remote vectors. Consider Shellshock, or the presentation at this year's systemd conference which is titled "Talking to systemd from a Web Browser."

Systemd's "we don't make mistakes" attitude towards security can be seen in other places, such as this code from the main() function of PID 1:

/* Disable the umask logic */
if (getpid() == 1)
        umask(0);

Setting a umask of 0 means that, by default, any file created by systemd will be world-readable and -writable. Systemd defines a macro called RUN_WITH_UMASK which is used to temporarily set a more restrictive umask when systemd needs to create a file with different permissions. This is backwards. The default umask should be restrictive, so forgetting to change the umask when creating a file would result in a file that obviously doesn't work. This is called fail-safe design. Instead systemd is fail-open, so forgetting to change the umask (which has already happened twice) creates a file that works but is a potential security vulnerability.

The Linux ecosystem has fallen behind other operating systems in writing secure and robust software. While Microsoft was hardening Windows and Apple was developing iOS, open source software became complacent. However, I see improvement on the horizon. Heartbleed and Shellshock were wake-up calls that have led to increased scrutiny of open source software. Go and Rust are compelling, safe languages for writing the type of systems software that has traditionally been written in C. Systemd is dangerous not only because it is introducing hundreds of thousands of lines of complex C code without any regard to longstanding security practices like privilege separation or fail-safe design, but because it is setting itself up to be irreplaceable. Systemd is far more than an init system: it is becoming a secondary operating system kernel, providing a log server, a device manager, a container manager, a login manager, a DHCP client, a DNS resolver, and an NTP client. These services are largely interdependent and provide non-standard interfaces for other applications to use. This makes any one component of systemd hard to replace, which will prevent more secure alternatives from gaining adoption in the future.

Consider systemd's DNS resolver. DNS is a complicated, security-sensitive protocol. In August 2014, Lennart Poettering declared that "systemd-resolved is now a pretty complete caching DNS and LLMNR stub resolver." In reality, systemd-resolved failed to implement any of the documented best practices to protect against DNS cache poisoning. It was vulnerable to Dan Kaminsky's cache poisoning attack which was fixed in every other DNS server during a massive coordinated response in 2008 (and which had been fixed in djbdns in 1999). Although systemd doesn't force you to use systemd-resolved, it exposes a non-standard interface over DBUS which they encourage applications to use instead of the standard DNS protocol over port 53. If applications follow this recommendation, it will become impossible to replace systemd-resolved with a more secure DNS resolver, unless that DNS resolver opts to emulate systemd's non-standard DBUS API.

It is not too late to stop this. Although almost every Linux distribution now uses systemd for their init system, init was a soft target for systemd because the systems they replaced were so bad. That's not true for the other services which systemd is trying to replace such as network management, DNS, and NTP. Systemd offers very few compelling features over existing implementations, but does carry a large amount of risk. If you're a system administrator, resist the replacement of existing services and hold out for replacements that are more secure. If you're an application developer, do not use systemd's non-standard interfaces. There will be better alternatives in the future that are more secure than what we have now. But adopting them will only be possible if systemd has not destroyed the modularity and standards-compliance that make innovation possible.