The Story Behind Last Week's Let's Encrypt Downtime

Blog

June 22, 2023

The Story Behind Last Week's Let's Encrypt Downtime

Last Thursday (June 15th, 2023), Let's Encrypt went down for about an hour, during which time it was not possible to obtain certificates from Let's Encrypt. Immediately prior to the outage, Let's Encrypt issued 645 certificates which did not work in Chrome or Safari. In this post, I'm going to explain what went wrong and how I detected it.

The Law of Precertificates

Before I can explain the incident, we need to talk about Certificate Transparency. Certificate Transparency (CT) is a system for putting certificates issued by publicly-trusted CAs, such as Let's Encrypt, in public, append-only logs. Certificate authorities have a tremendous amount of power, and if they misuse their power by issuing certificates that they shouldn't, traffic to HTTPS websites could be intercepted by attackers. Historically, CAs have not used their power well, and Certificate Transparency is an effort to fix that by letting anyone examine the certificates that CAs issue.

A key concept in Certificate Transparency is the "precertificate". Before issuing a certificate, the certificate authority creates a precertificate, which contains all of the information that will be in the certificate, plus a "poison extension" that prevents the precertificate from being used like a real certificate. The CA submits the precertificate to multiple Certificate Transparency logs. Each log returns a Signed Certificate Timestamp (SCT), which is a signed statement acknowledging receipt of the precertificate and promising to publish the precertificate in the log for anyone to download. The CA takes all of the SCTs and embeds them in the certificate. When a CT-enforcing browser (like Chrome or Safari) validates the certificate, it makes sure that the certificate embeds a sufficient number of SCTs from trustworthy logs. This doesn't prevent the browser from accepting a malicious certificate, but it does ensure that the precertificate is in public logs, allowing the attack to be detected and action taken against the CA.

The certificate itself may or may not end up in CT logs. Some CAs, notably Let's Encrypt and Sectigo, automatically submit their certificates. Certificates from other CAs only end up in logs if someone else finds and submits them. Since only the precertificate is guaranteed to be logged, it is essential that a precertificate be treated as incontrovertible proof that a certificate containing the same data exists. When someone finds a precertificate for a malicious or non-compliant certificate, the CA can't be allowed to evade responsibility by saying "just kidding, we never actually issued the real certificate" (and boy, have they tried). Otherwise, CT would be useless.

There are two ways a CA could create a certificate. They could take the precertificate, remove the poison extension, add the SCTs, and re-sign it. Or, they could create the certificate from scratch, making sure to add the same data, in the same order, as used in the precertificate.

The first way is robust because it's guaranteed to produce a certificate which matches the precertificate. At least one CA, Sectigo, uses this approach. Let's Encrypt uses the second approach. You can probably see where this is going...

The Let's Encrypt incident

On June 15, 2023, Let's Encrypt deployed a planned change to their certificate configuration which altered the contents of the Certificate Policies extension from:

X509v3 Certificate Policies: 
    Policy: 2.23.140.1.2.1
    Policy: 1.3.6.1.4.1.44947.1.1.1
      CPS: http://cps.letsencrypt.org

to:

X509v3 Certificate Policies: 
    Policy: 2.23.140.1.2.1

Unfortunately, any certificate which was requested while the change was being rolled out could have its precertificate and certificate created with different configurations. For example, when Let's Encrypt issued the certificate with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71, the precertificate contained the new Certificate Policies extension, and the certificate contained the old Certificate Policies extension.

This had two consequences:

First, this certificate won't work in Chrome or Safari, because its SCTs are for a precertificate containing different data from the certificate. Specifically, the SCTs fail signature validation. When logs sign SCTs, they compute the signature over the data in the precertificate, and when browsers verify SCTs, they compute the signature over the data in the certificate. In this case, that data was not the same.

Second, remember how I said that precertificates are treated as incontrovertible proof that a certificate containing the same data exists? When Let's Encrypt issued a precertificate with the new Certificate Policies value, it implied that they also issued a certificate with the new Certificate Policies value. Thus, according to the Law of Precertificates, Let's Encrypt issued two certificates with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71:

A certificate containing the old Certificate Policies extension
A certificate containing the new Certificate Policies extension (implied by the existence of the precertificate with the new Certificate Policies extension)

Issuing two certificates with the same serial number is a violation of the Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates. Consequentially, Let's Encrypt must revoke the certificate and post a public incident report, which must be noted on their next audit statement.

You might think that it's harsh to treat this as a compliance incident if Let's Encrypt didn't really issue two certificates with the same serial number. Unfortunately, they have no way of proving this, and the whole reason for Certificate Transparency is so we don't have to take CAs at their word that they aren't issuing certificates that they shouldn't. Any exception to the Law of Precertificates creates an opening for a malicious CA to exploit.

How I found this

My company, SSLMate, operates a Certificate Transparency monitor called Cert Spotter, which continuously downloads and indexes the contents of every Certificate Transparency log. You can use Cert Spotter to get notifications when a certificate is issued for one of your domains, or search the database using a JSON API.

When Cert Spotter ingests a certificate containing embedded SCTs, it verifies each SCT's signature and audits that the log really published the precertificate. (If it detects that a log has broken its promise to publish a precertificate, I'll publicly disclose the SCT and the log will be distrusted. Happily, Cert Spotter has never found a bogus SCT, though it has detected logs violating other requirements.)

On Thursday, June 15, 2023 at 15:41 UTC, Cert Spotter began sending me alerts about certificates containing embedded SCTs with invalid signatures. Since I was getting hundreds of alerts, I decided to stop what I was doing and investigate.

I had received these alerts several times before, and have gotten pretty good at zeroing in on the problem. When only one SCT in a certificate has an invalid signature, it probably means that the CT log screwed up. When all of the embedded SCTs have an invalid signature, it probably means the CA screwed up. The most common reason is issuing certificates that don't match the precertificate. So I took one of the affected certificates and searched for precertificates containing the same serial number in Cert Spotter's database of every (pre)certificate ever logged to Certificate Transparency. Decoding the certificate and precertificate with the openssl command immediately revealed the different Certificate Policies extension.

Since I was continuing to get alerts from Cert Spotter about invalid SCT signatures, I quickly fired off an email to Let's Encrypt's problem reporting address alerting them to the problem.

I sent the email at 15:52 UTC. At 16:08, Let's Encrypt replied that they had paused issuance to investigate. Meanwhile, I filed a CA Certificate Compliance bug in Bugzilla, which is where Mozilla and Chrome track compliance incidents by publicly-trusted certificate authorities.

At 16:54, Let's Encrypt resumed issuance after confirming that they would not issue any more certificates with mismatched precertificates.

On Friday, June 16, 2023, Let's Encrypt emailed the subscribers of the affected certificates to inform them of the need to replace their certificates.

On Monday, June 19, 2023 at 18:00 UTC, Let's Encrypt revoked the 645 affected certificates, as required by the Baseline Requirements. This will cause the certificates to stop working in any client that checks revocation, but remember that these certificates were already being rejected by Chrome and Safari for having invalid SCTs.

On Tuesday, June 20, 2023, Let's Encrypt posted their public incident report, which explained the root cause of the incident and what they're doing to prevent it from happening again. Specifically, they plan to add a pre-issuance check that ensures certificates contain the same data as the precertificate.

Hundreds of websites are still serving broken certificates

I've been periodically checking port 443 of every DNS name in the affected certificates, and as of publication time, 261 certificates are still in use, despite not working in CT-enforcing or revocation-checking clients.

I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt. I thought that maybe these certificates were being used by API endpoints which are accessed by non-browser clients that don't enforce CT or check revocation, but this doesn't appear to be the case, as most of the DNS names are for bare domains or www subdomains. It's fortunate that Let's Encrypt issued only a small number of non-compliant certificates, because otherwise it would have broken a lot of websites.

There is a new standard under development called ACME Renewal Information which enables certificate authorities to inform ACME clients to renew certificates ahead of their normal expiration. Let's Encrypt supports ARI, and used it in this incident to trigger early renewal of the affected certificates. Clearly, more ACME clients need to add support for ARI.

This is my 50th CA compliance bug

It turns out this is the 50th CA compliance bug that I've filed in Bugzilla, and the 5th which was uncovered by Cert Spotter's SCT signature checks. Additionally, I reported a number of incidents before 2018 which didn't end up in Bugzilla.

Some of the problems I uncovered were quite serious (like issuing certificates without doing domain validation) and snowballed until the CA was ultimately distrusted. Most are minor in comparison, and ten years ago, no one would have cared about them: there was no Certificate Transparency to unearth non-compliant certificates, and even when someone did notice, the revocation requirement was not enforced, and CAs were not required to file incident reports or document the non-compliance on their next audit. Thankfully, that's no longer the case, and even compliance violations that seem minor are treated seriously, which has led to enormous improvements in the certificate ecosystem:

The improvements which certificate authorities make in response to seemingly-minor incidents also improve their compliance with the most security-critical rules.
TLS clients no longer need to work around non-standards-compliant certificates, which means they can be simpler. Simpler code is easier to make secure.
The way that CAs handle minor incidents can uncover much larger problems. Minor compliance problems are like "Brown M&M's".

Mozilla deserves enormous credit for being the first to require public incident reports from CAs, as does Google for creating and fostering Certificate Transparency.

You should monitor Certificate Transparency too

One limitation of my compliance monitoring is that I am generally only able to detect certificates that are intrinsically non-compliant, like those which violate encoding rules or are valid for too many days. While I do monitor certificates for domains that are likely to be abused, like example.com and test.com, I can't tell if a certificate issued for your domain is authorized or not. Only you know that.

Fortunately, it's pretty easy to monitor Certificate Transparency and get alerts when a certificate is issued for one of your domains. Cert Spotter has a standalone, open source version that's easy to set up. The paid version has additional features like expiration monitoring, Slack integration, and ways to filter alerts so you're not bothered about legitimate certificates. But most importantly, subscribing to the paid version helps me continue my compliance monitoring of the certificate authority ecosystem.

Older (View Archive)

The Difference Between Root Certificate Authorities, Intermediates, and Resellers

Comments

Reader mcint on 2023-06-22 at 21:06:

Excellent post. Thank you for acting as a verifier in the public interest. In

Mozilla deserves enormous credit for being the first to require public incident reports from CAs, as does Google for creating and fostering Certificate Transparency.

could you link to more information about this Mozilla and Google history, to learn about the evolved consensus you're helping to uphold?

Andrew Ayer on 2023-06-22 at 21:33:

Thanks for reading!

This has some information about Certificate Transparency's history: https://certificate.transparency.dev/community/

Here's Mozilla's guidelines for CA incident response: https://wiki.mozilla.org/CA/Responding_To_An_Incident

Andrew Ayer

Sections

Blog