Skip to Content [alt-c]

November 3, 2025

Google Just Suspended My Company's Google Cloud Account for the Third Time

On each of the last two Fridays, Google has suspended SSLMate's Google Cloud access without notification, having previously suspended it in 2024 without notification. But this isn't just another cautionary tale about using Google Cloud Platform; it's also a story about usable security and how Google's capriciousness is forcing me to choose between weakening security or reducing usability.

Apart from testing and experimentation, the only reason SSLMate still has a Google Cloud presence is to enable integrations with our customers' Google Cloud accounts so that we can publish certificate validation DNS records and discover domain names to monitor on their behalf. We create a service account for each customer under our Google Cloud project, and ask the customer to authorize this service account to access Cloud DNS and Cloud Domains. When SSLMate needs to access a customer's Google Cloud account, it impersonates the corresponding service account. I developed this system based on a suggestion in Google's own documentation (under "How can I access data from my users' Google Cloud project using Cloud APIs?") and it works really well. It is both very easy for the customer to configure, and secure: there are no long-lived credentials or confused deputy vulnerabilities.

Easy and secure: I love it when that's possible!

The only problem is that Google keeps suspending our Google Cloud access.

The First Suspension

Google suspended us for the first time in 2024. Our customer integrations began failing, and logging into the Google Cloud console returned this error:

Screenshot of Google Cloud web page stating "Your account has been disabled"

Although Google's customer support people were surprisingly responsive considering Google's rock-bottom reputation in this area, the process to recover our account was super frustrating:

  1. Google required me to email them from the address associated with the account, but when I did so, the message was bounced with the error "The account [redacted] is disabled" (the redacted portion being the email address I sent from). When I emailed from a different address, the message went through, but the support people initially refused to communicate with it because it was the wrong address.

  2. At one point Google asked me to provide the IDs of our Google Cloud projects - information which I could not retrieve because I couldn't log in to the console. Have you saved your project IDs in a safe place in case your account gets suspended?

  3. After several emails back and forth with Google support, and verifying a phone number, I was able to log back into the Google Cloud console, but two of our projects were still suspended, including the one needed for the customer integrations. (At the time, we still had some domains registered through Google Cloud Domains, and thankfully the project for this was accessible, allowing me to begin transferring all of our domains out to a more dependable registrar.)

  4. The day after I regained access to the console, I received an automated email from no-reply@accounts.google.com stating that my access to Google Cloud Platform had been restricted. Once again, I could no longer access the console, but the error message was different this time:

    Screenshot of Google Cloud web page stating "Access to a service or feature has been restricted" and listing Google Cloud Platform as "Entire service unavailable"

  5. Twelve hours later, I received multiple automated emails from google-cloud-compliance@google.com stating that my Google Cloud projects had been "reinstated" but I still could not access the console.

  6. Seven hours after that, I got another automated email from no-reply@accounts.google.com stating that my access to Google Cloud Platform had been restored. Everything began working after this.

I was never told why our account was suspended or what could be done to prevent it from happening again. Although Google claims to send emails when an account or project is suspended, they never did so for the initial suspension. Since errors with customer integrations were only being displayed in our customers' SSLMate consoles (usually an error indicates the customer made a mistake), I didn't learn about the suspension right away. I fixed this by adding a health check that fails if a large percentage of Google Cloud integrations have errors.

The Second Suspension

Two Fridays ago, that health check failed. I immediately investigated and saw that all but one Google Cloud integrations were failing with the same error as during last year's suspension ("Invalid grant: account not found"). Groaning, I tried logging into the Google Cloud console, bracing myself for another Kafkaesque reinstatement process. At least I know the project IDs this time, I reassured myself. Surprisingly, I was able to log in successfully. Then I got emails, one per Google Cloud project, informing me that my projects had been reinstated "based on information that [I] have provided." Naturally, I had received no emails that they had been suspended in the first place. The integrations started working again.

The Third Suspension

Last Friday, the health check failed again. I logged in to the Google Cloud console, unsure what to expect. This time, I was presented with a third type of error message:

Screenshot of Google Cloud web page stating "We have detected a Terms of Service violation in SSLMate Integrations" and containing a form to submit an appeal

Most, but not all, of SSLMate's Google Cloud projects were suspended, including the one needed for customer integrations.

I submitted an appeal on Friday. On Sunday, I received an email from Google. Was it a response to the appeal? Nope! It was an automated email stating that SSLMate's access to Google Cloud was now completely suspended.

Edited to add: On Monday, shortly after this post hit the front page of Hacker News, most projects were reinstated, including the project for the integrations. A few hours later, access was fully restored. As before, there was no explanation why access was suspended or how to prevent it from happening again.

The Lucky Customer

Incredibly, we have one lucky customer whose integration has continued to work during every suspension, even though it uses a service account in the same suspended project as all the other customer integrations.

What Now?

Clearly, I cannot rely on having a Google account for production use cases. Google has built a complex, unreliable system in which some or all of the following can be suspended: an entire Google account, a Google Cloud Platform account, or individual Google Cloud projects.

Unfortunately, the alternatives for integrations are not great.

The first alternative is to ask customers to create a service account for SSLMate and have SSLMate authenticate to it using a long-lived key. This is pretty easy, but less secure since the long-lived key could leak and can never be rotated in practice.

The second alternative is to use OpenID Connect, aka OIDC. In recent years, OIDC has become the de facto standard for integrations between service providers. For example, you can use OIDC to let GitHub Actions access your Google Cloud account without the need for long-lived credentials. SSLMate's Azure integration uses OIDC and it works well.

Unfortunately, Google has made setting up OIDC unnecessarily difficult. What is currently a simple one step process for our customers to add an integration (assign some roles to a service account) would become a complicated seven step process:

  1. Enable the IAM Service Account Credentials API.
  2. Create a service account.
  3. Create a workload identity pool.
  4. Create a workload identity provider in the pool created in step 3.
  5. Allow SSLMate to impersonate the service account created in step 2 (this requires knowing the ID of the pool created in step 3).
  6. Assign roles to the service account created in step 2.
  7. Provide SSLMate with the ID of the service account created in step 2, and the ID of the workload identity provider created in step 4.

Since many of the steps require knowing the identifiers of resources created in previous steps, it's hard for SSLMate to provide easy-to-follow instructions.

This is more complicated than it needs to be:

  • Creating a service account (steps 1, 2, and 5) should not be necessary. While it is possible to forgo a service account and assign roles directly to an identity from the pool, not all Google Cloud services support this. If you want your integration to work with all current and future services, you have to impersonate a service account. Google should stop treating OIDC like a second-class citizen and guarantee that all current and future services will directly support it.

  • Creating an identity pool shouldn't be necessary either. While I'm sure some use cases are nicely served by pools, it seems like most setups are going to have just one provider per pool, making the extra step of creating a pool nothing but unnecessary busy work.

  • Even creating a provider shouldn't be necessary; it should be possible to assign roles directly to an OIDC issuer URL and subject. You should only have to create a provider if you need to do more advanced configuration, such as mapping attributes.

I find this state of affairs unacceptable, because it's really, really important to move away from long-lived credentials and Google ought to be doing everything possible to encourage more secure alternatives. Sadly, SSLMate's current solution of provider-created service accounts is susceptible to arbitrary account suspensions, and OIDC is hampered by an unnecessarily complicated setup process.

In summary, when setting up cross-provider access with Google Cloud, you can have only two of the following:

  1. No dangerous long-lived credentials.
  2. Easy for the customer to set up.
  3. Safe from arbitrary account suspensions.
Provider-created service accounts Service account + key OpenID Connect
No long-lived keys No long-lived keys
Easy setup Easy setup
Safe from suspension Safe from suspension

Which two would you pick?

Comments

October 29, 2025

I'm Independently Verifying Go's Reproducible Builds

When you try to compile a Go module that requires a newer version of the Go toolchain than the one you have installed, the go command automatically downloads the newer toolchain and uses it for compiling the module. (And only that module; your system's go installation is not replaced.) This useful feature was introduced in Go 1.21 and has let me quickly adopt new Go features in my open source projects without inconveniencing people with older versions of Go.

However, the idea of downloading a binary and executing it on demand makes a lot of people uncomfortable. It feels like such an easy vector for a supply chain attack, where Google, or an attacker who has compromised Google or gotten a misissued SSL certificate, could deliver a malicious binary. Many developers are more comfortable getting Go from their Linux distribution, or compiling it from source themselves.

To address these concerns, the Go project did two things:

  1. They made it so every version of Go starting with 1.21 could be easily reproduced from its source code. Every time you compile a Go toolchain, it produces the exact same Zip archive, byte-for-byte, regardless of the current time, your operating system, your architecture, or other aspects of your environment (such as the directory from which you run the build).

  2. They started publishing the checksum of every toolchain Zip archive in a public transparency log called the Go Checksum Database. The go command verifies that the checksum of a downloaded toolchain is published in the Checksum Database for anyone to see.

These measures mean that:

  1. You can be confident that the binaries downloaded and executed by the go command are the exact same binaries you would have gotten had you built the toolchain from source yourself. If there's a backdoor, the backdoor has to be in the source code.

  2. You can be confident that the binaries downloaded and executed by the go command are the same binaries that everyone else is downloading. If there's a backdoor, it has to be served to the whole world, making it easier to detect.

But these measures mean nothing if no one is checking that the binaries are reproducible, or that the Checksum Database isn't presenting inconsistent information to different clients. Although Google checks reproducibility and publishes a report, this doesn't help if you think Google might try to slip in a backdoor themselves. There needs to be an independent third party doing the checks.

Why not me? I was involved in Debian's Reproducible Builds project back in the day and developed some of the core tooling used to make Debian packages reproducible (strip-nondeterminism and disorderfs). I also have extensive experience monitoring Certificate Transparency logs and have detected misbehavior by numerous logs since 2017. And I do not work for Google (though I have eaten their food).

In fact, I've been quietly operating an auditor for the Go Checksum Database since 2020 called Source Spotter (à la Cert Spotter, my Certificate Transparency monitor). Source Spotter monitors the Checksum Database, making sure it doesn't present inconsistent information or publish more than one checksum for a given module and version. I decided to extend Source Spotter to also verify toolchain reproducibility.

The Checksum Database was originally intended for recording the checksums of Go modules. Essentially, it's a verifiable, append-only log of records which say that a particular version (e.g. v0.4.0) of a module (e.g. src.agwa.name/snid) has a particular SHA-256 hash. Go repurposed it for recording toolchain checksums. Toolchain records have the pseudo-module golang.org/toolchain and versions that look like v0.0.1-goVERSION.GOOS-GOARCH. For example, the Go1.24.2 toolchain for linux/amd64 has the module version v0.0.1-go1.24.2.linux-amd64.

When Source Spotter sees a new version of the golang.org/toolchain pseudo-module, it downloads the corresponding source code, builds it in an AWS Lambda function by running make.bash -distpack, and compares the checksum of the resulting Zip file to the checksum published in the Checksum Database. Any mismatches are published on a webpage and in an Atom feed which I monitor.

So far, Source Spotter has successfully reproduced every toolchain since Go 1.21.0, for every architecture and operating system. As of publication time, that's 2,672 toolchains!

Bootstrap Toolchains

Since the Go toolchain is written in Go, building it requires an earlier version of the Go toolchain to be installed already.

When reproducing Go 1.21, 1.22, and 1.23, Source Spotter uses a Go 1.20.14 toolchain that I built from source. I started by building Go 1.4.3 using a C compiler. I used Go 1.4.3 to build Go 1.17.13, which I used to build Go 1.20.14. To mitigate Trusting Trust attacks, I repeated this process on both Debian and Amazon Linux using both GCC and Clang for the Go 1.4 build. I got the exact same bytes every time, which I believe makes a compiler backdoor vanishingly unlikely. The scripts I used for this are open source.

When reproducing Go 1.24 or higher, Source Spotter uses a binary toolchain downloaded from the Go module proxy that it previously verified as being reproducible from source.

Problems Encountered

Compared to reproducing a typical Debian package, it was really easy to reproduce the same bytes when building the Go toolchains. Nevertheless, there were some bumps along the way:

First, the Darwin (macOS) toolchains published by Google contain signatures produced by Google's private key. Obviously, Source Spotter can't reproduce these. Instead, Source Spotter has to download the toolchain (making sure it matches the checksum published in the Checksum Database) and strip the signatures to produce a new checksum that is verified against the reproduced toolchain. I reused code written by Google to strip the signatures and I honestly have no clue what it's doing and whether it could potentially strip a backdoor. A review from someone versed in Darwin binaries would be very helpful! Edit: since publication, I've learned enough about Darwin binaries to be confident in this code.

Second, to reproduce the linux-arm toolchains, Source Spotter has to set GOARM=6 in the environment... except when reproducing Go 1.21.0, which Google accidentally built using GOARM=7. I find it unfortunate that cmd/dist (the tool used to build the toolchain) doesn't set this environment variable along with the many other environment variables it sets, but Russ Cox pointed me to some context why this is the case.

Finally, the Checksum Database contains a toolchain for Go 1.9.2rc2, which is not a valid version number. It turns out this version was released by mistake. To avoid raising an error for an invalid version number, Source Spotter has to special case it. Not a huge deal, but I found it interesting because it demonstrates one of the downsides of transparency logs: you can't fix or remove entries that were added by mistake!

Source Code Transparency

The source tarballs built by Source Spotter are not published in the Checksum Database, meaning Google could serve Source Spotter, and only Source Spotter, source code which contains a backdoor. To mitigate this, Source Spotter publishes the checksums of every source tarball it builds. However, there are alternatives:

First, Russ Cox pointed out that while the source tarballs aren't in the Checksum Database, the toolchain Zip archives also contain the source code, so Source Spotter could build those instead of the source tarballs. (A previous version of this post incorrectly said that source code wasn't published in the Checksum Database at all.)

Second, Filippo Valsorda suggested that Source Spotter build from Go's Git repository and publish the Git commit IDs instead, since lots of Go developers have the Go Git repository checked out and it would be relatively easy for them to compare the state of their repos against what Source Spotter has seen. Regrettably, Git commit IDs are SHA-1, but this is mitigated by Git's use of Marc Stevens' collision detection, so the benefits may be worth the risk. I think building from Git is a good idea, and to bootstrap it, Filippo used Magic Wormhole to send me the output of git show-ref --tags from his repo while we were both at the Transparency.dev Summit last week.

Conclusion

Thanks to Go's Checksum Database and reproducible toolchains, Go developers get the usability benefits of a centralized package repository and binary toolchains without sacrificing the security benefits of decentralized packages and building from source. The Go team deserves enormous credit for making this a reality, particularly for building a system that is not too hard for a third party to verify. They've raised the bar, and I hope other language and package ecosystems can learn from what they've done.

Learn more by visiting the Source Spotter website or the GitHub repo.

Comments

August 29, 2025

SQLite's Durability Settings are a Mess

One of the most important properties of a database is durability. Durability means that after a transaction commits, you can be confident that, absent catastrophic hardware failure, the changes made by the commit won't be lost. This should remain true even if the operating system crashes or the system loses power soon after the commit. On Linux, and most other Unix operating systems, durability is ensured by calling the fsync system call at the right time.

Durability comes at a performance cost, and sometimes applications don't need durability. Some applications can tolerate losing the last several seconds of commits in the event of a power failure, as long as the database doesn't end up corrupted. Thus, databases typically provide knobs to configure if and when they call fsync. This is fine, but it's essential that the database clearly documents what its default durability properties are, and what each configuration setting guarantees.

Unfortunately, SQLite's documentation about its durability properties is far from clear. I cannot tell whether SQLite is durable by default, and if not, what are the minimal settings you need to use to ensure durability.

The two relevant configuration options are journal_mode and synchronous. journal_mode has several possible values, but most people use either DELETE or WAL. synchronous has four possible values: EXTRA, FULL, NORMAL, and OFF.

This is how I interpret SQLite's documentation after a careful reading:

  • The default value of journal_mode is DELETE:

    The DELETE journaling mode is the normal behavior (source; archived)
  • The default value of synchronous is FULL:

    If not overridden at compile-time, the default setting is 2 (FULL) (source; archived)
  • The default value of synchronous is FULL even in WAL mode:

    If not overridden at compile-time, this value is the same as SQLITE_DEFAULT_SYNCHRONOUS. (source; archived)
  • When journal_mode is DELETE, you need to set synchronous to EXTRA to get durability:

    EXTRA synchronous is like FULL with the addition that the directory containing a rollback journal is synced after that journal is unlinked to commit a transaction in DELETE mode. EXTRA provides additional durability if the commit is followed closely by a power loss. (source; archived)

    Edited to add: I confirmed this to be true through testing - see my Hacker News comment for the methodology.

  • When journal_mode is WAL, FULL is sufficient for durability:

    With synchronous=FULL in WAL mode, an additional sync operation of the WAL file happens after each transaction commit. The extra WAL sync following each transaction helps ensure that transactions are durable across a power loss (source; archived)

    Note that this is not mentioned under the definition of FULL, but rather further down in the documentation for synchronous.

Based on the above, I conclude that:

  • By default, SQLite is not durable, because the default value of journal_mode is DELETE, and the default value of synchronous is FULL, which doesn't provide durability in DELETE mode.

  • If you change journal_mode to WAL, then SQLite is durable, because synchronous=FULL provides durability in WAL mode.

However, a recent Hacker News comment by a user who credibly claims to be Richard Hipp, the creator of SQLite, says:

  • "In its default configuration, SQLite is durable."

  • "If you switch to WAL mode, the default behavior is that transactions ... are not necessarily durable across OS crashes or power failures"

That's literally the opposite of what the documentation seems to say!

A Hacker News commenter who agrees with my reading of the documentation asked Hipp how his comment is consistent with the documentation, but received no reply.

Hipp also says that WAL mode used to be durable by default, but it was changed after people complained about poor performance. This surprised me, since I had the impression that SQLite cared deeply about backwards compatibility, and weakening the default durability setting is a nasty breaking change for any application which needs durability.

There are a couple other pitfalls around SQLite durability that you should be aware of, though I don't necessarily blame the SQLite project for these:

  • Libraries that wrap SQLite can override the default value of synchronous. For example, the most popular Go driver for SQLite sets it to NORMAL when in WAL mode, which does not provide durability.

  • On macOS, fsync is nerfed to make macOS appear faster. If you want a real fsync, you have to make a different, macOS-specific system call. SQLite can do this, but it's off by default.

My takeaway is that if you need durability, you'd better set the synchronous option explicitly because who knows what the default is, or what it will be in the future. With WAL mode, FULL seems to suffice. As for DELETE mode, who knows if FULL is enough, so you'd better go with EXTRA to be safe. And if your application might be used on macOS, enable fullfsync.

The SQLite project ought to clarify their documentation. Since the meaning of synchronous depends on the value of journal_mode, I think it would be quite helpful to document the values of synchronous separately for each possible journal_mode, rather than mixing it all together. A table with synchronous values on one axis and journal_mode on the other which tells you if the combination provides durability would do wonders.

By the way, there are definitely many applications for which losing a few seconds of data in exchange for better performance is a great tradeoff, which is why SQLite and macOS have made the choices they have made. But programmers need to know what guarantees their tools provide, which is why unclear documentation and breaking previously-held assumptions is not cool.

Comments

June 22, 2023

The Story Behind Last Week's Let's Encrypt Downtime

Last Thursday (June 15th, 2023), Let's Encrypt went down for about an hour, during which time it was not possible to obtain certificates from Let's Encrypt. Immediately prior to the outage, Let's Encrypt issued 645 certificates which did not work in Chrome or Safari. In this post, I'm going to explain what went wrong and how I detected it.

The Law of Precertificates

Before I can explain the incident, we need to talk about Certificate Transparency. Certificate Transparency (CT) is a system for putting certificates issued by publicly-trusted CAs, such as Let's Encrypt, in public, append-only logs. Certificate authorities have a tremendous amount of power, and if they misuse their power by issuing certificates that they shouldn't, traffic to HTTPS websites could be intercepted by attackers. Historically, CAs have not used their power well, and Certificate Transparency is an effort to fix that by letting anyone examine the certificates that CAs issue.

A key concept in Certificate Transparency is the "precertificate". Before issuing a certificate, the certificate authority creates a precertificate, which contains all of the information that will be in the certificate, plus a "poison extension" that prevents the precertificate from being used like a real certificate. The CA submits the precertificate to multiple Certificate Transparency logs. Each log returns a Signed Certificate Timestamp (SCT), which is a signed statement acknowledging receipt of the precertificate and promising to publish the precertificate in the log for anyone to download. The CA takes all of the SCTs and embeds them in the certificate. When a CT-enforcing browser (like Chrome or Safari) validates the certificate, it makes sure that the certificate embeds a sufficient number of SCTs from trustworthy logs. This doesn't prevent the browser from accepting a malicious certificate, but it does ensure that the precertificate is in public logs, allowing the attack to be detected and action taken against the CA.

The certificate itself may or may not end up in CT logs. Some CAs, notably Let's Encrypt and Sectigo, automatically submit their certificates. Certificates from other CAs only end up in logs if someone else finds and submits them. Since only the precertificate is guaranteed to be logged, it is essential that a precertificate be treated as incontrovertible proof that a certificate containing the same data exists. When someone finds a precertificate for a malicious or non-compliant certificate, the CA can't be allowed to evade responsibility by saying "just kidding, we never actually issued the real certificate" (and boy, have they tried). Otherwise, CT would be useless.

There are two ways a CA could create a certificate. They could take the precertificate, remove the poison extension, add the SCTs, and re-sign it. Or, they could create the certificate from scratch, making sure to add the same data, in the same order, as used in the precertificate.

The first way is robust because it's guaranteed to produce a certificate which matches the precertificate. At least one CA, Sectigo, uses this approach. Let's Encrypt uses the second approach. You can probably see where this is going...

The Let's Encrypt incident

On June 15, 2023, Let's Encrypt deployed a planned change to their certificate configuration which altered the contents of the Certificate Policies extension from:

X509v3 Certificate Policies: Policy: 2.23.140.1.2.1 Policy: 1.3.6.1.4.1.44947.1.1.1 CPS: http://cps.letsencrypt.org

to:

X509v3 Certificate Policies: Policy: 2.23.140.1.2.1

Unfortunately, any certificate which was requested while the change was being rolled out could have its precertificate and certificate created with different configurations. For example, when Let's Encrypt issued the certificate with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71, the precertificate contained the new Certificate Policies extension, and the certificate contained the old Certificate Policies extension.

This had two consequences:

First, this certificate won't work in Chrome or Safari, because its SCTs are for a precertificate containing different data from the certificate. Specifically, the SCTs fail signature validation. When logs sign SCTs, they compute the signature over the data in the precertificate, and when browsers verify SCTs, they compute the signature over the data in the certificate. In this case, that data was not the same.

Second, remember how I said that precertificates are treated as incontrovertible proof that a certificate containing the same data exists? When Let's Encrypt issued a precertificate with the new Certificate Policies value, it implied that they also issued a certificate with the new Certificate Policies value. Thus, according to the Law of Precertificates, Let's Encrypt issued two certificates with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71:

  1. A certificate containing the old Certificate Policies extension
  2. A certificate containing the new Certificate Policies extension (implied by the existence of the precertificate with the new Certificate Policies extension)

Issuing two certificates with the same serial number is a violation of the Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates. Consequentially, Let's Encrypt must revoke the certificate and post a public incident report, which must be noted on their next audit statement.

You might think that it's harsh to treat this as a compliance incident if Let's Encrypt didn't really issue two certificates with the same serial number. Unfortunately, they have no way of proving this, and the whole reason for Certificate Transparency is so we don't have to take CAs at their word that they aren't issuing certificates that they shouldn't. Any exception to the Law of Precertificates creates an opening for a malicious CA to exploit.

How I found this

My company, SSLMate, operates a Certificate Transparency monitor called Cert Spotter, which continuously downloads and indexes the contents of every Certificate Transparency log. You can use Cert Spotter to get notifications when a certificate is issued for one of your domains, or search the database using a JSON API.

When Cert Spotter ingests a certificate containing embedded SCTs, it verifies each SCT's signature and audits that the log really published the precertificate. (If it detects that a log has broken its promise to publish a precertificate, I'll publicly disclose the SCT and the log will be distrusted. Happily, Cert Spotter has never found a bogus SCT, though it has detected logs violating other requirements.)

On Thursday, June 15, 2023 at 15:41 UTC, Cert Spotter began sending me alerts about certificates containing embedded SCTs with invalid signatures. Since I was getting hundreds of alerts, I decided to stop what I was doing and investigate.

I had received these alerts several times before, and have gotten pretty good at zeroing in on the problem. When only one SCT in a certificate has an invalid signature, it probably means that the CT log screwed up. When all of the embedded SCTs have an invalid signature, it probably means the CA screwed up. The most common reason is issuing certificates that don't match the precertificate. So I took one of the affected certificates and searched for precertificates containing the same serial number in Cert Spotter's database of every (pre)certificate ever logged to Certificate Transparency. Decoding the certificate and precertificate with the openssl command immediately revealed the different Certificate Policies extension.

Since I was continuing to get alerts from Cert Spotter about invalid SCT signatures, I quickly fired off an email to Let's Encrypt's problem reporting address alerting them to the problem.

I sent the email at 15:52 UTC. At 16:08, Let's Encrypt replied that they had paused issuance to investigate. Meanwhile, I filed a CA Certificate Compliance bug in Bugzilla, which is where Mozilla and Chrome track compliance incidents by publicly-trusted certificate authorities.

At 16:54, Let's Encrypt resumed issuance after confirming that they would not issue any more certificates with mismatched precertificates.

On Friday, June 16, 2023, Let's Encrypt emailed the subscribers of the affected certificates to inform them of the need to replace their certificates.

On Monday, June 19, 2023 at 18:00 UTC, Let's Encrypt revoked the 645 affected certificates, as required by the Baseline Requirements. This will cause the certificates to stop working in any client that checks revocation, but remember that these certificates were already being rejected by Chrome and Safari for having invalid SCTs.

On Tuesday, June 20, 2023, Let's Encrypt posted their public incident report, which explained the root cause of the incident and what they're doing to prevent it from happening again. Specifically, they plan to add a pre-issuance check that ensures certificates contain the same data as the precertificate.

Hundreds of websites are still serving broken certificates

I've been periodically checking port 443 of every DNS name in the affected certificates, and as of publication time, 261 certificates are still in use, despite not working in CT-enforcing or revocation-checking clients.

I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt. I thought that maybe these certificates were being used by API endpoints which are accessed by non-browser clients that don't enforce CT or check revocation, but this doesn't appear to be the case, as most of the DNS names are for bare domains or www subdomains. It's fortunate that Let's Encrypt issued only a small number of non-compliant certificates, because otherwise it would have broken a lot of websites.

There is a new standard under development called ACME Renewal Information which enables certificate authorities to inform ACME clients to renew certificates ahead of their normal expiration. Let's Encrypt supports ARI, and used it in this incident to trigger early renewal of the affected certificates. Clearly, more ACME clients need to add support for ARI.

This is my 50th CA compliance bug

It turns out this is the 50th CA compliance bug that I've filed in Bugzilla, and the 5th which was uncovered by Cert Spotter's SCT signature checks. Additionally, I reported a number of incidents before 2018 which didn't end up in Bugzilla.

Some of the problems I uncovered were quite serious (like issuing certificates without doing domain validation) and snowballed until the CA was ultimately distrusted. Most are minor in comparison, and ten years ago, no one would have cared about them: there was no Certificate Transparency to unearth non-compliant certificates, and even when someone did notice, the revocation requirement was not enforced, and CAs were not required to file incident reports or document the non-compliance on their next audit. Thankfully, that's no longer the case, and even compliance violations that seem minor are treated seriously, which has led to enormous improvements in the certificate ecosystem:

  1. The improvements which certificate authorities make in response to seemingly-minor incidents also improve their compliance with the most security-critical rules.
  2. TLS clients no longer need to work around non-standards-compliant certificates, which means they can be simpler. Simpler code is easier to make secure.
  3. The way that CAs handle minor incidents can uncover much larger problems. Minor compliance problems are like "Brown M&M's".

Mozilla deserves enormous credit for being the first to require public incident reports from CAs, as does Google for creating and fostering Certificate Transparency.

You should monitor Certificate Transparency too

One limitation of my compliance monitoring is that I am generally only able to detect certificates that are intrinsically non-compliant, like those which violate encoding rules or are valid for too many days. While I do monitor certificates for domains that are likely to be abused, like example.com and test.com, I can't tell if a certificate issued for your domain is authorized or not. Only you know that.

Fortunately, it's pretty easy to monitor Certificate Transparency and get alerts when a certificate is issued for one of your domains. Cert Spotter has a standalone, open source version that's easy to set up. The paid version has additional features like expiration monitoring, Slack integration, and ways to filter alerts so you're not bothered about legitimate certificates. But most importantly, subscribing to the paid version helps me continue my compliance monitoring of the certificate authority ecosystem.

Comments

June 18, 2023

The Difference Between Root Certificate Authorities, Intermediates, and Resellers

It happens every so often: some organization that sells publicly-trusted SSL certificates does something monumentally stupid, like generating, storing, and then intentionally disclosing all of their customers' private keys (Trustico), letting private keys be stolen via XSS (ZeroSSL), or most recently, literally exploiting remote code execution in ACME clients as part of their issuance process (HiCA aka QuantumCA).

When this happens, people inevitably refer to the certificate provider as a certificate authority (CA), which is an organization with the power to issue trusted certificates for any domain name. They fear that the integrity of the Internet's certificate authority system has been compromised by the "CA"'s incompetence. Something must be done!

But none of the organizations listed above are CAs - they just take certificate requests and forward them to real CAs, who validate the request and issue the certificate. The Internet is safe - from these organizations, at least.

In this post, I'm going to define terms like certificate authority, root CA, intermediate CA, and reseller, and explain whom you do and do not need to worry about.

Note that I'm going to talk only about publicly-trusted SSL certificate authorities - i.e those which are trusted by mainstream web browsers.

Certificate Authority Certificates

"Certificate authority" is a label that can apply both to SSL certificates and to organizations. A certificate is a CA if it can be used to issue certificates which will be accepted by browsers. There are two types of CA certificates: root and intermediate.

Root CA certificates, also known as "trust anchors" or just "roots", are shipped with your browser. If you poke around your browser's settings, you can find a list of root CA certificates.

Intermediate CA certificates, also known as "subordinate CAs" or just "intermediates", are certificates with the "CA" boolean set to true in the Basic Constraints extension, and which were issued by a root or another intermediate. If you decode an intermediate CA certificate with openssl, you'll see this in the extensions section:

X509v3 Basic Constraints: critical CA:TRUE

When you connect to a website, your browser has to verify that the website's certificate was issued by a CA certificate. If the website's certificate was issued by a root, it's easy because the browser already knows the public key needed to verify the certificate's signature. If the website's certificate was issued by an intermediate, the browser has to retrieve the intermediate from elsewhere, and then verify that the intermediate was issued by a CA certificate, recurring as necessary until it reaches a root. The web server can help the browser out by sending intermediate certificates along with the website's certificate, and some browsers are able to retrieve intermediates from the URL included in the certificate's AIA (Authority Information Access) field.

The purpose of this post is to discuss organizations, so for the rest of this post, when I say "certificate authority" or "CA" I am referring to an organization, not a certificate, unless otherwise specified.

Certificate Authority Organizations

An organization is a certificate authority if and only if they hold the private key for one or more certificate authority certificates. Holding the private key for a CA certificate is a big deal because it gives the organization the power to create certificates that are valid for any domain name on the Internet. These are essentially the Keys to the Internet.

Unfortunately, figuring out who holds a CA certificate's private key is not straightforward. CA certificates contain an organization name in their subject field, and many people reasonably assume that this must be the organization which holds the private key. But as I explained earlier this year, this is often not the case. I spend a lot of time researching the certificate ecosystem, and I am not exaggerating when I say that I completely ignore the organization name in CA certificates. It is useless. You have to look at other sources, like audit statements and the CCADB, to figure out who really holds a CA certificate's key. Consequentially, just because you see a company's name in a CA certificate, it does not mean that it has a key which can be used to create certificates.

Internal vs External Intermediate Certificates

CAs aren't allowed to issue website certificates directly from root certificates, so any CA which operates a root certificate is going to have to issue themselves at least one intermediate certificate. Since the CA retains control of the private key for the intermediate, these are called internally-operated intermediate certificates. Most intermediate certificates are internally-operated.

CAs are also able to issue intermediate certificates whose private key is held by another organization, making the other organization a certificate authority too. These are called externally-operated intermediate certificates, and there are two reasons they exist.

The more legitimate reason is that the other organization operates, or intends to operate, root certificates, and would like to issue certificates which work in older browsers whose trust stores don't include their roots. By getting an intermediate from a more-established CA, they can issue certificates which chain back to a root that is in more trust stores. This is called cross-signing, and a well-known example is Identrust cross-signing Let's Encrypt.

The less-savory reason is that the other organization would like to become a certificate authority without having to go through the onerous process of asking each browser to include them. Historically, this was a huge loophole which allowed organizations to become CAs with less oversight and thus less investment into security and compliance. Thankfully, this loophole is closing, and nowadays Mozilla and Chrome require CAs to obtain approval before issuing externally-operated intermediates to organizations which aren't already trusted CAs. I would not be surprised if browsers eventually banned the practice outright, leaving cross-signing as the only acceptable use case for externally-operated intermediates.

Certificate Resellers

There's no standard definition of certificate reseller, but I define it as any organization which provides certificates which they do not issue themselves. When someone requests a certificate from a reseller, the reseller forwards the request to a certificate authority, which validates the request and issues the certificate. The reseller has no access to the CA certificate's private key and no ability to cause issuance themselves. Typically, the reseller will use an API (ACME or proprietary) to request certificates from the CA. My company, SSLMate, is a reseller (though these days we mostly focus on Certificate Transparency monitoring).

The relationship between the reseller and the CA can take many forms. In some cases, the reseller may enter into an explicit reseller agreement with the CA and get access to special pricing or reseller-only APIs. In other cases, the reseller might obtain certificates from the CA using the same API and pricing available to the general public. The reseller might not even pay the CA - there's nothing stopping someone from acting like a reseller for free CAs like Let's Encrypt. For example, DNSimple provides paid certificates issued by Sectigo right alongside free certificates issued by Let's Encrypt. The CA might not know that their certificates are being resold - Let's Encrypt, for instance, allows anyone to create an anonymous account.

An organization may be a reseller of multiple CAs, or even be both a reseller and a CA. A reseller might not get certificates directly from a CA, but via a different reseller (SSLMate did this in the early days because it was the only way to get good pricing without making large purchase commitments).

A reseller might just provide certificates to customers, or they might use the certificates to provide a larger service, such as web hosting or a CDN (for example, Cloudflare).

White-Label / Branded Intermediate Certificates

Ideally, it would be easy to distinguish a reseller from a CA. However, most resellers are middlemen who provide no value over getting a certificate directly from a CA, and they don't want people to know this. So, for the right price, a CA will issue themselves an internally-operated intermediate CA certificate with the reseller's name in the organization field, and when the reseller requests a certificate, the CA will issue the certificate from the "branded" intermediate certificate instead of from an intermediate certificate with the CA's name in it.

The reseller does not have access to the private key of the branded intermediate certificate. Except for the name, everything about the branded intermediate certificate - like the security controls and the validation process - is exactly the same as the CA's non-branded intermediates. Thus, the mere existence of a branded intermediate certificate does not in any way affect the integrity of the certificate authority ecosystem, regardless of how untrustworthy, incompetent, or malicious the organization named in the certificate is.

What Should Worry You?

I spend a lot of time worrying about bad certificate authorities, but bad resellers don't concern me. A bad reseller can harm only their own customers, but a bad certificate authority can harm the entire Internet. Yeah, it sucks if you choose a reseller who screws you, but it's no different from choosing a bad hosting provider, domain registrar, or any of the myriad other vendors needed to host a website. As always, caveat emptor. In contrast, you can pick the best, most trustworthy certificate authority around, and some garbage CA you've never heard of can still issue an attacker a certificate for your domain. This is why web browsers tightly regulate CAs, but not resellers.

Unfortunately, this doesn't stop people from freaking out when a two-bit reseller with a branded intermediate (such as Trustico, ZeroSSL, or HiCA/QuantumCA) does something awful. People flood the mozilla-dev-security-policy mailing list to voice their concerns, including those who have little knowledge of certificates and have never posted there before. These discussions are a distraction from far more important issues. While the HiCA discussion was devolving into off-topic blather, the certificate authority Buypass disclosed in an incident report that they have a domain validation process which involves their employees manually looking up and interpreting CAA records. They are far from the only CA which does domain validation manually despite it being easy to automate, and it's troubling because having a human involved gives attackers an opening to exploit mistakes, bribery, or coercion to get unauthorized certificates for any domain. Buypass' incident response, which blames "human error" instead of addressing the root cause, won't make headlines on Hacker News or be shared on Twitter with the popcorn emoji, but it's actually far more concerning than the worst comments a reseller has ever made.

About the Author

I've reported over 50 certificate authority compliance issues, and uncovered evidence that led to the distrust of multiple certificate authorities and Certificate Transparency logs. My research has prompted changes to ACME and Mozilla's Root Store Policy. My company, SSLMate, offers a JSON API for searching Certificate Transparency logs and Cert Spotter, a service to monitor your domains for unauthorized, expiring, or incorrectly-installed certificates.

If you liked this post, you might like:

Stay Tuned: Let's Encrypt Incident

By popular demand, I will be blogging about how I found the compliance bug which prompted last week's Let's Encrypt downtime. Be sure to subscribe by email or RSS, or follow me on Mastodon or Twitter.

Comments

Older Posts