December 10, 2025
Certificate Authorities Are Once Again Issuing Certificates That Don't Work
Twice a year, the Certificate Transparency ecosystem undergoes a transition as certificate authorities start to submit certificates to new semiannual log partitions. And recently, the ecosystem has started transitioning to the new static-ct-api specification. Unfortunately, despite efforts to make these transitions extremely easy for certificate authorities, in the past week I have detected 16 certificate authorities who have bungled these transitions, issuing certificates that are rejected by some or all mainstream web browsers with an error message like "This Connection Is Not Private" or ERR_CERTIFICATE_TRANSPARENCY_REQUIRED.
If you're not familiar, Certificate Transparency (CT) is a system for publishing SSL certificates in public logs. Certificate Transparency monitors like Cert Spotter download the logs to help you track certificate expiration and detect unauthorized certificates for your domains.
At a high level, Certificate Transparency works like this:
- Before issuing a certificate, the certificate authority (CA) creates a "precertificate" containing the details of the certificate it intends to issue.
- The CA submits the precertificate to multiple Certificate Transparency logs.
- Each log returns a receipt, called a Signed Certificate Timestamp (SCT), which confirms submission of the precertificate.
- The CA embeds the SCTs in the certificate which it gives to the site operator.
- When a browser loads a website, it makes sure the website's certificate has SCTs from a sufficient number of recognized logs. If it doesn't, the browser throws up an error page and refuses to load the website.
Billions of SSL certificates are issued and logged to CT every year. To prevent logs from growing indefinitely, logs only accept (pre)certificates which expire within a certain range, typically six months long. Every log will eventually contain only expired certificates, allowing it to be shut down. Meanwhile, new logs are created to contain certificates expiring further in the future.
How do CAs know what logs to submit precertificates to? It's easy: Apple and Chrome each publish a JSON file containing a list of logs. (Firefox and Edge use Chrome's list.) Apple's is at https://valid.apple.com/ct/log_list/current_log_list.json and Chrome's is at https://www.gstatic.com/ct/log_list/v3/log_list.json. Each log object contains the log's name, URL, public key, range of expiration dates accepted by the log, and crucially, the log's state.
{
"description": "Sectigo 'Elephant2027h1'",
"log_id": "YEyar3p/d18B1Ab8kg3ImesLHH34yVIb+voXdzuXi8k=",
"key": "MFkwEwYHKoZIzj0CAQYIKoZI...AScw2woA==",
"url": "https://elephant2027h1.ct.sectigo.com/",
"mmd": 86400,
"state": {
"usable": {
"timestamp": "2025-07-22T01:33:20Z"
}
},
"temporal_interval": {
"start_inclusive": "2027-01-01T00:00:00Z",
"end_exclusive": "2027-07-01T00:00:00Z"
}
}
The state is very simple: if it's "usable", then CAs should use it. If it's something else, CAs should not use it.
The full process of logging is a bit more complicated, because CAs have to include SCTs from a sufficiently-diverse set of logs, but when it comes to finding the initial set of logs to consider, it's hard to imagine how it could be any easier for CAs. They just need to download the Apple and Chrome lists and find the logs whose state is Usable in both lists and whose expiration range covers the expiration date of the certificate.
Despite this, a number of CAs appear to either disregard the state or only consider Chrome's log list. Historically, this has not caused problems because new logs have become Usable in both Chrome and Apple before they were needed for new certificates. Since the maximum certificate lifetime is 398 days, logs for certificates expiring in the first half of 2027 (2027h1) needed to be Usable by November 29, 2025. Unfortunately, not all 2027h1 logs were Usable by this date.
First, Google's 2027h1 logs (Argon 2027h1 and Xenon 2027h1) were added to Chrome 40 days later than they should have been. Normally, new logs are added to Chrome after 30 days of successful monitoring, but this process is still very manual and human error led to Chrome setting a 70 day timer instead of a 30 day timer. Consequentially, these logs are still in the Qualified state in Chrome. Although Qualified logs are recognized by up-to-date installations of Chrome (and Firefox and Edge), there may be out-of-date installations which do not recognize them, making it a very bad idea for CAs to use Qualified logs if they care about compatibility. Chrome, Firefox, and Edge automatically disable Certificate Transparency enforcement once they become 70 days out-of-date, so Argon and Xenon 2027h1 will become Usable on December 27, 2025, which is 70 days after they became Qualified. (Argon and Xenon 2027h1 are already Usable in Apple's list.)
Second, DigiCert's 2027h1 logs (Sphinx 2027h1 and Wyvern 2027h1) don't appear at all in Apple's log list. Since Apple doesn't use a public bug tracker for their CT log program like Chrome, I have no idea what went wrong. Did DigiCert forget to tell Apple about their new logs, or is Apple slow-rolling them for some reason? Certificates which rely on either DigiCert log won't work at all on Apple platforms. (They are already Usable in Chrome's list.)
While the late addition of logs is not ideal, it should not have been a problem, because there are plenty of other 2027h1 logs which became Usable for both Apple and Chrome in time.
I first became aware of issues last Tuesday when Arabella Barks posted a message to Mozilla's dev-security-policy mailing list referencing a certificate issued by Certum with SCTs from DigiCert Wyvern 2027h1. Sensing that this could be a widespread problem, I decided to investigate. My company, SSLMate, maintains a 51TB PostgreSQL database with the contents of every Certificate Transparency log. The database's primary purpose is to power our Certificate Transparency monitoring service, Cert Spotter, and our Certificate Transparency Search API, but it's also very handy for investigating ecosystem issues.
I ran a query to find all precertificates logged to Google's and DigiCert's 2027h1 logs. This alone was not sufficient to identify broken certificates, since CAs could be submitting precertificates to these logs but not including the SCTs in the final certificate, or including more than the minimum number of required SCTs. Therefore, for every precertificate, I looked to see if the corresponding final certificate had been logged anywhere. If it had, I ran it through SSLMate's CT Policy Analyzer to see if it had enough SCTs from broadly Usable logs. If the final certificate wasn't available for analysis, I counted how many other logs the precertificate was logged to. If fewer than three of these logs were Usable, then there was no way the corresponding certificate could have enough SCTs.
I posted my findings to the ct-policy mailing list later that day, alerting CAs to the problem. Since then, I've found even more certificates relying on logs that are not broadly Usable. As of publication time, the following CAs have issued such certificates:
- Certum
- Cybertrust Japan (fixed)
- Disig
- GDCA
- GlobalSign (fixed)
- HARICA
- IdenTrust (fixed)
- Izenpe (fixed)
- Microsec
- NAVER
- SECOM
- SSL.com
- SHECA
- TWCA (fixed)
- certSIGN
- emSign
Of those, only the five indicated above have fixed their systems. The others have all issued broken certificates within the last two days, even though it has been a week since my first public posting.
Unfortunately, logging to non-Usable logs wasn't the only problem. Last Wednesday, Cert Spotter began alerting me about certificates issued by Cybertrust Japan containing SCTs with invalid signatures. I noticed that the SCTs with invalid signatures were all from static-ct-api logs.
To address shortcomings with the original Certificate Transparency specification (RFC6962), the ecosystem has been transitioning to logs based on the static-ct-api specification. Almost half of the 2027h1 logs use static-ct-api. However, while static-ct-api requires major changes for log monitors, it uses the exact same protocol for CAs to submit (pre)certificates. This was an intentional decision to make static-ct-api easier to adopt, so that it wouldn't suffer the same fate as RFC9162, which was intended to replace RFC6962 but was dead-on-arrival in part because it completely broke compatibility with the existing ecosystem.
However, there is one teeny tiny difference with static-ct-api: whereas RFC6962 logs always return SCTs with an empty extensions field, static-ct-api logs return SCTs with non-empty extensions. This should not be problem - the extensions field is just an opaque byte array and CAs do not need to understand what static-ct-api logs place it in it. They just need to copy it through to the final certificate, which they should have been doing anyways with RFC6962 logs. But Cybertrust Japan was always leaving the extension field empty regardless of what the log returned, breaking the SCT's signature. Since SCTs with invalid signatures are disregarded by browsers, this left their certificates with an insufficient number of SCTs, dooming them to rejection.
After publication of this post, Cert Spotter alerted me to invalid SCT signatures
in certificates issued by NAVER. In this case, the SCT extensions were non-empty but
encoded in base64, indicating that NAVER wasn't decoding the base64 from the JSON response
when copying it to the SCT. On one hand, I don't love
RFC6962's wording about the extensions
field:
while the other JSON fields, like id and signature, are clearly indicated
as "base64 encoded", it's only implied that extensions is base64-encoded (it says "Clients
should decode the base64-encoded data and include it in the SCT"). On the other hand,
if NAVER were verifying the signature of SCTs before embedding them in certificates, they almost certainly would have caught
this mistake, since successful verification relies on correctly decoding the JSON response.
And we know from past incidents
that it's very important for CAs to verify SCT signatures.
Unfortunately, we'll probably never learn the root cause of these failures or what CAs are doing to prevent them from happening again. Normally, when a CA violates a policy, they are required to publish a public incident report, answer questions from the community, and note the failure in their next audit. If their incident response is bad or they keep having the same incident, they run the risk of being distrusted. However, Certificate Transparency is not a policy requirement in the traditional sense - CAs are free to issue certificates which violate CT requirements; those certificates just won't work in CT-enforcing browsers. This allows to CAs to issue unlogged certificates to customers who don't want their certificates to be public knowledge (and don't need them to work in browsers). Of course, that's not what the CAs here were doing - they were clearly trying to issue certificates that work in browsers; they just did a bad job of it.
Previously:
November 3, 2025
Google Just Suspended My Company's Google Cloud Account for the Third Time
On each of the last two Fridays, Google has suspended SSLMate's Google Cloud access without notification, having previously suspended it in 2024 without notification. But this isn't just another cautionary tale about using Google Cloud Platform; it's also a story about usable security and how Google's capriciousness is forcing me to choose between weakening security or reducing usability.
Apart from testing and experimentation, the only reason SSLMate still has a Google Cloud presence is to enable integrations with our customers' Google Cloud accounts so that we can publish certificate validation DNS records and discover domain names to monitor on their behalf. We create a service account for each customer under our Google Cloud project, and ask the customer to authorize this service account to access Cloud DNS and Cloud Domains. When SSLMate needs to access a customer's Google Cloud account, it impersonates the corresponding service account. I developed this system based on a suggestion in Google's own documentation (under "How can I access data from my users' Google Cloud project using Cloud APIs?") and it works really well. It is both very easy for the customer to configure, and secure: there are no long-lived credentials or confused deputy vulnerabilities.
Easy and secure: I love it when that's possible!
The only problem is that Google keeps suspending our Google Cloud access.
The First Suspension
Google suspended us for the first time in 2024. Our customer integrations began failing, and logging into the Google Cloud console returned this error:
Although Google's customer support people were surprisingly responsive considering Google's rock-bottom reputation in this area, the process to recover our account was super frustrating:
-
Google required me to email them from the address associated with the account, but when I did so, the message was bounced with the error "The account [redacted] is disabled" (the redacted portion being the email address I sent from). When I emailed from a different address, the message went through, but the support people initially refused to communicate with it because it was the wrong address.
-
At one point Google asked me to provide the IDs of our Google Cloud projects - information which I could not retrieve because I couldn't log in to the console. Have you saved your project IDs in a safe place in case your account gets suspended?
-
After several emails back and forth with Google support, and verifying a phone number, I was able to log back into the Google Cloud console, but two of our projects were still suspended, including the one needed for the customer integrations. (At the time, we still had some domains registered through Google Cloud Domains, and thankfully the project for this was accessible, allowing me to begin transferring all of our domains out to a more dependable registrar.)
-
The day after I regained access to the console, I received an automated email from no-reply@accounts.google.com stating that my access to Google Cloud Platform had been restricted. Once again, I could no longer access the console, but the error message was different this time:
-
Twelve hours later, I received multiple automated emails from google-cloud-compliance@google.com stating that my Google Cloud projects had been "reinstated" but I still could not access the console.
-
Seven hours after that, I got another automated email from no-reply@accounts.google.com stating that my access to Google Cloud Platform had been restored. Everything began working after this.
I was never told why our account was suspended or what could be done to prevent it from happening again. Although Google claims to send emails when an account or project is suspended, they never did so for the initial suspension. Since errors with customer integrations were only being displayed in our customers' SSLMate consoles (usually an error indicates the customer made a mistake), I didn't learn about the suspension right away. I fixed this by adding a health check that fails if a large percentage of Google Cloud integrations have errors.
The Second Suspension
Two Fridays ago, that health check failed. I immediately investigated and saw that all but one Google Cloud integrations were failing with the same error as during last year's suspension ("Invalid grant: account not found"). Groaning, I tried logging into the Google Cloud console, bracing myself for another Kafkaesque reinstatement process. At least I know the project IDs this time, I reassured myself. Surprisingly, I was able to log in successfully. Then I got emails, one per Google Cloud project, informing me that my projects had been reinstated "based on information that [I] have provided." Naturally, I had received no emails that they had been suspended in the first place. The integrations started working again.
The Third Suspension
Last Friday, the health check failed again. I logged in to the Google Cloud console, unsure what to expect. This time, I was presented with a third type of error message:
Most, but not all, of SSLMate's Google Cloud projects were suspended, including the one needed for customer integrations.
I submitted an appeal on Friday. On Sunday, I received an email from Google. Was it a response to the appeal? Nope! It was an automated email stating that SSLMate's access to Google Cloud was now completely suspended.
Edited to add: On Monday, shortly after this post hit the front page of Hacker News, most projects were reinstated, including the project for the integrations. A few hours later, access was fully restored. As before, there was no explanation why access was suspended or how to prevent it from happening again.
The Lucky Customer
Incredibly, we have one lucky customer whose integration has continued to work during every suspension, even though it uses a service account in the same suspended project as all the other customer integrations.
What Now?
Clearly, I cannot rely on having a Google account for production use cases. Google has built a complex, unreliable system in which some or all of the following can be suspended: an entire Google account, a Google Cloud Platform account, or individual Google Cloud projects.
Unfortunately, the alternatives for integrations are not great.
The first alternative is to ask customers to create a service account for SSLMate and have SSLMate authenticate to it using a long-lived key. This is pretty easy, but less secure since the long-lived key could leak and can never be rotated in practice.
The second alternative is to use OpenID Connect, aka OIDC. In recent years, OIDC has become the de facto standard for integrations between service providers. For example, you can use OIDC to let GitHub Actions access your Google Cloud account without the need for long-lived credentials. SSLMate's Azure integration uses OIDC and it works well.
Unfortunately, Google has made setting up OIDC unnecessarily difficult. What is currently a simple one step process for our customers to add an integration (assign some roles to a service account) would become a complicated seven step process:
- Enable the IAM Service Account Credentials API.
- Create a service account.
- Create a workload identity pool.
- Create a workload identity provider in the pool created in step 3.
- Allow SSLMate to impersonate the service account created in step 2 (this requires knowing the ID of the pool created in step 3).
- Assign roles to the service account created in step 2.
- Provide SSLMate with the ID of the service account created in step 2, and the ID of the workload identity provider created in step 4.
Since many of the steps require knowing the identifiers of resources created in previous steps, it's hard for SSLMate to provide easy-to-follow instructions.
This is more complicated than it needs to be:
Creating a service account (steps 1, 2, and 5) should not be necessary. While it is possible to forgo a service account and assign roles directly to an identity from the pool, not all Google Cloud services support this. If you want your integration to work with all current and future services, you have to impersonate a service account. Google should stop treating OIDC like a second-class citizen and guarantee that all current and future services will directly support it.
Creating an identity pool shouldn't be necessary either. While I'm sure some use cases are nicely served by pools, it seems like most setups are going to have just one provider per pool, making the extra step of creating a pool nothing but unnecessary busy work.
Even creating a provider shouldn't be necessary; it should be possible to assign roles directly to an OIDC issuer URL and subject. You should only have to create a provider if you need to do more advanced configuration, such as mapping attributes.
I find this state of affairs unacceptable, because it's really, really important to move away from long-lived credentials and Google ought to be doing everything possible to encourage more secure alternatives. Sadly, SSLMate's current solution of provider-created service accounts is susceptible to arbitrary account suspensions, and OIDC is hampered by an unnecessarily complicated setup process.
In summary, when setting up cross-provider access with Google Cloud, you can have only two of the following:
- No dangerous long-lived credentials.
- Easy for the customer to set up.
- Safe from arbitrary account suspensions.
| Provider-created service accounts | Service account + key | OpenID Connect |
|---|---|---|
| No long-lived keys | No long-lived keys | |
| Easy setup | Easy setup | |
| Safe from suspension | Safe from suspension |
Which two would you pick?
October 29, 2025
I'm Independently Verifying Go's Reproducible Builds
When you try to compile a Go module that requires a newer version of the Go toolchain than the one you have installed, the go command automatically downloads the newer toolchain and uses it for compiling the module. (And only that module; your system's go installation is not replaced.) This useful feature was introduced in Go 1.21 and has let me quickly adopt new Go features in my open source projects without inconveniencing people with older versions of Go.
However, the idea of downloading a binary and executing it on demand makes a lot of people uncomfortable. It feels like such an easy vector for a supply chain attack, where Google, or an attacker who has compromised Google or gotten a misissued SSL certificate, could deliver a malicious binary. Many developers are more comfortable getting Go from their Linux distribution, or compiling it from source themselves.
To address these concerns, the Go project did two things:
They made it so every version of Go starting with 1.21 could be easily reproduced from its source code. Every time you compile a Go toolchain, it produces the exact same Zip archive, byte-for-byte, regardless of the current time, your operating system, your architecture, or other aspects of your environment (such as the directory from which you run the build).
They started publishing the checksum of every toolchain Zip archive in a public transparency log called the Go Checksum Database. The go command verifies that the checksum of a downloaded toolchain is published in the Checksum Database for anyone to see.
These measures mean that:
You can be confident that the binaries downloaded and executed by the go command are the exact same binaries you would have gotten had you built the toolchain from source yourself. If there's a backdoor, the backdoor has to be in the source code.
You can be confident that the binaries downloaded and executed by the go command are the same binaries that everyone else is downloading. If there's a backdoor, it has to be served to the whole world, making it easier to detect.
But these measures mean nothing if no one is checking that the binaries are reproducible, or that the Checksum Database isn't presenting inconsistent information to different clients. Although Google checks reproducibility and publishes a report, this doesn't help if you think Google might try to slip in a backdoor themselves. There needs to be an independent third party doing the checks.
Why not me? I was involved in Debian's Reproducible Builds project back in the day and developed some of the core tooling used to make Debian packages reproducible (strip-nondeterminism and disorderfs). I also have extensive experience monitoring Certificate Transparency logs and have detected misbehavior by numerous logs since 2017. And I do not work for Google (though I have eaten their food).
In fact, I've been quietly operating an auditor for the Go Checksum Database since 2020 called Source Spotter (à la Cert Spotter, my Certificate Transparency monitor). Source Spotter monitors the Checksum Database, making sure it doesn't present inconsistent information or publish more than one checksum for a given module and version. I decided to extend Source Spotter to also verify toolchain reproducibility.
The Checksum Database was originally intended for recording the checksums of Go modules.
Essentially, it's a verifiable, append-only log of records which say that a particular
version (e.g. v0.4.0) of a module (e.g. src.agwa.name/snid) has a particular SHA-256 hash. Go repurposed
it for recording toolchain checksums. Toolchain records have the pseudo-module
golang.org/toolchain and versions that look like v0.0.1-goVERSION.GOOS-GOARCH. For example, the Go1.24.2 toolchain for linux/amd64 has the module version v0.0.1-go1.24.2.linux-amd64.
When Source Spotter sees a new version of the golang.org/toolchain pseudo-module,
it downloads the corresponding source code, builds it in an AWS Lambda function by running make.bash -distpack,
and compares the checksum
of the resulting Zip file to the checksum published in the Checksum Database. Any mismatches
are published on a webpage and
in an Atom feed which I monitor.
So far, Source Spotter has successfully reproduced every toolchain since Go 1.21.0, for every architecture and operating system. As of publication time, that's 2,672 toolchains!
Bootstrap Toolchains
Since the Go toolchain is written in Go, building it requires an earlier version of the Go toolchain to be installed already.
When reproducing Go 1.21, 1.22, and 1.23, Source Spotter uses a Go 1.20.14 toolchain that I built from source. I started by building Go 1.4.3 using a C compiler. I used Go 1.4.3 to build Go 1.17.13, which I used to build Go 1.20.14. To mitigate Trusting Trust attacks, I repeated this process on both Debian and Amazon Linux using both GCC and Clang for the Go 1.4 build. I got the exact same bytes every time, which I believe makes a compiler backdoor vanishingly unlikely. The scripts I used for this are open source.
When reproducing Go 1.24 or higher, Source Spotter uses a binary toolchain downloaded from the Go module proxy that it previously verified as being reproducible from source.
Problems Encountered
Compared to reproducing a typical Debian package, it was really easy to reproduce the same bytes when building the Go toolchains. Nevertheless, there were some bumps along the way:
First, the Darwin (macOS) toolchains published by Google contain signatures produced by Google's private key.
Obviously, Source Spotter can't reproduce these. Instead, Source Spotter has to download
the toolchain (making sure it matches the checksum published in the Checksum Database) and strip the signatures
to produce a new checksum that is verified against the reproduced toolchain.
I reused code written by Google
to strip the signatures and I honestly have no clue what it's doing and whether
it could potentially strip a backdoor. A review from someone versed in Darwin binaries would be very helpful!
Edit: since publication, I've learned enough about Darwin binaries to be confident in this code.
Second, to reproduce the linux-arm toolchains, Source Spotter has
to set GOARM=6 in the environment... except when reproducing Go 1.21.0, which
Google accidentally built using GOARM=7.
I find it unfortunate that cmd/dist (the tool used to build the toolchain) doesn't set this environment variable along with the many other environment variables it sets, but Russ Cox pointed me to some context why this is the case.
Finally, the Checksum Database contains a toolchain for Go 1.9.2rc2, which is not a valid version number. It turns out this version was released by mistake. To avoid raising an error for an invalid version number, Source Spotter has to special case it. Not a huge deal, but I found it interesting because it demonstrates one of the downsides of transparency logs: you can't fix or remove entries that were added by mistake!
Source Code Transparency
The source tarballs built by Source Spotter are not published in the Checksum Database, meaning Google could serve Source Spotter, and only Source Spotter, source code which contains a backdoor. To mitigate this, Source Spotter publishes the checksums of every source tarball it builds. However, there are alternatives:
First, Russ Cox pointed out that while the source tarballs aren't in the Checksum Database, the toolchain Zip archives also contain the source code, so Source Spotter could build those instead of the source tarballs. (A previous version of this post incorrectly said that source code wasn't published in the Checksum Database at all.)
Second, Filippo Valsorda suggested that Source Spotter build from Go's Git repository
and publish the Git commit IDs instead, since lots of Go developers have the Go Git repository checked out
and it would be relatively easy for them to compare the state of their repos against what Source Spotter has seen.
Regrettably, Git commit IDs are SHA-1, but this is mitigated by Git's use of
Marc Stevens' collision detection,
so the benefits may be worth the risk.
I think building from Git is a good idea, and to bootstrap it, Filippo used Magic Wormhole to send me the output of git show-ref --tags from his repo while we were both
at the Transparency.dev Summit last week.
Conclusion
Thanks to Go's Checksum Database and reproducible toolchains, Go developers get the usability benefits of a centralized package repository and binary toolchains without sacrificing the security benefits of decentralized packages and building from source. The Go team deserves enormous credit for making this a reality, particularly for building a system that is not too hard for a third party to verify. They've raised the bar, and I hope other language and package ecosystems can learn from what they've done.
Learn more by visiting the Source Spotter website or the GitHub repo.
August 29, 2025
SQLite's Durability Settings are a Mess
One of the most important properties of a database is durability. Durability means that after a transaction commits, you can be confident that, absent catastrophic hardware failure, the changes made by the commit won't be lost. This should remain true even if the operating system crashes or the system loses power soon after the commit. On Linux, and most other Unix operating systems, durability is ensured by calling the fsync system call at the right time.
Durability comes at a performance cost, and sometimes applications don't need durability. Some applications can tolerate losing the last several seconds of commits in the event of a power failure, as long as the database doesn't end up corrupted. Thus, databases typically provide knobs to configure if and when they call fsync. This is fine, but it's essential that the database clearly documents what its default durability properties are, and what each configuration setting guarantees.
Unfortunately, SQLite's documentation about its durability properties is far from clear. I cannot tell whether SQLite is durable by default, and if not, what are the minimal settings you need to use to ensure durability.
The two relevant configuration
options are journal_mode and synchronous. journal_mode has several possible values, but most people use either DELETE or WAL. synchronous has four possible values: EXTRA, FULL, NORMAL, and OFF.
This is how I interpret SQLite's documentation after a careful reading:
The default value of
journal_modeis DELETE:The DELETE journaling mode is the normal behavior (source; archived)
The default value of
synchronousis FULL:If not overridden at compile-time, the default setting is 2 (FULL) (source; archived)
The default value of
synchronousis FULL even in WAL mode:If not overridden at compile-time, this value is the same as SQLITE_DEFAULT_SYNCHRONOUS. (source; archived)
When
journal_modeis DELETE, you need to setsynchronousto EXTRA to get durability:EXTRA synchronous is like FULL with the addition that the directory containing a rollback journal is synced after that journal is unlinked to commit a transaction in DELETE mode. EXTRA provides additional durability if the commit is followed closely by a power loss. (source; archived)
Edited to add: I confirmed this to be true through testing - see my Hacker News comment for the methodology.
When
journal_modeis WAL, FULL is sufficient for durability:With synchronous=FULL in WAL mode, an additional sync operation of the WAL file happens after each transaction commit. The extra WAL sync following each transaction helps ensure that transactions are durable across a power loss (source; archived)
Note that this is not mentioned under the definition of FULL, but rather further down in the documentation for
synchronous.
Based on the above, I conclude that:
By default, SQLite is not durable, because the default value of
journal_modeis DELETE, and the default value ofsynchronousis FULL, which doesn't provide durability in DELETE mode.If you change
journal_modeto WAL, then SQLite is durable, becausesynchronous=FULLprovides durability in WAL mode.
However, a recent Hacker News comment by a user who credibly claims to be Richard Hipp, the creator of SQLite, says:
"In its default configuration, SQLite is durable."
"If you switch to WAL mode, the default behavior is that transactions ... are not necessarily durable across OS crashes or power failures"
That's literally the opposite of what the documentation seems to say!
A Hacker News commenter who agrees with my reading of the documentation asked Hipp how his comment is consistent with the documentation, but received no reply.
Hipp also says that WAL mode used to be durable by default, but it was changed after people complained about poor performance. This surprised me, since I had the impression that SQLite cared deeply about backwards compatibility, and weakening the default durability setting is a nasty breaking change for any application which needs durability.
There are a couple other pitfalls around SQLite durability that you should be aware of, though I don't necessarily blame the SQLite project for these:
Libraries that wrap SQLite can override the default value of
synchronous. For example, the most popular Go driver for SQLite sets it to NORMAL when in WAL mode, which does not provide durability.On macOS, fsync is nerfed to make macOS appear faster. If you want a real fsync, you have to make a different, macOS-specific system call. SQLite can do this, but it's off by default.
My takeaway is that if you need durability, you'd better set the synchronous option explicitly because who knows what the default is, or what it will be in the future. With WAL mode, FULL seems to suffice. As for DELETE mode, who knows if FULL is enough, so you'd better go with EXTRA to be safe. And if your application might be used on macOS, enable fullfsync.
The SQLite project ought to clarify their documentation. Since the meaning of synchronous depends on the value of journal_mode, I think it would be quite helpful to document the values of synchronous separately for each possible journal_mode, rather than mixing it all together. A table with synchronous values on one axis and journal_mode on the other which tells you if the combination provides durability would do wonders.
By the way, there are definitely many applications for which losing a few seconds of data in exchange for better performance is a great tradeoff, which is why SQLite and macOS have made the choices they have made. But programmers need to know what guarantees their tools provide, which is why unclear documentation and breaking previously-held assumptions is not cool.
June 22, 2023
The Story Behind Last Week's Let's Encrypt Downtime
Last Thursday (June 15th, 2023), Let's Encrypt went down for about an hour, during which time it was not possible to obtain certificates from Let's Encrypt. Immediately prior to the outage, Let's Encrypt issued 645 certificates which did not work in Chrome or Safari. In this post, I'm going to explain what went wrong and how I detected it.
The Law of Precertificates
Before I can explain the incident, we need to talk about Certificate Transparency. Certificate Transparency (CT) is a system for putting certificates issued by publicly-trusted CAs, such as Let's Encrypt, in public, append-only logs. Certificate authorities have a tremendous amount of power, and if they misuse their power by issuing certificates that they shouldn't, traffic to HTTPS websites could be intercepted by attackers. Historically, CAs have not used their power well, and Certificate Transparency is an effort to fix that by letting anyone examine the certificates that CAs issue.
A key concept in Certificate Transparency is the "precertificate". Before issuing a certificate, the certificate authority creates a precertificate, which contains all of the information that will be in the certificate, plus a "poison extension" that prevents the precertificate from being used like a real certificate. The CA submits the precertificate to multiple Certificate Transparency logs. Each log returns a Signed Certificate Timestamp (SCT), which is a signed statement acknowledging receipt of the precertificate and promising to publish the precertificate in the log for anyone to download. The CA takes all of the SCTs and embeds them in the certificate. When a CT-enforcing browser (like Chrome or Safari) validates the certificate, it makes sure that the certificate embeds a sufficient number of SCTs from trustworthy logs. This doesn't prevent the browser from accepting a malicious certificate, but it does ensure that the precertificate is in public logs, allowing the attack to be detected and action taken against the CA.
The certificate itself may or may not end up in CT logs. Some CAs, notably Let's Encrypt and Sectigo, automatically submit their certificates. Certificates from other CAs only end up in logs if someone else finds and submits them. Since only the precertificate is guaranteed to be logged, it is essential that a precertificate be treated as incontrovertible proof that a certificate containing the same data exists. When someone finds a precertificate for a malicious or non-compliant certificate, the CA can't be allowed to evade responsibility by saying "just kidding, we never actually issued the real certificate" (and boy, have they tried). Otherwise, CT would be useless.
There are two ways a CA could create a certificate. They could take the precertificate, remove the poison extension, add the SCTs, and re-sign it. Or, they could create the certificate from scratch, making sure to add the same data, in the same order, as used in the precertificate.
The first way is robust because it's guaranteed to produce a certificate which matches the precertificate. At least one CA, Sectigo, uses this approach. Let's Encrypt uses the second approach. You can probably see where this is going...
The Let's Encrypt incident
On June 15, 2023, Let's Encrypt deployed a planned change to their certificate configuration which altered the contents of the Certificate Policies extension from:
X509v3 Certificate Policies:
Policy: 2.23.140.1.2.1
Policy: 1.3.6.1.4.1.44947.1.1.1
CPS: http://cps.letsencrypt.org
to:
X509v3 Certificate Policies:
Policy: 2.23.140.1.2.1
Unfortunately, any certificate which was requested while the change was being rolled out could have its precertificate and certificate created with different configurations. For example, when Let's Encrypt issued the certificate with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71, the precertificate contained the new Certificate Policies extension, and the certificate contained the old Certificate Policies extension.
This had two consequences:
First, this certificate won't work in Chrome or Safari, because its SCTs are for a precertificate containing different data from the certificate. Specifically, the SCTs fail signature validation. When logs sign SCTs, they compute the signature over the data in the precertificate, and when browsers verify SCTs, they compute the signature over the data in the certificate. In this case, that data was not the same.
Second, remember how I said that precertificates are treated as incontrovertible proof that a certificate containing the same data exists? When Let's Encrypt issued a precertificate with the new Certificate Policies value, it implied that they also issued a certificate with the new Certificate Policies value. Thus, according to the Law of Precertificates, Let's Encrypt issued two certificates with serial number 03:e2:26:7b:78:6b:7e:33:83:17:dd:d6:2e:76:4f:cb:3c:71:
- A certificate containing the old Certificate Policies extension
- A certificate containing the new Certificate Policies extension (implied by the existence of the precertificate with the new Certificate Policies extension)
Issuing two certificates with the same serial number is a violation of the Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates. Consequentially, Let's Encrypt must revoke the certificate and post a public incident report, which must be noted on their next audit statement.
You might think that it's harsh to treat this as a compliance incident if Let's Encrypt didn't really issue two certificates with the same serial number. Unfortunately, they have no way of proving this, and the whole reason for Certificate Transparency is so we don't have to take CAs at their word that they aren't issuing certificates that they shouldn't. Any exception to the Law of Precertificates creates an opening for a malicious CA to exploit.
How I found this
My company, SSLMate, operates a Certificate Transparency monitor called Cert Spotter, which continuously downloads and indexes the contents of every Certificate Transparency log. You can use Cert Spotter to get notifications when a certificate is issued for one of your domains, or search the database using a JSON API.
When Cert Spotter ingests a certificate containing embedded SCTs, it verifies each SCT's signature and audits that the log really published the precertificate. (If it detects that a log has broken its promise to publish a precertificate, I'll publicly disclose the SCT and the log will be distrusted. Happily, Cert Spotter has never found a bogus SCT, though it has detected logs violating other requirements.)
On Thursday, June 15, 2023 at 15:41 UTC, Cert Spotter began sending me alerts about certificates containing embedded SCTs with invalid signatures. Since I was getting hundreds of alerts, I decided to stop what I was doing and investigate.
I had received these alerts several times before, and have gotten pretty good at zeroing in on the problem. When only one SCT in a certificate has an invalid signature, it probably means that the CT log screwed up. When all of the embedded SCTs have an invalid signature, it probably means the CA screwed up. The most common reason is issuing certificates that don't match the precertificate. So I took one of the affected certificates and searched for precertificates containing the same serial number in Cert Spotter's database of every (pre)certificate ever logged to Certificate Transparency. Decoding the certificate and precertificate with the openssl command immediately revealed the different Certificate Policies extension.
Since I was continuing to get alerts from Cert Spotter about invalid SCT signatures, I quickly fired off an email to Let's Encrypt's problem reporting address alerting them to the problem.
I sent the email at 15:52 UTC. At 16:08, Let's Encrypt replied that they had paused issuance to investigate. Meanwhile, I filed a CA Certificate Compliance bug in Bugzilla, which is where Mozilla and Chrome track compliance incidents by publicly-trusted certificate authorities.
At 16:54, Let's Encrypt resumed issuance after confirming that they would not issue any more certificates with mismatched precertificates.
On Friday, June 16, 2023, Let's Encrypt emailed the subscribers of the affected certificates to inform them of the need to replace their certificates.
On Monday, June 19, 2023 at 18:00 UTC, Let's Encrypt revoked the 645 affected certificates, as required by the Baseline Requirements. This will cause the certificates to stop working in any client that checks revocation, but remember that these certificates were already being rejected by Chrome and Safari for having invalid SCTs.
On Tuesday, June 20, 2023, Let's Encrypt posted their public incident report, which explained the root cause of the incident and what they're doing to prevent it from happening again. Specifically, they plan to add a pre-issuance check that ensures certificates contain the same data as the precertificate.
Hundreds of websites are still serving broken certificates
I've been periodically checking port 443 of every DNS name in the affected certificates, and as of publication time, 261 certificates are still in use, despite not working in CT-enforcing or revocation-checking clients.
I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt. I thought that maybe these certificates were being used by API endpoints which are accessed by non-browser clients that don't enforce CT or check revocation, but this doesn't appear to be the case, as most of the DNS names are for bare domains or www subdomains. It's fortunate that Let's Encrypt issued only a small number of non-compliant certificates, because otherwise it would have broken a lot of websites.
There is a new standard under development called ACME Renewal Information which enables certificate authorities to inform ACME clients to renew certificates ahead of their normal expiration. Let's Encrypt supports ARI, and used it in this incident to trigger early renewal of the affected certificates. Clearly, more ACME clients need to add support for ARI.
This is my 50th CA compliance bug
It turns out this is the 50th CA compliance bug that I've filed in Bugzilla, and the 5th which was uncovered by Cert Spotter's SCT signature checks. Additionally, I reported a number of incidents before 2018 which didn't end up in Bugzilla.
Some of the problems I uncovered were quite serious (like issuing certificates without doing domain validation) and snowballed until the CA was ultimately distrusted. Most are minor in comparison, and ten years ago, no one would have cared about them: there was no Certificate Transparency to unearth non-compliant certificates, and even when someone did notice, the revocation requirement was not enforced, and CAs were not required to file incident reports or document the non-compliance on their next audit. Thankfully, that's no longer the case, and even compliance violations that seem minor are treated seriously, which has led to enormous improvements in the certificate ecosystem:
- The improvements which certificate authorities make in response to seemingly-minor incidents also improve their compliance with the most security-critical rules.
- TLS clients no longer need to work around non-standards-compliant certificates, which means they can be simpler. Simpler code is easier to make secure.
- The way that CAs handle minor incidents can uncover much larger problems. Minor compliance problems are like "Brown M&M's".
Mozilla deserves enormous credit for being the first to require public incident reports from CAs, as does Google for creating and fostering Certificate Transparency.
You should monitor Certificate Transparency too
One limitation of my compliance monitoring is that I am generally only able to detect certificates that are intrinsically non-compliant, like those which violate encoding rules or are valid for too many days. While I do monitor certificates for domains that are likely to be abused, like example.com and test.com, I can't tell if a certificate issued for your domain is authorized or not. Only you know that.
Fortunately, it's pretty easy to monitor Certificate Transparency and get alerts when a certificate is issued for one of your domains. Cert Spotter has a standalone, open source version that's easy to set up. The paid version has additional features like expiration monitoring, Slack integration, and ways to filter alerts so you're not bothered about legitimate certificates. But most importantly, subscribing to the paid version helps me continue my compliance monitoring of the certificate authority ecosystem.


