Skip to Content [alt-c]

April 15, 2022

How I'm Using SNI Proxying and IPv6 to Share Port 443 Between Webapps

My preferred method for deploying webapps is to have the webapp listen directly on port 443, without any sort of standalone web server or HTTP reverse proxy in front. I have had it with standalone web servers: they're all over-complicated and I always end up with an awkward bifurcation of logic between my app's code and the web server's config. Meanwhile, my preferred language, Go, has a high-quality, memory-safe HTTPS server in the standard library that is well suited for direct exposure on the Internet.

However, only one process at a time can listen on a given IP address and port number. In a world of ubiquitous IPv6, this wouldn't be a problem - each of my servers has literally trillions of IPv6 addresses, so I could easily dedicate one IPv6 address per webapp. Unfortunately, IPv6 is not ubiquitous, and due to the shortage of IPv4 addresses, it would be too expensive to give each app its own IPv4 address.

The conventional solution for this problem is HTTP reverse proxying, but I want to do better. I want to be able to act like IPv6 really is ubiquitous, but continue to support IPv4-only clients with a minimum amount of complexity and mental overhead. To accomplish this, I've turned to SNI-based proxying.

I've written about SNI proxying before, but in a nutshell: a proxy server can use the first message in a TLS connection (the Client Hello message, which is unencrypted and contains the server name (SNI) that the client wants to connect to) to decide where to route the connection. Here's how I'm using it:

  • My webapps listen on port 443 of a dedicated IPv6 address. They do not listen on IPv4.
  • Each of my servers runs snid, a Go daemon which listens on port 443 of the server's single IPv4 address.
  • When snid receives a connection, it peeks at the first TLS message to get the desired server name. It does a DNS lookup for the server name's IPv6 address, and proxies the TCP connection there. To prevent snid from being used as an open proxy, snid only forwards the connection if the IPv6 address in within my server's IPv6 range.
  • The AAAA record for a webapp is the dedicated IPv6 address, and the A record is the shared IPv4 address. Thus, IPv6 clients connect directly to the webapp, while IPv4 clients are proxied via snid.

Preserving the Client's IP Address

One of the headaches caused by proxies is that the backend doesn't see the client's IP address - connections appear to be from the proxy server instead. With HTTP proxying, this problem is typically solved by stuffing the client's IP address in a header field, which is a minefield of security problems that allow client IP addresses to be spoofed if you're not careful. With TCP proxying, a common solution is to use the PROXY protocol, which puts the client's IP address at the beginning of the proxied connection. However, this requires backends to understand the PROXY protocol.

snid can do better. Since IPv6 addresses are 128 bits long, but IPv4 addresses are only 32 bits, it's possible to embed IPv4 addresses in IPv6 addresses. snid embeds the client's IP address in the lower 32 bits of the source address which it uses to connect to the backend. It's trivial for the backend to translate the source address back to the IPv4 address, but this is purely a user interface concern. If a backend doesn't do the translation, it's possible for the human operator to do the translation manually when viewing log entries, configuring access control, etc.

For the IPv6 prefix, I use 64:ff9b:1::/48, which is a non-publicly-routed prefix reserved for IPv4/IPv6 translation mechanisms. For example, the IPv4 address 192.0.2.10 translates to:

64:ff9b:1::c000:20a

Conveniently, it can also be written using embedded IPv4 notation:

64:ff9b:1::192.0.2.10

O(1) Config

snid's configuration is just a few command line arguments. Here's the command line for the instance of snid that's running on the server that serves www.agwa.name and src.agwa.name:

snid -listen tcp:18.220.42.202:443 -mode nat46 -nat46-prefix 64:ff9b:1:: -backend-cidr 2600:1f16:719:be00:5ba7::/80

-listen tells snid to listen on port 443 of 18.220.42.202, which is the IPv4 address for www.agwa.name and src.agwa.name. -mode nat46 tells snid to forward connections over IPv6, with the source IPv4 address embedded using the prefix specified by -nat46-prefix. -backend-cidr tells snid to only forward connections to addresses within the 2600:1f16:719:be00:5ba7::/80 subnet, which includes the IPv6 addresses for www.agwa.name and src.agwa.name (2600:1f16:719:be00:5ba7::2 and 2600:1f16:719:be00:5ba7::1, respectively).

The best thing about snid's configuration is that I only have to touch it once. I don't have to change it when deploying new webapps. Deploying a new webapp only requires assigning it an IPv6 address and publishing DNS records for it, just like it would be in my dream world of ubiquitous IPv6. I call this O(1) configuration since it doesn't get longer or more complex with the number of webapps I run.

Guaranteed Secure

HTTP reverse proxying is a minefield of security concerns. In addition to the IP address spoofing problems discussed above, you have to contend with request smuggling and HTTP desync vulnerabilities. This is a class of vulnerability that will never truly be solved: you can patch vulnerabilities as they're discovered, but thanks to the inherent ambiguity in parsing HTTP, you can never be sure there won't be more.

I don't have to worry about any of this with snid. Since snid doesn't decrypt the TLS connection (and lacks the necessary keys to do so), proxying with snid is guaranteed to be secure as long as TLS is secure. It can harm security no more than any untrusted router on the Internet can. This helps put snid out of my mind so I can forget that it even exists.

Compatible with ACME ALPN

Since ACME's TLS-ALPN challenge uses SNI to convey the hostname being validated, snid will forward TLS-ALPN requests from the certificate authority to the appropriate backend. Automatic certificate acquisition, such as with Go's autocert package, Just Works.

What About Encrypted Client Hello?

Since the SNI hostname is in plaintext, a network eavesdropper can determine what hostname a client is connecting to. This is bad for privacy and censorship resistance, so there is an effort underway to encrypt not just the SNI hostname, but the entire Client Hello message. How does this affect snid?

First, it's important to note that the destination IP address in the IP header is always going to be unencrypted, so by putting my webapps on different IPv6 addresses, I'm giving eavesdroppers the ability to find out which webapp clients are connecting to, regardless of SNI. However, a single webapp might handle multiple hostnames, and I'd like to hide the specific hostname from eavesdroppers, so Encrypted Client Hello still has some value. Fortunately, Encrypted Client Hello works with snid.

Encrypted Client Hello doesn't actually encrypt the initial Client Hello message. It's still sent in the clear, but with a decoy SNI hostname. The actual Client Hello message, with the true SNI hostname, is encrypted and placed in an extension of the unencrypted Client Hello. To make Encrypted Client Hello work with snid, I just need to ensure that the decoy SNI hostname resolves to the IPv6 address of the backend server. snid will see this hostname and route the connection to the correct backend server, as usual. The backend will decrypt the true Encrypted Client Hello to determine which specific hostname the client wants.

For additional detail about this approach, see my comment on Hacker News.

What About Port 80?

Obviously, I can't proxy unencrypted HTTP traffic using SNI-based proxying. But at this point, port 80 exists solely to redirect clients to HTTPS. To handle this, I plan to run a tiny, zero-config daemon on port 80 of all IPv4 and IPv6 addresses that will redirect the client to the same URL but with http:// replaced with https://. (For now, I'm using Apache for this.)

Installing snid

If you have Go, you can install snid by running:

go install src.agwa.name/snid@latest

You can also download a statically-linked binary.

See the README for the command line usage.

Rejected Approach: UNIX Domain Sockets

Before settling on the approach described above, I had snid listen on port 443 of all interfaces (both IPv4 and IPv6) and forward connections to a UNIX domain socket whose path contained the SNI hostname. For example, connections to example.com would be forwarded to /var/tls/example.com. The client's IP address was preserved using the PROXY protocol.

This had some nice properties. I could use filesystem permissions to control who was allowed to create sockets, either by setting permissions on /var/tls, or by symlinking specific hostnames under /var/tls to other locations on the filesystem which non-root users could write to. It felt really elegant that applications could listen on an SNI hostname rather than on an IP address and port.

However, few server applications support the PROXY protocol or listening on UNIX domain sockets. I could make sure my own apps had support, but I really wanted to be able to use off-the-shelf apps with snid. I did write an amazing LD_PRELOAD library that intercepts the bind system call and transparently replaces binding to a TCP port with binding to a UNIX domain socket. It even intercepts getpeername and makes it returns the IP address received via the PROXY protocol. Although this worked with every application I tried it with, it felt like a hack.

Additionally, UNIX domain sockets have some annoying semantics: if the socket file already exists (perhaps because the application crashed without removing the socket file), you can't bind to it - even if no other program is actually bound to it. But if you remove the socket file, any program bound to it continues running, completely unaware that it will never again accept a client. The semantics of TCP port binding feel more robust in comparison.

For these reasons I switched to the IPv6 approach described above, allowing me to use standard, unmodified TCP-listening server apps without any hacks that might compromise robustness. However, support for UNIX domain sockets lives on in snid with the -mode unix flag.

Comments

January 19, 2022

Comcast Shot Themselves in the Foot with MTA-STS

I recently heard from someone, let's call them Alex, who was unable to email comcast.net addresses. Alex's emails were being bounced back with an MTA-STS policy error:

MX host mx2h1.comcast.net does not match any MX pattern in MTA-STS policy MTA-STS failure for Comcast.net: Validation error (E_HOST_MISMATCH) MX host mx1a1.comcast.net does not match any MX pattern in MTA-STS policy MTA-STS failure for Comcast.net: Validation error (E_HOST_MISMATCH)

MTA-STS is a relatively new standard that allows domain owners such as Comcast to opt in to authenticated encryption for their mail servers. (By default, SMTP traffic between mail servers uses opportunistic encryption, which can be defeated by active attackers to intercept email.) MTA-STS requires the domain owner to duplicate their MX record (the DNS record that lists a domain's mail servers) in a text file served over HTTPS. Sending mail servers, like Alex's, refuse to contact mail servers that aren't listed in the MTA-STS text file. Since HTTPS uses authenticated encryption, the text file can't be altered by active attackers. (In contrast, the MX record is vulnerable to manipulation unless DNSSEC is used, but people don't like DNSSEC which is why MTA-STS was invented.)

The above error messages mean that although mx2h1.comcast.net and mx1a1.comcast.net are listed in comcast.net's MX record, they are not listed in comcast.net's MTA-STS policy file. Consequentially, Alex's mail server thinks that the MX record was altered by attackers, and is refusing to deliver mail to what it assumes are rogue mail servers.

However, mx2h1.comcast.net and mx1a1.comcast.net are not rogue mail servers. They are in fact listed in Comcast's current MTA-STS policy:

version: STSv1 mode: enforce mx: mx2c1.comcast.net mx: mx2h1.comcast.net mx: mx1a1.comcast.net mx: mx1h1.comcast.net mx: mx1c1.comcast.net mx: mx2a1.comcast.net max_age: 2592000

This means that Alex's mail server is not consulting comcast.net's current MTA-STS policy. Instead, it's consulting a cached policy which does not list mx2h1.comcast.net and mx1a1.comcast.net.

This can happen because mail servers cache MTA-STS policies to avoid having to re-download an unchanged policy file every time an email is sent. To determine whether a domain's policy has changed, mail servers query the domain's _mta-sts TXT record (e.g. _mta-sts.comcast.net), and only re-download the MTA-STS policy file if the ID in the TXT record is different from the ID of the currently-cached policy.

The obvious implication of the above is that if you ever change your domain's MTA-STS policy, you have to remember to update the TXT record as well.

A more subtle implication is that you have to do the updates in the right order. If you update the TXT record before changing the policy file, and a mail server fetches the policy in the intervening time, it will download the old policy file but cache it under the new ID. It won't ever download the new policy because it thinks it already has it in its cache.

This pitfall could have been avoided had MTA-STS required the ID to also be specified in the policy file instead of just in the TXT record. That would have prevented mail servers from caching policies under the wrong ID.

There's some evidence that this is what happened with comcast.net. The ID in the _mta-sts.comcast.net TXT record appears to be a UNIX timestamp (seconds since the Epoch):

_mta-sts.comcast.net. 7200 IN TXT "v=STSv1; id=1638997389;"

That timestamp translates to 2021-12-08 21:03:09 UTC.

However, the Last-Modified time of https://mta-sts.comcast.net/.well-known/mta-sts.txt is three minutes later:

Last-Modified: Wed, 08 Dec 2021 21:06:05 GMT

If the ID in the TXT record reflects when the TXT record was updated, there was a three minute gap between the updates. If Alex's mail server fetched comcast.net's MTA-STS policy during this window, it would have cached the old policy under the new ID, causing the errors seen above.

Recommendations for Domain Owners Who Use MTA-STS

You should automate MTA-STS policy publication to ensure that your MTA-STS policy always matches your MX records and that the TXT record is reliably updated, in the correct order, when your policy changes. If your policy file is served by a CDN, you have to be extra careful not to update the TXT record until your new policy file is fully propagated throughout the CDN.

I further recommend that you rotate the ID in the TXT record daily even if no changes have been made to your policy. This will force mail servers to re-download your policy file if it's more than a day old, which provides a backstop in case something goes wrong with the order of updates.

It may be tempting, but you should not reduce your policy's max_age value as this will diminish your protection against active attackers who block retrieval of your policy. Having a long max_age but a frequently rotating ID keeps your policy up-to-date in mail servers but ensures that in an attack scenario mail servers will fail safe by using a cached policy.

It's quite a bit of work to get this all right. If you want the easy option, SSLMate will automate all aspects of MTA-STS for you: all you need to do is publish two CNAME records delegating the mta-sts and _mta-sts subdomains to SSLMate-operated servers and SSLMate takes care of the rest.

Recommendations for Mail Server Developers

You should assume that domain operators are not going to properly update their TXT records and you should always attempt to re-download policy files that are more than a day old, regardless of what the ID in the TXT record says.

Is this Really Better than DNSSEC/DANE?

Thanks to MTA-STS' duplication of information and requirement for updates to be done in the right order, there is a high chance of human error when MTA-STS is deployed manually. Unfortunately, it's very likely to be deployed manually because there's a dearth of automation software, and on the surface it looks easy to manage by hand. To make matters worse, MTA-STS' caching semantics mean that the inevitable human error leads to hard-to-diagnose problems, such as a subset of mail servers being unable to mail your domain. I suspect that many problems will never be detected - email delivery will just become less reliable than it was before MTA-STS was deployed.

Meanwhile, DNSSEC is increasingly automated, and if you use a modern cloud provider like Route 53, Google Cloud DNS, or Cloudflare, you don't have to worry about remembering to sign zones before they expire, which was traditionally a major source of DNSSEC mistakes.

However, not all mail server operators support DNSSEC/DANE. Although Microsoft recently added DNSSEC/DANE support to Office 365 Exchange, Gmail only supports MTA-STS. Thus, there is still value in deploying MTA-STS despite its flaws. But we should not be happy about this state of affairs.

Comments

November 12, 2021

It's Now Possible To Sign Arbitrary Data With Your SSH Keys

Did you know that you can use the ssh-keygen command to sign and verify signatures on arbitrary data, like files and software releases? Although this feature isn't super new - it was added in 2019 with OpenSSH 8.0 - it seems to be little-known. That's a shame because it's super useful and the most viable alternative to PGP for signing data. If you're currently using PGP to sign data, you should consider switching to SSH signatures.

Here's why I like SSH signatures:

  • It's not PGP. For years, security professionals have been sounding the alarm on PGP, including its most popular implementation, GnuPG/GPG. PGP is absurdly complex, has an awful user experience, and is full of crufty old cryptography which shouldn't be touched with a ten foot pole.

  • SSH is everywhere, and people already have SSH keys. If you use Debian Bullseye or Ubuntu 20.04 or newer, you already have a new enough version of SSH installed. And if you use GitHub, or any other service that uses SSH keys for authentication, you already have an SSH key that can be used to generate signatures. This is why I'm more excited about SSH signatures than other PGP signature alternatives like signify or minisign. Signify and minisign are great, but require you to install new software and generate new keys, which will hinder widespread adoption.

  • SSH key distribution is easy. SSH public keys are one line strings that are easy to copy around. You don't need to use the Web of Trust or worry about configuring "trust levels" for keys. GitHub already acts as a key distribution service which is far easier to use and more secure than any of the PGP key servers ever were. You can retrieve the SSH public keys for any GitHub user by visiting a URL like https://github.com/USERNAME.keys. (For example, my public keys are at https://github.com/AGWA.keys.)

    (GitHub acts as a trusted third party here, and you have to trust them not to lie about people's public keys, so it may not be appropriate for all use cases. But relying on a trusted third party with a professional security team like GitHub seems like a way better default than PGP's Web of Trust, which was nigh impossible to use. Key Transparency would address the concerns with trusted third parties, if anyone ever figures out how to audit transparency logs in practice.)

  • SSH has optional lightweight certificates. You don't have to use SSH certificates (and most people shouldn't) but if certificates would make your life easier, SSH has a lightweight certificate system that is considerably simpler than X.509. This makes SSH signatures a good alternative to S/MIME as well!

You'll soon be able to sign Git commits and tags with SSH

Signing Git commits and tags gives consumers of your repository assurance that your code hasn't been tampered with. Unfortunately, you currently have to use either PGP or S/MIME, and personally I haven't bothered to sign Git tags since my PGP keys expired in 2018.

But that will soon change in Git 2.34, which adds support for SSH signatures.

Signing files

Signing a file is straightforward:

ssh-keygen -Y sign -f ~/.ssh/id_ed25519 -n file file_to_sign

Here are the arguments you may need to change:

  • ~/.ssh/id_ed25519 is the path to your private key. This is the standard path to your SSH Ed25519 private key. If you have an RSA key, use id_rsa instead.

  • file is the "namespace", which describes the purpose of the signature. SSH defines file for signing generic files, and email for signing emails. Git uses git for its signatures.

    If you are using the signature for a different purpose, such as a custom protocol, you must specify your own namespace. This prevents cross-protocol attacks whereby a valid signature is removed from a message for one protocol and attached to a message from a different protocol. If the protocols don't use distinct namespaces for their signatures, there's a risk that the signature is considered valid by the second protocol even though it was meant for the first protocol.

    Namespaces can be arbitrary strings. To ensure global uniqueness of namespaces, SSH recommends that you structure them like an email address under a domain that you own. For example, I would use a namespace like protocolname-v1@agwa.name.

  • file_to_sign is the path to the file to be signed.

The signature is written to a new file called file_to_sign.sig, which looks like this:

-----BEGIN SSH SIGNATURE----- U1NIU0lHAAAAAQAAADMAAAALc3NoLWVkMjU1MTkAAAAg2rirQQddpzEzOZwbtM0LUMmlLG krl2EkDq4CVn/Hw7sAAAAEZmlsZQAAAAAAAAAGc2hhNTEyAAAAUwAAAAtzc2gtZWQyNTUx OQAAAEDyjWPjmOdG8HJ8gh1CbM8WDDWoGfm+TTd8Qa8eua9Bt5Cc+43S24i/JqVWmk98qV YXoQmOYL4bY8t/q7cSNeMH -----END SSH SIGNATURE-----

If you specify - for the filename, the file to sign is read from standard in and the signature is written to standard out.

Verifying signatures

Verifying signatures is a bit more involved. First you need to create an allowed signers file which maps email addresses to public keys, like this:

alice@example.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINq4q0EHXacxMzmcG7TNC1DJpSxpK5dhJA6uAlZ/x8O7 alice@example.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCfHGCK5jjI/Oib4vRBLB9rG30A8y/Br9U75rfAYsitwFPFfl/CaTAvfRlW1lIBqOCshLWxGsN+PFiJCiCWzpW4iILkD5X5KcBBYHTq1ojYXb70BrQXQ+QBDcGxqQjcOp/uTq1D9Z82mYq/usI5wdz6f1KNyqM0J6ZwRXMu6u7NZaAwmY7j1fV4DRiYdmIfUDIyEdqX4a1Gan+EMSanVUYDcNmeBURqmTkkOPYSg8g5xYgcXBMOZ+V0ZUjreV9paKraUD/mVDlZbb/VyWhJGT4FLMNXHU6UHC2FFgqANMUKIlL4vhqc23MoygKbfF3HgNB6BNfv3s+GYlaQ3+66jc5j bob@example.net ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBgQuuEvhUXerOTIZ2zoOx60M/HHJ/tcHnD84ZvTiX5b eve@example.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFxsKcWHB9hamTXCPWKVUw0WM0S3IXH0YArf8iJE0dMG

Once you have your allowed signers file, verification works like this:

ssh-keygen -Y verify -f allowed_signers -I alice@example.com -n file -s file_to_verify.sig < file_to_verify

Here are the arguments you may need to change:

  • allowed_signers is the path to the allowed signers file.

  • alice@example.com is the email address of the person who allegedly signed the file. This email address is looked up in the allowed signers file to get possible public keys.

  • file is the "namespace", which must match the namespace used for signing as described above.

  • file_to_verify.sig is the path to the signature file.

  • file_to_verify is the path to the file to be verified. Note that this file is read from standard in. In the above command, the < shell operator is used to redirect standard in from this file.

If the signature is valid, the command exits with status 0 and prints a message like this:

Good "file" signature for alice@example.com with ED25519 key SHA256:ZGa8RztddW4kE2XKPPsP9ZYC7JnMObs6yZzyxg8xZSk

Otherwise, the command exits with a non-zero status and prints an error message.

Is it safe to repurpose SSH keys?

Short answer: yes.

Always be wary of repurposing cryptographic keys for a different protocol. If not done carefully, there's a risk of cross-protocol attacks. For example, if the structure of the messages signed by Git is similar to the structure of SSH protocol messages, an attacker might be able to forge Git artifacts by misappropriating the signature from an SSH transcript.

Fortunately, the structure of SSH protocol messages and the structure of messages signed by ssh-keygen are dissimilar enough that there is no risk of confusion.

To convince ourselves, let's consult RFC 4252 section 7, which specifies how SSH keys are traditionally used by SSH to authenticate a user logging into a server. The RFC specifies that the input to the signature algorithm has the following structure:

string session identifier byte SSH_MSG_USERAUTH_REQUEST string user name string service name string "publickey" boolean TRUE string public key algorithm name string public key to be used for authentication

The first field is the session identifier, a string. In the SSH protocol, strings are prefixed by a 32-bit big endian length. The session identifier is a hash. Since hashes are short, the first three bytes of the above signature input will always be zero.

Meanwhile, the PROTOCOL.sshsig file the OpenSSH repository specifies how SSH keys are used with ssh-keygen-generated signatures. It specifies that the input to the signature algorithm has this structure:

#define MAGIC_PREAMBLE "SSHSIG" byte[6] MAGIC_PREAMBLE string namespace string reserved string hash_algorithm string H(message)

Here, the first three bytes are SSH, from the magic preamble. Since the first three bytes of the SSH protocol signature input are different from the ssh-keygen signature input, the SSH client and ssh-keygen will never produce identical signatures. Therefore, there is no risk of cross-protocol attacks, and I am totally comfortable using my existing SSH keys to sign messages with ssh-keygen.

Comments

July 9, 2021

How Certificate Transparency Logs Fail and Why It's OK

Last week, a Certificate Transparency log called Yeti 2022 suffered a single bit flip, likely due to a hardware error or cosmic ray, which rendered the log unusable. Although this event will have zero impact on Web users and website operators, and was reported on an obscure mailing list for industry insiders, it captured the interest of people on Hacker News, Twitter, and Reddit. Certificate Transparency plays an essential role in ensuring security on the Web, and numerous commentators were concerned that logs could be wiped out by a single bit flip. I'm going to explain why these concerns are misplaced and why log failure doesn't worry me.

Background: Certificate Transparency (CT) is a system to log publicly-trusted SSL certificates in public, append-only logs. Website owners can monitor these logs and take action if they discover an unauthorized certificate for one of their domains. Thanks to Certificate Transparency, several untrustworthy certificate authorities have been distrusted, and the ecosystem has improved enormously compared to the pre-CT days when misissued certificates usually went unnoticed.

To ensure that CT logs remain append-only, submitted certificates are placed in the leaves of a data structure called a Merkle Tree. The leaves of the Merkle Tree are recursively hashed together with SHA-256 to produce a root hash that represents the contents of the tree. Periodically, the CT log publishes a signed statement, called a Signed Tree Head or STH, containing the current tree size and root hash. The STH is a commitment that at the specified size, the log has the specified contents. To enforce the commitment, monitors collect STHs and verify that the root hashes match the certificates downloaded from the log. If the downloaded certificates don't match the published STHs, or if a monitor detects two STHs for the same tree size with different root hashes, it means the log contents have been altered - perhaps to conceal a malicious certificate.

On June 30, 2021 at 01:02 UTC, my CT monitor, Cert Spotter, raised an alert that the root hash it calculated from the first 65,569,149 certificate entries downloaded from Yeti 2022 did not equal the root hash in the STH that Yeti 2022 had published for tree size 65,569,149.

I noticed the alert the next morning and reported the problem to the ct-policy mailing list, where these matters are discussed. The Google Chrome CT team reported that they too had observed problems, as did a user of the open source certspotter. Later that day, I drilled down that the problem was with entry 65,562,066 - the hash of the certificate returned by Yeti at this position was not part of the Merkle Tree.

On Thursday, the log operator reported that this entry had "shifted one bit" before the STH was signed. Curious, I calculated the correct hash of entry 65,562,066 and then tried flipping every bit and seeing if the resulting hash was part of the Merkle Tree. Sure enough, flipping the lowest bit of the first byte of the hash resulted in a hash that was part of the Merkle Tree and would ultimately produce the root hash from the STH.

There is no way for the log operator to fix this problem: they can't change entry 65,562,066 to match the errant hash, as this would require breaking SHA-2's preimage resistance, which is computationally infeasible. And since they've already published an STH for tree size 65,569,149, they can't publish an updated STH with a root hash that correctly reflects entry 65,562,066.

Consequentially, Yeti 2022 is toast. It has been made read-only and web browsers will soon cease to rely on it.

Yeti 2022 is not the first Certificate Transparency log to fail. Seven logs have previously failed, all without causing any impact to Web users:

The largest risk of log failure is that previously-issued certificates will stop working and require replacement prior to their natural expiration dates. CT-enforcing browsers only accept certificates that contain receipts, called Signed Certificate Timestamps or SCTs, from a sufficient number of approved (i.e. non-failed) CT logs. Each SCT is a promise by the respective log to publish the certificate (Technically, the precertificate, which contains the same information as the certificate). What if one or more of the SCTs in a certificate is from a log which failed after the certificate was issued?

Fortunately, browser Certificate Transparency policies anticipated the possibility. At a high level, the Chrome and Apple policies require the following:

  1. At least one SCT from a log that is approved at time of certificate validation
  2. At least 2-5 SCTs (depending on certificate lifetime) from logs that were approved at time of SCT issuance

(The precise details are a bit more nuanced but don't matter for this blog post. Also, some very advanced website operators deliver SCTs using alternative mechanisms, which are subject to different rules, but this is extremely rare. Read the policies if you want the nitty gritty.)

Consequentially, a single log failure can't cause a certificate to stop working. The first requirement is still satisfied because the certificate still has at least one SCT from a currently-approved log. The second requirement is still satisfied because the failed log was approved at time of SCT issuance. The minimum number of SCTs from approved-at-issuance logs increases with certificate lifetime to reflect the increased probability of log failure: a 180 day certificate only needs 2 SCTs, whereas a 3 year certificate (back when they were allowed) needed 5 SCTs.

Note that when a log fails, its public key is not totally removed from browsers like a distrusted certificate authority's key would be. Instead, the log transitions to a state called "retired" (previously known as "disqualified"), which means its SCTs are still recognized for satisfying the second requirement. This led to an interesting question in 2020 when a log's private key was compromised: should the log be retired, or should it be totally distrusted? Counter-intuitively, it was retired, even though SCTs could have been forged to make it seem like a certificate was included in the log when it really wasn't. But that's OK, because the second requirement isn't about making sure certificates are logged, but about making sure certificate authorities aren't dumbasses. In a world of competent CAs, the second requirement wouldn't be necessary since CAs would have the good sense to include SCTs from extra logs in case some logs failed. But we do not live in a world of competent CAs - indeed, that's why CT exists - and there would no doubt be CAs embedding a single SCT in certificates if browsers didn't require redundancy.

Of course, there is still a chance that all of the SCTs in a certificate come from logs that end up failing. That would suck. But I don't think the solution is to change Certificate Transparency. Catastrophic CT failure is just one of several reasons that a certificate might need to be replaced before its natural expiration, and empirically it's the least likely reason. When a certificate authority is distrusted, as has happened several times, all of its certificates must be replaced. When a certificate is misissued, it has to be revoked and replaced, and there have been numerous incidents since 2019 in which a considerable number of certificates have required revocation - sometimes as many as 100% of a CA's active certificates:

The ecosystem is currently ill-prepared to handle mass replacement events like these, and in many of the above cases CAs missed the revocation deadline or declined to revoke entirely. Although the above misissuances had relatively low security impact, other cases, such as distrusting a compromised certificate authority, or events like Heartbleed or the Debian random number fiasco, are very security critical. This makes the inability to quickly replace certificates at scale a serious problem, larger than the problem of CT logs failing. To address the problem, Let's Encrypt is working on a specification called ACME Renewal Info (ARI) that would allow CAs to instruct TLS servers to request new certificates prior to their normal expiration. They've committed to deploying ARI or a similar technology in their staging environment by 2021-11-12.

Comments

December 17, 2020

Security Vulnerabilities in Smallstep PKI Software

I recently did a partial security review of Smallstep, a commercially-backed open source private certificate authority written in Go. I found that Smallstep is vulnerable to JSON injection, misuses JWTs, and relies on client-side enforcement of server-side security. These vulnerabilities can be exploited to obtain unauthorized certificates. This post is a full disclosure of the issues.

I reviewed the certificates repository as of commit 1feb4fcb26dc78d70bc1d9e586237a36a8fbea9c and the crypto repository as of commit 162770cad29063385cb768b0191814e4c6a94e45. The vulnerabilities are present in version 0.15.5 of step-certificates and version 0.15.3 of step-cli, and have been fixed as of version 0.15.6.

JSON Injection in Certificate Templates

Like many PKI systems, Smallstep supports user-definable certificate profiles (which they call certificate templates) to specify the contents of a certificate (the subject, SANs, extensions, etc.). Smallstep's certificate profiles are implemented as JSON objects which are templated using Go's text/template package. Using templates, you can substitute a variety of information into the profile, including information from the certificate request.

Here's an example template from Smallstep's blog post announcing the feature:

{ "subject": {"commonName":"{{ .Insecure.CR.Subject.CommonName }}"}, "sans": {{ toJson .SANs }}, "keyUsage": ["digitalSignature", "keyAgreement"], "extKeyUsage": ["clientAuth"] }

20 years of HTML and SQL injection vulnerabilities have shown that it's a bad idea to use raw text templating to construct syntactic data like HTML, SQL, or JSON, which is why best practice is to use context-aware templating like SQL prepared statements or Go's html/template package. When using raw text templates, it's way too easy to suffer an injection vulnerability when the author of the template inevitably forgets to escape a value. Indeed, in the above example, the commonName field is unescaped, and Smallstep is rife with other examples of unescaped data in their documentation and in Go string literals that define the default SSH and X.509 templates.

Two factors make it easy for attackers to exploit the injection vulnerability. First, if a JSON object has more than one field with the same name, Go's JSON decoder takes the value from the last one. Second, Smallstep uses json.Decode to decode the object instead of json.Unmarshal, which means trailing garbage is ignored (this is an unfortunate foot gun in Go). Thus, an attacker who can inject values into a template has total control over the resulting object, as they can override the values of earlier fields, and then end the object so later fields are ignored. For example, signing a CSR containing the following common name using the above template results in a CA:TRUE certificate with a SAN of wheeeeeee.example and no extended key usage (EKU) extension:

"}, "basicConstraints":{"isCA":true}, "sans":[{"type":"dns","value":"wheeeeeee.example"}]}

Fortunately, despite the use of unescaped values in the default templates, I found that in virtually all cases the unescaped values come from trusted sources or are validated first. The one exception is the AWS provisioner, which in the default configuration will inject an unvalidated common name from the CSR into the template. However, in this configuration the DNS SANs are also completely unvalidated, so clients are already pretty trusted. Still, the vulnerability could be used to get a certificate with arbitrary or no EKUs, giving an attacker more capabilities than they would otherwise have. (The attacker could also get a CA:TRUE certificate, although by default the issuing CA has a pathlen:0 constraint so the CA:TRUE certificate wouldn't work.)

In any case, this approach to templates is extremely concerning. This issue cannot be fixed just by updating the built-in templates to use proper escaping, as users who write their own templates are likely to forget to escape, just as scores of developers have forgotten to escape HTML and SQL. The responsible way to offer this feature is with a context-aware JSON templating system akin to template/html that automatically escapes values.

Weaknesses in AWS Provisioner

This section applies to Smallstep's AWS provisioner, which enables EC2 instances to obtain X.509 certificates which identify the instance.

To prove that it is authorized to obtain a certificate, the Smallstep client retrieves a signed instance identity document from the EC2 metadata service and sends it to the Smallstep server. The Smallstep server verifies that the instance identity document was validly signed by AWS, and that the EC2 instance belongs to an authorized AWS account. If so, the Smallstep server issues a certificate.

EC2 instance identity documents aren't bound to a particular purpose when they are signed and they never expire. An attacker who obtains a single instance identity document from any EC2 instance in their victim's account has a permanent capability to obtain certificates. Instance identity documents are not particularly protected; anyone who can make an HTTP request from the instance can obtain them. Smallstep attempts to address this in a few ways. Unfortunately, the most effective protection is off by default, and the others are ineffective.

First, instance identity documents contain the date and time that the instance was created, and Smallstep can be configured to reject instance identity documents from instances that are too old. This is a good idea and works well with architectures where an instance obtains its certificate after first bootup and never needs to obtain future certificates. However, this protection is off by default. (Also, the documentation for the configuration parameter is not very good: it's described as "the maximum duration to grant a certificate in AWS and GCP provisioners" which sounds like maximum certificate lifetime. This is not going to help people choose a sensible value for this option.)

Second, by default Smallstep only allows an EC2 instance to request one certificate. After it requests a certificate, the instance is not supposed to be able to request any more certificates. Unfortunately, this logic is enforced client-side. Specifically, the server relies on the client to generate a unique ID, which in a non-malicious implementation is derived from the instance ID. Of course, a malicious client can just generate any ID it wants and the server will think the instance has never requested a certificate before.

(Aside: Smallstep claims that the one-certificate-per-instance logic "allows" them to not validate certificate identifiers by default. This doesn't make much sense to me, as an instance which has never requested a certificate before would still have carte blanche ability to request a certificate for any identifier, including those belonging to instances which already have requested certificates. It seems to me that to get the TOFU-like security properties that they desire, they should be enforcing a one-certificate-per-identifier rule rather than one-certificate-per-instance.)

Finally, Smallstep tries to use JWT in a bizarre and futile attempt to limit the lifetime of identity documents to 5 minutes. Instead of simply sending the identity document and signature to the server, the client sticks the identity document and signature inside a JSON Web Token which is authenticated with HMAC, using the signature as the HMAC key - the same signature that is in the payload of the JWT.

This leads to an incredible sequence of code on the server side which first deserializes the JWT payload without verifying the HMAC, and then immediately deserializes the payload again, this time verifying the HMAC using the key just taken out of the unverified payload:

var unsafeClaims awsPayload
if err := jwt.UnsafeClaimsWithoutVerification(&unsafeClaims); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error unmarshaling claims")
}

var payload awsPayload
if err := jwt.Claims(unsafeClaims.Amazon.Signature, &payload); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error verifying claims")
}

Obviously, this code adds zero security, as an attacker who wants to use an expired JWT can just adjust the expiration date and then recompute the HMAC using the key that is right there in the JWT.

Edited to add: Smallstep explains that they used JWT here because their other cloud provisioners use JWT natively and therefore it was easier to use JWT here as well. It was not intended as a security measure. Indeed, the 5 minute expiration is not documented anywhere, so I wouldn't call this a vulnerability. However, it would be wrong to call it a harmless no-op: the JWT provided the means for an untrusted, client-generated ID to be transmitted to the server, which was accidentally used instead of the trusted ID from the signed instance identity document, causing the vulnerability above. This is why it's so important to avoid unnecessary code and protocol components, especially in security-sensitive contexts.

Timeline

  • 2020-12-17 18:17 UTC: full disclosure and notification to Smallstep
  • 2020-12-17 21:02 UTC: Smallstep merges patch to escape JSON in default X.509 template
  • 2020-12-17 23:52 UTC: Smallstep releases version 0.15.6 fixing the vulnerabilities, plans to address other feedback
  • 2020-12-27 19:04 UTC: Ten days later, Smallstep's blog post and documentation still show examples of unescaped JSON, giving me doubts that they have fully understood the issue

Comments

Older Posts Newer Posts