Blog (Page 3) - Andrew Ayer

Blog

July 9, 2021

How Certificate Transparency Logs Fail and Why It's OK

Last week, a Certificate Transparency log called Yeti 2022 suffered a single bit flip, likely due to a hardware error or cosmic ray, which rendered the log unusable. Although this event will have zero impact on Web users and website operators, and was reported on an obscure mailing list for industry insiders, it captured the interest of people on Hacker News, Twitter, and Reddit. Certificate Transparency plays an essential role in ensuring security on the Web, and numerous commentators were concerned that logs could be wiped out by a single bit flip. I'm going to explain why these concerns are misplaced and why log failure doesn't worry me.

Background: Certificate Transparency (CT) is a system to log publicly-trusted SSL certificates in public, append-only logs. Website owners can monitor these logs and take action if they discover an unauthorized certificate for one of their domains. Thanks to Certificate Transparency, several untrustworthy certificate authorities have been distrusted, and the ecosystem has improved enormously compared to the pre-CT days when misissued certificates usually went unnoticed.

To ensure that CT logs remain append-only, submitted certificates are placed in the leaves of a data structure called a Merkle Tree. The leaves of the Merkle Tree are recursively hashed together with SHA-256 to produce a root hash that represents the contents of the tree. Periodically, the CT log publishes a signed statement, called a Signed Tree Head or STH, containing the current tree size and root hash. The STH is a commitment that at the specified size, the log has the specified contents. To enforce the commitment, monitors collect STHs and verify that the root hashes match the certificates downloaded from the log. If the downloaded certificates don't match the published STHs, or if a monitor detects two STHs for the same tree size with different root hashes, it means the log contents have been altered - perhaps to conceal a malicious certificate.

On June 30, 2021 at 01:02 UTC, my CT monitor, Cert Spotter, raised an alert that the root hash it calculated from the first 65,569,149 certificate entries downloaded from Yeti 2022 did not equal the root hash in the STH that Yeti 2022 had published for tree size 65,569,149.

I noticed the alert the next morning and reported the problem to the ct-policy mailing list, where these matters are discussed. The Google Chrome CT team reported that they too had observed problems, as did a user of the open source certspotter. Later that day, I drilled down that the problem was with entry 65,562,066 - the hash of the certificate returned by Yeti at this position was not part of the Merkle Tree.

On Thursday, the log operator reported that this entry had "shifted one bit" before the STH was signed. Curious, I calculated the correct hash of entry 65,562,066 and then tried flipping every bit and seeing if the resulting hash was part of the Merkle Tree. Sure enough, flipping the lowest bit of the first byte of the hash resulted in a hash that was part of the Merkle Tree and would ultimately produce the root hash from the STH.

There is no way for the log operator to fix this problem: they can't change entry 65,562,066 to match the errant hash, as this would require breaking SHA-2's preimage resistance, which is computationally infeasible. And since they've already published an STH for tree size 65,569,149, they can't publish an updated STH with a root hash that correctly reflects entry 65,562,066.

Consequentially, Yeti 2022 is toast. It has been made read-only and web browsers will soon cease to rely on it.

Yeti 2022 is not the first Certificate Transparency log to fail. Seven logs have previously failed, all without causing any impact to Web users:

In 2016, a log failed for having excessive downtime.
In 2016, a log failed because the operator reused the log's private key with a test log. Since the same private key was being used to sign STHs from two different logs, each log appeared to be violating the append-only property.
In 2017, a log failed because it violated the append-only property after its database was rolled back during a botched backup restore.
In 2017, a log failed after its operator, a Chinese blockchain company that no one had ever heard of, vanished into thin air. (I tried calling their phone number in China and got a message that their voicemail box was full.) (Maybe they thought CT was a cryptocurrency and pulled the plug when they didn't get rich?)
In 2018, a log failed because it didn't include submitted certificates in the log.
In 2018, another log failed because it didn't include submitted certificates in the log.
In 2020, a log failed because its private key was presumed compromised after the server fell victim to a Salt vulnerability.

The largest risk of log failure is that previously-issued certificates will stop working and require replacement prior to their natural expiration dates. CT-enforcing browsers only accept certificates that contain receipts, called Signed Certificate Timestamps or SCTs, from a sufficient number of approved (i.e. non-failed) CT logs. Each SCT is a promise by the respective log to publish the certificate (Technically, the precertificate, which contains the same information as the certificate). What if one or more of the SCTs in a certificate is from a log which failed after the certificate was issued?

Fortunately, browser Certificate Transparency policies anticipated the possibility. At a high level, the Chrome and Apple policies require the following:

At least one SCT from a log that is approved at time of certificate validation
At least 2-5 SCTs (depending on certificate lifetime) from logs that were approved at time of SCT issuance

(The precise details are a bit more nuanced but don't matter for this blog post. Also, some very advanced website operators deliver SCTs using alternative mechanisms, which are subject to different rules, but this is extremely rare. Read the policies if you want the nitty gritty.)

Consequentially, a single log failure can't cause a certificate to stop working. The first requirement is still satisfied because the certificate still has at least one SCT from a currently-approved log. The second requirement is still satisfied because the failed log was approved at time of SCT issuance. The minimum number of SCTs from approved-at-issuance logs increases with certificate lifetime to reflect the increased probability of log failure: a 180 day certificate only needs 2 SCTs, whereas a 3 year certificate (back when they were allowed) needed 5 SCTs.

Note that when a log fails, its public key is not totally removed from browsers like a distrusted certificate authority's key would be. Instead, the log transitions to a state called "retired" (previously known as "disqualified"), which means its SCTs are still recognized for satisfying the second requirement. This led to an interesting question in 2020 when a log's private key was compromised: should the log be retired, or should it be totally distrusted? Counter-intuitively, it was retired, even though SCTs could have been forged to make it seem like a certificate was included in the log when it really wasn't. But that's OK, because the second requirement isn't about making sure certificates are logged, but about making sure certificate authorities aren't dumbasses. In a world of competent CAs, the second requirement wouldn't be necessary since CAs would have the good sense to include SCTs from extra logs in case some logs failed. But we do not live in a world of competent CAs - indeed, that's why CT exists - and there would no doubt be CAs embedding a single SCT in certificates if browsers didn't require redundancy.

Of course, there is still a chance that all of the SCTs in a certificate come from logs that end up failing. That would suck. But I don't think the solution is to change Certificate Transparency. Catastrophic CT failure is just one of several reasons that a certificate might need to be replaced before its natural expiration, and empirically it's the least likely reason. When a certificate authority is distrusted, as has happened several times, all of its certificates must be replaced. When a certificate is misissued, it has to be revoked and replaced, and there have been numerous incidents since 2019 in which a considerable number of certificates have required revocation - sometimes as many as 100% of a CA's active certificates:

The ecosystem is currently ill-prepared to handle mass replacement events like these, and in many of the above cases CAs missed the revocation deadline or declined to revoke entirely. Although the above misissuances had relatively low security impact, other cases, such as distrusting a compromised certificate authority, or events like Heartbleed or the Debian random number fiasco, are very security critical. This makes the inability to quickly replace certificates at scale a serious problem, larger than the problem of CT logs failing. To address the problem, Let's Encrypt is working on a specification called ACME Renewal Info (ARI) that would allow CAs to instruct TLS servers to request new certificates prior to their normal expiration. They've committed to deploying ARI or a similar technology in their staging environment by 2021-11-12.

Comments

December 17, 2020

Security Vulnerabilities in Smallstep PKI Software

I recently did a partial security review of Smallstep, a commercially-backed open source private certificate authority written in Go. I found that Smallstep is vulnerable to JSON injection, misuses JWTs, and relies on client-side enforcement of server-side security. These vulnerabilities can be exploited to obtain unauthorized certificates. This post is a full disclosure of the issues.

I reviewed the certificates repository as of commit 1feb4fcb26dc78d70bc1d9e586237a36a8fbea9c and the crypto repository as of commit 162770cad29063385cb768b0191814e4c6a94e45. The vulnerabilities are present in version 0.15.5 of step-certificates and version 0.15.3 of step-cli, and have been fixed as of version 0.15.6.

JSON Injection in Certificate Templates

Like many PKI systems, Smallstep supports user-definable certificate profiles (which they call certificate templates) to specify the contents of a certificate (the subject, SANs, extensions, etc.). Smallstep's certificate profiles are implemented as JSON objects which are templated using Go's text/template package. Using templates, you can substitute a variety of information into the profile, including information from the certificate request.

Here's an example template from Smallstep's blog post announcing the feature:

{
  "subject": {"commonName":"{{ .Insecure.CR.Subject.CommonName }}"},
  "sans": {{ toJson .SANs }},
  "keyUsage": ["digitalSignature", "keyAgreement"],
  "extKeyUsage": ["clientAuth"]
}

20 years of HTML and SQL injection vulnerabilities have shown that it's a bad idea to use raw text templating to construct syntactic data like HTML, SQL, or JSON, which is why best practice is to use context-aware templating like SQL prepared statements or Go's html/template package. When using raw text templates, it's way too easy to suffer an injection vulnerability when the author of the template inevitably forgets to escape a value. Indeed, in the above example, the commonName field is unescaped, and Smallstep is rife with other examples of unescaped data in their documentation and in Go string literals that define the default SSH and X.509 templates.

Two factors make it easy for attackers to exploit the injection vulnerability. First, if a JSON object has more than one field with the same name, Go's JSON decoder takes the value from the last one. Second, Smallstep uses json.Decode to decode the object instead of json.Unmarshal, which means trailing garbage is ignored (this is an unfortunate foot gun in Go). Thus, an attacker who can inject values into a template has total control over the resulting object, as they can override the values of earlier fields, and then end the object so later fields are ignored. For example, signing a CSR containing the following common name using the above template results in a CA:TRUE certificate with a SAN of wheeeeeee.example and no extended key usage (EKU) extension:

"}, "basicConstraints":{"isCA":true}, "sans":[{"type":"dns","value":"wheeeeeee.example"}]}

Fortunately, despite the use of unescaped values in the default templates, I found that in virtually all cases the unescaped values come from trusted sources or are validated first. The one exception is the AWS provisioner, which in the default configuration will inject an unvalidated common name from the CSR into the template. However, in this configuration the DNS SANs are also completely unvalidated, so clients are already pretty trusted. Still, the vulnerability could be used to get a certificate with arbitrary or no EKUs, giving an attacker more capabilities than they would otherwise have. (The attacker could also get a CA:TRUE certificate, although by default the issuing CA has a pathlen:0 constraint so the CA:TRUE certificate wouldn't work.)

In any case, this approach to templates is extremely concerning. This issue cannot be fixed just by updating the built-in templates to use proper escaping, as users who write their own templates are likely to forget to escape, just as scores of developers have forgotten to escape HTML and SQL. The responsible way to offer this feature is with a context-aware JSON templating system akin to template/html that automatically escapes values.

Weaknesses in AWS Provisioner

This section applies to Smallstep's AWS provisioner, which enables EC2 instances to obtain X.509 certificates which identify the instance.

To prove that it is authorized to obtain a certificate, the Smallstep client retrieves a signed instance identity document from the EC2 metadata service and sends it to the Smallstep server. The Smallstep server verifies that the instance identity document was validly signed by AWS, and that the EC2 instance belongs to an authorized AWS account. If so, the Smallstep server issues a certificate.

EC2 instance identity documents aren't bound to a particular purpose when they are signed and they never expire. An attacker who obtains a single instance identity document from any EC2 instance in their victim's account has a permanent capability to obtain certificates. Instance identity documents are not particularly protected; anyone who can make an HTTP request from the instance can obtain them. Smallstep attempts to address this in a few ways. Unfortunately, the most effective protection is off by default, and the others are ineffective.

First, instance identity documents contain the date and time that the instance was created, and Smallstep can be configured to reject instance identity documents from instances that are too old. This is a good idea and works well with architectures where an instance obtains its certificate after first bootup and never needs to obtain future certificates. However, this protection is off by default. (Also, the documentation for the configuration parameter is not very good: it's described as "the maximum duration to grant a certificate in AWS and GCP provisioners" which sounds like maximum certificate lifetime. This is not going to help people choose a sensible value for this option.)

Second, by default Smallstep only allows an EC2 instance to request one certificate. After it requests a certificate, the instance is not supposed to be able to request any more certificates. Unfortunately, this logic is enforced client-side. Specifically, the server relies on the client to generate a unique ID, which in a non-malicious implementation is derived from the instance ID. Of course, a malicious client can just generate any ID it wants and the server will think the instance has never requested a certificate before.

(Aside: Smallstep claims that the one-certificate-per-instance logic "allows" them to not validate certificate identifiers by default. This doesn't make much sense to me, as an instance which has never requested a certificate before would still have carte blanche ability to request a certificate for any identifier, including those belonging to instances which already have requested certificates. It seems to me that to get the TOFU-like security properties that they desire, they should be enforcing a one-certificate-per-identifier rule rather than one-certificate-per-instance.)

Finally, Smallstep tries to use JWT in a bizarre and futile attempt to limit the lifetime of identity documents to 5 minutes. Instead of simply sending the identity document and signature to the server, the client sticks the identity document and signature inside a JSON Web Token which is authenticated with HMAC, using the signature as the HMAC key - the same signature that is in the payload of the JWT.

This leads to an incredible sequence of code on the server side which first deserializes the JWT payload without verifying the HMAC, and then immediately deserializes the payload again, this time verifying the HMAC using the key just taken out of the unverified payload:

var unsafeClaims awsPayload
if err := jwt.UnsafeClaimsWithoutVerification(&unsafeClaims); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error unmarshaling claims")
}

var payload awsPayload
if err := jwt.Claims(unsafeClaims.Amazon.Signature, &payload); err != nil {
	return nil, errs.Wrap(http.StatusUnauthorized, err, "aws.authorizeToken; error verifying claims")
}

Obviously, this code adds zero security, as an attacker who wants to use an expired JWT can just adjust the expiration date and then recompute the HMAC using the key that is right there in the JWT.

Edited to add: Smallstep explains that they used JWT here because their other cloud provisioners use JWT natively and therefore it was easier to use JWT here as well. It was not intended as a security measure. Indeed, the 5 minute expiration is not documented anywhere, so I wouldn't call this a vulnerability. However, it would be wrong to call it a harmless no-op: the JWT provided the means for an untrusted, client-generated ID to be transmitted to the server, which was accidentally used instead of the trusted ID from the signed instance identity document, causing the vulnerability above. This is why it's so important to avoid unnecessary code and protocol components, especially in security-sensitive contexts.

Timeline

2020-12-17 18:17 UTC: full disclosure and notification to Smallstep
2020-12-17 21:02 UTC: Smallstep merges patch to escape JSON in default X.509 template
2020-12-17 23:52 UTC: Smallstep releases version 0.15.6 fixing the vulnerabilities, plans to address other feedback
2020-12-27 19:04 UTC: Ten days later, Smallstep's blog post and documentation still show examples of unescaped JSON, giving me doubts that they have fully understood the issue

Comments

November 30, 2020

The Lengths People Go To Just To Avoid DNSSEC

Connecting to a website, say example.com, over TLS is a relatively straightforward affair. The client looks up the DNS A/AAAA record for example.com, connects to the IP address over TLS, and confirms that the presented certificate is valid for example.com.

In contrast, connecting to other services, like XMPP or SMTP, over TLS is less straightforward. That's because clients don't directly look up the A/AAAA record for example.com. Instead they look up a SRV record (for XMPP) or an MX record (for SMTP) which contains the hostname of the XMPP or SMTP server. Then they look up the A/AAAA record of that hostname and connect to it. This layer of indirection makes it easy to delegate the operation of certain services to other hosts. For instance, if example.com wants to use Gmail for their email, their MX record would contain aspmx.l.google.com.

This raises the question of which hostname the certificate should certify: the original hostname (example.com), or the hostname listed in the SRV/MX record (aspmx.l.google.com). Both options have problems.

Approach 1, using the original hostname (example.com), is undesirable because there is no automated way for the operator of a service like SMTP or XMPP to obtain certificates for the hostnames which are delegated to them. Google can automatically get a certificate for aspmx.l.google.com because they own google.com, but they can't for example.com. The admins of example.com would have to request the certificate themselves and give it to their SMTP and XMPP providers. Such a manual approach is bound to cause outages as people forget to renew expiring certificates. But there's another problem: the certificate would permit the SMTP and XMPP providers to impersonate all services for example.com. You probably don't want your instant messaging provider to be able to impersonate your website.

Approach 2, using the SRV/MX hostname (aspmx.l.google.com), doesn't have these problems. It's quite easy for the service operator to automatically obtain a certificate for their own domain. Unfortunately, this approach is not secure. Since the DNS lookup for the SRV/MX record is most likely unauthenticated, a man-in-the-middle attacker could intercept the query and return a rogue record that says the SMTP or XMPP service for example.com is handled by a server operated by the attacker. The attacker would have no problem obtaining a valid certificate for their own domain.

Perhaps the most obvious solution to this problem is to just make the DNS lookup authenticated. The IETF has been trying to do this since 1997 with DNSSEC. More than 20 years later, the results are not promising: fewer than 2% of .com domains support DNSSEC, only 25% of Internet users validate DNSSEC even when it is supported, the use of insecure crypto like 1024 bit RSA and SHA-1 is still rampant, and DNSSEC is so hard to deploy correctly that outages are common.

Unsurprisingly, many people would like to avoid the DNSSEC quagmire, which has led to some very interesting workarounds...

POSH

One workaround is POSH, or "PKIX over Secure HTTP", standardized in RFC 7711. With POSH, the owner of example.com publishes a JSON document at https://example.com/.well-known/posh/SERVICE.json containing a list of certificate fingerprints which are allowed to be used for the given SERVICE (e.g. xmpp-server) on example.com. To connect to example.com's XMPP server, a POSH-aware client would first retrieve https://example.com/.well-known/posh/xmpp-server.json, ensuring that the HTTPS server presents a valid, publicly-trusted certificate for example.com. It would then connect to the XMPP server indicated in example.com's XMPP SRV record, and ensure that it presents a certificate whose fingerprint is listed in the JSON document.

POSH solves the problems presented above. It's secure, because the JSON document containing the fingerprints is authenticated by a certificate for example.com, which remains under the control of the owner of example.com. The operator of the XMPP service doesn't need to obtain a publicly-trusted certificate for example.com. To facilitate certificate rotation, POSH supports delegation: the JSON file at example.com can reference the URL of a different JSON file which is hosted by the XMPP operator. This gives the XMPP operator flexibility to rotate certificates at any time without needing to inform their customers to update their JSON documents.

Although POSH was designed to be protocol agnostic, it was only ever used with XMPP, and even then I could only find a few XMPP clients which support it. It's fair to say POSH never caught on.

MTA-STS

A more recent workaround, for server-to-server SMTP only, is MTA-STS, standardized in RFC 8461. The underlying concept of MTA-STS is the same as POSH: the owner of example.com publishes a document over HTTPS with information about how to validate secure SMTP connections to example.com's mail servers. Several details are different. One cosmetic difference is that the document is published at mta-sts.example.com rather than example.com, which simplifies deployment for domain owners who can't easily make changes to their main website. A more fundamental difference is that instead of listing certificate fingerprints, the document lists the hostnames which are allowed in the MX record set. When connecting to an SMTP server for example.com, the client verifies both that it's connecting to a server listed in the MTA-STS document at mta-sts.example.com, and that the server presents a publicly-trusted certificate valid for the SMTP server's hostname.

The amusing thing about MTA-STS is that it basically boils down to duplicating the contents of the MX record in a document that is published over HTTPS, leveraging the WebPKI to authenticate the MX record rather than DNSSEC. It's kind of incredible that this is considered easier than using DNSSEC, despite having more moving parts and requiring duplication. That says way more about DNSSEC being a failure than about MTA-STS being good, and I've written before about how I think MTA-STS will prove hard to deploy in practice. I suggested how DNS providers could alleviate the problems by automating MTA-STS for their customers, but I'm not aware of any DNS providers to do so.

I am happy to report that SSLMate now offers MTA-STS automation as part of Cert Spotter. It's not quite as seamless as what a DNS provider could offer, but it's pretty good: Cert Spotter continuously monitors your domains' MX records and automatically publishes an appropriate MTA-STS policy for you, obtaining and renewing the necessary SSL certificates. All you need to do is publish two CNAME records delegating the MTA-STS-related subdomains (mta-sts and _mta-sts) to SSLMate-operated servers. Cert Spotter updates the policy automatically any time it detects a change to your MX records, ensuring the policy never falls out of sync with your DNS. For transparency, Cert Spotter emails you when this happens, so you can detect unauthorized MX record changes. Cert Spotter will also alert you if any of your MX servers have a TLS or certificate problem that would prevent MTA-STS from working.

Looking forward: SRVName certificates

There is another solution which, if it's ever deployed, will be much nicer than POSH or MTA-STS: SRVName certificates. SRVName certificates authenticate not just a domain name, like normal certificate, but a particular service running on that domain. For example, you could get a certificate that's valid for only SMTP on example.com, or only XMPP on example.com. This solves the security problem of the first approach above: the owner of example.com can give their mail server operator a SRVName certificate that's valid only for SMTP, allowing the mail server to operate an SMTP service for example.com, but not impersonate any other example.com services. Assuming the validation rules are flexible enough, the SMTP service operator could even obtain the SRVName certificate themselves provided the certificate authority validates that they are listed in the MX record for example.com.

Technically speaking, SRVName is a type of subject alternative name (SAN) which can be placed in a certificate, akin to the DNS SAN which certificates use today for authenticating domain names. It's possible for a certificate to contain both SRVName and DNS SANs, and here I use "SRVName certificate" to mean a certificate containing a SRVName SAN.

Unfortunately, there's a major obstacle blocking SRVName certificates: technically-constrained subordinate certificate authorities. A technically-constrained sub-CA is a certificate authority which is restricted to issuing certificates only for namespaces that are enumerated in the sub-CA certificate's name constraints field. For example, an enterprise that needs to issue a large number of certificates might operate a publicly-trusted, technically-constrained sub-CA that is constrained to the enterprise's domains and IP address ranges. Since their sub-CA can only issue certificates for namespaces that they control, they're allowed to operate it under looser security standards than unconstrained publicly-trusted CAs.

The problem is that if a particular type of SAN (in this case, the SRVName SAN) isn't listed in the name constraints as either allowed or denied, the standard says that all instances of that SAN type are allowed by default. Allow-by-default is usually a bad idea when security is concerned, and this case is no different. Clients can't accept SRVName certificates because it would be unsafe: every existing technically-constrained sub-CA that doesn't have SRVName in its name constraints field has unconstrained ability to issue SRVName certificates. Unfortunately, the Baseline Requirements (the rules governing public certificate issuance) only require technically-constrained sub-CAs to have DNS, IP, and Directory Name constraints. Consequentially, there are many existing technically-constrained sub-CAs out there that would need to be revoked and reissued with SRVName constraints before it's safe to deploy SRVName certificates.

Will that ever happen? Who knows. In the meantime, we're stuck with hacks like MTA-STS.

Comments

June 25, 2020

Writing an SNI Proxy in 115 Lines of Go

The very first message sent in a TLS connection is the Client Hello record, in which the client greets the server and tells it, among other things, the server name it wants to connect to. This is called Server Name Indication, or SNI for short, and it's quite handy as it allows many different servers to be co-located on a single IP address.

The server name is sent in plaintext, which is unfortunately really bad for privacy and censorship resistance, but does enable something very useful: a proxy server can read the server name and use it to decide where to route the connection, without having to decrypt the connection. You can leverage this to make many different physical servers accessible from the Internet even if you have only one public IPv4 address: the proxy listens on your public IP address and forwards connections to the appropriate private IP address based on the SNI.

I just finished writing such a proxy server, which I plan to run on my home network's router so that I can easily access my internal servers from anywhere on the Internet, without a VPN or SSH port forwarding. I was pleased by how easy it was to write this proxy server using only Go's standard library. It's a great example of how well-suited Go is for programs involving networking and cryptography.

Let's start with a standard listen/accept loop (right out of the examples for Go's net package):

func main() {
	l, err := net.Listen("tcp", ":443")
	if err != nil {
		log.Fatal(err)
	}
	for {
		conn, err := l.Accept()
		if err != nil {
			log.Print(err)
			continue
		}
		go handleConnection(conn)
	}
}

Here's a sketch of the handleConnection function, which reads the Client Hello record from the client, dials the backend server indicated by the Client Hello, and then proxies the client to and from the backend. (Note that we dial the backend using the SNI value, which works well with split-horizon DNS where the proxy sees the backend's private IP address and external clients see the proxy's public IP address. If that doesn't work for you, can use more complicated routing logic.)

func handleConnection(clientConn net.Conn) {
	defer clientConn.Close()

	// ... read Client Hello from clientConn ...

	backendConn, err := net.Dial("tcp", net.JoinHostPort(clientHello.ServerName, "443"))
	if err != nil {
		log.Print(err)
		return
	}
	defer backendConn.Close()

	// ... proxy clientConn <==> backendConn ...
}

Let's assume for now we have a convenient function to read a Client Hello record from an io.Reader and return a tls.ClientHelloInfo:

func readClientHello(reader io.Reader) (*tls.ClientHelloInfo, error)

We can't simply call this function from handleConnection, because once the Client Hello is read, the bytes are gone. We need to preserve the bytes and forward them along to the backend, which is expecting a proper TLS connection that starts with a Client Hello record.

What we need to do instead is "peek" at the Client Hello record, and thanks to some simple but powerful abstractions from Go's io package, this can be done with just six lines of code:

func peekClientHello(reader io.Reader) (*tls.ClientHelloInfo, io.Reader, error) {
        peekedBytes := new(bytes.Buffer)
        hello, err := readClientHello(io.TeeReader(reader, peekedBytes))
        if err != nil {
                return nil, nil, err
        }
        return hello, io.MultiReader(peekedBytes, reader), nil
}

What this code does is create a TeeReader, which is a reader that wraps another reader and writes everything that is read to a writer, which in our case is a byte buffer. We pass the TeeReader to readClientHello, so every byte read by readClientHello gets saved to our buffer. Finally, we create a MultiReader which essentially concatenates our buffer with the original reader. Reads from the MultiReader initially come out of the buffer, and when that's exhausted, continue from the original reader. We return the MultiReader to the caller along with the ClientHelloInfo. When the caller reads from the MultiReader it will see a full TLS connection stream, starting with the Client Hello.

Now we just need to implement readClientHello. We could open up the TLS RFCs and learn how to parse a Client Hello record, but it turns out we can let crypto/tls do the work for us, thanks to a callback function in tls.Config called GetConfigForClient:

// GetConfigForClient, if not nil, is called after a ClientHello is
// received from a client.
GetConfigForClient func(*ClientHelloInfo) (*Config, error) // Go 1.8

Roughly, what we need to do is create a TLS server-side connection with a GetConfigForClient callback that saves the ClientHelloInfo passed to it. However, creating a TLS connection requires a full-blown net.Conn, and readClientHello is passed merely an io.Reader. So let's create a type, readOnlyConn, which wraps an io.Reader and satisfies the net.Conn interface:

type readOnlyConn struct {
        reader io.Reader
}
func (conn readOnlyConn) Read(p []byte) (int, error)         { return conn.reader.Read(p) }
func (conn readOnlyConn) Write(p []byte) (int, error)        { return 0, io.ErrClosedPipe }
func (conn readOnlyConn) Close() error                       { return nil }
func (conn readOnlyConn) LocalAddr() net.Addr                { return nil }
func (conn readOnlyConn) RemoteAddr() net.Addr               { return nil }
func (conn readOnlyConn) SetDeadline(t time.Time) error      { return nil }
func (conn readOnlyConn) SetReadDeadline(t time.Time) error  { return nil }
func (conn readOnlyConn) SetWriteDeadline(t time.Time) error { return nil }

readOnlyConn forwards reads to the reader and simulates a broken pipe when written to (as if the client closed the connection before the server could reply). All other operations are a no-op.

Now we're ready to write readClientHello:

func readClientHello(reader io.Reader) (*tls.ClientHelloInfo, error) {
        var hello *tls.ClientHelloInfo

        err := tls.Server(readOnlyConn{reader: reader}, &tls.Config{
                GetConfigForClient: func(argHello *tls.ClientHelloInfo) (*tls.Config, error) {
                        hello = new(tls.ClientHelloInfo)
                        *hello = *argHello
                        return nil, nil
                },
        }).Handshake()

        if hello == nil {
                return nil, err
        }

        return hello, nil
}

Note that Handshake always fails because the readOnlyConn is not a real connection. As long as the Client Hello is successfully read, the failure should only happen after GetConfigForClient is called, so we only care about the error if hello was never set.

Let's put everything together to write the full handleConnection function. I've added deadlines (thanks, Filippo!) and a check that the SNI value ends with .internal.example.com to prevent this from being used as an open proxy. When I deploy this, I will use the DNS suffix of my home network.

func handleConnection(clientConn net.Conn) {
	defer clientConn.Close()

	if err := clientConn.SetReadDeadline(time.Now().Add(5 * time.Second)); err != nil {
		log.Print(err)
		return
	}

	clientHello, clientReader, err := peekClientHello(clientConn)
	if err != nil {
		log.Print(err)
		return
	}

	if err := clientConn.SetReadDeadline(time.Time{}); err != nil {
		log.Print(err)
		return
	}

	if !strings.HasSuffix(clientHello.ServerName, ".internal.example.com") {
		log.Print("Blocking connection to unauthorized backend")
		return
	}

	backendConn, err := net.DialTimeout("tcp", net.JoinHostPort(clientHello.ServerName, "443"), 5*time.Second)
	if err != nil {
		log.Print(err)
		return
	}
	defer backendConn.Close()

        var wg sync.WaitGroup
        wg.Add(2)

        go func() {
                io.Copy(clientConn, backendConn)
                clientConn.(*net.TCPConn).CloseWrite()
                wg.Done()
        }()
        go func() {
                io.Copy(backendConn, clientReader)
                backendConn.(*net.TCPConn).CloseWrite()
                wg.Done()
        }()

        wg.Wait()
}

Here's the complete Go source code - just 115 lines! (Not counting copyright legalese)

Comments

June 18, 2020

Security Review of CFSSL Signer Code

Certificate signing is the most security-sensitive task performed by a certificate authority. The CA has to sign values, like DNS names, that are provided by untrusted sources. The CA must rigorously validate these values before signing them. If an attacker can bypass validation and get untrusted data included in a certificate, the results can be dire. For example, if an attacker can trick a CA into including an arbitrary SAN extension, they can get a certificate for domains they don't control.

Unfortunately, there is a history of CAs including unvalidated information in certificates. A common cause is CAs copying information directly from CSRs instead of from more constrained information sources. Since CSRs can contain both subject identity information and arbitrary certificate extensions, directly ingesting CSRs is extremely error-prone for CAs. For this reason, CAs would be well-advised to extract the public key from the CSR very early in the certificate enrollment process and discard everything else. In a perfect world, CAs would accept standalone public keys from subscribers instead of CSRs. (Before you say "proof-of-possession", see the appendix.)

I decided to review the signing code in CFSSL, the open source PKI toolkit used by Let's Encrypt's Boulder, to see how it stacks up against this advice. Unfortunately, I found that CFSSL copies subject identity information from CSRs by default, has features that are hard to use safely, and uses complicated logic that obfuscates what is included in certificate fields. I recommend that publicly-trusted CAs not use CFSSL.

Update: since publication of this post, Let's Encrypt has begun moving away from CFSSL!

Scope of review

I reviewed the CFSSL Git repository as of commit 6b49beae21ff90a09aea3901741ef02b1057ee65 (the HEAD of master at the time of my review). I reviewed the code in the signer and signer/local packages.

The signing operation

In CFSSL, you sign a certificate by invoking the Sign function on the signer.Signer interface, which has this signature:


Sign(req SignRequest) (cert []byte, err error)

There is only one actual implementation of signer.Signer: local.Signer. (The other implementations, remote.Signer and universal.Signer, are ultimately wrappers around local.Signer.)

Inputs to the signing operation

At a high level, inputs to the signing operation come from three to four places:

The Signer object, which contains:
- The private key and certificate of the CA
- A list of named profiles plus a default profile
- A signature algorithm
I will refer to this object as Signer.
The SignRequest argument, whose relevant fields are:
Hosts []string Request string // The CSR Subject *Subject Profile string CRLOverride string Serial *big.Int Extensions []Extension NotBefore time.Time NotAfter time.Time
I will refer to this object as SignRequest.
The Signer's default certificate profile, represented by an instance of the SigningProfile struct. I will refer to the default profile as defaultProfile.
The effective certificate profile, represented by an instance of the SigningProfile struct. I will refer to the effective profile as profile. If the profile named by SignRequest.Profile exists in Signer, then profile is that profile. If it doesn't exist, then profile equals defaultProfile.

The Sign function takes values from these places and combines them to produce the input to x509.CreateCertificate in Go's standard library. There is overlap - for instance SANs can be specified in the CSR, SignRequest.Hosts, or SignRequest.Extensions. How does Sign decide which source to use when constructing the certificate?

Certificate construction logic

To understand how Sign works, I looked at each certificate field and worked backwards to figure out how Sign decides to populate the field. Below are my findings.

Serial Number

If profile.ClientProvidesSerialNumbers is true: use SignRequest.Serial (error if not set).
Else: generate a random 20 byte serial number.

Not Before

If SignRequest.NotBefore is non-zero: use it.
Else if profile.NotBefore is non-zero: use it.
Else if profile.Backdate is non-zero: use current time - profile.Backdate.
Else: use current time - 5 minutes.

Not After

If SignRequest.NotAfter is non-zero: use it.
Else if profile.NotAfter is non-zero: use it.
Else if profile.Expiry is non-zero: use not before + profile.Expiry.
Else: use not before + defaultProfile.Expiry.

Signature Algorithm

If profile.CSRWhitelist is nil or profile.CSRWhitelist.SignatureAlgorithm is true: Use Signer's signature algorithm.
Else: CFSSL leaves the signature algorithm unspecified and the Go standard library picks a sensible algorithm.

Comments: it's weird how something named CSRWhitelist is used to decide whether to use a value that comes not from the CSR, but from Signer. This is probably because CFSSL's ParseCertificateRequest function gets this field from Signer rather than from the CSR that it is parsing. This sort of indirection and misleading naming makes the code hard to understand.

Public Key

If profile.CSRWhitelist is nil or profile.CSRWhitelist.PublicKey is is true: Use the CSR's public key.
Else: the certificate won't have a public key (this probably causes x509.CreateCertificate to return an error).

Comments: it's unclear why you'd ever want profile.CSRWhitelist.PublicKey to be false. The public key is literally the only piece of information that should be taken from the CSR.

SANs

This one's a doozy...

If profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a SAN extension and SignRequest.Extensions contains a SAN extension, and the SAN OID is present in profile.ExtensionWhitelist: add two SAN extensions to the certificate, one from the CSR and one from SignRequest.Extensions. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a SAN extension: use the SAN extension verbatim from the CSR. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if SignRequest.Extensions contains a SAN extension, and the SAN OID is present in profile.ExtensionWhitelist: use the SAN extension verbatim from SignRequest.Extensions. Note that SignRequest.Hosts is ignored and profile.NameWhitelist is bypassed.
Else if profile.CAConstraint.IsCA is true: the certificate will not contain a SAN extension.
Else if SignRequest.Hosts is non-nil:
1. Use each string in SignRequest.Hosts as follows:
  - If string parses as an IP address: make it an IP Address SAN.
  - Else if string parses as an email address: make it an email address SAN.
  - Else if string parses as a URI, make it a URI SAN.
  - Else: make it a DNS SAN.
2. If profile.NameWhitelist is non-nil: return an error unless the string representation of every DNS, email, and URI SAN matches the profile.NameWhitelist regex (IP address SANs are not checked).
Else if profile.CSRWhitelist is nil and the CSR contains a SAN extension:
1. Copy the DNS names, IP addresses, email addresses, and URIs from the CSR's SAN extension.
2. If profile.NameWhitelist is non-nil: enforce whitelist as described above.
Else if profile.CSRWhitelist is non-nil and the CSR contains a SAN extension:
1. If profile.CSRWhitelist.DNSNames is true: use DNS names from the CSR's SAN extension.
2. If profile.CSRWhitelist.IPAddresses is true: use IP addresses from the CSR's SAN extension.
3. If profile.CSRWhitelist.EmailAddresses is true: use email addresses from the CSR's SAN extension.
4. If profile.CSRWhitelist.URIs is true: use URIs from the CSR's SAN extension.
5. If profile.NameWhitelist is non-nil: enforce whitelist as described above.

Subject

For each supported subject attribute (common name, country, province, locality, organization, organizational unit, serial number):

If the attribute was specified in SignRequest.Subject: use it.
Else if profile.CSRWhitelist is nil or profile.CSRWhitelist.Subject is true: use the attribute from the CSR's subject, if present.

Common name only: if profile.NameWhitelist is non-nil: return an error unless the common name matches the profile.NameWhitelist regex.

Note: SignRequest.Hosts does not override the common name.

Basic Constraints

If SignRequest.Extensions contains a basic constraints extension, and the basic constraints OID is present in profile.ExtensionWhitelist: copy the basic constraints extension verbatim from SignRequest.Extensions.
Else: use the values from profile.CAConstraint.

Comments: given how security-sensitive this extension is, it's a relief that there's no way for the value to come from the CSR. Despite this, there is code earlier in the signing process that looks at the CSR's Basic Constraints extension. First it's extracted from the CSR in ParseCertificateRequest and then it's validated in Sign. This code ultimately has no effect, but it makes the logic harder to follow (and gave me a mild heart attack when I saw it).

Extensions besides SAN and Basic Constraints

For a given extension FOO:

If profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a FOO extension and SignRequest.Extensions contains a FOO extension, and FOO is present in profile.ExtensionWhitelist: add two FOO extensions to the certificate, one from the CSR and one from SignRequest.Extensions. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else if profile.CopyExtensions is true and profile.CSRWhitelist is nil and the CSR contains a FOO extension: copy it verbatim from the CSR. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else if SignRequest.Extensions contains a FOO extension, and FOO is present in profile.ExtensionWhitelist: copy it verbatim to the certificate. Note that fields in SignRequest (like CRLOverride) or profile (like OCSP, CRL, etc.) that would normally control the FOO extension are ignored.
Else: use fields from SignRequest (like CRLOverride) and profile (like OCSP, CRL, etc.) to decide what value the extension should have, if any.

Other comments

By default, CSRWhitelist is nil. This is a bad default, as it means SANs will be copied from the CSR unless SignRequest.Hosts is set. Likewise, any subject attribute not specified in SignRequest.Subject will be copied from the CSR. This is practically impossible to use safely: to avoid including unvalidated subject information you have to specify a value for every attribute in SignRequest.Subject - and if you don't want the attribute included in the final certificate you're out of luck. If CFSSL ever adds support for a new attribute type, you had better update your code to specify a value for the attribute or unvalidated information might slip through. This is exactly the sort of logic that makes it so easy to accidentally issue certificates with "Some-State" in the subject.

If the profile specified by SignRequest.Profile doesn't exist, the default profile is used. This could lead to an unexpected certificate profile being used if a CA deletes a profile from their configuration but there are still references to it elsewhere. Considering the trouble that CAs have with profile management (see the infamous TURKTRUST incident or the CA that discovered they had a whopping 85 buggy profiles), I think it would be much safer if a non-existent profile resulted in an error.

SignRequest.Hosts is untyped - everything is a string and there is no distinction between IP addresses, email addresses, URIs, and DNS names. (Also, Hosts is a misleading name because URIs and email addresses aren't hosts.) CFSSL decides what type of SAN to include based on what the string in Hosts successfully parses as, and assumes it's a DNS name if it doesn't parse as anything else. This could lead to unexpected SAN types in the certificate. Determining if a string was intended to be a URI by trying to parse it is an especially bad idea considering how hellish URIs are to parse, and how much variation there is between different URI parsing implementations. If the user of CFSSL adds a string which they believe to be a valid URI to SignRequest.Hosts, but Go's URI parser rejects it, the URI will end up in a DNS SAN instead.

Variable names are inconsistent and often unhelpful. In Sign, req is used for values from SignRequest and safeTemplate is used for values from the CSR. But in PopulateSubjectFromCSR (which is called by Sign), req is used for values from the CSR, and s is used for values from the SignRequest. This increases the likelihood of accidentally using data from the wrong source.

ParseCertificateRequest blindly and unconditionally copies the extensions from the CSR to the Extensions field of the x509.Certificate template - even if profile.CopyExtensions is false. Fortunately, this field is ignored by x509.CreateCertificate so it's probably harmless. It just means that attacker-controlled input is propagated further through the program, increasing the opportunity for it to be misused.

CopyExtensions is a foot cannon

I am extremely concerned by the presence of the CopyExtensions option. Enabling it practically guarantees misissuance because all extensions (except Basic Constraints) are copied verbatim from the CSR, overriding any value specified in the profile or the SignRequest. In particular, SignRequest.Hosts and profile.NameWhitelist are ignored if the CSR contains a SAN extension. Also, profile.ExtensionWhitelist only applies to extensions specified in SignRequest - not those specified in the CSR. I think it's quite likely that users of CopyExtensions will be surprised when neither of these whitelists are effective.

Lack of documentation

As I showed above, the logic for constructing a certificate is very complicated, and you have to use CFSSL in exactly the right way to avoid copying unvalidated information from CSRs. Unfortunately, documentation is practically non-existent and I could only figure out CFSSL's logic by reading the source code. Obviously, the lack of documentation makes it hard to use CFSSL safely. But the more fundamental problem is that documentation writing wasn't a core part of CFSSL's engineering process. Had documentation been written in tandem with the design and implementation of CFSSL, it would have been evident that incomprehensibility was spiraling out of control. This information could have been fed back into the engineering process and used to redesign or even reject features that made the system too hard to understand. I have personally saved myself many times from releasing overly-complicated software just by writing the documentation for it.

Final thoughts

CFSSL has some nice features, like its friendly command line interface and its certificate bundler for building optimal certificate chains. However, I struggle to see the value provided by its signer package. Its truly useful functionality, like Certificate Transparency submission and pre-issuance linting, could be extracted into standalone libraries. The rest of the signer is just a complicated wrapper around Go's x509.CreateCertificate that obscures what gets included in certificates and will include the wrong thing if you hold it wrong. A long history of misissuance shows us why we need better. If you're a CA, just call x509.CreateCertificate directly - it will be much easier to ensure you are only including validated information in your certificates.

Appendix: Proof-of-Possession and TLS

A common but unfounded objection to discarding everything in a CSR except the public key is that checking the CSR's signature is necessary because it ensures proof-of-possession of the private key. If a CA doesn't verify proof-of-possession, then someone could obtain a certificate for a key which belongs to someone else. (In fact, someone recently got a certificate containing Let's Encrypt's public key.) For TLS, this doesn't matter. (Other protocols, like S/MIME, may be different.) The TLS protocol ensures proof-of-possession every time the certificate is used.

For TLS 1.3, this is easy to see: the server or client has to send a Certificate Verify message which contains a signature from their private key over a transcript of the handshake. The handshake includes their certificate, which is a superset of the information in a CSR. Therefore, the Certificate Verify message proves at least as much as the CSR signature does. In fact it's better, since the proof is fresh and not reliant on a trusted third party doing its job correctly.

In earlier versions of TLS, client certificates are verified in the same way (signing a handshake transcript which includes the certificate). Server certificates are used differently, but ultimately the handshake transcript (which includes the server certificate) is authenticated by a shared secret that is known only to the client and the holder of the certificate private key (provided neither party deliberately sabotages their security). So as with TLS 1.3, private key possession is proven, rendering the CSR signature unnecessary.

Comments

Older Posts Newer Posts

Andrew Ayer

Sections

Blog

How Certificate Transparency Logs Fail and Why It's OK

Security Vulnerabilities in Smallstep PKI Software

JSON Injection in Certificate Templates

Weaknesses in AWS Provisioner

Timeline

The Lengths People Go To Just To Avoid DNSSEC

POSH

MTA-STS

Looking forward: SRVName certificates

Writing an SNI Proxy in 115 Lines of Go

Security Review of CFSSL Signer Code

Scope of review

The signing operation

Inputs to the signing operation

Certificate construction logic

Serial Number

Not Before

Not After

Signature Algorithm

Public Key

SANs

Subject

Basic Constraints

Extensions besides SAN and Basic Constraints

Other comments

CopyExtensions is a foot cannon

Lack of documentation

Final thoughts

Appendix: Proof-of-Possession and TLS