Skip to Content [alt-c]

February 8, 2020

Short Take: Why Trust-On-First-Use Doesn't Work (Even for SSH)

Considering all the progress that has been made over the last decade making SSL certificates on the Web easy, free, automated, and transparent, it's a bit jarring to see someone arguing in 2020 that trust-on-first-use (TOFU) would be better for the Web:

Unpopular opinion. Most people would be better off with a Trust On First Use system for accessing sites. Like SSH, perhaps with some unique (per user) OOB addition to it. Would we really design it this way of starting again?

— Nick Hutton @nickdothutton, Feb 6, 2020

First, be wary of any comparison with SSH, because in the grand scheme of things, very few people use SSH. *nix sysadmins do, obviously. Many, but not all, software developers do. Some people in engineering/science fields might. But that's a drop in the bucket compared to the Web, which basically everyone uses. So just because something appears to work for SSH doesn't mean it will work for the Web.

And I would argue that TOFU actually doesn't work very well for SSH, and the only reason we put up with it is because of SSH's low deployment. SSH server host keys rarely change (which is bad for post-compromise security, so this is nothing to celebrate), but when they do, SSH handles it very poorly. The user gets a big scary message about a possible man-in-the-middle attack. And then what do you think they do? They do this:

Hi all,

It appears that as of midnight last night, SSH and login are working. However, there were a couple students last night who were getting errors such as “REMOTE HOST IDENTIFICATION HAS CHANGED!” or “POSSIBLE DNS SPOOFING DETECTED!” when trying to SSH in.

To fix this, you can run `ssh-keygen -R [REDACTED]` then try to SSH in again. I believe someone else mentioned last night that you could also just delete the entire ~/.ssh/known_hosts file as well to fix the issue, but this seems to be a less destructive solution.

That's from a real email that I once received. I would not be at all surprised if TOFU actually devolves to opportunistic encryption in practice, because users just bypass any man-in-the-middle error they receive.

You could make it really hard to bypass man-in-the-middle errors, but then people would brick their servers, as happened with HTTP public key pinning, which is one of the reasons why that technology is now extinct.

Proponents of TOFU might say that even if TOFU devolves to opportunistic encryption, the man-in-the-middle errors at least make attacks noisy. True, but the errors are seen by people who generally don't know what they mean and even if they did, can't evaluate whether an error is a legitimate key change or an actual attack. In contrast, a PKI with Certificate Transparency (i.e. the system currently deployed on the Web) also makes attacks noisy, but alerts about new certificates go to server operators, who actually know whether a new certificate is legitimate or not. They just need to be monitoring Certificate Transparency logs.

So yes, I do believe we would design the Web this way if starting again.

Comments

February 3, 2020

When Will Your DNS Record Be Published?

When publishing a DNS record through an API, it's often useful to know when the DNS record has been fully published and is visible to DNS resolvers. A perfect example which comes up at SSLMate is automatically validating a certificate request by publishing a DNS record. SSLMate must be sure that the DNS record is visible before it tells the certificate authority to validate it, or the certificate request may fail.

Unfortunately, I know of only one DNS provider that has an API to tell you when a change is published: Route 53. After submitting a DNS change request to Route 53, the API returns a ChangeInfo object which contains a status of either "PENDING" or "INSYNC". You can poll the change until its status becomes "INSYNC", which means the change has taken effect on all Route 53 servers. SSLMate has published a lot of DNS records through Route 53 and this API has never let me down, which makes me happy.

Other DNS providers offer absolutely nothing to help you determine when a DNS change is visible. In these cases, SSLMate can do nothing but sleep for 10-120 seconds (depending on the provider) and hope for the best. Unfortunately, it doesn't help for SSLMate to try to resolve the DNS record to see if the record has been published - modern authoritative DNS services use many different servers, often with anycast or load balancing, so just because SSLMate sees the record doesn't mean that others will.

And then there's Google Cloud DNS, which deserves a special mention because they offer an API that looks very similar to Route 53's: after submitting a change request, the API returns a change object with a status of "pending" that you can poll until the status becomes "done". Sounds perfect! Except if you read the fine print, it says:

A status of "done" means that the request to update the authoritative servers has been sent, but the servers might not be updated yet

Sure enough, I found that it often takes two minutes after a change becomes "done" for it to be fully visible. The change object also contains a very bizarre boolean called "isServing", which is documented as:

If the DNS queries for the zone will be served.

I'm not sure what this means, or why information about the zone's status would be present in a record change object. In my testing I never once saw a value besides false, even long after queries for both the individual record and the zone as a whole were being served.

So the change object API is completely useless, and I don't know why it exists - who cares if the "request to update the authoritative servers has been sent"? That's an internal implementation detail. It only matters to users of the API if the change has been fully applied everywhere. So, SSLMate doesn't use change objects. It sleeps for 2 minutes after adding the record and hopes for the best.

All of this is exasperated when requesting a certificate using the ACME protocol. With ACME, if you tell the server the DNS record is published, but the server doesn't see the record, your certificate order is invalidated. You have to create a new order, and you're given a different DNS record that you have to publish. That means your ACME client could potentially get in a situation where it never makes forward progress, because on each attempt it fails to wait long enough before telling the ACME server to check the record.

SSLMate has a workaround for this when talking to an ACME-using certificate authority such as Let's Encrypt. Instead of publishing the record returned by the ACME server, SSLMate publishes an NS record that delegates the record to a custom-built authoritative DNS server operated by SSLMate. SSLMate's authoritative server returns the record provided by the ACME server. The NS record never changes, so if checking the record fails and SSLMate has to create a new ACME order, it doesn't need to republish a DNS record in the customer's zone; instead it just has to update the record that SSLMate's authoritative server returns, which can be done instantaneously. Therefore, every retry is more likely to succeed than the previous one since more time has elapsed since publishing the NS record. All of this happens completely automatically and transparently to the user of SSLMate, and is one of the ways that SSLMate provides great dependability. (Another benefit is that if the customer's DNS provider doesn't provide an API, they can publish the NS record manually and never have to touch it again, even for renewals.)

Nevertheless, it would be really nice if more DNS providers offered an API like Route 53 to report when a DNS record has been published.

Comments

January 4, 2020

This Is Why You Always Review Your Dependencies, AGPL Edition

Before adding a dependency to one of my software projects, I do some basic vetting of the dependency. Among the things I check are:

  • How is the code licensed?
  • Who are the authors?
  • Are there any serious unresolved issues in the issue tracker?
  • Is there a history of serious bugs in the issue tracker?
  • What kind of code review process is used for pull requests?

Finally, I do a cursory review of the code. I look for anything blatantly insecure or malicious, and try to get a feel for the quality of the code base. I look for "Brown M&Ms" - minor inattention to detail that might indicate a larger problem.

I repeat the above recursively on transitive dependencies as many times as necessary. I also repeat the cursory code review any time I upgrade a dependency.

This is quite a bit of work, but is necessary to avoid falling victim to attacks like event-stream. I was recently reminded of yet another reason to review dependencies, as I reviewed Duo's highly-publicized Go library for WebAuthn, github.com/duo-labs/webauthn.

It started off poorly when I noticed some Brown M&M's: despite being a library, it was logging messages to stdout, and there were several code smells which indicated inexperience with Go. Sure enough, these minor issues foreshadowed a far larger problem: when I started reviewing the transitive dependency github.com/katzenpost/core/crypto/eddsa, I was greeted with an AGPLv3 license header.

This was bad news for most people wanting to use Duo's WebAuthn library. Although Duo had licensed their library under a BSD license, when you linked your application with Duo's library, you'd also be linking with the AGPL-licensed library, creating a "modified" work in the eyes of the (A)GPL, thus subjecting your application to section 13 of the AGPL:

Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software.

In other words, if you used github.com/duo-labs/webauthn in a public-facing web app, your web app had to be open source.

The most galling thing about this dependency is that it's redundant with golang.org/x/crypto/ed25519, which is one of Go's quasi-standard "x" libraries. In fact, github.com/duo-labs/webauthn originally used golang.org/x/crypto/ed25519. That changed during a pull request from an external collaborator titled "Consolidate COSE things to their own area". In the process of moving some code from one file to another, this pull request subtly changed the implementation of OKPPublicKeyData.Verify.

Here's the old OKPPublicKeyData.Verify, which uses golang.org/x/crypto/ed25519:

// Verify Octet Key Pair (OKP) Public Key Signature
func (k *OKPPublicKeyData) Verify(data []byte, sig []byte) (bool, error) {
	f := HasherFromCOSEAlg(COSEAlgorithmIdentifier(k.PublicKeyData.Algorithm))
	h := f()
	h.Write(data)
	return ed25519.Verify(k.XCoord, h.Sum(nil), sig), nil
}

Here's the new OKPPublicKeyData.Verify, which uses the AGPL-licensed github.com/katzenpost/core/crypto/eddsa:

// Verify Octet Key Pair (OKP) Public Key Signature
func (k *OKPPublicKeyData) Verify(data []byte, sig []byte) (bool, error) {
	f := HasherFromCOSEAlg(COSEAlgorithmIdentifier(k.PublicKeyData.Algorithm))
	h := f()
	h.Write(data)
	var oKey eddsa.PublicKey
	err := oKey.FromBytes(k.XCoord)
	if err != nil {
		return false, err
	}
	return oKey.Verify(h.Sum(nil), sig), nil
}

There was zero explanation provided for this change. The pull request was reviewed by two Duo employees, who approved and merged it.

Aside: this is why I don't like to accept pull requests that move code around. Even if the new code organization is better, it's usually not worth the time it takes to ensure the pull request isn't doing anything extra.

I filed an issue about the AGPL-licensed dependency, and the developers switched back to using golang.org/x/crypto/ed25519. Nevertheless, I've decided not to use github.com/duo-labs/webauthn. The bulk of the library and its dependencies are to support a WebAuthn misfeature called attestation, which I have less-than-zero desire to use. I just finished writing a vastly simpler, attestation-free library which is less than one tenth the size (I will open source it soon - watch this space). (There's another lesson here, which is that complicated "features" like attestation that serve a minority's use case shouldn't be added to Web standards.) Developing this library is less costly than the liability of using an existing WebAuthn Go library.

This incident reminded me of why I like programming in Go. Go's extensive standard library, along with its quasi-standard "x" libraries, mean that the dependency graph of my projects is typically quite small. The bulk of my trust is consolidated in the Go project, and thanks to their stellar reputation and solid operating procedures, I don't feel a need to review the source code of the Go compiler and standard libraries. Even though I love Rust, I am terrified every time I look at the dependency graph of a typical Rust library: I usually see dozens of transitive dependencies written by Internet randos whom I have zero reason to trust. Vetting all those dependencies takes far too much time, which is why I'm much less productive in Rust than Go.

One final note: as a fan of verifiable data structures like Certificate Transparency, I have to love the new Go checksum database. However, the checksum database does you no good if you don't take the time to review your dependencies. Unfortunately, I've already seen one over-enthusiastic Go user claim that the Go checksum database solves all problems with dependency management. It doesn't. There's no easy way around this basic fact: you have to review your dependencies.

Comments

December 20, 2019

Preventing Server Side Request Forgery in Golang

If your application makes requests to URLs provided by untrusted sources (such as users), you must take care to avoid server side request forgery (SSRF) attacks. Otherwise, an attacker might be able to induce your application to make a request to a service on your server's localhost or internal network. Since the service thinks the request is coming from a trusted source, it might perform a privileged action or return sensitive data that gets relayed by your application back to the attacker. This is particularly a problem when running in EC2, which exposes sensitive credentials over its metadata service, which is accessible over HTTP at a private IP address. SSRF attacks can be serious; one was exploited earlier this year to steal more than 100 million credit applications from Capital One.

One way to prevent SSRF attacks is to validate all addresses before connecting to them. However, you must do the validation at a very low layer to be effective. It's not sufficient to simply block URLs that contain "localhost" or an internal IP address, since an attacker could publish a DNS record under a public domain that resolves to an internal IP address. It's also insufficient to do the DNS lookup yourself and block a URL if the hostname resolves to an unsafe address; an attacker could set up a special DNS server that returns a safe address the first time it's queried, and the target address the second time when your application actually connects to the URL.

Instead, you need to hook deep into your HTTP client's networking stack and check for a safe address right before the HTTP client tries to access it.

Fortunately, Go makes it easy to hook in at just the right place, thanks to the Control field of net.Dialer, introduced in Go 1.11:

// If Control is not nil, it is called after creating the network
// connection but before actually dialing.
//
// Network and address parameters passed to Control method are not
// necessarily the ones passed to Dial. For example, passing "tcp" to Dial
// will cause the Control function to be called with "tcp4" or "tcp6".
Control func(network, address string, c syscall.RawConn) error // Go 1.11

This function is called by Go's standard library after the address has been resolved, but before connecting. The network argument is tcp4, udp4, tcp6, or udp6, and the address argument is an IP address and port number separated by a colon (e.g. 192.0.2.0:80). If the control function returns an error, the dial is aborted.

Here's an example control function that returns an error if the address is not safe. It's quite conservative, permitting only TCP connections to port 80 and 443 on public IP addresses (see here for the implementation of isPublicIPAddress). You may want to customize the control function to suit your application's needs.

func safeSocketControl(network string, address string, conn syscall.RawConn) error {
	if !(network == "tcp4" || network == "tcp6") {
		return fmt.Errorf("%s is not a safe network type", network)
	}

	host, port, err := net.SplitHostPort(address)
	if err != nil {
		return fmt.Errorf("%s is not a valid host/port pair: %s", address, err)
	}

	ipaddress := net.ParseIP(host)
	if ipaddress == nil {
		return fmt.Errorf("%s is not a valid IP address", host)
	}

	if !isPublicIPAddress(ipaddress) {
		return fmt.Errorf("%s is not a public IP address", ipaddress)
	}

	if !(port == "80" || port == "443") {
		return fmt.Errorf("%s is not a safe port number", port)
	}

	return nil
}

Once you have a control function, you can use it to make HTTP requests as follows (the various numbers below match those used by http.DefaultClient):

safeDialer := &net.Dialer{
	Timeout:   30 * time.Second,
	KeepAlive: 30 * time.Second,
	DualStack: true,
	Control:   safeSocketControl,
}

safeTransport := &http.Transport{
	Proxy:                 http.ProxyFromEnvironment,
	DialContext:           safeDialer.DialContext,
	ForceAttemptHTTP2:     true,
	MaxIdleConns:          100,
	IdleConnTimeout:       90 * time.Second,
	TLSHandshakeTimeout:   10 * time.Second,
	ExpectContinueTimeout: 1 * time.Second,
}

safeClient := &http.Client{
	Transport: safeTransport,
}

resp, err := safeClient.Get(untrustedURL)

The above code examples are in the public domain.

Comments

December 3, 2019

Programmatically Accessing Your Customers' Google Cloud Accounts (While Avoiding the Confused Deputy Problem)

SaaS applications often need to access their customers' cloud resources at providers like Amazon Web Services and Google Cloud Platform. For instance, a monitoring service might require read-only access to their customers' AWS accounts so it can inventory resources. At SSLMate, we request access to our customers' DNS zones so we can publish DNS records to automatically validate the certificates that they request.

Doing this with AWS was easy, thanks to their detailed documentation for precisely this use case. However, when it came to Google Cloud, the only hint of a solution that I could find was buried deep in this FAQ:

How can I access data from my users' Google Cloud Platform project using Cloud APIs?

You can access data from your users' Google Cloud Platform projects by creating a service account to represent your service, and then having your customers grant that service account appropriate access to their cloud data using IAM policies. Note that you might want to create a service account per customer if you need to avoid confused deputy problems.

The first part sounds pretty easy: create a service account which SSLMate uses when making DNS changes and ask customers to grant this account access to Cloud DNS in their Google Cloud project. What isn't as easy, unfortunately, is solving the confused deputy problem. Contrary to the FAQ, this isn't something that we "might" want to do - it's something we absolutely must do. If SSLMate used just a single service account and all of our customers authorized it, then a malicious customer would be able to request a certificate for any of our customer's domains. SSLMate would access the victim's project using the single SSLMate service account and publish the validation record for the attacker's certificate request. This would succeed since the victim had authorized the single SSLMate service account. I can't think of any application that would not also be vulnerable if it did not address the confused deputy problem.

AWS provides a nice solution to the problem. The customer doesn't just authorize SSLMate's AWS account; they authorize the AWS account plus an "external ID", which for SSLMate is the same as their SSLMate customer ID. When SSLMate connects to AWS to add a DNS record, it sends the ID of the customer on whose behalf it is acting. If it doesn't match the authorized external ID, AWS blocks the request.

But with Google Cloud we're stuck creating a different service account for each SSLMate customer. The customer would authorize only that service account, and SSLMate would use it to access their Google Cloud project. Creating all these service accounts isn't hard - there's an API for that - but what's hard is giving each of these service accounts their own key. I'd have to securely store all those keys somewhere, and rotate them periodically, which is a pain.

Fortunately, Google Cloud has an API to generate short-lived OAuth2 access tokens for a service account. SSLMate invokes this API from a single master service account to get an access token for the customer-specific service account, and uses the access token to make the DNS change requests. When it's done, SSLMate discards the credentials. The customer-specific accounts have no long-term keys associated with them; only the single master service account does, making key management significantly easier, and equivalent to AWS.

Here's how you can set this up for your SaaS application...

1. Create the Master Service Account

This service account will be used by your app to create customer-specific service accounts, and to create short-lived access credentials for accessing those accounts.

  1. Create the service account, giving it a name of your choosing.

  2. Assign the service account the following roles:

    • Service Account Admin - this is needed to create customer-specific service accounts
    • Service Account Token Creator - this is needed to get the short-lived access credentials
  3. Create a key for this service account and save a copy. This key will be used by your app.

2. Onboarding a Customer

Your app needs to do this every time you onboard a new customer (or an existing customer wants to integrate their Google Cloud account with your service).

  1. Create a service account for the customer using the IAM API, authenticating with the key for your master service account. I suggest deriving the service account ID from the customer's internal ID. For instance, if the customer ID is 1234, use customer-1234 as the service account ID.

  2. Instruct your customer to authorize this service account as follows:

    1. Visit the IAM page for their project.
    2. Click Add.
    3. In the New Member box, your customer must enter the email address of the service account created in step 1. The email address will look like: SERVICE_ACCOUNT_ID@YOUR_PROJECT_ID.iam.gserviceaccount.com
    4. In the Role box, choose the roles necessary for your app to function (in SSLMate's case, this is "DNS Administrator").
    5. Click Save.
  3. After your customer authorizes the account, I recommend testing the access by making a simple read-only request as described in part 3 below and displaying an error if it doesn't work.

3. Accessing Your Customer's Account

  1. Use the generateAccessToken API to create an access token for the customer-specific service account. Authenticate to this API using the master service account's key. The name parameter must contain the email address of the customer-specific service account, which you can derive from the customer ID if you follow the naming convention suggested above. The scope array must contain the OAuth scopes you need to access. (In SSLMate's case, this is https://www.googleapis.com/auth/ndev.clouddns.readwrite)

  2. Use the token returned by the API call as a Bearer token for accessing your customer's account.

If you're using Go, it's helpful to put the above logic in a type that implements the oauth2.TokenSource interface, which you can pass to oauth2.NewClient to create the http.Client that you use with the Google Cloud API library. Here's a drop-in token source type you can use, and here's an adaptable example of how to use it.

Conclusion

While it's more complicated to set up than AWS, in the end this solution has the same desirable properties as AWS: protection against confused deputy attacks, and just one long term credential for your application. If only Google Cloud's documentation for this was as helpful as AWS's!

Addendum: Why I Didn't Use Three-Legged OAuth

My first attempt at implementing this feature used three-legged OAuth. SSLMate would redirect the customer to Google, they'd click a single button to authorize SSLMate for read/write access to Google Cloud DNS, and then Google would redirect back to SSLMate with an access code. This worked well and provided a great user experience since the customer didn't need to do any configuration. Unfortunately, the access was linked to the customer's Google account, rather than their Google Cloud project. I thought this was a bad idea because if the Google Account ever lost access to the project, it would break the SSLMate integration. The integration needs to continue working even if the employee who set it up leaves the company which owns the Google Cloud project.

Another thing which turned me off three-legged OAuth was that Google wanted to subject my integration to a lengthy review because accessing DNS is considered a "sensitive scope." This sounded like a hassle, and I assume it is only going to get more restrictive in the future as Google tries to stop malicious OAuth apps like last year's viral Google Docs attack. For example, if they decided one day to classify DNS access as a "restricted scope," I would be subjected to a 5 figure security audit.

Consequentially, I ditched three-legged OAuth and turned to the solution described above. The user experience is not quite as nice but it's much more robust. SSLMate still uses three-legged OAuth with other DNS providers which support it, like DNSimple and Digital Ocean.

Comments

Older Posts