mjg59 | Recent Entries

The Free Software Foundation Europe and the Software Freedom Conservancy recently released a statement that they would no longer work with Eben Moglen, chairman of the Software Freedom Law Center. Eben was the general counsel for the Free Software Foundation for over 20 years, and was centrally involved in the development of version 3 of the GNU General Public License. He's devoted a great deal of his life to furthering free software.

But, as described in the joint statement, he's also acted abusively towards other members of the free software community. He's acted poorly towards his own staff. In a professional context, he's used graphically violent rhetoric to describe people he dislikes. He's screamed abuse at people attempting to do their job.

And, sadly, none of this comes as a surprise to me. As I wrote in 2017, after it became clear that Eben's opinions diverged sufficiently from the FSF's that he could no longer act as general counsel, he responded by threatening an FSF board member at an FSF-run event (various members of the board were willing to tolerate this, which is what led to me quitting the board). There's over a decade's evidence of Eben engaging in abusive behaviour towards members of the free software community, be they staff, colleagues, or just volunteers trying to make the world a better place.

When we build communities that tolerate abuse, we exclude anyone unwilling to tolerate being abused[1]. Nobody in the free software community should be expected to deal with being screamed at or threatened. Nobody should be afraid that they're about to have their sexuality outed by a former boss.

But of course there are some that will defend Eben based on his past contributions. There were people who were willing to defend Hans Reiser on that basis. We need to be clear that what these people are defending is not free software - it's the right for abusers to abuse. And in the long term, that's bad for free software.

[1] "Why don't people just get better at tolerating abuse?" is a terrible response to this. Why don't abusers stop abusing? There's fewer of them, and it should be easier.

TPMs contain a set of registers ("Platform Configuration Registers", or PCRs) that are used to track what a system boots. Each time a new event is measured, a cryptographic hash representing that event is passed to the TPM. The TPM appends that hash to the existing value in the PCR, hashes that, and stores the final result in the PCR. This means that while the PCR's value depends on the precise sequence and value of the hashes presented to it, the PCR value alone doesn't tell you what those individual events were. Different PCRs are used to store different event types, but there are still more events than there are PCRs so we can't avoid this problem by simply storing each event separately.

This is solved using the event log. The event log is simply a record of each event, stored in RAM. The algorithm the TPM uses to calculate the PCR values is known, so we can reproduce that by simply taking the events from the event log and replaying the series of events that were passed to the TPM. If the final calculated value is the same as the value in the PCR, we know that the event log is accurate, which means we now know the value of each individual event and can make an appropriate judgement regarding its security.

If any value in the event log is invalid, we'll calculate a different PCR value and it won't match. This isn't terribly helpful - we know that at least one entry in the event log doesn't match what was passed to the TPM, but we don't know which entry. That means we can't trust any of the events associated with that PCR. If you're trying to make a security determination based on this, that's going to be a problem.

PCR 7 is used to track information about the secure boot policy on the system. It contains measurements of whether or not secure boot is enabled, and which keys are trusted and untrusted on the system in question. This is extremely helpful if you want to verify that a system booted with secure boot enabled before allowing it to do something security or safety critical. Unfortunately, if the device gives you an event log that doesn't replay correctly for PCR 7, you now have no idea what the security state of the system is.

We ran into that this week. Examination of the event log revealed an additional event other than the expected ones - a measurement accompanied by the string "Boot Guard Measured S-CRTM". Boot Guard is an Intel feature where the CPU verifies the firmware is signed with a trusted key before executing it, and measures information about the firmware in the process. Previously I'd only encountered this as a measurement into PCR 0, which is the PCR used to track information about the firmware itself. But it turns out that at least some versions of Boot Guard also measure information about the Boot Guard policy into PCR 7. The argument for this is that this is effectively part of the secure boot policy - having a measurement of the Boot Guard state tells you whether Boot Guard was enabled, which tells you whether or not the CPU verified a signature on your firmware before running it (as I wrote before, I think Boot Guard has user-hostile default behaviour, and that enforcing this on consumer devices is a bad idea).

But there's a problem here. The event log is created by the firmware, and the Boot Guard measurements occur before the firmware is executed. So how do we get a log that represents them? That one's fairly simple - the firmware simply re-calculates the same measurements that Boot Guard did and creates a log entry after the fact[1]. All good.

Except. What if the firmware screws up the calculation and comes up with a different answer? The entry in the event log will now not match what was sent to the TPM, and replaying will fail. And without knowing what the actual value should be, there's no way to fix this, which means there's no way to verify the contents of PCR 7 and determine whether or not secure boot was enabled.

But there's still a fundamental source of truth - the measurement that was sent to the TPM in the first place. Inspired by Henri Nurmi's work on sniffing Bitlocker encryption keys, I asked a coworker if we could sniff the TPM traffic during boot. The TPM on the board in question uses SPI, a simple bus that can have multiple devices connected to it. In this case the system flash and the TPM are on the same SPI bus, which made things easier. The board had a flash header for external reprogramming of the firmware in the event of failure, and all SPI traffic was visible through that header. Attaching a logic analyser to this header made it simple to generate a record of that. The only problem was that the chip select line on the header was attached to the firmware flash chip, not the TPM. This was worked around by simply telling the analysis software that it should invert the sense of the chip select line, ignoring all traffic that was bound for the flash and paying attention to all other traffic. This worked in this case since the only other device on the bus was the TPM, but would cause problems in the event of multiple devices on the bus all communicating.

With the aid of this analyser plugin, I was able to dump all the TPM traffic and could then search for writes that included the "0182" sequence that corresponds to the command code for a measurement event. This gave me a couple of accesses to the locality 3 registers, which was a strong indication that they were coming from the CPU rather than from the firmware. One was for PCR 0, and one was for PCR 7. This corresponded to the two Boot Guard events that we expected from the event log. The hash in the PCR 0 measurement was the same as the hash in the event log, but the hash in the PCR 7 measurement differed from the hash in the event log. Replacing the event log value with the value actually sent to the TPM resulted in the event log now replaying correctly, supporting the hypothesis that the firmware was failing to correctly reconstruct the event.

What now? The simple thing to do is for us to simply hard code this fixup, but longer term we'd like to figure out how to reconstruct the event so we can calculate the expected value ourselves. Unfortunately there doesn't seem to be any public documentation on this. Sigh.

[1] What stops firmware on a system with no Boot Guard faking those measurements? TPMs have a concept of "localities", effectively different privilege levels. When Boot Guard performs its initial measurement into PCR 0, it does so at locality 3, a locality that's only available to the CPU. This causes PCR 0 to be initialised to a different initial value, affecting the final PCR value. The firmware can't access locality 3, so can't perform an equivalent measurement, so can't fake the value.

Work involves supporting Windows (there's a lot of specialised hardware design software that's only supported under Windows, so this isn't really avoidable), but also involves git, so I've been working on extending our support for hardware-backed SSH certificates to Windows and trying to glue that into git. In theory this doesn't sound like a hard problem, but in practice oh good heavens.

Git for Windows is built on top of msys2, which in turn is built on top of Cygwin. This is an astonishing artifact that allows you to build roughly unmodified POSIXish code on top of Windows, despite the terrible impedance mismatches inherent in this. One is that until 2017, Windows had no native support for Unix sockets. That's kind of a big deal for compatibility purposes, so Cygwin worked around it. It's, uh, kind of awful. If you're not a Cygwin/msys app but you want to implement a socket they can communicate with, you need to implement this undocumented protocol yourself. This isn't impossible, but ugh.

But going to all this trouble helps you avoid another problem! The Microsoft version of OpenSSH ships an SSH agent that doesn't use Unix sockets, but uses a named pipe instead. So if you want to communicate between Cygwinish OpenSSH (as is shipped with git for Windows) and the SSH agent shipped with Windows, you need something that bridges between those. The state of the art seems to be to use npiperelay with socat, but if you're already writing something that implements the Cygwin socket protocol you can just use npipe to talk to the shipped ssh-agent and then export your own socket interface.

And, amazingly, this all works? I've managed to hack together an SSH agent (using Go's SSH agent implementation) that can satisfy hardware backed queries itself, but forward things on to the Windows agent for compatibility with other tooling. Now I just need to figure out how to plumb it through to WSL. Sigh.

I dug out a computer running Fedora 28, which was released 2018-04-01 - over 5 years ago. Backing up the data and re-installing seemed tedious, but the current version of Fedora is 38, and while Fedora supports updates from N to N+2 that was still going to be 5 separate upgrades. That seemed tedious, so I figured I'd just try to do an update from 28 directly to 38. This is, obviously, extremely unsupported, but what could possibly go wrong?

Running sudo dnf system-upgrade download --releasever=38 didn't successfully resolve dependencies, but sudo dnf system-upgrade download --releasever=38 --allowerasing passed and dnf started downloading 6GB of packages. And then promptly failed, since I didn't have any of the relevant signing keys. So I downloaded the fedora-gpg-keys package from F38 by hand and tried to install it, and got a signature hdr data: BAD, no. of bytes(88084) out of range error. It turns out that rpm doesn't handle cases where the signature header is larger than a few K, and RPMs from modern versions of Fedora. The obvious fix would be to install a newer version of rpm, but that wouldn't be easy without upgrading the rest of the system as well - or, alternatively, downloading a bunch of build depends and building it. Given that I'm already doing all of this in the worst way possible, let's do something different.

The relevant code in the hdrblobRead function of rpm's lib/header.c is:


    int32_t il_max = HEADER_TAGS_MAX;
    int32_t dl_max = HEADER_DATA_MAX;

    if (regionTag == RPMTAG_HEADERSIGNATURES) {
        il_max = 32;
        dl_max = 8192;
    }

which indicates that if the header in question is RPMTAG_HEADERSIGNATURES, it sets more restrictive limits on the size (no, I don't know why). So I installed rpm-libs-debuginfo, ran gdb against librpm.so.8, loaded the symbol file, and then did disassemble hdrblobRead. The relevant chunk ends up being:


   0x000000000001bc81 <+81>:	cmp    $0x3e,%ebx
   0x000000000001bc84 <+84>:	mov    $0xfffffff,%ecx
   0x000000000001bc89 <+89>:	mov    $0x2000,%eax
   0x000000000001bc8e <+94>:	mov    %r12,%rdi
   0x000000000001bc91 <+97>:	cmovne %ecx,%eax

which is basically "If ebx is not 0x3e, set eax to 0xffffffff - otherwise, set it to 0x2000". RPMTAG_HEADERSIGNATURES is 62, which is 0x3e, so I just opened librpm.so.8 in hexedit, went to byte 0x1bc81, and replaced 0x3e with 0xfe (an arbitrary invalid value). This has the effect of skipping the if (regionTag == RPMTAG_HEADERSIGNATURES) code and so using the default limits even if the header section in question is the signatures. And with that one byte modification, rpm from F28 would suddenly install the fedora-gpg-keys package from F38. Success!

But short-lived. dnf now believed packages had valid signatures, but sadly there were still issues. A bunch of packages in F38 had files that conflicted with packages in F28. These were largely Python 3 packages that conflicted with Python 2 packages from F28 - jumping this many releases meant that a bunch of explicit replaces and the like no longer existed. The easiest way to solve this was simply to uninstall python 2 before upgrading, and avoiding the entire transition. Another issue was that some data files had moved from libxcrypt-common to libxcrypt, and removing libxcrypt-common would remove libxcrypt and a bunch of important things that depended on it (like, for instance, systemd). So I built a fake empty package that provided libxcrypt-common and removed the actual package. Surely everything would work now?

Ha no. The final obstacle was that several packages depended on rpmlib(CaretInVersions), and building another fake package that provided that didn't work. I shouted into the void and Bill Nottingham answered - rpmlib dependencies are synthesised by rpm itself, indicating that it has the ability to handle extensions that specific packages are making use of. This made things harder, since the list is hard-coded in the binary. But since I'm already committing crimes against humanity with a hex editor, why not go further? Back to editing librpm.so.8 and finding the list of rpmlib() dependencies it provides. There were a bunch, but I couldn't really extend the list. What I could do is overwrite existing entries. I tried this a few times but (unsurprisingly) broke other things since packages depended on the feature I'd overwritten. Finally, I rewrote rpmlib(ExplicitPackageProvide) to rpmlib(CaretInVersions) (adding an extra '\0' at the end of it to deal with it being shorter than the original string) and apparently nothing I wanted to install depended on rpmlib(ExplicitPackageProvide) because dnf finished its transaction checks and prompted me to reboot to perform the update. So, I did.

And about an hour later, it rebooted and gave me a whole bunch of errors due to the fact that dbus never got started. A bit of digging revealed that I had no /etc/systemd/system/dbus.service, a symlink that was presumably introduced at some point between F28 and F38 but which didn't get automatically added in my case because well who knows. That was literally the only thing I needed to fix up after the upgrade, and on the next reboot I was presented with a gdm prompt and had a fully functional F38 machine.

You should not do this. I should not do this. This was a terrible idea. Any situation where you're binary patching your package manager to get it to let you do something is obviously a bad situation. And with hindsight performing 5 independent upgrades might have been faster. But that would have just involved me typing the same thing 5 times, while this way I learned something. And what I learned is "Terrible ideas sometimes work and so you should definitely act upon them rather than doing the sensible thing", so like I said, you should not do this in case you learn the same lesson.

The phrase "Root of Trust" turns up at various points in discussions about verified boot and measured boot, and to a first approximation nobody is able to give you a coherent explanation of what it means[1]. The Trusted Computing Group has a fairly wordy definition, but (a) it's a lot of words and (b) I don't like it, so instead I'm going to start by defining a root of trust as "A thing that has to be trustworthy for anything else on your computer to be trustworthy".

(An aside: when I say "trustworthy", it is very easy to interpret this in a cynical manner and assume that "trust" means "trusted by someone I do not necessarily trust to act in my best interest". I want to be absolutely clear that when I say "trustworthy" I mean "trusted by the owner of the computer", and that as far as I'm concerned selling devices that do not allow the owner to define what's trusted is an extremely bad thing in the general case)

Let's take an example. In verified boot, a cryptographic signature of a component is verified before it's allowed to boot. A straightforward implementation of a verified boot implementation has the firmware verify the signature on the bootloader or kernel before executing it. In this scenario, the firmware is the root of trust - it's the first thing that makes a determination about whether something should be allowed to run or not[2]. As long as the firmware behaves correctly, and as long as there aren't any vulnerabilities in our boot chain, we know that we booted an OS that was signed with a key we trust.

But what guarantees that the firmware behaves correctly? What if someone replaces our firmware with firmware that trusts different keys, or hot-patches the OS as it's booting it? We can't just ask the firmware whether it's trustworthy - trustworthy firmware will say yes, but the thing about malicious firmware is that it can just lie to us (either directly, or by modifying the OS components it boots to lie instead). This is probably not sufficiently trustworthy!

Ok, so let's have the firmware be verified before it's executed. On Intel this is "Boot Guard", on AMD this is "Platform Secure Boot", everywhere else it's just "Secure Boot". Code on the CPU (either in ROM or signed with a key controlled by the CPU vendor) verifies the firmware[3] before executing it. Now the CPU itself is the root of trust, and, well, that seems reasonable - we have to place trust in the CPU, otherwise we can't actually do computing. We can now say with a reasonable degree of confidence (again, in the absence of vulnerabilities) that we booted an OS that we trusted. Hurrah!

Except. How do we know that the CPU actually did that verification? CPUs are generally manufactured without verification being enabled - different system vendors use different signing keys, so those keys can't be installed in the CPU at CPU manufacture time, and vendors need to do code development without signing everything so you can't require that keys be installed before a CPU will work. So, out of the box, a new CPU will boot anything without doing verification[4], and development units will frequently have no verification.

As a device owner, how do you tell whether or not your CPU has this verification enabled? Well, you could ask the CPU, but if you're doing that on a device that booted a compromised OS then maybe it's just hotpatching your OS so when you do that you just get RET_TRUST_ME_BRO even if the CPU is desperately waving its arms around trying to warn you it's a trap. This is, unfortunately, a problem that's basically impossible to solve using verified boot alone - if any component in the chain fails to enforce verification, the trust you're placing in the chain is misplaced and you are going to have a bad day.

So how do we solve it? The answer is that we can't simply ask the OS, we need a mechanism to query the root of trust itself. There's a few ways to do that, but fundamentally they depend on the ability of the root of trust to provide proof of what happened. This requires that the root of trust be able to sign (or cause to be signed) an "attestation" of the system state, a cryptographically verifiable representation of the security-critical configuration and code. The most common form of this is called "measured boot" or "trusted boot", and involves generating a "measurement" of each boot component or configuration (generally a cryptographic hash of it), and storing that measurement somewhere. The important thing is that it must not be possible for the running OS (or any pre-OS component) to arbitrarily modify these measurements, since otherwise a compromised environment could simply go back and rewrite history. One frequently used solution to this is to segregate the storage of the measurements (and the attestation of them) into a separate hardware component that can't be directly manipulated by the OS, such as a Trusted Platform Module. Each part of the boot chain measures relevant security configuration and the next component before executing it and sends that measurement to the TPM, and later the TPM can provide a signed attestation of the measurements it was given. So, an SoC that implements verified boot should create a measurement telling us whether verification is enabled - and, critically, should also create a measurement if it isn't. This is important because failing to measure the disabled state leaves us with the same problem as before; someone can replace the mutable firmware code with code that creates a fake measurement asserting that verified boot was enabled, and if we trust that we're going to have a bad time.

(Of course, simply measuring the fact that verified boot was enabled isn't enough - what if someone replaces the CPU with one that has verified boot enabled, but trusts keys under their control? We also need to measure the keys that were used in order to ensure that the device trusted only the keys we expected, otherwise again we're going to have a bad time)

So, an effective root of trust needs to:

1) Create a measurement of its verified boot policy before running any mutable code
2) Include the trusted signing key in that measurement
3) Actually perform that verification before executing any mutable code

and from then on we're in the hands of the verified code actually being trustworthy, and it's probably written in C so that's almost certainly false, but let's not try to solve every problem today.

Does anything do this today? As far as I can tell, Intel's Boot Guard implementation does. Based on publicly available documentation I can't find any evidence that AMD's Platform Secure Boot does (it does the verification, but it doesn't measure the policy beforehand, so it seems spoofable), but I could be wrong there. I haven't found any general purpose non-x86 parts that do, but this is in the realm of things that SoC vendors seem to believe is some sort of value-add that can only be documented under NDAs, so please do prove me wrong. And then there are add-on solutions like Titan, where we delegate the initial measurement and validation to a separate piece of hardware that measures the firmware as the CPU reads it, rather than requiring that the CPU do it.

But, overall, the situation isn't great. On many platforms there's simply no way to prove that you booted the code you expected to boot. People have designed elaborate security implementations that can be bypassed in a number of ways.

[1] In this respect it is extremely similar to "Zero Trust"
[2] This is a bit of an oversimplification - once we get into dynamic roots of trust like Intel's TXT this story gets more complicated, but let's stick to the simple case today
[3] I'm kind of using "firmware" in an x86ish manner here, so for embedded devices just think of "firmware" as "the first code executed out of flash and signed by someone other than the SoC vendor"
[4] In the Intel case this isn't strictly true, since the keys are stored in the motherboard chipset rather than the CPU, and so taking a board with Boot Guard enabled and swapping out the CPU won't disable Boot Guard because the CPU reads the configuration from the chipset. But many mobile Intel parts have the chipset in the same package as the CPU, so in theory swapping out that entire package would disable Boot Guard. I am not good enough at soldering to demonstrate that.

(Edit 2023-05-10: This has now launched for a subset of Twitter users. The code that existed to notify users that device identities had changed does not appear to have been enabled - as a result, in its current form, Twitter can absolutely MITM conversations and read your messages)

Elon Musk appeared on an interview with Tucker Carlson last month, with one of the topics being the fact that Twitter could be legally compelled to hand over users' direct messages to government agencies since they're held on Twitter's servers and aren't encrypted. Elon talked about how they were in the process of implementing proper encryption for DMs that would prevent this - "You could put a gun to my head and I couldn't tell you. That's how it should be."

tl;dr - in the current implementation, while Twitter could subvert the end-to-end nature of the encryption, it could not do so without users being notified. If any user involved in a conversation were to ignore that notification, all messages in that conversation (including ones sent in the past) could then be decrypted. This isn't ideal, but it still seems like an improvement over having no encryption at all. More technical discussion follows.

For context: all information about Twitter's implementation here has been derived from reverse engineering version 9.86.0 of the Android client and 9.56.1 of the iOS client (the current versions at time of writing), and the feature hasn't yet launched. While it's certainly possible that there could be major changes in the protocol between now launch, Elon has asserted that they plan to launch the feature this week so it's plausible that this reflects what'll ship.

For it to be impossible for Twitter to read DMs, they need to not only be encrypted, they need to be encrypted with a key that's not available to Twitter. This is what's referred to as "end-to-end encryption", or e2ee - it means that the only components in the communication chain that have access to the unencrypted data are the endpoints. Even if the message passes through other systems (and even if it's stored on other systems), those systems do not have access to the keys that would be needed to decrypt the data.

End-to-end encrypted messengers were initially popularised by Signal, but the Signal protocol has since been incorporated into WhatsApp and is probably much more widely used there. Millions of people per day are sending messages to each other that pass through servers controlled by third parties, but those third parties are completely unable to read the contents of those messages. This is the scenario that Elon described, where there's no degree of compulsion that could cause the people relaying messages to and from people to decrypt those messages afterwards.

But for this to be possible, both ends of the communication need to be able to encrypt messages in a way the other end can decrypt. This is usually performed using AES, a well-studied encryption algorithm with no known significant weaknesses. AES is a form of what's referred to as a symmetric encryption, one where encryption and decryption are performed with the same key. This means that both ends need access to that key, which presents us with a bootstrapping problem. Until a shared secret is obtained, there's no way to communicate securely, so how do we generate that shared secret? A common mechanism for this is something called Diffie Hellman key exchange, which makes use of asymmetric encryption. In asymmetric encryption, an encryption key can be split into two components - a public key and a private key. Both devices involved in the communication combine their private key and the other party's public key to generate a secret that can only be decoded with access to the private key. As long as you know the other party's public key, you can now securely generate a shared secret with them. Even a third party with access to all the public keys won't be able to identify this secret. Signal makes use of a variation of Diffie-Hellman called Extended Triple Diffie-Hellman that has some desirable properties, but it's not strictly necessary for the implementation of something that's end-to-end encrypted.

Although it was rumoured that Twitter would make use of the Signal protocol, and in fact there are vestiges of code in the Twitter client that still reference Signal, recent versions of the app have shipped with an entirely different approach that appears to have been written from scratch. It seems simple enough. Each device generates an asymmetric keypair using the NIST P-256 elliptic curve, along with a device identifier. The device identifier and the public half of the key are uploaded to Twitter using a new API endpoint called /1.1/keyregistry/register. When you want to send an encrypted DM to someone, the app calls /1.1/keyregistry/extract_public_keys with the IDs of the users you want to communicate with, and gets back a list of their public keys. It then looks up the conversation ID (a numeric identifier that corresponds to a given DM exchange - for a 1:1 conversation between two people it doesn't appear that this ever changes, so if you DMed an account 5 years ago and then DM them again now from the same account, the conversation ID will be the same) in a local database to retrieve a conversation key. If that key doesn't exist yet, the sender generates a random one. The message is then encrypted with the conversation key using AES in GCM mode, and the conversation key is then put through Diffie-Hellman with each of the recipients' public device keys. The encrypted message is then sent to Twitter along with the list of encrypted conversation keys. When each of the recipients' devices receives the message it checks whether it already has a copy of the conversation key, and if not performs its half of the Diffie-Hellman negotiation to decrypt the encrypted conversation key. One it has the conversation key it decrypts it and shows it to the user.

What would happen if Twitter changed the registered public key associated with a device to one where they held the private key, or added an entirely new device to a user's account? If the app were to just happily send a message with the conversation key encrypted with that new key, Twitter would be able to decrypt that and obtain the conversation key. Since the conversation key is tied to the conversation, not any given pair of devices, obtaining the conversation key means you can then decrypt every message in that conversation, including ones sent before the key was obtained.

(An aside: Signal and WhatsApp make use of a protocol called Sesame which involves additional secret material that's shared between every device a user owns, hence why you have to do that QR code dance whenever you add a new device to your account. I'm grossly over-simplifying how clever the Signal approach is here, largely because I don't understand the details of it myself. The Signal protocol uses something called the Double Ratchet Algorithm to implement the actual message encryption keys in such a way that even if someone were able to successfully impersonate a device they'd only be able to decrypt messages sent after that point even if they had encrypted copies of every previous message in the conversation)

How's this avoided? Based on the UI that exists in the iOS version of the app, in a fairly straightforward way - each user can only have a single device that supports encrypted messages. If the user (or, in our hypothetical, a malicious Twitter) replaces the device key, the client will generate a notification. If the user pays attention to that notification and verifies with the recipient through some out of band mechanism that the device has actually been replaced, then everything is fine. But, if any participant in the conversation ignores this warning, the holder of the subverted key can obtain the conversation key and decrypt the entire history of the conversation. That's strictly worse than anything based on Signal, where such impersonation would simply not work, but even in the Twitter case it's not possible for someone to silently subvert the security.

So when Elon says Twitter wouldn't be able to decrypt these messages even if someone held a gun to his head, there's a condition applied to that - it's true as long as nobody fucks up. This is clearly better than the messages just not being encrypted at all in the first place, but overall it's a weaker solution than Signal. If you're currently using Twitter DMs, should you turn on encryption? As long as the limitations aren't too limiting, definitely! Should you use this in preference to Signal or WhatsApp? Almost certainly not. This seems like a genuine incremental improvement, but it'd be easy to interpret what Elon says as providing stronger guarantees than actually exist.

Here's an article from a French anarchist describing how his (encrypted) laptop was seized after he was arrested, and material from the encrypted partition has since been entered as evidence against him. His encryption password was supposedly greater than 20 characters and included a mixture of cases, numbers, and punctuation, so in the absence of any sort of opsec failures this implies that even relatively complex passwords can now be brute forced, and we should be transitioning to even more secure passphrases.

Or does it? Let's go into what LUKS is doing in the first place. The actual data is typically encrypted with AES, an extremely popular and well-tested encryption algorithm. AES has no known major weaknesses and is not considered to be practically brute-forceable - at least, assuming you have a random key. Unfortunately it's not really practical to ask a user to type in 128 bits of binary every time they want to unlock their drive, so another approach has to be taken.

This is handled using something called a "key derivation function", or KDF. A KDF is a function that takes some input (in this case the user's password) and generates a key. As an extremely simple example, think of MD5 - it takes an input and generates a 128-bit output, so we could simply MD5 the user's password and use the output as an AES key. While this could technically be considered a KDF, it would be an extremely bad one! MD5s can be calculated extremely quickly, so someone attempting to brute-force a disk encryption key could simply generate the MD5 of every plausible password (probably on a lot of machines in parallel, likely using GPUs) and test each of them to see whether it decrypts the drive.

(things are actually slightly more complicated than this - your password is used to generate a key that is then used to encrypt and decrypt the actual encryption key. This is necessary in order to allow you to change your password without having to re-encrypt the entire drive - instead you simply re-encrypt the encryption key with the new password-derived key. This also allows you to have multiple passwords or unlock mechanisms per drive)

Good KDFs reduce this risk by being what's technically referred to as "expensive". Rather than performing one simple calculation to turn a password into a key, they perform a lot of calculations. The number of calculations performed is generally configurable, in order to let you trade off between the amount of security (the number of calculations you'll force an attacker to perform when attempting to generate a key from a potential password) and performance (the amount of time you're willing to wait for your laptop to generate the key after you type in your password so it can actually boot). But, obviously, this tradeoff changes over time - defaults that made sense 10 years ago are not necessarily good defaults now. If you set up your encrypted partition some time ago, the number of calculations required may no longer be considered up to scratch.

And, well, some of these assumptions are kind of bad in the first place! Just making things computationally expensive doesn't help a lot if your adversary has the ability to test a large number of passwords in parallel. GPUs are extremely good at performing the sort of calculations that KDFs generally use, so an attacker can "just" get a whole pile of GPUs and throw them at the problem. KDFs that are computationally expensive don't do a great deal to protect against this. However, there's another axis of expense that can be considered - memory. If the KDF algorithm requires a significant amount of RAM, the degree to which it can be performed in parallel on a GPU is massively reduced. A Geforce 4090 may have 16,384 execution units, but if each password attempt requires 1GB of RAM and the card only has 24GB on board, the attacker is restricted to running 24 attempts in parallel.

So, in these days of attackers with access to a pile of GPUs, a purely computationally expensive KDF is just not a good choice. And, unfortunately, the subject of this story was almost certainly using one of those. Ubuntu 18.04 used the LUKS1 header format, and the only KDF supported in this format is PBKDF2. This is not a memory expensive KDF, and so is vulnerable to GPU-based attacks. But even so, systems using the LUKS2 header format used to default to argon2i, ~~again not a memory expensive KDF~~which is memory strong, but not designed to be resistant to GPU attack (thanks to the comments pointing out my misunderstanding here). New versions default to argon2id, which is. You want to be using argon2id.

What makes this worse is that distributions generally don't update this in any way. If you installed your system and it gave you pbkdf2 as your KDF, you're probably still using pbkdf2 even if you've upgraded to a system that would use argon2id on a fresh install. Thankfully, this can all be fixed-up in place. But note that if anything goes wrong here you could lose access to all your encrypted data, so before doing anything make sure it's all backed up (and figure out how to keep said backup secure so you don't just have your data seized that way).

First, make sure you're running as up-to-date a version of your distribution as possible. Having tools that support the LUKS2 format doesn't mean that your distribution has all of that integrated, and old distribution versions may allow you to update your LUKS setup without actually supporting booting from it. Also, if you're using an encrypted /boot, stop now - very recent versions of grub2 support LUKS2, but they don't support argon2id, and this will render your system unbootable.

Next, figure out which device under /dev corresponds to your encrypted partition. Run

lsblk

and look for entries that have a type of "crypt". The device above that in the tree is the actual encrypted device. Record that name, and run

sudo cryptsetup luksHeaderBackup /dev/whatever --header-backup-file /tmp/luksheader

and copy that to a USB stick or something. If something goes wrong here you'll be able to boot a live image and run

sudo cryptsetup luksHeaderRestore /dev/whatever --header-backup-file luksheader

to restore it.

(Edit to add: Once everything is working, delete this backup! It contains the old weak key, and someone with it can potentially use that to brute force your disk encryption key using the old KDF even if you've updated the on-disk KDF.)

Next, run

sudo cryptsetup luksDump /dev/whatever

and look for the Version: line. If it's version 1, you need to update the header to LUKS2. Run

sudo cryptsetup convert /dev/whatever --type luks2

and follow the prompts. Make sure your system still boots, and if not go back and restore the backup of your header. Assuming everything is ok at this point, run

sudo cryptsetup luksDump /dev/whatever

again and look for the PBKDF: line in each keyslot (pay attention only to the keyslots, ignore any references to pbkdf2 that come after the Digests: line). If the PBKDF is either "pbkdf2" or "argon2i" you should convert to argon2id. Run the following:

sudo cryptsetup luksConvertKey /dev/whatever --pbkdf argon2id

and follow the prompts. If you have multiple passwords associated with your drive you'll have multiple keyslots, and you'll need to repeat this for each password.

Distributions! You should really be handling this sort of thing on upgrade. People who installed their systems with your encryption defaults several years ago are now much less secure than people who perform a fresh install today. Please please please do something about this.

CPUs can't do anything without being told what to do, which leaves the obvious problem of how do you tell a CPU to do something in the first place. On many CPUs this is handled in the form of a reset vector - an address the CPU is hardcoded to start reading instructions from when power is applied. The address the reset vector points to will typically be some form of ROM or flash that can be read by the CPU even if no other hardware has been configured yet. This allows the system vendor to ship code that will be executed immediately after poweron, configuring the rest of the hardware and eventually getting the system into a state where it can run user-supplied code.

The specific nature of the reset vector on x86 systems has varied over time, but it's effectively always been 16 bytes below the top of the address space - so, 0xffff0 on the 20-bit 8086, 0xfffff0 on the 24-bit 80286, and 0xfffffff0 on the 32-bit 80386. Convention on x86 systems is to have RAM starting at address 0, so the top of address space could be used to house the reset vector with as low a probability of conflicting with RAM as possible.

The most notable thing about x86 here, though, is that when it starts running code from the reset vector, it's still in real mode. x86 real mode is a holdover from a much earlier era of computing. Rather than addresses being absolute (ie, if you refer to a 32-bit address, you store the entire address in a 32-bit or larger register), they are 16-bit offsets that are added to the value stored in a "segment register". Different segment registers existed for code, data, and stack, so a 16-bit address could refer to different actual addresses depending on how it was being interpreted - jumping to a 16 bit address would result in that address being added to the code segment register, while reading from a 16 bit address would result in that address being added to the data segment register, and so on. This is all in order to retain compatibility with older chips, to the extent that even 64-bit x86 starts in real mode with segments and everything (and, also, still starts executing at 0xfffffff0 rather than 0xfffffffffffffff0 - 64-bit mode doesn't support real mode, so there's no way to express a 64-bit physical address using the segment registers, so we still start just below 4GB even though we have massively more address space available).

Anyway. Everyone knows all this. For modern UEFI systems, the firmware that's launched from the reset vector then reprograms the CPU into a sensible mode (ie, one without all this segmentation bullshit), does things like configure the memory controller so you can actually access RAM (a process which involves using CPU cache as RAM, because programming a memory controller is sufficiently hard that you need to store more state than you can fit in registers alone, which means you need RAM, but you don't have RAM until the memory controller is working, but thankfully the CPU comes with several megabytes of RAM on its own in the form of cache, so phew). It's kind of ugly, but that's a consequence of a bunch of well-understood legacy decisions.

Except. This is not how modern Intel x86 boots. It's far stranger than that. Oh, yes, this is what it looks like is happening, but there's a bunch of stuff going on behind the scenes. Let's talk about boot security. The idea of any form of verified boot (such as UEFI Secure Boot) is that a signature on the next component of the boot chain is validated before that component is executed. But what verifies the first component in the boot chain? You can't simply ask the BIOS to verify itself - if an attacker can replace the BIOS, they can replace it with one that simply lies about having done so. Intel's solution to this is called Boot Guard.

But before we get to Boot Guard, we need to ensure the CPU is running in as bug-free a state as possible. So, when the CPU starts up, it examines the system flash and looks for a header that points at CPU microcode updates. Intel CPUs ship with built-in microcode, but it's frequently old and buggy and it's up to the system firmware to include a copy that's new enough that it's actually expected to work reliably. The microcode image is pulled out of flash, a signature is verified, and the new microcode starts running. This is true in both the Boot Guard and the non-Boot Guard scenarios. But for Boot Guard, before jumping to the reset vector, the microcode on the CPU reads an Authenticated Code Module (ACM) out of flash and verifies its signature against a hardcoded Intel key. If that checks out, it starts executing the ACM. Now, bear in mind that the CPU can't just verify the ACM and then execute it directly from flash - if it did, the flash could detect this, hand over a legitimate ACM for the verification, and then feed the CPU different instructions when it reads them again to execute them (a Time of Check vs Time of Use, or TOCTOU, vulnerability). So the ACM has to be copied onto the CPU before it's verified and executed, which means we need RAM, which means the CPU already needs to know how to configure its cache to be used as RAM.

Anyway. We now have an ACM loaded and verified, and it can safely be executed. The ACM does various things, but the most important from the Boot Guard perspective is that it reads a set of write-once fuses in the motherboard chipset that represent the SHA256 of a public key. It then reads the initial block of the firmware (the Initial Boot Block, or IBB) into RAM (or, well, cache, as previously described) and parses it. There's a block that contains a public key - it hashes that key and verifies that it matches the SHA256 from the fuses. It then uses that key to validate a signature on the IBB. If it all checks out, it executes the IBB and everything starts looking like the nice simple model we had before.

Except, well, doesn't this seem like an awfully complicated bunch of code to implement in real mode? And yes, doing all of this modern crypto with only 16-bit registers does sound like a pain. So, it doesn't. All of this is happening in a perfectly sensible 32 bit mode, and the CPU actually switches back to the awful segmented configuration afterwards so it's still compatible with an 80386 from 1986. The "good" news is that at least firmware can detect that the CPU has already configured the cache as RAM and can skip doing that itself.

I'm skipping over some steps here - the ACM actually does other stuff around measuring the firmware into the TPM and doing various bits of TXT setup for people who want DRTM in their lives, but the short version is that the CPU bootstraps itself into a state where it works like a modern CPU and then deliberately turns a bunch of the sensible functionality off again before it starts executing firmware. I'm also missing out the fact that this entire process only kicks off after the Management Engine says it can, which means we're waiting for an entirely independent x86 to boot an entire OS before our CPU even starts pretending to execute the system firmware.

Of course, as mentioned before, on modern systems the firmware will then reprogram the CPU into something actually sensible so OS developers no longer need to care about this[1][2], which means we've bounced between multiple states for no reason other than the possibility that someone wants to run legacy BIOS and then boot DOS on a CPU with like 5 orders of magnitude more transistors than the 8086.

tl;dr why can't my x86 wake up with the ~~gin~~ protected mode already inside it

[1] Ha uh except that on ACPI resume we're going to skip most of the firmware setup code so we still need to handle the CPU being in fucking 16-bit mode because suspend/resume is basically an extremely long reboot cycle

[2] Oh yeah also you probably have multiple cores on your CPU and well bad news about the state most of the cores are in when the OS boots because the firmware never started them up so they're going to come up in 16-bit real mode even if your boot CPU is already in 64-bit protected mode, unless you were using TXT in which case you have a different sort of nightmare that if we're going to try to map it onto real world nightmare concepts is one that involves a lot of teeth. Or, well, that used to be the case, but ACPI 6.4 (released in 2021) provides a mechanism for the OS to ask the firmware to wake the CPU up for it so this is invisible to the OS, but you're still relying on the firmware to actually do the heavy lifting here

Github accidentally committed their SSH RSA private key to a repository, and now a bunch of people's infrastructure is broken because it needs to be updated to trust the new key. This is obviously bad, but what's frustrating is that there's no inherent need for it to be - almost all the technological components needed to both reduce the initial risk and to make the transition seamless already exist.

But first, let's talk about what actually happened here. You're probably used to the idea of TLS certificates from using browsers. Every website that supports TLS has an asymmetric pair of keys divided into a public key and a private key. When you contact the website, it gives you a certificate that contains the public key, and your browser then performs a series of cryptographic operations against it to (a) verify that the remote site possesses the private key (which prevents someone just copying the certificate to another system and pretending to be the legitimate site), and (b) generate an ephemeral encryption key that's used to actually encrypt the traffic between your browser and the site. But what stops an attacker from simply giving you a fake certificate that contains their public key? The certificate is itself signed by a certificate authority (CA), and your browser is configured to trust a preconfigured set of CAs. CAs will not give someone a signed certificate unless they prove they have legitimate ownership of the site in question, so (in theory) an attacker will never be able to obtain a fake certificate for a legitimate site.

This infrastructure is used for pretty much every protocol that can use TLS, including things like SMTP and IMAP. But SSH doesn't use TLS, and doesn't participate in any of this infrastructure. Instead, SSH tends to take a "Trust on First Use" (TOFU) model - the first time you ssh into a server, you receive a prompt asking you whether you trust its public key, and then you probably hit the "Yes" button and get on with your life. This works fine up until the point where the key changes, and SSH suddenly starts complaining that there's a mismatch and something awful could be happening (like someone intercepting your traffic and directing it to their own server with their own keys). Users are then supposed to verify whether this change is legitimate, and if so remove the old keys and add the new ones. This is tedious and risks users just saying "Yes" again, and if it happens too often an attacker can simply redirect target users to their own server and through sheer fatigue at dealing with this crap the user will probably trust the malicious server.

Why not certificates? OpenSSH actually does support certificates, but not in the way you might expect. There's a custom format that's significantly less complicated than the X509 certificate format used in TLS. Basically, an SSH certificate just contains a public key, a list of hostnames it's good for, and a signature from a CA. There's no pre-existing set of trusted CAs, so anyone could generate a certificate that claims it's valid for, say, github.com. This isn't really a problem, though, because right now nothing pays attention to SSH host certificates unless there's some manual configuration.

(It's actually possible to glue the general PKI infrastructure into SSH certificates. Please do not do this)

So let's look at what happened in the Github case. The first question is "How could the private key have been somewhere that could be committed to a repository in the first place?". I have no unique insight into what happened at Github, so this is conjecture, but I'm reasonably confident in it. Github deals with a large number of transactions per second. Github.com is not a single computer - it's a large number of machines. All of those need to have access to the same private key, because otherwise git would complain that the private key had changed whenever it connected to a machine with a different private key (the alternative would be to use a different IP address for every frontend server, but that would instead force users to repeatedly accept additional keys every time they connect to a new IP address). Something needs to be responsible for deploying that private key to new systems as they're brought up, which means there's ample opportunity for it to accidentally end up in the wrong place.

Now, best practices suggest that this should be avoided by simply placing the private key in a hardware module that performs the cryptographic operations, ensuring that nobody can ever get at the private key. The problem faced here is that HSMs typically aren't going to be fast enough to handle the number of requests per second that Github deals with. This can be avoided by using something like a Nitro Enclave, but you're still going to need a bunch of these in different geographic locales because otherwise your front ends are still going to be limited by the need to talk to an enclave on the other side of the planet, and now you're still having to deal with distributing the private key to a bunch of systems.

What if we could have the best of both worlds - the performance of private keys that just happily live on the servers, and the security of private keys that live in HSMs? Unsurprisingly, we can! The SSH private key could be deployed to every front end server, but every minute it could call out to an HSM-backed service and request a new SSH host certificate signed by a private key in the HSM. If clients are configured to trust the key that's signing the certificates, then it doesn't matter what the private key on the servers is - the client will see that there's a valid certificate and will trust the key, even if it changes. Restricting the validity of the certificate to a small window of time means that if a key is compromised an attacker can't do much with it - the moment you become aware of that you stop signing new certificates, and once all the existing ones expire the old private key becomes useless. You roll out a new private key with new certificates signed by the same CA and clients just carry on trusting it without any manual involvement.

Why don't we have this already? The main problem is that client tooling just doesn't handle this well. OpenSSH has no way to do TOFU for CAs, just the keys themselves. This means there's no way to do a git clone ssh://git@github.com/whatever and get a prompt asking you to trust Github's CA. Instead, you need to add a @cert-authority github.com (key) line to your known_hosts file by hand, and since approximately nobody's going to do that there's only marginal benefit in going to the effort to implement this infrastructure. The most important thing we can do to improve the security of the SSH ecosystem is to make it easier to use certificates, and that means improving the behaviour of the clients.

It should be noted that certificates aren't the only approach to handling key migration. OpenSSH supports a protocol for key rotation, basically by allowing the server to provide a set of multiple trusted keys that the client can cache, and then invalidating old ones. Unfortunately this still requires that the "new" private keys be deployed in the same way as the old ones, so any screwup that results in one private key being leaked may well also result in the additional keys being leaked. I prefer the certificate approach.

Finally, I've seen a couple of people imply that the blame here should be attached to whoever or whatever caused the private key to be committed to a repository in the first place. This is a terrible take. Humans will make mistakes, and your systems should be resilient against that. There's no individual at fault here - there's a series of design decisions that made it possible for a bad outcome to occur, and in a better universe they wouldn't have been necessary. Let's work on building that better universe.

In one week from now, Twitter will block free API access. This prevents anyone who has written interesting bot accounts, integrations, or tooling from accessing Twitter without paying for it. A whole number of fascinating accounts will cease functioning, people will no longer be able to use tools that interact with Twitter, and anyone using a free service to do things like find Twitter mutuals who have moved to Mastodon or to cross-post between Twitter and other services will be blocked.

There's a cynical interpretation to this, which is that despite firing 75% of the workforce Twitter is still not profitable and Elon is desperate to not have Twitter go bust and also not to have to tank even more of his Tesla stock to achieve that. But let's go with the less cynical interpretation, which is that API access to Twitter is something that enables bot accounts that make things worse for everyone. Except, well, why would a hostile bot account do that?

To interact with an API you generally need to present some sort of authentication token to the API to prove that you're allowed to access it. It's easy enough to restrict issuance of those tokens to people who pay for the service. But, uh, how do the apps work? They need to be able to communicate with the service to tell it to post tweets, retrieve them, and so on. And the simple answer to that is that they use some hardcoded authentication tokens. And while registering for an API token yourself identifies that you're not using an official client, using the tokens embedded in the clients makes it look like you are. If you want to make it look like you're a human, you're already using tokens ripped out of the official clients.

The Twitter client API keys are widely known. Anyone who's pretending to be a human is using those already and will be unaffected by the shutdown of the free API tier. Services like movetodon.org do get blocked. This isn't an anti-abuse choice. It's one that makes it harder to move to other services. It's one that blocks a bunch of the integrations and accounts that bring value to the platform. It's one that hurts people who follow the rules, without hurting the ones who don't. This isn't an anti-abuse choice, it's about trying to consolidate control of the platform.

After my previous efforts, I wrote up a PKCS#11 module of my own that had no odd restrictions about using non-RSA keys and I tested it. And things looked much better - ssh successfully obtained the key, negotiated with the server to determine that it was present in authorized_keys, and then went to actually do the key verification step. At which point things went wrong - the Sign() method in my PKCS#11 module was never called, and a strange
debug1: identity_sign: sshkey_sign: error in libcrypto
sign_and_send_pubkey: signing failed for ECDSA "testkey": error in libcrypto"
error appeared in the ssh output. Odd. libcrypto was originally part of OpenSSL, but Apple ship the LibreSSL fork. Apple don't include the LibreSSL source in their public source repo, but do include OpenSSH. I grabbed the OpenSSH source and jumped through a whole bunch of hoops to make it build (it uses the macosx.internal SDK, which isn't publicly available, so I had to cobble together a bunch of headers from various places), and also installed upstream LibreSSL with a version number matching what Apple shipped. And everything worked - I logged into the server using a hardware-backed key.

Was the difference in OpenSSH or in LibreSSL? Telling my OpenSSH to use the system libcrypto resulted in the same failure, so it seemed pretty clear this was an issue with the Apple version of the library. The way all this works is that when OpenSSH has a challenge to sign, it calls ECDSA_do_sign(). This then calls ECDSA_do_sign_ex(), which in turn follows a function pointer to the actual signature method. By default this is a software implementation that expects to have the private key available, but you can also register your own callback that will be used instead. The OpenSSH PKCS#11 code does this by calling EC_KEY_set_method(), and as a result calling ECDSA_do_sign() ends up calling back into the PKCS#11 code that then calls into the module that communicates with the hardware and everything works.

Except it doesn't under macOS. Running under a debugger and setting a breakpoint on EC_do_sign(), I saw that we went down a code path with a function called ECDSA_do_sign_new(). This doesn't appear in any of the public source code, so seems to be an Apple-specific patch. I pushed Apple's libcrypto into Ghidra and looked at ECDSA_do_sign() and found something that approximates this:

nid = EC_GROUP_get_curve_name(curve);
if (nid == NID_X9_62_prime256v1) {
  return ECDSA_do_sign_new(dgst,dgst_len,eckey);
}
return ECDSA_do_sign_ex(dgst,dgst_len,NULL,NULL,eckey);

What this means is that if you ask ECDSA_do_sign() to sign something on a Mac, and if the key in question corresponds to the NIST P256 elliptic curve type, it goes down the ECDSA_do_sign_new() path and never calls the registered callback. This is the only key type supported by the Apple Secure Enclave, so I assume it's special-cased to do something with that. Unfortunately the consequence is that it's impossible to use a PKCS#11 module that uses Secure Enclave keys with the shipped version of OpenSSH under macOS. For now I'm working around this with an SSH agent built using Go's agent module, forwarding most requests through to the default session agent but appending hardware-backed keys and implementing signing with them, which is probably what I should have done in the first place.

Working in information security means building controls, developing technologies that ensure that sensitive material can only be accessed by people that you trust. It also means categorising people into "trustworthy" and "untrustworthy", and trying to come up with a reasonable way to apply that such that people can do their jobs without all your secrets being available to just anyone in the company who wants to sell them to a competitor. It means ensuring that accounts who you consider to be threats shouldn't be able to do any damage, because if someone compromises an internal account you need to be able to shut them down quickly.

And like pretty much any security control, this can be used for both good and bad. The technologies you develop to monitor users to identify compromised accounts can also be used to compromise legitimate users who management don't like. The infrastructure you build to push updates to users can also be used to push browser extensions that interfere with labour organisation efforts. In many cases there's no technical barrier between something you've developed to flag compromised accounts and the same technology being used to flag users who are unhappy with certain aspects of management.

If you're asked to build technology that lets you make this sort of decision, think about whether that's what you want to be doing. Think about who can compel you to use it in ways other than how it was intended. Consider whether that's something you want on your conscience. And then think about whether you can meet those requirements in a different way. If they can simply compel one junior engineer to alter configuration, that's very different to an implementation that requires sign-offs from multiple senior developers. Make sure that all such policy changes have to be clearly documented, including not just who signed off on it but who asked them to. Build infrastructure that creates a record of who decided to fuck over your coworkers, rather than just blaming whoever committed the config update. The blame trail should never terminate in the person who was told to do something or get fired - the blame trail should clearly indicate who ordered them to do that.

But most importantly: build security features as if they'll be used against you.

There's a bunch of ways you can store cryptographic keys. The most obvious is to just stick them on disk, but that has the downside that anyone with access to the system could just steal them and do whatever they wanted with them. At the far end of the scale you have Hardware Security Modules (HSMs), hardware devices that are specially designed to self destruct if you try to take them apart and extract the keys, and which will generate an audit trail of every key operation. In between you have things like smartcards, TPMs, Yubikeys, and other platform secure enclaves - devices that don't allow arbitrary access to keys, but which don't offer the same level of assurance as an actual HSM (and are, as a result, orders of magnitude cheaper).

The problem with all of these hardware approaches is that they have entirely different communication mechanisms. The industry realised this wasn't ideal, and in 1994 RSA released version 1 of the PKCS#11 specification. This defines a C interface with a single entry point - C_GetFunctionList. Applications call this and are given a structure containing function pointers, with each entry corresponding to a PKCS#11 function. The application can then simply call the appropriate function pointer to trigger the desired functionality, such as "Tell me how many keys you have" and "Sign this, please". This is both an example of C not just being a programming language and also of you having to shove a bunch of vendor-supplied code into your security critical tooling, but what could possibly go wrong.

(Linux distros work around this problem by using p11-kit, which is a daemon that speaks d-bus and loads PKCS#11 modules for you. You can either speak to it directly over d-bus, or for apps that only speak PKCS#11 you can load a module that just transports the PKCS#11 commands over d-bus. This moves the weird vendor C code out of process, and also means you can deal with these modules without having to speak the C ABI, so everyone wins)

One of my work tasks at the moment is helping secure SSH keys, ensuring that they're only issued to appropriate machines and can't be stolen afterwards. For Windows and Linux machines we can stick them in the TPM, but Macs don't have a TPM as such. Instead, there's the Secure Enclave - part of the T2 security chip on x86 Macs, and directly integrated into the M-series SoCs. It doesn't have anywhere near as many features as a TPM, let alone an HSM, but it can generate NIST curve elliptic curve keys and sign things with them and that's good enough. Things are made more complicated by Apple only allowing keys to be used by the app that generated them, so it's hard for applications to generate keys on behalf of each other. This can be mitigated by using CryptoTokenKit, an interface that allows apps to present tokens to the systemwide keychain. Although this is intended for allowing a generic interface for access to such tokens (kind of like PKCS#11), an app can generate its own keys in the Secure Enclave and then expose them to other apps via the keychain through CryptoTokenKit.

Of course, applications then need to know how to communicate with the keychain. Browsers mostly do so, and Apple's version of SSH can to an extent. Unfortunately, that extent is "Retrieve passwords to unlock on-disk keys", which doesn't help in our case. PKCS#11 comes to the rescue here! Apple ship a module called ssh-keychain.dylib, a PKCS#11 module that's intended to allow SSH to use keys that are present in the system keychain. Unfortunately it's not super well maintained - it got broken when Big Sur moved all the system libraries into a cache, but got fixed up a few releases later. Unfortunately every time I tested it with our CryptoTokenKit provider (and also when I retried with SecureEnclaveToken to make sure it wasn't just our code being broken), ssh would tell me "provider /usr/lib/ssh-keychain.dylib returned no slots" which is not especially helpful. Finally I realised that it was actually generating more debug output, but it was being sent to the system debug logs rather than the ssh debug output. Well, when I say "more debug output", I mean "Certificate []: algorithm is not supported, ignoring it", which still doesn't tell me all that much. So I stuck it in Ghidra and searched for that string, and the line above it was

iVar2 = __auth_stubs::_objc_msgSend(uVar7,"isEqual:",*(undefined8*)__got::_kSecAttrKeyTypeRSA);

with it immediately failing if the key isn't RSA. Which it isn't, since the Secure Enclave doesn't support RSA. Apple's PKCS#11 module appears incapable of making use of keys generated on Apple's hardware.

There's a couple of ways of dealing with this. The first, which is taken by projects like Secretive, is to implement the SSH agent protocol and have SSH delegate key management to that agent, which can then speak to the keychain. But if you want this to work in all cases you need to implement all the functionality in the existing ssh-agent, and that seems like a bunch of work. The second is to implement a PKCS#11 module, which sounds like less work but probably more mental anguish. I'll figure that out tomorrow.

Long-term Linux users may remember that Alan Cox used to write an online diary. This was before the concept of a "Weblog" had really become a thing, and there certainly weren't any expectations around what one was used for - while now blogging tends to imply a reasonably long-form piece on a specific topic, Alan was just sitting there noting small life concerns or particular technical details in interesting problems he'd solved that day. For me, that was fascinating. I was trying to figure out how to get into kernel development, and was trying to read as much LKML as I could to figure out how kernel developers did stuff. But when you see discussion on LKML, you're frequently missing the early stages. If an LKML patch is a picture of an owl, I wanted to know how to draw the owl, and most of the conversations about starting in kernel development were very "Draw two circles. Now draw the rest of the owl". Alan's musings gave me insight into the thought processes involved in getting from "Here's the bug" to "Here's the patch" in ways that really wouldn't have worked in a more long-form medium.

For the past decade or so, as I moved away from just doing kernel development and focused more on security work instead, Twitter's filled a similar role for me. I've seen people just dumping their thought process as they work through a problem, helping me come up with effective models for solving similar problems. I've learned that the smartest people in the field will spend hours (if not days) working on an issue before realising that they misread something back at the beginning and that's helped me feel like I'm not unusually bad at any of this. It's helped me learn more about my peers, about my field, and about myself.

Twitter's now under new ownership that appears to think all the worst bits of Twitter were actually the good bits, so I've mostly bailed to the Fediverse instead. There's no intrinsic length limit on posts there - Mastodon defaults to 500 characters per post, but that's configurable per instance. But even at 500 characters, it means there's more room to provide thoughtful context than there is on Twitter, and what I've seen so far is more detailed conversation and higher levels of meaningful engagement. Which is great! Except it also seems to discourage some of the posting style that I found so valuable on Twitter - if your timeline is full of nuanced discourse, it feels kind of rude to just scream "THIS FUCKING PIECE OF SHIT IGNORES THE HIGH ADDRESS BIT ON EVERY OTHER WRITE" even though that's exactly the sort of content I'm there for.

And, yeah, not everything has to be for me. But I worry that as Twitter's relevance fades for the people I'm most interested in, we're replacing it with something that's not equivalent - something that doesn't encourage just dropping 50 characters or so of your current thought process into a space where it can be seen by thousands of people. And I think that's a shame.

I've written about bearer tokens and how much pain they cause me before, but sadly wishing for a better world doesn't make it happen so I'm making do with what's available. Okta has a feature called Device Trust which allows to you configure access control policies that prevent people obtaining tokens unless they're using a trusted device. This doesn't actually bind the tokens to the hardware in any way, so if a device is compromised or if a user is untrustworthy this doesn't prevent the token ending up on an unmonitored system with no security policies. But it's an incremental improvement, other than the fact that for desktop it's only supported on Windows and MacOS, which really doesn't line up well with my interests.

Obviously there's nothing fundamentally magic about these platforms, so it seemed fairly likely that it would be possible to make this work elsewhere. I spent a while staring at the implementation using Charles Proxy and the Chrome developer tools network tab and had worked out a lot, and then Okta published a paper describing a lot of what I'd just laboriously figured out. But it did also help clear up some points of confusion and clarified some design choices. I'm not going to give a full description of the details (with luck there'll be code shared for that before too long), but here's an outline of how all of this works. Also, to be clear, I'm only going to talk about the desktop support here - mobile is a bunch of related but distinct things that I haven't looked at in detail yet.

Okta's Device Trust (as officially supported) relies on Okta Verify, a local agent. When initially installed, Verify authenticates as the user, obtains a token with a scope that allows it to manage devices, and then registers the user's computer as an additional MFA factor. This involves it generating a JWT that embeds a number of custom claims about the device and its state, including things like the serial number. This JWT is signed with a locally generated (and hardware-backed, using a TPM or Secure Enclave) key, which allows Okta to determine that any future updates from a device claiming the same identity are genuinely from the same device (you could construct an update with a spoofed serial number, but you can't copy the key out of a TPM so you can't sign it appropriately). This is sufficient to get a device registered with Okta, at which point it can be used with Fastpass, Okta's hardware-backed MFA mechanism.

As outlined in the aforementioned deep dive paper, Fastpass is implemented via multiple mechanisms. I'm going to focus on the loopback one, since it's the one that has the strongest security properties. In this mode, Verify listens on one of a list of 10 or so ports on localhost. When you hit the Okta signin widget, choosing Fastpass triggers the widget into hitting each of these ports in turn until it finds one that speaks Fastpass and then submits a challenge to it (along with the URL that's making the request). Verify then constructs a response that includes the challenge and signs it with the hardware-backed key, along with information about whether this was done automatically or whether it included forcing the user to prove their presence. Verify then submits this back to Okta, and if that checks out Okta completes the authentication.

Doing this via loopback from the browser has a bunch of nice properties, primarily around the browser providing information about which site triggered the request. This means the Verify agent can make a decision about whether to submit something there (ie, if a fake login widget requests your creds, the agent will ignore it), and also allows the issued token to be cross-checked against the site that requested it (eg, if g1thub.com requests a token that's valid for github.com, that's a red flag). It's not quite at the same level as a hardware WebAuthn token, but it has many of the anti-phishing properties.

But none of this actually validates the device identity! The entire registration process is up to the client, and clients are in a position to lie. Someone could simply reimplement Verify to lie about, say, a device serial number when registering, and there'd be no proof to the contrary. Thankfully there's another level to this to provide stronger assurances. Okta allows you to provide a CA root[1]. When Okta issues a Fastpass challenge to a device the challenge includes a list of the trusted CAs. If a client has a certificate that chains back to that, it can embed an additional JWT in the auth JWT, this one containing the certificate and signed with the certificate's private key. This binds the CA-issued identity to the Fastpass validation, and causes the device to start appearing as "Managed" in the Okta device management UI. At that point you can configure policy to restrict various apps to managed devices, ensuring that users are only able to get tokens if they're using a device you've previously issued a certificate to.

I've managed to get Linux tooling working with this, though there's still a few drawbacks. The main issue is that the API only allows you to register devices that declare themselves as Windows or MacOS, followed by the login system sniffing browser user agent and only offering Fastpass if you're on one of the officially supported platforms. This can be worked around with an extension that spoofs user agent specifically on the login page, but that's still going to result in devices being logged as a non-Linux OS which makes interpreting the logs more difficult. There's also no ability to choose which bits of device state you log: there's a couple of existing integrations, and otherwise a fixed set of parameters that are reported. It'd be lovely to be able to log arbitrary material and make policy decisions based on that.

This also doesn't help with ChromeOS. There's no real way to automatically launch something that's bound to localhost (you could probably make this work using Crostini but there's no way to launch a Crostini app at login), and access to hardware-backed keys is kind of a complicated topic in ChromeOS for privacy reasons. I haven't tried this yet, but I think using an enterprise force-installed extension and the chrome.enterprise.platformKeys API to obtain a device identity cert and then intercepting requests to the appropriate port range on localhost ought to be enough to do that? But I've literally never written any Javascript so I don't know. Okta supports falling back from the loopback protocol to calling a custom URI scheme, but once you allow that you're also losing a bunch of the phishing protection, so I'd prefer not to take that approach.

Like I said, none of this prevents exfiltration of bearer tokens once they've been issued, and there's still a lot of ecosystem work to do there. But ensuring that tokens can't be issued to unmanaged machines in the first place is still a step forwards, and with luck we'll be able to make use of this on Linux systems without relying on proprietary client-side tooling.

(Time taken to code this implementation: about two days, and under 1000 lines of new code. Time taken to figure out what the fuck to write: rather a lot longer)

[1] There's also support for having Okta issue certificates, but then you're kind of back to the "How do I know this is my device" situation

I have a patchset that makes it safe to use hibernation under Linux even in a secure boot world but it relies on preventing userland from using one of the TPM PCRs[1] that would otherwise be available to it. TPM 2 devices support emulating PCRs in NVRAM, which would obviously be beneficial in not taking away a scare resource, but the docs suggest there's no way to tie a key generation to an NVRAM value which makes this more complicated.

So, why not ask a computer to do this for me? I asked ChatGPT to write me some code that would bind a TPM key to an NVRAM value. After a couple of false starts it told me:To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command to define an NV index with the TPMA_NV_BIND attribute. Woo! That sounds perfect. Only one problem - I can't actually find the TPMA_NV_BIND attribute defined anywhere in the spec. Well, ask the AI again.

Here is the definition of TPMA_NV_BIND from the TPM 2.0 specification:
TPMA_NV_BIND (BIT 17) If SET, the contents of the NV Index are bound to the platform hierarchy.

Ok, that sounds convincing. Where is it in the spec?

The definition of TPMA_NV_BIND is on page 126 of the TPM 2.0 specification, in section 9.7.1 "TPMA_NV".

Huh. Nope. Page 126 says nothing about this. Section 9.7.1 says nothing about this. The TPMA_NV bits are actually described on page 168 of volume 2 of the spec, in section 13.4. Bit 17 is TPMA_NV_OWNERREAD. TPMA_NV_BIND does not appear in the spec anywhere. I told the AI that.

I apologize for the error in my previous response. You are correct that the TPMA_NV_BIND attribute does not appear in the TPM 2.0 specification.

To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command with the TPMA_NV_BIND attribute.

I'm not afraid of being unemployed in the near future.

[1] "Platform Configuration Registers" - a modern TPM has 24 of these. When something security sensitive happens, you generate a hash of that event and pass it to the TPM. The TPM appends that to an existing PCR value and then hashes that concatenated value and sets the PCR to that. This means the PCR value depends not only on the values provided, but also the order they're provided in. Various TPM operations can be made conditional on the PCR values meeting specific criteria.

Update: There's actually a more detailed writeup of this here that I somehow missed. Original entry follows:

Today I had to deal with a system that had an irritating restriction - a firmware configuration option I really wanted to be able to change appeared as a greyed out entry in the configuration menu. Some emails revealed that this was a deliberate choice on the part of the system vendor, so that seemed to be that. Thankfully in this case there was a way around that.

One of the things UEFI introduced was a mechanism to generically describe firmware configuration options, called Visual Forms Representation (or VFR). At the most straightforward level, this lets you define a set of forms containing questions, with each question associated with a value in a variable. Questions can be made dependent upon the answers to other questions, so you can have options that appear or disappear based on how other questions were answered. An example in this language might be something like:
CheckBox Prompt: "Console Redirection", Help: "Console Redirection Enable or Disable.", QuestionFlags: 0x10, QuestionId: 53, VarStoreId: 1, VarStoreOffset: 0x39, Flags: 0x0
In which question 53 asks whether console redirection should be enabled or disabled. Other questions can then rely on the answer to question 53 to influence whether or not they're relevant (eg, if console redirection is disabled, there's no point in asking which port it should be redirected to). As a checkbox, if it's set then the value will be set to 1, and 0 otherwise. But where's that stored? Earlier we have another declaration:
VarStore GUID: EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9, VarStoreId: 1, Size: 0xF4, Name: "Setup"
A UEFI variable called "Setup" and with GUID EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 is declared as VarStoreId 1 (matching the declaration in the question) and is 0xf4 bytes long. The question indicates that the offset for that variable is 0x39. Rewriting Setup-EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 with a modified value in offset 0x39 will allow direct manipulation of the config option.

But how do we get this data in the first place? VFR isn't built into the firmware directly - instead it's turned into something called Intermediate Forms Representation, or IFR. UEFI firmware images are typically in a standardised format, and you can use UEFITool to extract individual components from that firmware. If you use UEFITool to search for "Setup" there's a good chance you'll be able to find the component that implements the setup UI. Running IFRExtractor-RS against it will then pull out any IFR data it finds, and decompile that into something resembling the original VFR. And now you have the list of variables and offsets and the configuration associated with them, even if your firmware has chosen to hide those options from you.

Given that a bunch of these config values may be security relevant, this seems a little concerning - what stops an attacker who has access to the OS from simply modifying these variables directly? UEFI avoids this by having two separate stages of boot, one where the full firmware ("Boot Services") is available, and one where only a subset ("Runtime Services") is available. The transition is triggered by the OS calling ExitBootServices, indicating the handoff from the firmware owning the hardware to the OS owning the hardware. This is also considered a security boundary - before ExitBootServices everything running has been subject to any secure boot restrictions, and afterwards applications can do whatever they want. UEFI variables can be flagged as being visible in both Boot and Runtime Services, or can be flagged as Boot Services only. As long as all the security critical variables are Boot Services only, an attacker should never be able to run untrusted code that could alter them.

In my case, the firmware option I wanted to alter had been enclosed in "GrayOutIf True" blocks. But the questions were still defined and the code that acted on those options was still present, so simply modifying the variables while still inside Boot Services gave me what I wanted. Note that this isn't a given! The presence of configuration options in the IFR data doesn't mean that anything will later actually read and make use of that variable - a vendor may have flagged options as unavailable and then removed the code, but never actually removed the config data. And also please do note that the reason stuff was removed may have been that it doesn't actually work, and altering any of these variables risks bricking your hardware in a way that's extremely difficult to recover. And there's also no requirement that vendors use IFR to describe their configuration, so you may not get any help here anyway.

In summary: if you do this you may break your computer. If you don't break your computer, it might not work anyway. I'm not going to help you try to break your computer. And I didn't come up with any of this, I just didn't find it all written down in one place while I was researching it.

Turns out that linking to several days old public data in order to demonstrate that Elon's jet was broadcasting its tail number in the clear is apparently "posting private information" so for anyone looking for me there I'm actually here

First up: what I'm covering here is probably not relevant for most people. That's ok! Different situations have different threat models, and if what I'm talking about here doesn't feel like you have to worry about it, that's great! Your life is easier as a result. But I have worked in situations where we had to care about some of the scenarios I'm going to describe here, and the technologies I'm going to talk about here solve a bunch of these problems.

So. You run a typical VM in the cloud. Who has access to that VM? Well, firstly, anyone who has the ability to log into the host machine with administrative capabilities. With enough effort, perhaps also anyone who has physical access to the host machine. But the hypervisor also has the ability to inspect what's running inside a VM, so anyone with the ability to install a backdoor into the hypervisor could theoretically target you. And who's to say the cloud platform launched the correct image in the first place? The control plane could have introduced a backdoor into your image and run that instead. Or the javascript running in the web UI that you used to configure the instance could have selected a different image without telling you. Anyone with the ability to get a (cleverly obfuscated) backdoor introduced into quite a lot of code could achieve that. Obviously you'd hope that everyone working for a cloud provider is honest, and you'd also hope that their security policies are good and that all code is well reviewed before being committed. But when you have several thousand people working on various components of a cloud platform, there's always the potential for something to slip up.

Let's imagine a large enterprise with a whole bunch of laptops used by developers. If someone has the ability to push a new package to one of those laptops, they're in a good position to obtain credentials belonging to the user of that laptop. That means anyone with that ability effectively has the ability to obtain arbitrary other privileges - they just need to target someone with the privilege they want. You can largely mitigate this by ensuring that the group of people able to do this is as small as possible, and put technical barriers in place to prevent them from pushing new packages unilaterally.

Now imagine this in the cloud scenario. Anyone able to interfere with the control plane (either directly or by getting code accepted that alters its behaviour) is in a position to obtain credentials belonging to anyone running in that cloud. That's probably a much larger set of people than have the ability to push stuff to laptops, but they have much the same level of power. You'll obviously have a whole bunch of processes and policies and oversights to make it difficult for a compromised user to do such a thing, but if you're a high enough profile target it's a plausible scenario.

How can we avoid this? The easiest way is to take the people able to interfere with the control plane out of the loop. The hypervisor knows what it booted, and if there's a mechanism for the VM to pass that information to a user in a trusted way, you'll be able to detect the control plane handing over the wrong image. This can be achieved using trusted boot. The hypervisor-provided firmware performs a "measurement" (basically a cryptographic hash of some data) of what it's booting, storing that information in a virtualised TPM. This TPM can later provide a signed copy of the measurements on demand. A remote system can look at these measurements and determine whether the system is trustworthy - if a modified image had been provided, the measurements would be different. As long as the hypervisor is trustworthy, it doesn't matter whether or not the control plane is - you can detect whether you were given the correct OS image, and you can build your trust on top of that.

(Of course, this depends on you being able to verify the key used to sign those measurements. On real hardware the TPM has a certificate that chains back to the manufacturer and uniquely identifies the TPM. On cloud platforms you typically have to retrieve the public key via the metadata channel, which means you're trusting the control plane to give you information about the hypervisor in order to verify what the control plane gave to the hypervisor. This is suboptimal, even though realistically the number of moving parts in that part of the control plane is much smaller than the number involved in provisioning the instance in the first place, so an attacker managing to compromise both is less realistic. Still, AWS doesn't even give you that, which does make it all rather more complicated)

Ok, so we can (largely) decouple our trust in the VM from having to trust the control plane. But we're still relying on the hypervisor to provide those attestations. What if the hypervisor isn't trustworthy? This sounds somewhat ridiculous (if you can't run a trusted app on top of an untrusted OS, how can you run a trusted OS on top of an untrusted hypervisor?), but AMD actually have a solution for that. SEV ("Secure Encrypted Virtualisation") is a technology where (handwavily) an encryption key is generated when a new VM is created, and the memory belonging to that VM is encrypted with that key. The hypervisor has no access to that encryption key, and any access to memory initiated by the hypervisor will only see the encrypted content. This means that nobody with the ability to tamper with the hypervisor can see what's going on inside the OS (and also means that nobody with physical access can either, so that's another threat dealt with).

But how do we know that the hypervisor set this up, and how do we know that the correct image was booted? SEV has support for a "Launch attestation", a CPU generated signed statement that it booted the current VM with SEV enabled. But it goes further than that! The attestation includes a measurement of what was booted, which means we don't need to trust the hypervisor at all - the CPU itself will tell us what image we were given. Perfect.

Except, well. There's a few problems. AWS just doesn't have any VMs that implement SEV yet (there are bare metal instances that do, but obviously you're building your own infrastructure to make that work). Google only seem to provide the launch measurement via the logging service - and they only include the parsed out data, not the original measurement. So, we still have to trust (a subset of) the control plane. Azure provides it via a separate attestation service, but again it doesn't seem to provide the raw attestation and so you're still trusting the attestation service. For the newest generation of SEV, SEV-SNP, this is less of a big deal because the guest can provide its own attestation. But Google doesn't offer SEV-SNP hardware yet, and the driver you need for this only shipped in Linux 5.19 and Azure's SEV Ubuntu images only offer up to 5.15 at the moment, so making use of that means you're putting your own image together at the moment.

And there's one other kind of major problem. A normal VM image provides a bootloader and a kernel and a filesystem. That bootloader needs to run on something. That "something" is typically hypervisor-provided "firmware" - for instance, OVMF. This probably has some level of cloud vendor patching, and they probably don't ship the source for it. You're just having to trust that the firmware is trustworthy, and we're talking about trying to avoid placing trust in the cloud provider. Azure has a private beta allowing users to upload images that include their own firmware, meaning that all the code you trust (outside the CPU itself) can be provided by the user, and once that's GA it ought to be possible to boot Azure VMs without having to trust any Microsoft-provided code.

Well, mostly. As AMD admit, SEV isn't guaranteed to be resistant to certain microarchitectural attacks. This is still much more restrictive than the status quo where the hypervisor could just read arbitrary content out of the VM whenever it wanted to, but it's still not ideal. Which, to be fair, is where we are with CPUs in general.

(Thanks to Leonard Cohnen who gave me a bunch of excellent pointers on this stuff while I was digging through it yesterday)

I've been ridiculously burned out for a while now but I'm taking the month off to recover and that's giving me an opportunity to catch up on a lot of stuff. This has included me actually writing some code to work with the Pluton in my Thinkpad Z13. I've learned some more stuff in the process, but based on everything I know I'd still say that in its current form Pluton isn't a threat to free software.

So, first up: by default on the Z13, Pluton is disabled. It's not obviously exposed to the OS at all, which also means there's no obvious mechanism for Microsoft to push out a firmware update to it via Windows Update. The Windows drivers that bind to Pluton don't load in this configuration. It's theoretically possible that there's some hidden mechanism to re-enable it at runtime, but that code doesn't seem to be in Windows at the moment. I'm reasonably confident that "Disabled" is pretty genuinely disabled.

Second, when enabled, Pluton exposes two separate devices. The first of these has an MSFT0101 identifier in ACPI, which is the ID used for a TPM 2 device. The Pluton TPM implementation doesn't work out of the box with existing TPM 2 drivers, though, because it uses a custom start method. TPM 2 devices commonly use something called a "Command Response Buffer" architecture, where a command is written into a buffer, the TPM is told to do a thing, and the response to the command ends up in another buffer. The mechanism to tell the TPM to do a thing varies, and an ACPI table exposed to the OS defines which of those various things should be used for a given TPM. Pluton systems have a mechanism that isn't defined in the existing version of the spec (1.3 rev 8 at the time of writing), so I had to spend a while staring hard at the Windows drivers to figure out how to implement it. The good news is that I now have a patch that successfully gets the existing Linux TPM driver code work correctly with the Pluton implementation.

The second device has an MSFT0200 identifier, and is entirely not a TPM. The Windows driver appears to be a relatively thin layer that simply takes commands from userland and passes them on to the chip - I haven't found any userland applications that make use of this, so it's tough to figure out what functionality is actually available. But what does seem pretty clear from the code I've looked at is that it's a component that only responds when it's asked - if the OS never sends it any commands, it's not able to do anything.

One key point from this recently published Microsoft doc is that the whole "Microsoft can update Pluton firmware" thing does just seem to be the ability for the OS to push new code to the chip at runtime. That means Microsoft can't arbitrarily push new firmware to the chip - the OS needs to be involved. This is unsurprising, but it's nice to see some stronger confirmation of that.

Anyway. tl;dr - Pluton can (now) be used as a regular TPM. Pluton also exposes some additional functionality which is not yet clear, but there's no obvious mechanism for it to compromise user privacy or restrict what users can run on a Free operating system. The Pluton firmware update mechanism appears to be OS mediated, so users who control their OS can simply choose not to opt in to that.

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at nvidia. Ex-biologist. Content here should not be interpreted as the opinion of my employer. Also on Mastodon and Bluesky.

Page Summary

Defending abuse does not defend free software
Reconstructing an invalid TPM event log
Unix sockets, Cygwin, SSH agents, and sadness
Updating Fedora the unsupported way
Roots of Trust are difficult
Twitter's e2ee DMs are better than nothing
PSA: upgrade your LUKS key derivation function
Booting modern Intel CPUs
We need better support for SSH host certificates
Blocking free API access to Twitter doesn't stop abuse
Further adventures in Apple PKCS#11 land
Build security with the assumption it will be used against your friends
PKCS#11. hardware keystores, and Apple frustrations
Blogging and microblogging
Integrating Linux with Okta Device Trust
Asking ChatGPT to write my security-sensitive code for me
Changing firmware config that doesn't want to be changed
Off Twitter for a bit
Trying to remove the need to trust cloud providers
Quick update on Pluton and Linux

Expand Cut Tags

No cut tags