Avoiding gaps in IOMMU protection at boot
Jan. 28th, 2020 01:26 pmWhen you save a large file to disk or upload a large texture to your graphics card, you probably don't want your CPU to sit there spending an extended period of time copying data between system memory and the relevant peripheral - it could be doing something more useful instead. As a result, most hardware that deals with large quantities of data is capable of Direct Memory Access (or DMA). DMA-capable devices are able to access system memory directly without the aid of the CPU - the CPU simply tells the device which region of memory to copy and then leaves it to get on with things. However, we also need to get data back to system memory, so DMA is bidirectional. This means that DMA-capable devices are able to read and write directly to system memory.
As long as devices are entirely under the control of the OS, this seems fine. However, this isn't always true - there may be bugs, the device may be passed through to a guest VM (and so no longer under the control of the host OS) or the device may be running firmware that makes it actively malicious. The third is an important point here - while we usually think of DMA as something that has to be set up by the OS, at a technical level the transactions are initiated by the device. A device that's running hostile firmware is entirely capable of choosing what and where to DMA.
Most reasonably recent hardware includes an IOMMU to handle this. The CPU's MMU exists to define which regions of memory a process can read or write - the IOMMU does the same but for external IO devices. An operating system that knows how to use the IOMMU can allocate specific regions of memory that a device can DMA to or from, and any attempt to access memory outside those regions will fail. This was originally intended to handle passing devices through to guests (the host can protect itself by restricting any DMA to memory belonging to the guest - if the guest tries to read or write to memory belonging to the host, the attempt will fail), but is just as relevant to preventing malicious devices from extracting secrets from your OS or even modifying the runtime state of the OS.
But setting things up in the OS isn't sufficient. If an attacker is able to trigger arbitrary DMA before the OS has started then they can tamper with the system firmware or your bootloader and modify the kernel before it even starts running. So ideally you want your firmware to set up the IOMMU before it even enables any external devices, and newer firmware should actually do this automatically. It sounds like the problem is solved.
Except there's a problem. Not all operating systems know how to program the IOMMU, and if a naive OS fails to remove the IOMMU mappings and asks a device to DMA to an address that the IOMMU doesn't grant access to then things are likely to explode messily. EFI has an explicit transition between the boot environment and the runtime environment triggered when the OS or bootloader calls ExitBootServices(). Various EFI components have registered callbacks that are triggered at this point, and the IOMMU driver will (in general) then tear down the IOMMU mappings before passing control to the OS. If the OS is IOMMU aware it'll then program new mappings, but there's a brief window where the IOMMU protection is missing - and a sufficiently malicious device could take advantage of that.
The ideal solution would be a protocol that allowed the OS to indicate to the firmware that it supported this functionality and request that the firmware not remove it, but in the absence of such a protocol we're left with non-ideal solutions. One is to prevent devices from being able to DMA in the first place, which means the absence of any IOMMU restrictions is largely irrelevant. Every PCI device has a busmaster bit - if the busmaster bit is disabled, the device shouldn't start any DMA transactions. Clearing that seems like a straightforward approach. Unfortunately this bit is under the control of the device itself, so a malicious device can just ignore this and do DMA anyway. Fortunately, PCI bridges and PCIe root ports should only forward DMA transactions if their busmaster bit is set. If we clear that then any devices downstream of the bridge or port shouldn't be able to DMA, no matter how malicious they are. Linux will only re-enable the bit after it's done IOMMU setup, so we should then be in a much more secure state - we still need to trust that our motherboard chipset isn't malicious, but we don't need to trust individual third party PCI devices.
This patch just got merged, adding support for this. My original version did nothing other than clear the bits on bridge devices, but this did have the potential for breaking devices that were still carrying out DMA at the moment this code ran. Ard modified it to call the driver shutdown code for each device behind a bridge before disabling DMA on the bridge, which in theory makes this safe but does still depend on the firmware drivers behaving correctly. As a result it's not enabled by default - you can either turn it on in kernel config or pass the efi=disable_early_pci_dma kernel command line argument.
In combination with firmware that does the right thing, this should ensure that Linux systems can be protected against malicious PCI devices throughout the entire boot process.
As long as devices are entirely under the control of the OS, this seems fine. However, this isn't always true - there may be bugs, the device may be passed through to a guest VM (and so no longer under the control of the host OS) or the device may be running firmware that makes it actively malicious. The third is an important point here - while we usually think of DMA as something that has to be set up by the OS, at a technical level the transactions are initiated by the device. A device that's running hostile firmware is entirely capable of choosing what and where to DMA.
Most reasonably recent hardware includes an IOMMU to handle this. The CPU's MMU exists to define which regions of memory a process can read or write - the IOMMU does the same but for external IO devices. An operating system that knows how to use the IOMMU can allocate specific regions of memory that a device can DMA to or from, and any attempt to access memory outside those regions will fail. This was originally intended to handle passing devices through to guests (the host can protect itself by restricting any DMA to memory belonging to the guest - if the guest tries to read or write to memory belonging to the host, the attempt will fail), but is just as relevant to preventing malicious devices from extracting secrets from your OS or even modifying the runtime state of the OS.
But setting things up in the OS isn't sufficient. If an attacker is able to trigger arbitrary DMA before the OS has started then they can tamper with the system firmware or your bootloader and modify the kernel before it even starts running. So ideally you want your firmware to set up the IOMMU before it even enables any external devices, and newer firmware should actually do this automatically. It sounds like the problem is solved.
Except there's a problem. Not all operating systems know how to program the IOMMU, and if a naive OS fails to remove the IOMMU mappings and asks a device to DMA to an address that the IOMMU doesn't grant access to then things are likely to explode messily. EFI has an explicit transition between the boot environment and the runtime environment triggered when the OS or bootloader calls ExitBootServices(). Various EFI components have registered callbacks that are triggered at this point, and the IOMMU driver will (in general) then tear down the IOMMU mappings before passing control to the OS. If the OS is IOMMU aware it'll then program new mappings, but there's a brief window where the IOMMU protection is missing - and a sufficiently malicious device could take advantage of that.
The ideal solution would be a protocol that allowed the OS to indicate to the firmware that it supported this functionality and request that the firmware not remove it, but in the absence of such a protocol we're left with non-ideal solutions. One is to prevent devices from being able to DMA in the first place, which means the absence of any IOMMU restrictions is largely irrelevant. Every PCI device has a busmaster bit - if the busmaster bit is disabled, the device shouldn't start any DMA transactions. Clearing that seems like a straightforward approach. Unfortunately this bit is under the control of the device itself, so a malicious device can just ignore this and do DMA anyway. Fortunately, PCI bridges and PCIe root ports should only forward DMA transactions if their busmaster bit is set. If we clear that then any devices downstream of the bridge or port shouldn't be able to DMA, no matter how malicious they are. Linux will only re-enable the bit after it's done IOMMU setup, so we should then be in a much more secure state - we still need to trust that our motherboard chipset isn't malicious, but we don't need to trust individual third party PCI devices.
This patch just got merged, adding support for this. My original version did nothing other than clear the bits on bridge devices, but this did have the potential for breaking devices that were still carrying out DMA at the moment this code ran. Ard modified it to call the driver shutdown code for each device behind a bridge before disabling DMA on the bridge, which in theory makes this safe but does still depend on the firmware drivers behaving correctly. As a result it's not enabled by default - you can either turn it on in kernel config or pass the efi=disable_early_pci_dma kernel command line argument.
In combination with firmware that does the right thing, this should ensure that Linux systems can be protected against malicious PCI devices throughout the entire boot process.