mjg59 | ACPI, kernels and contracts with firmware

ACPI is a complicated specification - the latest version is 980 pages long. But that's because it's trying to define something complicated: an entire interface for abstracting away hardware details and making it easier for an unmodified OS to boot diverse platforms.

Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.

This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.

Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.

In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.

This sounds horrifying, but it's actually important. The existence of well-defined[1] OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version[2] the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.

That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?

Our even grimmer meathook future welcomes ARM to the ACPI world.

Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.

So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.

Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:

1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.

(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.

[1] Defined by implementation, not defined by specification
[2] Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.

Flat | Top-Level Comments Only

From: (Anonymous)

"Abandon all hope, ye who enter here"... that seems to be the story of your adventures with ACPI.

From: (Anonymous)

We already have this problem with the existing claims of being Windows: if we claim to be Windows 8, and behave like Windows 8, we'll still end up causing potential regressions in behavior and forcing blacklists/whitelists. If we start claiming compatibility with a new version of Windows, we'll change our behavior on existing systems. Sometimes that behavior change will improve things, while other times it'll cause regressions. Not with the same granularity, but in practice there are Linux kernel versions claiming to be Windows 8 with one set of bugs, and Linux kernel versions claiming to be Windows 8 with a different set of bugs, one of which may be closer to the behavior of Windows 8. Fixing those bugs may itself cause regressions.

The claim is that by behaving more closely to Windows, we can get away with this, because Windows manages to work on that hardware. However, we're never going to *exactly* match the behavior of Windows, bug-compatibly.

From:

mjg59

What we have on x86 is a well defined target. We won't always succeed in reaching it, but it's possible. If a code change gets us closer to the behaviour of Windows then it's the correct thing to do. What we have right no on ARM is *no* target. How do we decide whether a given code change is correct or not?

From: (Anonymous)

How is that different from the ARM world *without* ACPI?

From:

mjg59

Device Tree provides data, not code. It also supports versioning at the individual interface level rather than having a single global version. These don't make breakage impossible, but they do reduce the probability.

From: (Anonymous)

Your argumentation is flawed. You argue everything is safe when we have a reference implementation and two paragraphs later you argue that reference implementations are fundamentally broken (3.19 problem found in 3.21).

Its *not* a valid assumption to say that the ACPI implementation is ok just because one OS can boot. There should be a formal ACPI specification or at least a large test suite that checks if ACPI implementations behave sane.

And you are right that we should do it properly on ARM systems.

From:

mjg59

Everything's safe when we have a reference implementation that people test against in a useful timeframe, and Linux is not currently that implementation. We churn far too quickly.

ACPI is formally specified, but there's no meaningful way you can write a specification that documents every single possible implicit assumption in an implementation. You could potentially write a Linux-based test suite, but we'd still need someone to run that on every single piece of ACPI-based ARM hardware (including the unreleased ones) towards the end of every kernel cycle.

From: (Anonymous)

I would think the goal would be a FLOSS ACPI test-suite that is written against the 980 pages long ACPI spec which validates the firmware, which can be run by the ARM hardware developers. Ideally, there should also be a corresponding FLOSS ACPI test-suite that simulates the hardware/firmware which validstes the OS (Linux). And when properly written, the suites can be run against each other to develop and validate the tests. If either the OS or firmware take actions undefined in the ACPI spec, an error would be reported. Is this painful, difficult, and time-consuming? Yes. Can it be split into smaller work units and distributed among a large team? Yes.

From: (Anonymous)

That still won't help. For example how would you verify the "right" thing is done with the backlight in various power states? You'd need some ACPI independent way of finding out what is going on with the backlight, in which case why use ACPI. How would you work out if various things have been put into their most power efficient modes, versus just working correctly modes? Writing such a test suite that was comprehensive would be a humungous effort, and as Matthew points out would still miss all sorts of cases and ordering. For example the firmware may do the right thing if the graphics subsystem completes initialisation before the backlight, but get things wrong if the backlight is done first or concurrently.

There are ACPI test suites which verify the big picture, but it is the little details, hardware and ordering that really matters and is far harder to address. Example test suite https://wiki.linaro.org/LEG/Engineering/test-acpi

From: (Anonymous)

It is called unit testing. You test the backlight all by itself, without initializing the graphics subsystem. You also have coverage metrics that identify all of the input states that can affect the backlight, including any that initializing the gpu may modify, and you test the backlight code against each possible set of inputs: both if the gpu had initialized already, and if it had not.

Properly validating software is possible, but it takes a lot more work than just seeing if Windows boots up on it.

From: (Anonymous)

Ironically, these servers will only ever run Linux, and we have complete control over that platform. Windows is irrelevant to the ARM server. It's a unique position in history and we're not thinking how to both benefit from and also optimize for that case.

What confuses me the most: why is there this fixation with adding complexity to firmware? Surely the kernel code is easier to work with. Is it the kernel process? Skills? Or something else?

From: (Anonymous)

What about the old solution of putting as much knowledge about the hardware into the kernel itself, and only relying on ACPI for the things that the kernel really can't know, like the bus/mapping/whatever address of a given device? I'm sure of course that there is a good reason why that doesn't work if you are going to all this trouble.

From: (Anonymous)

I'm curious why nobody raised the idea of targeting Windows RT. Since Microsoft got in on ARM, it's bound to be targeted by vendors, isn't it?

From:

mjg59

RT pre-dates ACPI 5.1, so it's not a terribly useful reference unfortunately.

From:

yuhong.wordpress.com

Windows RT is 32-bit while ARM SBSA is 64-bit.

From:

yuhong.wordpress.com

Personally I hope that this work can benefit x86 too.

From: (Anonymous)

I don't agree with your assertion that " Windows 8" is a contract. "contract" implies something formal and well-defined. What happens in the real world is windows starts doing things one way, which may or may not be spec compliant. When we vendors make a new piece of hardware, we adjust our ACPI tables until windows stops bluescreening. It really is this brute-force approach. ACPI tables end up becoming bad hacks to work around bad programming in windows. If the end-user doesn't notice something's wrong, the attitude of most vendors is to say it's not wrong. In my line of work I don't have the luxury of bad hacks when it comes to firmware. In early development, I can get away with booting linux without a BIOS or EFI interface. I am incredibly lucky to be allowed to do this, as linux will complain loudly about mistakes in ACPI. Once linux shuts up, then we move on to windows testing. Notice that at no point do we care if we're ACPI compliant. ACPI compliance is so ill defined, that even if we cared, there would be no way to verify if we are indeed compliant. If you want to get to the heart of the problem, stop being part of it. Stop being part of the RedHat-Linaro-Intel-Microsoft complex which pushes these ill-defined and unduly complex specifications. Notice that, in order to run an OS that can fully utilize the features of the hardware, we haven't truly needed BIOS calls for a long while, and we don't need EFI if we can correctly implement a very limited subset of ACPI. And once you've reduced everything to this simplest scenario, you realize that what's left is just a misdefined devicetree. Linux on ARM and a few other non-x86 architectures has been doing just fine with a proper devicetree. Why we're still talking about thousand plus page firmware interface specifications is beyond me. Everything EFI does and does not do can be done by firmware in a hardware-direct manner, with less layers and less complexity. On ARM that firmware is really damn simple. As soon as we bring in EFI and ACPI complexities to this ARM firmware, we'll run into the same set of problems we're having today on x86. An unrelated point: when you talk about firmware relying on undefined behavior, you're talking about bad programming. You make it sound like you condone this reliance on the undefined, and by induction, that you condone bad programming. I'm certain that's not what you meant.

Flat | Top-Level Comments Only

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at Aurora. Ex-biologist.

mjg59 on Twitter. Content here should not be interpreted as the opinion of my employer. Also on Mastodon.

Page Summary

(Anonymous) - (no subject)
(Anonymous) - We already have this problem
(Anonymous) - ACPI spec
(Anonymous) - ARM Server and Linux versus the rest
(Anonymous) - (no subject)
(Anonymous) - why not both
yuhong.wordpress.com - Personally I hope that....
(Anonymous) - Message from a paranoid firmware writer

Expand Cut Tags

No cut tags

Matthew Garrett

ACPI, kernels and contracts with firmware

ACPI, kernels and contracts with firmware

no subject

We already have this problem

Re: We already have this problem

Re: We already have this problem

Re: We already have this problem

ACPI spec

Re: ACPI spec

Re: ACPI spec

Re: ACPI spec

Re: ACPI spec

ARM Server and Linux versus the rest

no subject

why not both

Re: why not both

Re: why not both

Personally I hope that....

Message from a paranoid firmware writer

Profile

About Matthew

Page Summary

Expand Cut Tags