[personal profile] mjg59
ACPI is a complicated specification - the latest version is 980 pages long. But that's because it's trying to define something complicated: an entire interface for abstracting away hardware details and making it easier for an unmodified OS to boot diverse platforms.

Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.

This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.

Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.

In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.

This sounds horrifying, but it's actually important. The existence of well-defined[1] OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version[2] the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.

That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?

Our even grimmer meathook future welcomes ARM to the ACPI world.

Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.

So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.

Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:

1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.

(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.

[1] Defined by implementation, not defined by specification
[2] Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.

ACPI spec

Date: 2014-09-17 08:52 am (UTC)
From: (Anonymous)
Your argumentation is flawed. You argue everything is safe when we have a reference implementation and two paragraphs later you argue that reference implementations are fundamentally broken (3.19 problem found in 3.21).

Its *not* a valid assumption to say that the ACPI implementation is ok just because one OS can boot. There should be a formal ACPI specification or at least a large test suite that checks if ACPI implementations behave sane.

And you are right that we should do it properly on ARM systems.

Re: ACPI spec

Date: 2014-09-17 12:52 pm (UTC)
From: (Anonymous)
I would think the goal would be a FLOSS ACPI test-suite that is written against the 980 pages long ACPI spec which validates the firmware, which can be run by the ARM hardware developers. Ideally, there should also be a corresponding FLOSS ACPI test-suite that simulates the hardware/firmware which validstes the OS (Linux). And when properly written, the suites can be run against each other to develop and validate the tests. If either the OS or firmware take actions undefined in the ACPI spec, an error would be reported. Is this painful, difficult, and time-consuming? Yes. Can it be split into smaller work units and distributed among a large team? Yes.

Re: ACPI spec

Date: 2014-09-17 05:52 pm (UTC)
From: (Anonymous)
That still won't help. For example how would you verify the "right" thing is done with the backlight in various power states? You'd need some ACPI independent way of finding out what is going on with the backlight, in which case why use ACPI. How would you work out if various things have been put into their most power efficient modes, versus just working correctly modes? Writing such a test suite that was comprehensive would be a humungous effort, and as Matthew points out would still miss all sorts of cases and ordering. For example the firmware may do the right thing if the graphics subsystem completes initialisation before the backlight, but get things wrong if the backlight is done first or concurrently.

There are ACPI test suites which verify the big picture, but it is the little details, hardware and ordering that really matters and is far harder to address. Example test suite https://wiki.linaro.org/LEG/Engineering/test-acpi

Re: ACPI spec

Date: 2014-11-19 09:12 pm (UTC)
From: (Anonymous)
It is called unit testing. You test the backlight all by itself, without initializing the graphics subsystem. You also have coverage metrics that identify all of the input states that can affect the backlight, including any that initializing the gpu may modify, and you test the backlight code against each possible set of inputs: both if the gpu had initialized already, and if it had not.

Properly validating software is possible, but it takes a lot more work than just seeing if Windows boots up on it.

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at Aurora. Ex-biologist. [personal profile] mjg59 on Twitter. Content here should not be interpreted as the opinion of my employer. Also on Mastodon.

Page Summary

Expand Cut Tags

No cut tags