Implementing support for advanced DPTF policy in Linux
Intel's Dynamic Platform and Thermal Framework (DPTF) is a feature that's becoming increasingly common on highly portable Intel-based devices. The adaptive policy it implements is based around the idea that thermal management of a system is becoming increasingly complicated - the appropriate set of cooling constraints to place on a system may differ based on a whole bunch of criteria (eg, if a tablet is being held vertically rather than lying on a table, it's probably going to be able to dissipate heat more effectively, so you should impose different constraints). One way of providing these criteria to the OS is to embed them in the system firmware, allowing an OS-level agent to read that and then incorporate OS-level knowledge into a final policy decision.
Unfortunately, while Intel have released some amount of support for DPTF on Linux, they haven't included support for the adaptive policy. And even more annoyingly, many modern laptops run in a heavily conservative thermal state if the OS doesn't support the adaptive policy, meaning that the CPU throttles down extremely quickly and the laptop runs excessively slowly.
It's been a while since I really got stuck into a laptop reverse engineering project, and I don't have much else to do right now, so I've been working on this. It's been a combination of examining what source Intel have released, reverse engineering the Windows code and staring hard at hex dumps until they made some sort of sense. Here's where I am.
There's two main components to the adaptive policy - the adaptive conditions table (APCT) and the adaptive actions table (APAT). The adaptive conditions table contains a set of condition sets, with up to 10 conditions in each condition set. A condition is something like "is the battery above a certain charge", "is this temperature sensor below a certain value", "is the lid open or closed", "is the machine upright or horizontal" and so on. Each condition set is evaluated in turn - if all the conditions evaluate to true, the condition set's target is implemented. If not, we move onto the next condition set. There will typically be a fallback condition set to catch the case where none of the other condition sets evaluate to true.
The action table contains sets of actions associated with a specific target. Once we've picked a target by evaluating the conditions, we execute the actions that have a corresponding target. Actions are things like "Set the CPU power limit to this value" or "Load a passive policy table". Passive policy tables are simply tables associating sensors with devices and an associated temperature limit. If the limit is exceeded, the associated device should be asked to reduce its heat output until the situation is resolved.
There's a couple of twists. The first is the OEM conditions. These are conditions that refer to values that are exposed by the firmware and are otherwise entirely opaque - the firmware knows what these mean, but we don't, so conditions that rely on these values are magical. They could be temperature, they could be power consumption, they could be SKU variations. We just don't know. The other is that older versions of the APCT table didn't include a reference to a device - ie, if you specified a condition based on a temperature, you had no way to express which temperature sensor to use. So, instead, you specified a condition that's greater than 0x10000, which tells the agent to look at the APPC table to extract the device and the appropriate actual condition.
Intel already have a Linux app called Thermal Daemon that implements a subset of this - you're supposed to run the binary-only dptfxtract against your firmware to parse a few bits of the DPTF tables, and it writes out an XML file that Thermal Daemon makes use of. Unfortunately it doesn't handle most of the more interesting bits of the adaptive performance policy, so I've spent the past couple of days extending it to do so and to remove the proprietary dependency.
My current work is here - it requires a couple of kernel patches (that are in the patches directory), and it only supports a very small subset of the possible conditions. It's also entirely possible that it'll do something inappropriate and cause your computer to melt - none of this is publicly documented, I don't have access to the spec and you're relying on my best guesses in a lot of places. But it seems to behave roughly as expected on the one test machine I have here, so time to get some wider testing?
Unfortunately, while Intel have released some amount of support for DPTF on Linux, they haven't included support for the adaptive policy. And even more annoyingly, many modern laptops run in a heavily conservative thermal state if the OS doesn't support the adaptive policy, meaning that the CPU throttles down extremely quickly and the laptop runs excessively slowly.
It's been a while since I really got stuck into a laptop reverse engineering project, and I don't have much else to do right now, so I've been working on this. It's been a combination of examining what source Intel have released, reverse engineering the Windows code and staring hard at hex dumps until they made some sort of sense. Here's where I am.
There's two main components to the adaptive policy - the adaptive conditions table (APCT) and the adaptive actions table (APAT). The adaptive conditions table contains a set of condition sets, with up to 10 conditions in each condition set. A condition is something like "is the battery above a certain charge", "is this temperature sensor below a certain value", "is the lid open or closed", "is the machine upright or horizontal" and so on. Each condition set is evaluated in turn - if all the conditions evaluate to true, the condition set's target is implemented. If not, we move onto the next condition set. There will typically be a fallback condition set to catch the case where none of the other condition sets evaluate to true.
The action table contains sets of actions associated with a specific target. Once we've picked a target by evaluating the conditions, we execute the actions that have a corresponding target. Actions are things like "Set the CPU power limit to this value" or "Load a passive policy table". Passive policy tables are simply tables associating sensors with devices and an associated temperature limit. If the limit is exceeded, the associated device should be asked to reduce its heat output until the situation is resolved.
There's a couple of twists. The first is the OEM conditions. These are conditions that refer to values that are exposed by the firmware and are otherwise entirely opaque - the firmware knows what these mean, but we don't, so conditions that rely on these values are magical. They could be temperature, they could be power consumption, they could be SKU variations. We just don't know. The other is that older versions of the APCT table didn't include a reference to a device - ie, if you specified a condition based on a temperature, you had no way to express which temperature sensor to use. So, instead, you specified a condition that's greater than 0x10000, which tells the agent to look at the APPC table to extract the device and the appropriate actual condition.
Intel already have a Linux app called Thermal Daemon that implements a subset of this - you're supposed to run the binary-only dptfxtract against your firmware to parse a few bits of the DPTF tables, and it writes out an XML file that Thermal Daemon makes use of. Unfortunately it doesn't handle most of the more interesting bits of the adaptive performance policy, so I've spent the past couple of days extending it to do so and to remove the proprietary dependency.
My current work is here - it requires a couple of kernel patches (that are in the patches directory), and it only supports a very small subset of the possible conditions. It's also entirely possible that it'll do something inappropriate and cause your computer to melt - none of this is publicly documented, I don't have access to the spec and you're relying on my best guesses in a lot of places. But it seems to behave roughly as expected on the one test machine I have here, so time to get some wider testing?
What happens without thermald or dptf?
(Anonymous) 2020-04-13 02:30 am (UTC)(link)I have a quick question. What happens if thermald is not installed, or not configured correctly?
I have a brand new laptop that supports this thermal stuff. Out of the box with Linux (no thermald installed, no tweaks to CPU throttling or cooling) it would run to 100 C under load. With thermald configured (using the binary extractor), it rarely crosses 80 C, which makes me think that's the manufacturer expected temperature range.
Yet, how come the default without configuration is less conservative and just shoots up? Is it dangerous to run at full thermal range, without thermald, or are there other safety checks to avoid HW damage?
(I'm asking cause I wonder whether it's safe to test the patch set)
Re: What happens without thermald or dptf?
Testing on Dell Latitude 7270
(Anonymous) 2020-04-13 11:03 am (UTC)(link)I applied the kernel patches and recompiled thermald from your fork, everything seems to work fine. Is there specific testing I can do to verify that everything works as it should?
Regards,
Emantor
int3400 thermal
(Anonymous) 2020-04-13 04:36 pm (UTC)(link)status = acpi_evaluate_object(priv->adev->handle, "IDSP", NULL, &buf);
if (ACPI_FAILURE(status))
return -ENODEV;
I wonder if this means it doesn't have DPTF or they changed something the driver is not yet aware of..
Re: int3400 thermal
sudo acpidump
and upload the output somewhere?
Re: int3400 thermal
(Anonymous) 2020-04-13 05:59 pm (UTC)(link)http://sprunge.us/k8uiQB
Re: int3400 thermal
segfault at src/thd_engine_adaptive.cpp:400
(Anonymous) 2020-04-13 09:24 pm (UTC)(link)(gdb) run --no-daemon --adaptive
Starting program: /home/sultan/sultan-projects/thermal_daemon/thermald --no-daemon --adaptive
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7444408 in __memmove_avx_unaligned_erms () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7444408 in __memmove_avx_unaligned_erms () at /usr/lib/libc.so.6
#1 0x0000555555591517 in cthd_engine_adaptive::parse_gddv(char*, int) (size=29715, buf=, this=0x55555560ddd0) at src/thd_engine_adaptive.cpp:400
#2 cthd_engine_adaptive::parse_gddv(char*, int) (this=0x55555560ddd0, buf=, size=29715) at src/thd_engine_adaptive.cpp:352
#3 0x000055555559183e in cthd_engine_adaptive::handle_compressed_gddv(char*, int) (this=0x55555560ddd0, buf=, size=1285) at src/thd_engine_adaptive.cpp:346
#4 0x00005555555927e2 in cthd_engine_adaptive::thd_engine_start(bool) (this=0x55555560ddd0, ignore_cpuid_check=) at src/thd_engine_adaptive.cpp:854
#5 0x0000555555592b22 in thd_engine_create_adaptive_engine(bool) (ignore_cpuid_check=) at src/thd_engine_adaptive.cpp:884
#6 0x00005555555756e4 in main(int, char**) (argc=, argv=) at src/main.cpp:331
Re: segfault at src/thd_engine_adaptive.cpp:400
Re: segfault at src/thd_engine_adaptive.cpp:400
(Anonymous) 2020-04-13 09:32 pm (UTC)(link)Re: segfault at src/thd_engine_adaptive.cpp:400
Re: segfault at src/thd_engine_adaptive.cpp:400
(Anonymous) 2020-04-14 04:15 am (UTC)(link)Re: segfault at src/thd_engine_adaptive.cpp:400
Re: segfault at src/thd_engine_adaptive.cpp:400
no subject
(Anonymous) 2020-04-16 11:34 am (UTC)(link)Is there anything one can do to make it easier for you to reverse engineer DPTF? I saw that you asked some commentators to provide `acpidump`s. Would it be beneficial for your work, if you got more dumps from affected devices?
There is a big thread over at the Lenovo forums [2] with affected users. I think many of them would be willing to donate such data.
[0] https://www.heise.de/newsticker/meldung/Neue-RKI-Corona-Fall-Studie-Einfluss-der-Kontaktsperre-eher-maessig-4702096.html
[1] https://www.golem.de/news/intel-dptf-garrett-verbessert-thermische-regulierung-von-linux-laptops-2004-147857.html
[2] https://forums.lenovo.com/t5/Other-Linux-Discussions/X1C6-T480s-low-cTDP-and-trip-temperature-in-Linux/td-p/4028489
no subject
Upstream the changes?
(Anonymous) 2020-06-08 02:03 pm (UTC)(link)I'll see if it fixes my GPU-throttling in Desk-Mode on my T580.
Are you trying to get your work upstreamed? I mean the kernel patches and your changes to thermald?
Re: Upstream the changes?
Applied changes
(Anonymous) 2020-08-06 04:28 pm (UTC)(link)