Matthew Garrett ([personal profile] mjg59) wrote2016-04-13 12:46 pm
Entry tags:

Skylake's power management under Linux is dreadful and you shouldn't buy one until it's fixed

(Edit to add: this issue is restricted to the mobile SKUs. Desktop parts have very different power management behaviour)

Linux 4.5 seems to have got Intel's Skylake platform (ie, 6th-generation Core CPUs) to the point where graphics work pretty reliably, which is great progress (4.4 tended to lose all my windows every so often, especially over suspend/resume). I'm even running Wayland happily. Unfortunately one of the reasons I have a laptop is that I want to be able to do things like use it on battery, and power consumption's an important part of that. Skylake continues the trend from Haswell of moving to an SoC-type model where clock and power domains are shared between components that were previously entirely independent, and so you can't enter deep power saving states unless multiple components all have the correct power management configuration. On Haswell/Broadwell this manifested in the form of Serial ATA link power management being involved in preventing the package from going into deep power saving states - setting that up correctly resulted in a reduction in full-system power consumption of about 40%[1].

I've now got a Skylake platform with a nice shiny NVMe device, so Serial ATA policy isn't relevant (the platform doesn't even expose a SATA controller). The deepest power saving state I can get into is PC3, despite Skylake supporting PC8 - so I'm probably consuming about 40% more power than I should be. And nobody seems to know what needs to be done to fix this. I've found no public documentation on the power management dependencies on Skylake. Turning on everything in Powertop doesn't improve anything. My battery life is pretty poor and the system is pretty warm.

The best thing about this is the following statement from page 64 of the 6th Generation Intel ® Processor Datasheet for U-Platforms:

Caution: Long term reliability cannot be assured unless all the Low-Power Idle States are enabled.

which is pretty concerning. Without support for states deeper than PC3, Linux is running in a configuration that Intel imply may trigger premature failure. That's obviously not good. Until this situation is improved, you probably shouldn't buy any Skylake systems if you're planning on running Linux.

[1] These patches never went upstream. Someone reported that they resulted in their SSD throwing errors and I couldn't find anybody with deeper levels of SATA experience who was interested in working on the problem. Intel's AHCI drivers for Windows do the right thing, but I couldn't find anybody at Intel who could get any information from their Windows driver team.

NVMe problems

(Anonymous) 2016-04-13 08:57 pm (UTC)(link)
The Dell XPS 13 had some problems related to NVMe preventing it from entering lower power states.

Relately, the NVMe SSD used in quite a few laptops these days has notorious power consumption [1]. If I purchased an XPS 13 I'd likely swap back to a non-NVMe SSD for this reason alone.

[1] http://www.silentpcreview.com/files/images/samsung-950pro/power.gif

[personal profile] edmonds 2016-04-13 09:32 pm (UTC)(link)
By working reliably under 4.5, do you mean you don't have to use any of the i915 module parameter workarounds like enable_rc6=0 ?

(Anonymous) 2016-04-13 10:06 pm (UTC)(link)
Heck Bay Trail is still a mess for support too.

Nasty issues with Linux 4.2, Skylake and NVMe

(Anonymous) 2016-04-13 10:06 pm (UTC)(link)
Until Linux Mint 18 will be released I thought it would be OK to go with the current release - 17.3. It's OK most of the time, until it breaks in various ways.

Mainboard: ASUS Z170-DELUXE
CPU: Intel Core i3-6320 3.90 GHz (Skylake)
SSD: Samsung 950 Pro 256GB M.2
Linux kernel: 4.2.x

Trying kernel 4.4 (available from Ubuntu) renders an unbootable system due to some strange issues configuring GRUB I think - I just didn't have the time to investigate in detail, although I tried chrooting and updating grub from a live session, without success.

I might as well consider another distro if kernel 4.5 or newer fixes these issues. Sometimes the system freezes so badly not even the (hardware) reset button doesn't work. Other times the NVMe controller simply throws in the towel [1] and I get left with some partially running programs, everything running from RAM, because the storage disappears until I reset the PC.

It's been two and a half terrible months since I got this new PC and lacking the time to find out answers it's frustrating I have no idea who to blame. So I guess I might as well blame myself for going with "the latest and greatest" from Intel without properly researching compatibility.

[1] http://pastebin.com/2djDSh3m

SATA PM Patches

(Anonymous) 2016-04-13 10:09 pm (UTC)(link)
I've read your blog post about SATA PM when it was fresh, saw your patches on LKML and thought: well, everything is on it's way. Also mentioned Panel Self Refresh and friends for i915 are slowly getting in mainline, so I thought we might get to a point, where power comsumption would get in a good state and that i should spend some time optimizing my Haswell notebook again. I tried but mostly gave up and thought I needed to wait some more time. Now I just learned your SATA patches were never merged, which makes me sad about pm in Linux again :(

I'm just a user, no expert at all, but If there is a possibility to give mainlining that patches another shot, i would be absolutely thankful.

Keep up the good work Mathew, it's really appreciated. You solved a lot of problems for us Linux users! :)

Regards!
Wilken Haase
parttime happy linux user

Been (mostly) fixed for me.

[personal profile] gourdcaptain 2016-04-13 10:41 pm (UTC)(link)
If it's the issue involving limiting it to C6 or lower, that actually got fixed two weeks back and is just now being pushed to stable kernels - I'm running 4.4.7 no problems on Skylake hardware (Lenovo Yoga 700 (11-inch) with an Intel Core m5 6Y54 cpu) with all C-States enabled and powertop telling me it's spending time in C10 even. (Before 4.4.7 got released and packaged on Arch (3rd party repo, not main yet), I was running 4.6rc1 and rc2 to get this working.)

https://bugzilla.kernel.org/show_bug.cgi?id=109081 - The bug report of the issue.
Edited 2016-04-13 22:52 (UTC)

My mobile part is seeing pc8

(Anonymous) 2016-04-13 10:42 pm (UTC)(link)
This is a Dell Precision 5510
model name : Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz

          Package   |             Core    |            CPU 0       CPU 4
                    |                     | C0 active   2.2%        0.2%
                    |                     | POLL        0.0%    0.0 ms  0.0%    0.0 ms
                    |                     | C1E-SKL     0.4%    0.3 ms  6.4%    2.9 ms
C2 (pc2)   39.7%    |                     |
C3 (pc3)    1.0%    | C3 (cc3)    0.2%    | C3-SKL      0.5%    0.2 ms  0.0%    0.0 ms
C6 (pc6)    8.7%    | C6 (cc6)   17.4%    | C6-SKL     12.2%    0.9 ms  8.6%   20.1 ms
C7 (pc7)    0.0%    | C7 (cc7)   64.2%    | C7s-SKL     0.0%    0.0 ms  0.0%    0.0 ms
C8 (pc8)   21.7%    |                     | C8-SKL     65.6%    1.8 ms  4.1%    2.7 ms
C9 (pc9)    0.0%    |                     | C9-SKL      0.0%    0.0 ms  0.0%    0.0 ms
C10 (pc10)  0.0%    |                     | C10-SKL    13.6%    6.3 ms 78.5%   25.2 ms

Re: Actually been fixed.

[personal profile] edmonds 2016-04-13 10:52 pm (UTC)(link)
No, IIUC, this is related to *R*C6, which is a power saving state on the GPU. Not C6.

I think this is the actual bug report: https://bugs.freedesktop.org/show_bug.cgi?id=94161.

Re: Actually been fixed.

[personal profile] gourdcaptain 2016-04-13 10:58 pm (UTC)(link)
Huh. Haven't had that issue either (and I've used this system for long stretches of moderate use in the week and a half I've had it). Have had random ACPI related crashes at boot (~50% of the time)unless I increase the wait time in Systemd-boot to ten seconds, weirdly enough, but that's more of a lousy BIOS/UEFI issue (given that on successful boots, it logs a bunch of ACPI table errors in dmesg). (Still trying to figure out how to report that given the kernel panic messages highly vary between crashes, scroll mostly off the screen, and the system completely freezes up after one without letting me use any of the stuff I read about online to capture it.)

Firmware?

(Anonymous) 2016-04-13 11:17 pm (UTC)(link)
What firmware are you on? IIRC anything before 1.1.7 incorrectly initialized the PCIe links, which broke ASPM and therefore prevented any deep sleep state from being entered.

NVMe power saving is still unimplemented on Linux, but I might get around to that soon if no one beats me.

Re: Actually been fixed.

(Anonymous) 2016-04-14 01:08 am (UTC)(link)
This actually reeks of a bunch of microcode and firmware issues that got fixed in the last months. Ensure you have microcode 0x73 or later, that's actually a good hint both the microcode and the PCH firmware are not crash-prone buggy crap.

As far as I am concerned, the kernel should refuse to boot on any Skylake box with a BIOS older than 2016 or running a microcode revision earlier than 0x73. That would certainly be a lot more truthful to everyone involved.

If an UEFI update is not available yet from your vendor, ask for your money back. A properly up-to-date UEFI for Skylake with SGX support will have microcode 0x83 or higher. If it has SGX support permanently disabled by UEFI, 0x76 is enough.

Re: SATA PM Patches

(Anonymous) 2016-04-14 01:16 am (UTC)(link)
The issue with SATA ALPM is that enabling that will interact badly with several SSDs, triggering *data destroying* firmware bugs in those SSDs.

And we're not talking el-cheap-o crap SSDs either, several models from Micron (datacenter) and Crucial (consumer) are included, for example... and not all of them have firmware updates that address this.

And yes, such issues do exist under the Intel ISRT Windows drivers. You really can't enable SATA ALPM by default if the device identifies itself as a SSD. I don't know if the issues with HDDs and ODDs is better, either.

Re: Actually been fixed.

[personal profile] gourdcaptain 2016-04-14 03:13 am (UTC)(link)
0x74 microcode, released 3/15/16, and SGX disabled. Unfortunately, I can't flash newer UEFI if they put one out (still the most recent for it as of this posting) because the updater is Windows only (although I did flash the most recent one before wiping the drive, and have a Clonezilla backup of the Windows install if I absolutely have to). At least hopefully there'll eventually be microcode files I can early boot load.

EDIT: At least this is less bad than when I got a Broadwell i7 5700hq laptop last year and the microcode-based TSX issues were so bad I could only boot Fedora 22 for a month stably (it would crash under any load) until MSI (the ones I bought it from) were the first out with a fixed microcode update. (And their updater actually works from the UEFI loading off a USB stick). Intel's just awful anymore, but not like we really have any alternatives, given how bad AMD CPUs are for a lot of things anymore.

EDIT: Seriously, TSX was properly disabled under Haswell for a year at that point! Why was Broadwell shipping with it enabled and faulty? Did they not even check?
Edited 2016-04-14 03:23 (UTC)

Long term reliability

(Anonymous) 2016-04-14 03:55 am (UTC)(link)
The "Caution: Long term reliability cannot be assured [...]" message is also present in the 4th (https://www-ssl.intel.com/content/www/us/en/processors/core/4th-gen-core-family-mobile-u-y-processor-lines-vol-1-datasheet.html) and 5th (https://www-ssl.intel.com/content/www/us/en/processors/core/5th-gen-core-family-datasheet-vol-1.html) generation mobile (Haswell and Broadwell) datasheets.

How do the macbooks deal with this?

(Anonymous) 2016-04-14 04:21 am (UTC)(link)
They supposedly release source to their (BSD based) kernel... do they leave out the interesting parts like this? If not, maybe it's worth checking.

not only Skylake..

(Anonymous) 2016-04-14 06:33 am (UTC)(link)
there's also a horrible bug in/for Bay Trail, crashing systems left and right: https://bugzilla.kernel.org/show_bug.cgi?id=109051

Re: My mobile part is seeing pc8

[personal profile] mikesart 2016-04-14 07:29 am (UTC)(link)
I've got a Dell XPS 9350 laptop, and if powertop is telling the truth and I'm reading it correctly, it's doing c7 as well. Although my gpu is currently pegged at 100% since I'm still running kernel 4.5 w/ i915.enable_rc6=0 to workaround the gpu bug.

powertop and lscpi -vvv details here:

http://pastebin.com/vBu5pBq6

I've been shutting it down completely when not in use and this post has certainly convinced me to continue doing that.

EFI updates

(Anonymous) 2016-04-14 07:37 am (UTC)(link)
FYI: Since EFI can directly run Portable Executables (.exe) you can just drop the .exe from your vendor in your EFI system partition and run it from the EFI menu, no need to boot windows. I've done this on my Dell XPS 13 system multiple times now.

Re: EFI updates

[personal profile] gourdcaptain 2016-04-14 07:42 am (UTC)(link)
That's a thing you can do? I've been digging everywhere on ways to install UEFI updates on this thing, and it hasn't come up. Not that I'm disbelieving you, it just seems amazingly poorly documented. And it's the same EXE update files they have for Windows?

Re: My mobile part is seeing pc8

(Anonymous) 2016-04-14 07:48 am (UTC)(link)
Same for me.

Dell XPS 15 9550, Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Re: How do the macbooks deal with this?

(Anonymous) 2016-04-14 08:05 am (UTC)(link)
And there's also ChromiumOS. They keep their kernels relatively close to mainline and upstream stuff regularly (and also push their hw partners to do so). And of course, they care about power consumption.

"Long term reliability"

(Anonymous) 2016-04-14 08:07 am (UTC)(link)
I'm not quite sure what this means, and I'm a bit scared of what it could mean. Could somebody please clarify this for me:

"Long term reliability cannot be assured unless all the Low-Power Idle States are enabled."

Does it mean hardware life? Does it imply the processor will degrade/wear if Low-Power Idle States are not enabled?

4.6-rc2

(Anonymous) 2016-04-14 08:14 am (UTC)(link)
I'm running 4.6-rc2 on Fedora 23 and it seems to have the GPU in RC6 state for a significant portion of time:

http://pastebin.com/f42EpWV2

No real problems here otherwise, can't seem to determine what my NVMe SSD power state is:
http://pastebin.com/uYWiHAAg

Re: SATA PM Patches

(Anonymous) 2016-04-14 08:26 am (UTC)(link)
1. Does this also apply to "firmware defaults" settings?

2. What does Windows do by default?

3. If the problem is with SSDs, then why we can't modify that patch to only apply to !SSD?

Page 1 of 6