Matthew Garrett ([personal profile] mjg59) wrote2013-06-02 20:15
Entry tags:

Dealing with UEFI non-volatile memory quirks

Since I wrote this, we've made some worthwhile progress on avoiding damaging Samsung hardware. The first is that the samsung-laptop driver appeared to be causing the firmware to attempt to write to an area of memory that was marked in the chipset, triggering a Machine Check Exception. That was what generated the pstore output that caused the problem originally. The driver now refuses to load if EFI is enabled, which avoids the problem. It's not ideal, since it's currently the only mechanism we have for certain functionality on Samsung laptops, but there you go.

The second problem was that avoiding crashing on boot didn't actually fix the problem in any fundamental way. Even with pstore disabled, it was possible for userspace to fill the nvram and trigger the same problem. Our first approach to this was to prevent any writes to nvram if the UEFI QueryVariableInfo() call reported that more than 50% of the nvram storage space would be used. That was safe, but led to another issue. The nvram storage area is typically implemented as part of the same flash chip as the firmware. Flash isn't arbitrarily accessible - changing the contents of a block typically involves rewriting the entire block. It's impractical to rewrite the entire nvram area on every write, so what actually happens is that deleting variables just results in them being marked as inactive but doesn't actually free up the space. The firmware can later perform some sort of garbage collection to free it up.

This caused us problems, since inactive space that hasn't been garbage collected yet isn't actually available, and as a result firmware implementations tend to count it as used. Say you had 64KB of nvram and wrote 32KB of variables. We'd then refuse to write any more because you'd drop below 50%. So you delete 16KB of the variables you've created and try again. Unfortunately, the firmware still thinks that there's 32KB in use and Linux would still refuse.

If you were lucky, rebooting would trigger a garbage collection run. If you weren't, it wouldn't. Problematic. Our next approach was to try to account for the space actually actively used by the variables, rather than relying on what the firmware told us via QueryVariableInfo(). This seems simple enough - just add up the size of all the variables and subtract that from the overall size to determine how much of the "used" space is actually just old inactive variables that can be ignored. However, there's still some problems there. The first is that each variable has some additional overhead associated with it, and the size of that overhead varies depending on the system vendor. We had to make a conservative guess, which could cause problems if systems had large numbers of small variables. The second is that the only variables the kernel can see are those that are flagged as runtime-visible. There may also be a significant quantity of nvram used to store variables that are only visible in boot services code. We could work around this by adding up sizes while we're still in boot services code, but on some systems calling QueryVariableInfo() before ExitBootServices() results in later calls to GetNextVariable() jumping to invalid addresses and crashing the kernel. Not a great approach.

Meanwhile, Samsung got back to us and let us know that their systems didn't require more than 5KB of nvram space to be available, which meant we could get rid of the 50% value and replace it with 5KB. The hope was that any system that booted with only 5KB of space available in nvram would trigger a garbage collection run. Unfortunately, it turned out that that wasn't true - some systems will only trigger garbage collection if the OS actually makes an attempt to write a variable that won't otherwise fit.

Hence this patch. The new approach is to ask the firmware how much space is available. If the size of the new variable would reduce this to less than 5K, we attempt to create a variable bigger than the remaining space. This should cause the firmware to realise that it's out of room and either (depending on implementation) perform a garbage collection run at runtime or set a flag that will cause the system to perform garbage collection on the next reboot. We then call QueryVariableInfo() again to see whether a garbage collection run actually happened, and if so check whether we now have enough space. If so, we go ahead and write the variable. If not, we tell userspace that there's not enough space.

This seems to work in all the situations I've tested, and it should avoid ending up in a situation where a Samsung can end up bricked. However, it's firmware, so who knows whether it's going to break things for someone else.

Yeep

(Anonymous) 2013-06-03 15:30 (UTC)(link)
And I remember when I thought UEFI would make our lives simpler and reduce the amount of ridiculous hacks we needed. Sigh.

Re: Yeep

(Anonymous) 2013-06-03 18:45 (UTC)(link)
Ridiculous hacks will always be required on the boundary of an interface, as long as at least one of the sides of the interface is immutable (and every hardware/firmware release is immutable from the perspective of the OS). It seems logical that the amount of hacks required increases with the complexity of the interface, so I wouldn't hold my breath for the future of UEFI. And given that one of EFI's original goals was to put more and more driver software on the "controlled" side of the fence, there is a lot of hackery yet to be explored.

I'm not sure if there is a solution to all this foolishness. The only thing I can think of is a blacklist/compliance check performed by the installer: "Your current firmware is not suited to run this software. Contact your vendor for an upgrade and try again". We'd be doing the manufacturer's quality control, as it were.

Ideally (at least for me), a Linux installer would be able to flash a known-good version of coreboot onto the motherboard before installing. However that would "define" an interface at a much lower level than even the old BIOS, and I shudder when thinking about the hackery required in Coreboot. At least hardware revisions come less frequently than firmware updates.

Re: Yeep

(Anonymous) 2013-06-04 11:09 (UTC)(link)
We're talking about code issues that can cause a device to brick under any OS, not just GNU/Linux.

If the UEFI folks are going to go to all the time and trouble to define an API, they should also go to the time and trouble to define some minimum behaviours/standards that should be followed. i.e. a test suite that the OEMs can't manipulate.

The likes of Samsung should be getting publicly ridiculed much more than they are, and perhaps even regulators getting involved to declare with devices unfit for sale.
> they should also go to the time and trouble to define some minimum behaviours/standards that should be followed

They have. It's called "whatever is needed to get Windows to boot".

Were Samsung willing to issue a BIOS update?

[identity profile] yuhongbao.blogspot.com 2013-06-13 06:44 (UTC)(link)
Were Samsung willing to issue a BIOS update to prevent the bricking?

Re: Were Samsung willing to issue a BIOS update?

(Anonymous) 2013-06-13 21:22 (UTC)(link)
For the Samsung NP510R5E-A01UB, they just released a P08RAN BIOS update. I have not found any reference to it via google yet. It might have been in response to my diatribe with "voice of the customer" regarding their UEFI issues recently. Not sure I am feeling that lucky to try linux yet. Time to do careful data backup.

Is there a recommended fedora or centos version to try? Kudos to everyone trying to get them to fix this problem!

Re: Samsung P08RAN BIOS update and F19 TC3

(Anonymous) 2013-06-14 05:29 (UTC)(link)
Matthew: Thank you for the image link and all your efforts. I flashed the P08RAN BIOS and w8 continued to run correctly. Covered my eyes and hit the enter key. The F19 TC3 live desktop worked! First time I have seen any linux boot on this laptop. The Wireless worked as well.

My Setup: np510r5e-A01U8, original W8 factory load, both fast boot and secure boot OFF, using UEFI OS setting.

Once I get some sleep, I will try installing F19 TC3 DVD in an empty area I made on the HD. If you have any advice on things to watch out for, would be appreciated. Take care. Bitflip10

Re: Samsung P08RAN BIOS update and F19 TC3

(Anonymous) 2013-06-15 01:46 (UTC)(link)
Matthew: Well look at this...I can dual boot w8 and f19 tc3 successfully on my NP510! Did not have to do anything really special. Other people should heed the advice, your mileage might vary so may mine over time. Time to start serious testing!

The UEFI Bugzilla 873207 could be pretty confusing for some newer users but an F10 to switch OSs works for me.

I agree with Linus on Gnome 3.x; yuk. But I started on Slackware 2.0 way back when so I am jaded.

I will email you my contact info in case you would like some info or testing. Thank you for the F19 TC3 guidance and your efforts.

Bruce bitflip10

Re: Samsung P08RAN BIOS update and F19 TC3

(Anonymous) 2013-06-21 19:23 (UTC)(link)
jun 21 update: No lockups or boot failures since installation except for a known QEMU/qxl driver issue (common F19 bug 7.3). I did a fresh install of TC5 (not using fedup) several days ago. No change in qxl issue but did fix a number of unrelated issues. xorg-x11-drv-qxl-0.1.1-0.9.20130514git77a1594.fc19 is loaded. Suggested workaround of VGA/VNC has worked. Don't believe QEMU is a sammy issue. Thank you!

Samsung-module

(Anonymous) 2013-06-21 09:34 (UTC)(link)
First of all, thank you very much for your effort, Mr. Garrett! :-) I installed Ubuntu 13.04 on a Samsung Series 5 machine (dmidecode -s system-product-name: 530U3C/530U4C/532U3C) few days ago and now I am running the 3.10-rc6 kernel. However, I wonder if the 'samsung-laptop' module will remain disabled for these laptops. I tried modprobing it but I get "ERROR: could not insert 'samsung_laptop': No such device". I guess this is intended safety means. However, there's some functionality missing without the 'samsung-module' inserted (like rfkill, fan control, etc.). I wonder will it ever work. Or maybe it will always be totally unsafe to install the module?

Re: Samsung-module

(Anonymous) 2013-08-01 14:38 (UTC)(link)
In a similar vein to this comment, I have just got a new Samsung Series 9 laptop (has a build stamp of May 2013, so I'm hoping it would be free of the EFI bug) and installed Fedora 19 on it. It took me a fair bit of internet searching until I found that this was the reason that things like the keyboard backlight can't be controlled.

I installed F19 in UEFI mode, but it doesn't boot if I change it to CSM, I guess because there's no standard grub MBR? Are there any other pros/cons of CSM?

For now I will probably stick with UEFI, but if there's a compelling reason to change, I will. I'd really like these nice features to all work 100% though. Is there any way to force samsung-laptop to load? Or any way to check if the EFI bug has been resolved? My laptop is still under warranty so I'm not too worried about the possibility of a brick.

install ubuntu 12.04.02 in a samsung laptop

(Anonymous) 2013-09-24 18:54 (UTC)(link)
Hi,

I would like to install Ubuntu in a samsung laptop but now I'm not sure if it's possible without a risk of broke my laptop. I've used ubuntu since 5 years ago and I don't want to use windows, but I don't want to broke my laptop...

How can I do?

the laptop is a samsung np270e5e with uefi and windows8 preinstalled.