[personal profile] mjg59
I bricked a Samsung laptop today. Unlike most of the reported cases of Samsung laptops refusing to boot, I never booted Linux on it - all experimentation was performed under Windows. It seems that the bug we've been seeing is simultaneously simpler in some ways and more complicated in others than we'd previously realised.

So, some background. The original belief was that the samsung-laptop driver was doing something that caused the system to stop working. This driver was coded to a Samsung specification in order to support certain laptop features that weren't accessible via any standardised mechanism. It works by searching a specific area of memory for a Samsung-specific signature. If it finds it, it follows a pointer to a table that contains various magic values that need to be written in order to trigger some system management code that actually performs the requested change. This is unusual in this day and age, but not unique. The problem is that the magic signature is still present on UEFI systems, but attempting to use the data contained in the table causes problems.

We're not quite sure what those problems are yet. Originally we assumed that the magic values we wrote were causing the problem, so the samsung-laptop driver was patched to disable it on UEFI systems. Unfortunately, this doesn't actually fix the problem - it just avoids the easiest way of triggering it. It turns out that it wasn't the writes that caused the problem, it was what happened next. Performing the writes triggered a hardware error of some description. The Linux kernel caught and logged this. In the old days, people would often never see these logs - the system would then be frozen and it would be impossible to access the hard drive, so they never got written to disk. There's code in the kernel to make this easier on UEFI systems. Whenever a severe error is encountered, the kernel copies recent messages to the UEFI variable storage space. They're then available to userspace after a reboot, allowing more accurate diagnostics of what caused the crash.

That crash dump takes about 10K of UEFI storage space. Microsoft require that Windows 8 systems have at least 64K of storage space available. We only keep one crash dump - if the system crashes again it'll simply overwrite the existing one rather than creating another. This is all completely compatible with the UEFI specification, and Apple actually do something very similar on their hardware. Unfortunately, it turns out that some Samsung laptops will fail to boot if too much of the variable storage space is used. We don't know what "too much" is yet, but writing a bunch of variables from Windows is enough to trigger it. I put some sample code here - it writes out 36 variables each containing a kilobyte of random data. I ran this as an administrator under Windows and then rebooted the system. It never came back.

This is pretty obviously a firmware bug. Writing UEFI variables is expressly permitted by the specification, and there should never be a situation in which an OS can fill the variable store in such a way that the firmware refuses to boot the system. We've seen similar bugs in Intel's reference code in the past, but they were all fixed early last year. For now the safest thing to do is not to use UEFI on any Samsung laptops. Unfortunately, if you're using Windows, that'll require you to reinstall it from scratch.

Thinkoad Bricked... related?

Date: 2013-03-13 03:31 am (UTC)
From: (Anonymous)
This might be totally unrelated and coincidental, but a couple days ago my thinkpad was bricked.

I have an Edge E430 (3254-CTO). I was booting Arch on kernel 3.8.2 (CK patchset) when it stalled. This wasn't unusual as there has been major changes in LVM2 with Arch that I have been still trying to figure out. So I manually scanned for the PV when I had a kernel panic.

The kernel panic was really funky looking with very little info having been dumped (maybe 7-10 short jumbled lines). Though this may have been because I was in the initrd still.

After that I was not able to boot whatsoever. No POST or anything. The fan spins like it is going to do something, but nothing after. I even put the optical drive back in so that i could see if it saw that. It spins like it always does when power is applied, but nothing else.

Since you are the only person who seems to be an authority on this particular problem, I am trying to contact you.

Hopefully I will get a reply from you here. I will check periodically. My computer has been sent for repair, but I should contact you at the very least.

Re: Thinkoad Bricked... related?

Date: 2013-03-13 03:32 am (UTC)
From: (Anonymous)
Wow auto-correct on my Nexus 7 really sucks sometimes.

Re: Thinkoad Bricked... related?

Date: 2013-03-13 02:39 pm (UTC)
From: (Anonymous)
Okay, well I certainly appreciate your response, and am glad you you at least aware that there is potentially a situation.

As an arch Linux user though, I use the latest stable kernels anyway. 3.7.10 is in the main repository, and I was using a 3.8.2 Linux CK kernel. Shouldn't these kernels have had this patch?

If so would you recommend I revert back to legacy BIOS for the time being? Or is using mainline with UEFI safe?

I still have ~165 days of warranty left... what do you recommend I do?

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at Nebula. Ex-biologist. @mjg59 on Twitter. Content here should not be interpreted as the opinion of my employer.

Page Summary

Expand Cut Tags

No cut tags