One of the goals of our work at Nebula
is making it as easy as possible for someone to set up a private cloud. In an ideal world that would involve being able to just plug in the controller hardware, wire up a rack of servers and turn everything on, but right now there are cases where the default firmware configuration on the servers doesn't match our desired configuration. Plugging a console into individual servers just to set some BIOS options is (based on personal experience) about as much fun as writing a doctoral thesis on the experience of watching paint dry, so it seemed worth trying to find a way to avoid people having to deal with that. Thankfully, it turns out that the industry has come to a similar set of conclusions. Recent Dell hardware lets you use WS-MAN
, which makes it easy to do things like enable security features
as long as you have an authenticated connection to the iDRAC management system. This actually works out wonderfully - the first time a Dell node sends a PXE request to the Cloud Controller, it can push back some configuration changes, reboot the system and then boot it in a known good configuration. Thanks, Dell!.
HP's slightly trickier. As far as I've been able to work out, the remotely-available programmatic interfaces only provide configuration interfaces to the iLO device itself, along with a range of chassis-level monitoring. Instead, HP provide a couple of utilities that allow the OS to change values. The first of these is called conrep. Conrep is able to modify BIOS settings, but needs an XML file which tells it which configuration values are at which addresses. That's a bit of a pain. Thankfully it's being replaced with a new tool called hprcu which is able to directly query the firmware in order to figure out which configuration options are available and which values have to be stored where in order to set them. You run hprcu once and it spits out a file containing your existing settings. You modify that, feed it back into hprcu, reboot and you're set.
Sounds pretty ideal. What's the problem? The first is that hprcu is currently only shipped as a 32-bit binary, and right now we don't deploy any 32-bit support code on our nodes. It'd be irritating to have to do so just to configure the firmware. The second is that it works by doing raw port IO, and that's not going to be possible from userspace once we've moved to a UEFI Secure Boot setup.
So, obviously, I've been working on reimplementing it. The first step was to figure out what it was actually doing. The first step was to strace it to see if it was using the kernel IPMI interface. strace showed no accesses to /dev/ipmi*. My next thought was that it could be using some sort of shared memory segment with the iLO hardware, so I used MMIOTrace
to dump the memory accesses it made. Turned out there weren't many, and certainly not enough to do anything interesting. Then I went back to the strace logs and saw that it was calling iopl() a bunch, and then I got sad.
iopl() allows a userspace application to change its io privilege levels. The Linux manpage for the iopl() call is amazingly helpful, informing us with a straight face that "This call is necessary to allow 8514-compatible
X servers to run under Linux." What it actually does is grant userspace access to the full range of io ports.
So, what's an io port? The x86 architecture (and some others) provide two ways to communicate with hardware. The one that's mostly used these days is to map the hardware into the same address space as memory (memory-mapped io), but x86 has an entirely separate address range that can be used. This is significantly more limited, as only 65,536 addresses are available for the entire system. Further, you can only read or write a maximum of 4 bytes in a single transaction. However, while slower and less generally useful than memory-mapped io, port io has the advantage that it's much simpler to implement. As a result, a lot of the original PC hardware was intended to be accessed via port io, and a bunch of that's still present. Want to reprogram your real-time clock? Port io. Want to read from the legacy keyboard controller? Port io. For low-bandwidth transactions, it's a completely reasonable way to implement hardware.
Applications perform port io by executing in and out instructions. Since these instructions allow you to do things like, say, hit the keyboard controller directly, userspace is generally forbidden from calling them. Sufficiently privileged applications can call iopl() to raise their privileges and gain access to the full set of io ports. But, since in and out are CPU-level instructions, the actual port io accesses won't show up in strace. So I had to take another approach.
The correct way of doing this would be to LD_PRELOAD something that intercepted iopl(), didn't call it, let the application perform an in or out instruction, caught the fault, looked at the stack frame, called iopl(), performed the access, dumped debug details, dropped iopl() again and restored state. Because that seemed like a bunch of work, I took advantage of the fact that hprcu had separate in and out functions and used gdb. I told gdb to break on every call to in or out, dump the register state and then continue. Then I hacked up a script to parse the register dumps and tell me where the reads and writes were going. Astonishingly, it worked. And then I read the results and got really sad.
PCI setup is hard, if you're a BIOS. You've got PCI devices with large memory windows and you've got to arrange them all somehow and look you've only got 640K of RAM you can access while you're doing this and come on, seriously? So PC-type systems define a port IO interface to performing PCI configuration. Each PCI device has an address made up of a bus, a device and a function. You take a 32-bit integer, set the top bit to indicate that you want PCI access, put the bus in bits 16-23, the device in bits 11-15, the function in bits 8-10 and the configuration register on that device you want to access in bits 0-7. Then you write that to io port 0xcf8. Reads or writes to io port 0xcfc then read or write from that configuration register on that device. This lets the BIOS tell each PCI device where it's going to live in mmio address space without having to actually perform any mmio. Good work, BIOS developers.
This works fine in the BIOS, because nothing else is running while the BIOS is. It even works fine in the kernel, which uses this approach on older hardware that doesn't provide a memory-mapped mechanism to access PCI configuration registers. But it works really badly if you're doing it in userspace, because it shares a problem with all other indexed accesses. You write an address to 0xcf8. You write a value to 0xcfc. What happens if the kernel writes to 0xcf8 before you write to 0xcfc? Your access goes to some other register on some other device and suddenly you've just mapped a PCI device over the top of RAM and well I sure hope you saved all your work.
So, I wasn't thrilled to discover that hprcu was communicating with the iLO by using 0xcf8/0xcfc port io accesses. Not only was it going to stop working once we started deploying UEFI Secure Boot, it had the potential to cause really annoyingly hard to debug problems. Working out how to reimplement it became something of a priority.
By looking at the XML files it generated, and by following the port IO accesses it was performing, I could figure out pretty much how it worked. Some configuration values are stored in the real-time clock CMOS. These were just done in the standard way - write the address to port 0x70, read or write the value to port 0x71. Other configuration values were stored in NVRAM, with accesses going via PCI. This was a little trickier to figure out, because sometimes different NVRAM addresses went to the same PCI address. I finally figured it out - there's a 48 byte window into NVRAM via the PCI configuration registers, and another register which chooses which set of NVRAM is visible. So, take the NVRAM address, divide by 48, write that value to register 0xa6, take the modulus, add it to 0x80 and write the desired value to that address.
So now I knew enough to be able to take an existing XML file and deploy it. Definitely progress, and I could even add support to the kernel. But I still wanted to know how hprcu generated that XML file in the first place. Running strings over the binary showed a bunch of debug output that it never actually printed, but immediately after the help text it also printed SsLlFf:HhAaTtOoDd
. I've lost enough years of my life to this kind of thing to be able to identity that as a getopt format string, and there was that tantalising D option at the end that went entirely unmentioned in the help text. Re-running hprcu with the -D argument gave me huge piles of debug output, including references to a DMI entry and an $RBS table.
DMI is a standard for exposing system information, and one aspect of it is providing data from the firmware to the operating system. The OS can look for a defined signature in a fixed area of memory and then pull out a bunch of data about the hardware, including things like the vendor name, model, serial number, BIOS version and more. Some of these information tables are standardised and some are vendor-specific. hprcu was looking for a vendor-specific table and then scanning for it looking for a known signature. That table turned out to to contain a set of signatures and addresses. The address corresponding to the signature hprcu was contained a bunch of data starting with "$RBS". It also contained a huge quantity of ASCII that looked awfully like the strings in the XML file that hprcu had written. Success!
So, the rest of today was spent on working out the format of this table. This was only marginally less tedious than setting up BIOS settings on 20 servers by hand, but I've made enough progress to be able to figure out how to write a kernel-level driver for this. That's about half-done now - the parsing code is all implemented, I just need to add the sysfs glue and I am nowhere near drunk enough for that right now, so for now you're just getting a format description.
To start with, look for a DMI table with a type of 0xe5. This contains several entries of 4 bytes of signature, 8 bytes of address and 4 bytes of length. Find the entry with a signature of "$CRQ" and map the corresponding address. The first four bytes of the mapped region should be "$RBS". The first record is 21 bytes into the file. Each record starts with a single byte representing the type and two bytes representing the record length. Records of type 1 contain an ascii string that starts 8 bytes into the record, and have a further type number 6 bytes into the record that tells you what type of string it is. 0x05 indicates that it's a "Feature string", which defines the overall classification of the following features. 0x06 is an "Option string", which provides a human readable explanation of what this configuration value corresponds to. 0x60 is an "Optional warning string", which tells the user which further configuration may be required before the configuration change takes effect.
All other types appear to be configuration options. Bytes 3 to 6 provide the little-endian name of the configuration option. Bytes 7 and 8 provide a unique numberical identifier for the configuration option. Byte 13 indicates the type. 0x03 is stored in CMOS. 0x04 is stored in NVRAM. 0x05 appears to be some different kind of CMOS store. I haven't figured out the others. There then follows a set of choices. These vary in length depending on the type - 0x03 are 14 bytes long, 0x04 are 6 bytes long, 0x05 are 5 bytes long. Type 0x03 choices have the choice id at byte 0, the CMOS offset at byte 2, the mask at byte 3 and the value at byte 4. Type 0x04 choices have the choice id at byte 0, the nvram offset at bytes 1 and 2, the mask at byte 3 and the value at byte 4. Type 0x05 choices have the choice id at byte 0, the cmos offset at byte 1, the mask at byte 2 and the value at byte 3. There are optional flags following each choice - if the final byte of the choice isn't 0, there then follow 6 bytes of flags. The first four bytes provide the name of a configuration option (little-endian, again), the fifth byte refers to a choice on that option and the sixth byte indicates what kind of flag. 0x1 appears to indicate that the option is mutually exclusive with that other option. Examples include the embedded serial port and virtual serial port options, where it's impossible to map both to the same address at once. A choice can have multiple flags.
After a configuration option, there'll be a set of option string records. In turn, these correspond to the previous choices - if a configuration option had three choices, there'll be three option string records. I haven't figured out whether there's a stronger way of binding the option string records to the configuration choice yet. The other thing I haven't entirely figured out are the details of configuring the platform to perform a cold reboot on the next power cycle, but the io traces for that don't look too bad.
The aim is to provide a kernel driver that exposes all these configuration options via sysfs, including indicating the current value, the available values and letting the user set a new value. In an ideal world we'd have a wonderfully generic interface to this kind of functionality, but I'm (sadly) not sure that that's possible.
And that's how I spent the past two days.
 Or, indeed, kernels
 And, hence, read your keystrokes
 Which has exactly the same problems as the PCI access, in that if you happen to ask the clock what the time is while hprcu is accessing CMOS, at least one of you is going to get very confused
 Obviously, I would never write kernel code while drunk. You're certainly not running any of it in your enterprise kernel.