[personal profile] mjg59
First, read these slides. Done? Good.

(Edit: Just to clarify - these are not my slides. They're from a presentation Jerome Petazzoni gave at Linuxcon NA earlier this year)

Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment[1].

Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.

I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace[2].

So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.

I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input[3]. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).

But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:

  • Strong auditing and aggressive fuzzing of containers under realistic configurations
  • Support for meaningful nesting of Linux Security Modules in namespaces
  • Introspection of container state and (more difficult) the host OS itself in order to identify compromises

These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.

[1] Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
[2] There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
[3] To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.

Date: 2014-10-23 08:04 am (UTC)
From: [identity profile] m50d.wordpress.com
The same reasoning about attack surfaces applies to other OS' container systems, like BSD jails or Solaris zones, right? So have any of them undergone this kind of rigorous security analysis? I'd expect Sun/Oracle or maybe even Joyent to have put some effort in, but maybe I'm being overly optimistic. So are there any audited container-like systems out there, or is every option in the same boat?

Date: 2014-10-23 02:47 pm (UTC)
From: (Anonymous)
You're not being overly optimistic: Sun did extensive analysis when the zones work was being done -- to the point of (somewhat famously) having a company-wide contest (with significant cash prizes) for finding an exploit in zones. Faults aside, Sun had many creative engineers, and many tried to find exploits. In the end, a single exploit was found that was somewhat dubious (it allowed for denial-of-service, but not necessarily privilege escalation), but it was fixed nonetheless -- and this was over a decade ago. In the years since, there has never been a privilege escalation discovered with Solaris zones (or with its descendant technologies in the open source illumos). At Joyent (where I am the CTO) we have run SmartOS containers in multi-tenant production for 8 years; we have built our business on it, and we take its security very seriously!

Use-cases for lightweight containers

Date: 2014-10-23 12:51 pm (UTC)
pvanhoof: (Default)
From: [personal profile] pvanhoof
Right now it's still hard for application developers to start using nspawn and the likes.

Firstly systemd is not universal yet. With Debian having accepted it as init system I have really good hopes this will happen soon.

Secondly its lightweight containers are at the moment not yet completely fit for delegating a desktop service to. For example I want to run org.gnome.evolution.dataserver.Calendar4 in a container that is completely separate from the host, yet the hosts' calendar applet of gnome-shell needs to show the Calendar's contents.

That's because sd-bus isn't a public library yet and kdbus isn't universal either, yet. You'd need both, I learned yesterday, for applications to start using sd_bus_open_system_remote and bus_set_address_system_remote.

I have not figured out how to properly configure D-Bus service activation for service requests on the host to activate and nspawn the container providing the service. But according to Lennart that's also already possible through container socket activation. I just have not figured it out yet.

Re: Use-cases for lightweight containers

Date: 2014-10-23 04:33 pm (UTC)
From: (Anonymous)
Actually it's possible to do basic containerization without relying upon work being delegated to a privileged component such as systemd.

See https://gitorious.org/linted/linted/source/0178ba7e01bbfcae993394af8965a5365ec3816b:src/spawn/spawn.c

Re: Use-cases for lightweight containers

Date: 2014-10-24 08:05 am (UTC)
pvanhoof: (Default)
From: [personal profile] pvanhoof
A lot of things are possible. That doesn't mean they are a good ideas.

Application developers want one API to do this stuff, not for every damn initiative that doesn't like another initiative another API. Because that will look like this:

static void use_service_x() {
#elseif _HAVE_OPENRC_
system("/etc/init.d/d-bus start");
system("/etc/init.d/some-other-vague-script start");
system("/etc/init.d/workaround-for-silly-idea start");
#elseif ...

To be clear, as a software developer myself, this is NOT what we are going to do. What we will do is this:

static void use_service_x() {
/* Hay! We are ignoring everything but systemd */


I hope all those init and container API initiatives and maintainers realize this.

Re: Use-cases for lightweight containers

Date: 2014-10-31 02:46 am (UTC)
From: (Anonymous)
> A lot of things are possible. That doesn't mean they are a good ideas.

> To be clear, as a software developer myself, this is NOT what we are going to do.

You too, you too.

Sure you can wait until systemd is universally deployed which it will not be as it makes several fundamental mistakes or you can actually solve the problem you were hired to do and do the situation which works right now and is correct.

Re: Use-cases for lightweight containers

Date: 2014-10-31 08:50 am (UTC)
pvanhoof: (Default)
From: [personal profile] pvanhoof
Right now service activation on D-Bus works by letting D-Bus do it and calling a method on an object using a service name.

I would expect any service activation for services running in lightweight containers to work the same way.

Right now systemd seems to be in the pole position to define the standard for this.

I'm not sure that you or somebody else hired me to solve this particular problem so I don't know what you mean by that. You can always reply me your business address so that I can send you an invoice if you want to hire me to do something you think is the right way. Please note that this is no guarantee that the other opensource communities will accept it as the right thing. It is a guarantee that I will invoice you for it anyway, though.

Re: Use-cases for lightweight containers

Date: 2014-10-23 11:14 pm (UTC)
From: (Anonymous)
Why not just bind mount the dbus socket into the container?

Re: Use-cases for lightweight containers

Date: 2014-10-24 08:37 am (UTC)
pvanhoof: (Default)
From: [personal profile] pvanhoof
Same reason as above: application developers need one consistent API, it must be simple to use and do the right thing (unless you make it blow off your foot).

If application developers must do voodoo ninja magic like bind mounting a UNIX domain socket in /srv/containers/service_x/var/run/dbus/system_bus_socket (which depends on /srv/containers/service_x/etc/d-bus/system.conf), then I predict that between now and a few years the Linux container usage by application developers will be one gigantic flying spaghetti mess.

I recall the mess we had when we had ORbit-2, Bonobo, D-COP, hundreds upon hundreds of worked on versions of a copy of XMMS's socket.c IPC code, libICE, IPC systems over X11 that used the same mechanisms the clipboard and drag and drop use (oh my god), and then finally, thank god finally, a standarization on D-Bus. And because it's a standard people like to hate it. But it's more easy to ignore hating people than to ignore millions of badly designed and each completely different IPC systems. So I like this situation a lot more than what we had begin 2000.

I'm really hoping the init and container initiative maintainers avoid it entirely this time. Please do it beneficial dictatorship style and define ONE good way of working and ONE set of APIs. From the flamewar heat systemd maintainers are getting I conclude they are doing a great job doing exactly that.

If that means saying no to a huge group of angry people (who aren't coding anything but instead spending entire days fulminating on systemd) and then just going for systemd, then I think that it should mean exactly that.

Afterwards competing implementations can replace systemd by implementing that ONE good way and ONE set of APIs. I guess a bit like how libdbus-1 came first, then implementations like GDBus, QtDbus, and now kdbus and sd-bus in systemd appeared.

For now experimenting with bind mounting and other APIs is fine for amusement and experimenting. Something standardized and actually usable should in my opinion also exist (and be the default that everybody does).
Edited Date: 2014-10-24 08:58 am (UTC)

Re: Use-cases for lightweight containers

Date: 2014-10-23 11:22 pm (UTC)
From: (Anonymous)
Oh, and maybe for the activation bit you can make a .service that is aliased to dbus-yadayada then Wants=systemd-nspawn@mycont.service.

Re: Use-cases for lightweight containers

Date: 2014-10-24 09:27 am (UTC)
pvanhoof: (Default)
From: [personal profile] pvanhoof
That looks better. I'm going to try this! Thanks.

With this I can just call my D-Bus service's object's method using the standardized D-Bus interface and be done with it as the application developer depending on such a D-Bus service.

This way I don't need to care about bind mounting, not about starting up O-M-F-G scripts that do everything completely different on each and every distribution to get the container running or anything completely silly like that.

Imagine the calendar support in gnome-shell having to do all that mess just to get the Calendar4 service of Evolution Data Server up and running. How could it ever know that on system x it's running in a container and on system y it's on the host? And of course, because the sysadmin needs absolute freedom and choice we application developers have to do it completely different for both. As if we care about their favorite distribution's init.d design mistakes.

In fact as application developer I do not want to care how and where the service runs. Container or on the host: just provide me the interfaces the .service file promises, bring up the service.

At the other side as the upstream developer of such a service I want to provide in a standard way to downstream how to bring my service up (with or without containers). How to configure cgroups for example. I want to add this to our make-install target. Downstream can override it, of course, and they probably very often will, but then it's their problem.

All these questions and ideas assume the D-Bus IPC can crossover from the host to the containers. I'm not sure if this is going to be the idea in the first place with kdbus, of course. Though I think it should be like that. But I always think a lot of things. Heh :)

LSMs and containers

Date: 2014-10-23 01:50 pm (UTC)
From: (Anonymous)
I suspect that lack of per-namespace policy isn't the only problem with LSMs and containers. I'm very slowly working on improving this, but I'm not going to do the heavy lifting.

--Andy, who breaks these things for amusement

Namespaced LSM's

Date: 2014-10-23 11:27 pm (UTC)
From: (Anonymous)
> There's apparently some support for loading per-namespace Apparmor policies,
> but that means that the process is no longer confined by the sVirt policy

Would it not be possible to make the namespace handling be able to tell if a namespaced policy tries to expand beyond the original (in this case, sVirt) policy then just silently deny that expansion (and report it in the host)?

Why no RSS?

Date: 2014-10-24 02:59 pm (UTC)
From: (Anonymous)
I might have forgotten to take my anti-stupid pills this morning, but I would really like to add this blog to my feed reader but I can't find an RSS feed for it anywhere. Am I just missing it?

Re: Why no RSS?

Date: 2014-10-27 01:57 pm (UTC)
From: (Anonymous)
Thanks, not sure why I couldn't find that on my own :/

LSM really??

Date: 2014-10-27 06:41 am (UTC)
From: (Anonymous)
Matthew Garrett a long time ago I raised the idea of allowing namespace spliting of LSM I fairly much got laughed at. No one would need that was very much the response now we need it maybe.

Lets be more focused here. LSM modules don't play nice with each other. We are going to have container that are like apparmor, selinux.... Is there any real gain todoing this. Would we be better off just to improve cgroup namespace limitations to cover everything selinux and apparmor can do. Unified interface for LSM's is required. Like both Selinux and apparmor limit the paths applications can access but they both require a different format configuration file. More individual configuration files more risk of human error.

LSM was a temp solution put in place in 2000 because Linux would could not come to agreement on what security the Linux kernel should provide as default. Since LSM was a temp measure we should serous-ally sit down now and look at what should be default security again and what can be unified from the user-space point of view.

(b) performance (several of the grsec options reduce performance).
Running a LSM inside a LSM will cause a performance hit as well.

Capabilities the weakness in design of these is a far bigger issue CAP_SYS_ADMIN allows far too much. More fine grain capabilities reduces how much syscall filtering is required.

container-based deployments still can choose to run individual instances in a hyper-visor.

Re: LSM really??

Date: 2014-10-30 03:18 am (UTC)
From: (Anonymous)
If we are going to settle on one LSM, then cgroups is not even in the running. Why add a bunch of extra code to cgroups when AppArmor and SELinux already have all of it?

Re: LSM really??

Date: 2014-10-30 04:46 am (UTC)
From: (Anonymous)
Why add a bunch of extra code to cgroups when AppArmor and SELinux already have all of it?
The reality here is both AppArmor and SELinux are using complete different formats. Docker and other container based solutions will be better off with something unified.

cgroups are not LSM they are more a standard feature. Reality is cgroups filesystem namespace is doing the same kind of stuff as both Apparmor and SELinux are on restricting file access. Reality is cgroups and Apparmor and SELinux are stomping all over each others turf all ready. A lot of cgroup namespace are doing all the same stuff.

Most of the code is already in cgroups. Its more finishing off the functionality.

SELinux rolebased secruity starts stomping over polkit as well.

In fact I am not saying settle on one LSM. I am saying focus that the secuirty works even if no LSM is loaded at all.

LSM was created because linux developers could not agree on how to do Mandorary access control. At some point we do have to draw a line in the sand and say we have had enough secuirty prototype code and focus on delivering final product secuirty that is uniform where ever possible.

Mostly to get to uniform we will be unable to select 1 LSM. Instead X feature will be delievers X way and then all LSM migrate to that method.

Basically if you say apparmor and selinux has it my question is: Exactly how do I the features all the time without having to know what LSM is loaded?

If you say I have to make sure the right LSM is loaded or select a LSM or make configuration per every LSM in existance this is wrong. Like I don't have to write different audio output code per sound card. Yet I have to provide different based on what LSM module loaded.

The reality is we will most likely end up with docker putting some form of processing wrapper over the LSM configuration files. So lets try to avoid having to be messy.


Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at Google. Ex-biologist. @mjg59 on Twitter. Content here should not be interpreted as the opinion of my employer.

Expand Cut Tags

No cut tags