mjg59 | Linux Container Security

First, read these slides. Done? Good.

(Edit: Just to clarify - these are not my slides. They're from a presentation Jerome Petazzoni gave at Linuxcon NA earlier this year)

Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment[1].

Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.

I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace[2].

So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.

I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input[3]. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).

But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:

Strong auditing and aggressive fuzzing of containers under realistic configurations
Support for meaningful nesting of Linux Security Modules in namespaces
Introspection of container state and (more difficult) the host OS itself in order to identify compromises

These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.

[1] Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
[2] There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
[3] To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.

Flat | Top-Level Comments Only

From:

m50d.wordpress.com

The same reasoning about attack surfaces applies to other OS' container systems, like BSD jails or Solaris zones, right? So have any of them undergone this kind of rigorous security analysis? I'd expect Sun/Oracle or maybe even Joyent to have put some effort in, but maybe I'm being overly optimistic. So are there any audited container-like systems out there, or is every option in the same boat?

From: (Anonymous)

You're not being overly optimistic: Sun did extensive analysis when the zones work was being done -- to the point of (somewhat famously) having a company-wide contest (with significant cash prizes) for finding an exploit in zones. Faults aside, Sun had many creative engineers, and many tried to find exploits. In the end, a single exploit was found that was somewhat dubious (it allowed for denial-of-service, but not necessarily privilege escalation), but it was fixed nonetheless -- and this was over a decade ago. In the years since, there has never been a privilege escalation discovered with Solaris zones (or with its descendant technologies in the open source illumos). At Joyent (where I am the CTO) we have run SmartOS containers in multi-tenant production for 8 years; we have built our business on it, and we take its security very seriously!

Nobody sees the point in spending their time attacking something that almost no one uses. A.K.A "security through irrelevance".

(reply from suspended user)

Actually it's possible to do basic containerization without relying upon work being delegated to a privileged component such as systemd.

See https://gitorious.org/linted/linted/source/0178ba7e01bbfcae993394af8965a5365ec3816b:src/spawn/spawn.c

> A lot of things are possible. That doesn't mean they are a good ideas.

> To be clear, as a software developer myself, this is NOT what we are going to do.

You too, you too.

Sure you can wait until systemd is universally deployed which it will not be as it makes several fundamental mistakes or you can actually solve the problem you were hired to do and do the situation which works right now and is correct.

Why not just bind mount the dbus socket into the container?

Oh, and maybe for the activation bit you can make a .service that is aliased to dbus-yadayada then Wants=systemd-nspawn@mycont.service.

I suspect that lack of per-namespace policy isn't the only problem with LSMs and containers. I'm very slowly working on improving this, but I'm not going to do the heavy lifting.

--Andy, who breaks these things for amusement

> There's apparently some support for loading per-namespace Apparmor policies,
> but that means that the process is no longer confined by the sVirt policy

Would it not be possible to make the namespace handling be able to tell if a namespaced policy tries to expand beyond the original (in this case, sVirt) policy then just silently deny that expansion (and report it in the host)?

I might have forgotten to take my anti-stupid pills this morning, but I would really like to add this blog to my feed reader but I can't find an RSS feed for it anywhere. Am I just missing it?

mjg59

http://mjg59.dreamwidth.org/data/rss

Thanks, not sure why I couldn't find that on my own :/

Matthew Garrett a long time ago I raised the idea of allowing namespace spliting of LSM I fairly much got laughed at. No one would need that was very much the response now we need it maybe.

Lets be more focused here. LSM modules don't play nice with each other. We are going to have container that are like apparmor, selinux.... Is there any real gain todoing this. Would we be better off just to improve cgroup namespace limitations to cover everything selinux and apparmor can do. Unified interface for LSM's is required. Like both Selinux and apparmor limit the paths applications can access but they both require a different format configuration file. More individual configuration files more risk of human error.

LSM was a temp solution put in place in 2000 because Linux would could not come to agreement on what security the Linux kernel should provide as default. Since LSM was a temp measure we should serous-ally sit down now and look at what should be default security again and what can be unified from the user-space point of view.

(b) performance (several of the grsec options reduce performance).
Running a LSM inside a LSM will cause a performance hit as well.

Capabilities the weakness in design of these is a far bigger issue CAP_SYS_ADMIN allows far too much. More fine grain capabilities reduces how much syscall filtering is required.

container-based deployments still can choose to run individual instances in a hyper-visor.

If we are going to settle on one LSM, then cgroups is not even in the running. Why add a bunch of extra code to cgroups when AppArmor and SELinux already have all of it?

Why add a bunch of extra code to cgroups when AppArmor and SELinux already have all of it?
The reality here is both AppArmor and SELinux are using complete different formats. Docker and other container based solutions will be better off with something unified.

cgroups are not LSM they are more a standard feature. Reality is cgroups filesystem namespace is doing the same kind of stuff as both Apparmor and SELinux are on restricting file access. Reality is cgroups and Apparmor and SELinux are stomping all over each others turf all ready. A lot of cgroup namespace are doing all the same stuff.

Most of the code is already in cgroups. Its more finishing off the functionality.

SELinux rolebased secruity starts stomping over polkit as well.

In fact I am not saying settle on one LSM. I am saying focus that the secuirty works even if no LSM is loaded at all.

LSM was created because linux developers could not agree on how to do Mandorary access control. At some point we do have to draw a line in the sand and say we have had enough secuirty prototype code and focus on delivering final product secuirty that is uniform where ever possible.

Mostly to get to uniform we will be unable to select 1 LSM. Instead X feature will be delievers X way and then all LSM migrate to that method.

Basically if you say apparmor and selinux has it my question is: Exactly how do I the features all the time without having to know what LSM is loaded?

If you say I have to make sure the right LSM is loaded or select a LSM or make configuration per every LSM in existance this is wrong. Like I don't have to write different audio output code per sound card. Yet I have to provide different based on what LSM module loaded.

The reality is we will most likely end up with docker putting some form of processing wrapper over the LSM configuration files. So lets try to avoid having to be messy.

Matthew Garrett

Linux Container Security

Linux Container Security

no subject

no subject

no subject

Re: Use-cases for lightweight containers

Re: Use-cases for lightweight containers

Re: Use-cases for lightweight containers

Re: Use-cases for lightweight containers

LSMs and containers

Namespaced LSM's

Why no RSS?

Re: Why no RSS?

Re: Why no RSS?

LSM really??

Re: LSM really??

Re: LSM really??

Profile

About Matthew

Page Summary

Expand Cut Tags