[FEATURE REQUEST] Encrypted indexes over encrypted data

Started by Danylo Hlynskyiover 7 years ago6 messages
#1Danylo Hlynskyi
abcz2.uprola@gmail.com

Hi! Haven't found discussions on possible ways to encrypt indexes. Let's
start!

The problem
==========

I'd like to encrypt some columns (first name, last name, SSN, etc...) in a
nondeterministic way. This can be done using `pg_crypto`, but then I loose
full-text-search (and any other indexing) capabilities on these fields.

Blind indexing also isn't a good enough option.

Obviously we don't want create expression-based indexes, that perform
decryption during index build. This will store plaintexts inside index
buffers, and decryption key will be included in database dump.

We don't trust full-disk-encryption or any other transparent encryption,
because of possible SQL injections.

Solution 1 (possibly can be used even now)
========
- perform full-disk encryption
- perform encryption of column
- add decrypting expression-based index with decryption key
- limit ways on disclosing index internals. Ideally if no one except admin
can do that
- limit ways to read index definitions - so it's not possible for
application to uncover decryption key from database itself, it should know
it on it's own.

Solution 2 (feature request)
========
- full-disk encryption is optional
- data column is encrypted
- index is decrypted by construction, but each it's block is encrypted,
even in memory.
- lookups over index do lazy index buffer decrypt and close buffers ASAP
- make every query that has to touch encrypted column or encrypted index
require decryption key. This means, SELECT, DELETE, UPDATE, INSERT, VACUUM,
CLUSTER, CREATE INDEX, pg_dump, pg_restore all should have decryption key
supplied in order to be executed. This also means, that autovacuum daemon
can't work.

What do you think about both solutions? Is it hard to implement soluition 2?

#2Andres Freund
andres@anarazel.de
In reply to: Danylo Hlynskyi (#1)
Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

On August 9, 2018 5:30:26 PM GMT+05:30, Danylo Hlynskyi <abcz2.uprola@gmail.com> wrote:

?Is it hard to implement soluition 2?

Yes.

To the point that I'm fairly certain that an implementation would be considered to costly to maintain (vs benefit) of proposed.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#3Bear Giles
bgiles@coyotesong.com
In reply to: Andres Freund (#2)
Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

There are alternatives. If you know what you want to find, e.g., a search
by username or email address, you can store a strong hash of the value as
an indexed column. By "strong hash" I mean don't just use md5 or sha1, or
even one round with a salt. I can give you more details about how and why
offline.

So you might have a record with:

id serial primary key,
email_hash text not null indexed,
first_name_hash text indexed,
last_name_hash text indexed,
phone_number_hash text indexed ,
'wallet'_containing_all_encrypted_values text

and that allows you to search on email, first name, last name, or phone
number, or some combination on them. But no expressions. The hashing would
be done in your app, not the database. You also probably want to convert
everything to lowercase, maybe remove spaces, etc., before computing the
hash.

You should be prepared to handle multiple matches. It's unlikely that an
email or phone number hash won't be unique but it's safest to always be
prepared for more than one match, decrypt the 'wallet', and then do a final
comparison. That also gives you a bit of protection from an attacker
creating an account and then changing the hash values to match someone
else. You can use that to support very limited expressions, e.g., also keep
a hash on the first three letters of their last name, but that will
compromise your security a bit since it allows an attacker to perform some
statistical analysis on the data.

Finally there's the general advice that hashes (and encrypted values)
should always have a version number of some sort. It could be something as
simple as 3$hash, or it could be a composite column or even a user-defined
type. The # indicates is a lookup into a table, perhaps in your app, that
tells you which hashing algorithm and salt to use. It makes life a lot
easier if the security audit tells you that you need to change your
cipher/salt/key/whatever but you can't do it immediately since you don't
know everything you need in order to do it, e.g., the password that you
need in order to recompute the hash value. With that version number it's
easy to continue to accept the existing password so they can log in, and in
the background you quietly recompute the hash using the new
salt/algorithm/whatever and update their record. I've worked for some
pretty knowledgeable companies that have overlooked this.

On Thu, Aug 9, 2018 at 6:05 AM, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

On August 9, 2018 5:30:26 PM GMT+05:30, Danylo Hlynskyi <
abcz2.uprola@gmail.com> wrote:

?Is it hard to implement soluition 2?

Yes.

To the point that I'm fairly certain that an implementation would be
considered to costly to maintain (vs benefit) of proposed.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#4Nico Williams
nico@cryptonector.com
In reply to: Danylo Hlynskyi (#1)
Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

On Thu, Aug 09, 2018 at 03:00:26PM +0300, Danylo Hlynskyi wrote:

The problem
==========

[...]

We don't trust full-disk-encryption or any other transparent encryption,
because of possible SQL injections.

Can you elaborate on this?

Solution 1 (possibly can be used even now)
========
- perform full-disk encryption
- perform encryption of column
- add decrypting expression-based index with decryption key
- limit ways on disclosing index internals. Ideally if no one except admin
can do that
- limit ways to read index definitions - so it's not possible for
application to uncover decryption key from database itself, it should know
it on it's own.

But... you have to have the decryption key(s) in memory at all times to
enable any write operations. And plaintext as well at various times.

What does this gain you that FDE doesn't?

Solution 2 (feature request)
========
- full-disk encryption is optional
- data column is encrypted
- index is decrypted by construction, but each it's block is encrypted,
even in memory.
- lookups over index do lazy index buffer decrypt and close buffers ASAP
- make every query that has to touch encrypted column or encrypted index
require decryption key. This means, SELECT, DELETE, UPDATE, INSERT, VACUUM,
CLUSTER, CREATE INDEX, pg_dump, pg_restore all should have decryption key
supplied in order to be executed. This also means, that autovacuum daemon
can't work.

Same response.

What do you think about both solutions? Is it hard to implement soluition 2?

They gain little or nothing over doing filesystem encryption in the OS
or even just plain FDE (see below). They are not worthwhile.

You need to define your threat model. What bad actors are you
protecting against? What threats are you protecting against?

/messages/by-id/20180622042337.GL4200@localhost

Here are some threats you might choose to protect against:

1) passive attackers on the wire
2) active attackers on the wire
3a) theft / compromise of storage devices
3b) compromise of decommissioned storage devices
3c) theft of running server
4) compromised backup storage
5) bad / compromised clients
6) bad / compromised DBAs or sysadmins
7) side channel exploits
8) ??

(1) and (2) are taken care of by TLS.

(3a) is taken care of by FDE in controllers, say, or by physical
security.

(3b) is taken care of by proper decommissioning, but FDE helps.

(3c) you can't protect against if you have keys in memory. You could
use client-side crypto, but you'll have more clients to worry about than
servers. Physical security is your best option. (And really, you don't
get any way to protect against law enforcement taking the devices.)

(4) is taken care of by encrypting backups, which requires no changes to
PG to get.

(5) is taken care of (to some degree) by server-side logic (triggers,
...).

(6)... You can't protect against sysadmins, really, nor DBAs, but you
can use crypto on the *client*-side to get some protection. Since the
PG client is very thin and dumb, the PG client can't easily do this.
The idea is to encrypt values and MAC/sign rows to prevent DBAs/
sysadmins seeing sensitive data or tampering with your data.

(7) one deals with by using crypto implementations built with side
channel protection, though, really, this is a very difficult subject in
general, especially since Spectre.

Nico
--

#5Bear Giles
bgiles@coyotesong.com
In reply to: Nico Williams (#4)
Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

Some regulatory standards require all UII, even all PII, information be
encrypted within the database, not just on encrypted media. That's to
reduce exposure even if someone gets access to a live server, e.g., via SQL
Injection. (The perennial #1 risk for software vulnerabilities.)

UII is uniquely identifiable information, e.g., SSN.

PII is personally identifiable information, e.g, email address, phone
number, address. It doesn't have to be enough to uniquely identify the
person, just enough to cut the number of possible individuals down to a
handful. That's a surprising large number of fields, e.g., knowing where
someone was born and their birthdate will get you close to their SSN for
anyone born after the IRS started requiring SSNs for all claimed
dependents. Knowing someone's birth date and city of residence will get you
down to a handful of individuals, often a single individual depending upon
their age and the size of their city. It's remarkably easy to uniquely
identify something like 75% of the population if you have the data from a
couple different sites and some way to correlate the records. (That's why
it's not good enough to just use the sha1 of an email address, etc.)

I know the government required UII encryption in its databases when I last
worked on a government contract, and I think they've required PII
encryption as well for years. I would be verify surprised if HIPAA doesn't
require that as well for PII in addition to the medical info. I definitely
know PCI-DSS requires encryption of all information on the credit card
itself - you can keep the last few digits (I think 6 are allowed but for is
recommended) to facilitate searches. Of course companies could still have
the same information unencrypted in other columns or tables (except for the
CC number itself - and you *never* keep the CVN in any form on threat of
losing your ability to accept credit cards if you're caught) but they were
encouraged to encrypt it as well.

Anyway legal requirements is "#0" on that list. Everything else *might*
happen but depending upon the nature of the data you *will* be audited for
compliance with regulations, either preemptively (e.g., VISA requires
periodic audits of anyone making more than $X in transactions per year) or
after a breach. One of my other past employers did the type of auditing
VISA requires and their promotional material was full of interviews with
former small business owners who lost their business after a breach. It
wasn't due to the loss itself, it's because any breach automatically
requires the strictest auditing for the next (4?) years and that cost far
more than the average independent restaurant, auto repair shop, etc., can
afford. Obviously their business model is (in part) to scare people but
there are plenty of situations where you have to encrypt data within the
database and not just rely on encrypted media.

Show quoted text

Here are some threats you might choose to protect against:

1) passive attackers on the wire
2) active attackers on the wire
3a) theft / compromise of storage devices
3b) compromise of decommissioned storage devices
3c) theft of running server
4) compromised backup storage
5) bad / compromised clients
6) bad / compromised DBAs or sysadmins
7) side channel exploits
8) ??

(1) and (2) are taken care of by TLS.

(3a) is taken care of by FDE in controllers, say, or by physical
security.

(3b) is taken care of by proper decommissioning, but FDE helps.

(3c) you can't protect against if you have keys in memory. You could
use client-side crypto, but you'll have more clients to worry about than
servers. Physical security is your best option. (And really, you don't
get any way to protect against law enforcement taking the devices.)

(4) is taken care of by encrypting backups, which requires no changes to
PG to get.

(5) is taken care of (to some degree) by server-side logic (triggers,
...).

(6)... You can't protect against sysadmins, really, nor DBAs, but you
can use crypto on the *client*-side to get some protection. Since the
PG client is very thin and dumb, the PG client can't easily do this.
The idea is to encrypt values and MAC/sign rows to prevent DBAs/
sysadmins seeing sensitive data or tampering with your data.

(7) one deals with by using crypto implementations built with side
channel protection, though, really, this is a very difficult subject in
general, especially since Spectre.

Nico
--

#6Nico Williams
nico@cryptonector.com
In reply to: Bear Giles (#5)
Re: [FEATURE REQUEST] Encrypted indexes over encrypted data

On Thu, Aug 09, 2018 at 02:34:07PM -0600, Bear Giles wrote:

Some regulatory standards require all UII, even all PII, information be
encrypted within the database, not just on encrypted media. That's to
reduce exposure even if someone gets access to a live server, e.g., via SQL
Injection. (The perennial #1 risk for software vulnerabilities.)

My preference for dealing with SQL Injection is to not provide direct
SQL access, but to use PostgREST exporting a schema that has only PG SQL
functions encapsulating all supported queries. You just can't have
injection with such an access layer because you don't get to send SQL to
the server (because you don't get to send SQL to PostgREST). It really
helps that PostgREST is written in Haskell.

That said, sure, if you have SQL Injection issues, then encrypting in
the database will do provided that there's no transparent way to access
the data (otherwise you've gained nothing). That basically means you're
doing all the crypto on the client.

If you're doing all the crypto on the client, then your options for
indexing are very limited indeed. To avoid offline dictionary attacks
you have to index MAC'ed values, effectively. You can still do free
text indexing, provided you MAC each word. MAC == message
authentication code, really, it's a keyed hash function, typically HMAC,
UMAC, or some such. You could also index ciphertext, provided it has an
authentication tag, but you don't gain anything versus just indexing the
authentication tag.

I know the government required UII encryption in its databases when I last
[...]

Usually regulations are not quite as prescriptive as that, though
there's always a discussion to be had with the regulators/auditors when
you deviate from the norm. You're generally not precluded from having
better solutions than is the norm.

Nico
--