Transparent column encryption

Started by Peter Eisentrautover 4 years ago103 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

I want to present my proof-of-concept patch for the transparent column
encryption feature. (Some might also think of it as automatic
client-side encryption or similar, but I like my name.) This feature
enables the {automatic,transparent} encryption and decryption of
particular columns in the client. The data for those columns then
only ever appears in ciphertext on the server, so it is protected from
the "prying eyes" of DBAs, sysadmins, cloud operators, etc. The
canonical use case for this feature is storing credit card numbers
encrypted, in accordance with PCI DSS, as well as similar situations
involving social security numbers etc. Of course, you can't do any
computations with encrypted values on the server, but for these use
cases, that is not necessary. This feature does support deterministic
encryption as an alternative to the default randomized encryption, so
in that mode you can do equality lookups, at the cost of some
security.

This functionality also exists in other SQL database products, so the
overall concepts weren't invented by me by any means.

Also, this feature has nothing to do with the on-disk encryption
feature being contemplated in parallel. Both can exist independently.

The attached patch has all the necessary pieces in place to make this
work, so you can have an idea how the overall system works. It
contains some documentation and tests to help illustrate the
functionality. But it's missing the remaining 90% of the work,
including additional DDL support, error handling, robust memory
management, protocol versioning, forward and backward compatibility,
pg_dump support, psql \d support, refinement of the cryptography, and
so on. But I think obvious solutions exist to all of those things, so
it isn't that interesting to focus on them for now.

------

Now to the explanation of how it works.

You declare a column as encrypted in a CREATE TABLE statement. The
column value is encrypted by a symmetric key called the column
encryption key (CEK). The CEK is a catalog object. The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK). The CMK is not stored in the database but somewhere
where the client can get to it, for example in a file or in a key
management system. When a server sends rows containing encrypted
column values to the client, it first sends the required CMK and CEK
information (new protocol messages), which the client needs to record.
Then, the client can use this information to automatically decrypt the
incoming row data and forward it in plaintext to the application.

For the CMKs, the catalog object specifies a "provider" and generic
options. Right now, libpq has a "file" provider hardcoded, and it
takes a "filename" option. Via some mechanism to be determined,
additional providers could be loaded and then talk to key management
systems via http or whatever. I have left some comments in the libpq
code where the hook points for this could be.

The general idea would be for an application to have one CMK per area
of secret stuff, for example, for credit card data. The CMK can be
rotated: each CEK can be represented multiple times in the database,
encrypted by a different CMK. (The CEK can't be rotated easily, since
that would require reading out all the data from a table/column and
reencrypting it. We could/should add some custom tooling for that,
but it wouldn't be a routine operation.)

The encryption algorithms are mostly hardcoded right now, but there
are facilities for picking algorithms and adding new ones that will be
expanded. The CMK process uses RSA-OAEP. The CEK process uses
AES-128-CBC right now; a more complete solution should probably
involve some HMAC thrown in.

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption). These are essentially cousins of bytea. For the rest of
the database system below the protocol handling, there is nothing
special about those. For example, encryptedr has no operators at all,
encryptedd has only an equality operator. pg_attribute has a new
column attrealtypid that stores the original type of the data in the
column. This is only used for providing it to clients, so that
higher-level clients can convert the decrypted value to their
appropriate data types in their environments.

Some protocol extensions are required. These should be guarded by
some _pq_... setting, but this is not done in this patch yet. As
mentioned above, extra messages are added for sending the CMKs and
CEKs. In the RowDescription message, I have commandeered the format
field to add a bit that indicates that the field is encrypted. This
could be made a separate field, and there should probably be
additional fields to indicate the algorithm and CEK name, but this was
easiest for now. The ParameterDescription message is extended to
contain format fields for each parameter, for the same purpose.
Again, this could be done differently.

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out). It is required to use
protocol-level prepared statements (i.e., extended query) for this.
The client must first prepare a statement, then describe the statement
to get parameter metadata, which indicates which parameters are to be
encrypted and how. So this will require some care by applications
that want to do this, but, well, they probably should be careful
anyway. In libpq, the existing APIs make this difficult, because
there is no way to pass the result of a describe-statement call back
into execute-statement-with-parameters. I added new functions that do
this, so you then essentially do

res0 = PQdescribePrepared(conn, "");
res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0);

(The name could obviously be improved.) Other client APIs that have a
"statement handle" concept could do this more elegantly and probably
without any API changes.

Another challenge is that the parse analysis must check which
underlying column a parameter corresponds to. This is similar to
resorigtbl and resorigcol in the opposite direction. The current
implementation of this works for the test cases, but I know it has
some problems, so I'll continue working in this. This functionality
is in principle available to all prepared-statement variants, not only
protocol-level. So you can see in the tests that I expanded the
pg_prepared_statements view to show this information as well, which
also provides an easy way to test and debug this functionality
independent of column encryption.

And also, psql doesn't use prepared statements, so writing into
encrypted columns currently doesn't work at all via psql. (Reading
works no problem.) All the test code currently uses custom libpq C
programs. We should think about a way to enable prepared statements
in psql, perhaps something like

INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2'

(\gexec and \gx are already taken.)

------

This is not targeting PostgreSQL 15. But I'd appreciate some feedback
on the direction. As I mentioned above, a lot of the remaining work
is arguably mostly straightforward. Some closer examination of the
issues surrounding the libpq API changes and psql would be useful.
Perhaps there are other projects where that kind of functionality
would also be useful.

Attachments:

v1-0001-Transparent-column-encryption.patchtext/plain; charset=UTF-8; name=v1-0001-Transparent-column-encryption.patchDownload+2196-67
#2Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Transparent column encryption

On Fri, Dec 3, 2021 at 4:32 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

But it's missing the remaining 90% of the work,
including additional DDL support, error handling, robust memory
management, protocol versioning, forward and backward compatibility,
pg_dump support, psql \d support, refinement of the cryptography, and
so on. But I think obvious solutions exist to all of those things, so
it isn't that interesting to focus on them for now.

Right, we wouldn't want to get bogged down at this stage in little
details like, uh, everything.

Some protocol extensions are required. These should be guarded by
some _pq_... setting, but this is not done in this patch yet. As
mentioned above, extra messages are added for sending the CMKs and
CEKs. In the RowDescription message, I have commandeered the format
field to add a bit that indicates that the field is encrypted. This
could be made a separate field, and there should probably be
additional fields to indicate the algorithm and CEK name, but this was
easiest for now. The ParameterDescription message is extended to
contain format fields for each parameter, for the same purpose.
Again, this could be done differently.

I think this is reasonable. I would choose to use an additional bit in
the format field as opposed to a separate field. It is worth
considering whether it makes more sense to extend the existing
ParameterDescription message conditionally on some protocol-level
option, or whether we should instead, say, add ParameterDescription2
or the moral equivalent. As I see it, the latter feels conceptually
simpler, but on the other hand, our wire protocol supposes that we
will never run out of 1-byte codes for messages, so perhaps some
prudence is needed.

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out). It is required to use
protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#1)
Re: Transparent column encryption

On Fri, 2021-12-03 at 22:32 +0100, Peter Eisentraut wrote:

This feature does support deterministic
encryption as an alternative to the default randomized encryption, so
in that mode you can do equality lookups, at the cost of some
security.

+ if (enc_det)
+ memset(iv, ivlen, 0);

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]https://datatracker.ietf.org/doc/html/rfc8452), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

The encryption algorithms are mostly hardcoded right now, but there
are facilities for picking algorithms and adding new ones that will be
expanded. The CMK process uses RSA-OAEP. The CEK process uses
AES-128-CBC right now; a more complete solution should probably
involve some HMAC thrown in.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

This is not targeting PostgreSQL 15. But I'd appreciate some feedback
on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

--Jacob

[1]: https://datatracker.ietf.org/doc/html/rfc8452

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#2)
Re: Transparent column encryption

On 06.12.21 19:28, Robert Haas wrote:

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out). It is required to use
protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.

The client can send something like

PQexec(conn, "INSERT INTO tbl VALUES ('ENCBLOB', 'ENCBLOB')");

and it will work. (See the included test suite where 'ENCBLOB' is
actually computed by pgcrypto.) But that is not transparent encryption.
The client wants to send "INSERT INTO tbl VALUES ('val1', 'val2')" and
have libpq take care of encrypting 'val1' and 'val2' before hitting the
wire. For that you need to use the prepared statement API so that the
values are available separately from the statement. And furthermore the
client needs to know what columns the insert statements is writing to,
so that it can get the CEK for that column. That's what it needs the
parameter description for.

As alluded to, workarounds exist or might be made available to do part
of that work yourself, but that shouldn't be the normal way of using it.

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#3)
Re: Transparent column encryption

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

I'm aware of this and plan to make use of SIV. The current
implementation is just an example.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

I don't know how that is supposed to work. When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those. But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row. You would need some permanent logical row ID for this, I think.
For this scenario, the deterministic encryption mode is perhaps not the
right one.

This is not targeting PostgreSQL 15. But I'd appreciate some feedback
on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

The point is to prevent admins from getting at plaintext data. The
scenario you show is an interesting one but I think it's not meant to be
addressed by this. If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.

If this is a problem, then perhaps a better approach would be to store
parts of the data in a separate database with separate admins.

#6Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#5)
Re: Transparent column encryption

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

I'm aware of this and plan to make use of SIV. The current
implementation is just an example.

Sounds good.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

I don't know how that is supposed to work. When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those. But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row. You would need some permanent logical row ID for this, I think.

Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.

This is not targeting PostgreSQL 15. But I'd appreciate some feedback
on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

The point is to prevent admins from getting at plaintext data. The
scenario you show is an interesting one but I think it's not meant to be
addressed by this. If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.

Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.

--Jacob

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jacob Champion (#6)
Re: Transparent column encryption

On 12/7/21 19:02, Jacob Champion wrote:

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

I'm aware of this and plan to make use of SIV. The current
implementation is just an example.

Sounds good.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

I don't know how that is supposed to work. When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those. But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row. You would need some permanent logical row ID for this, I think.

Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.

This is not targeting PostgreSQL 15. But I'd appreciate some feedback
on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

The point is to prevent admins from getting at plaintext data. The
scenario you show is an interesting one but I think it's not meant to be
addressed by this. If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.

Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.

IMO it's impossible to solve this attack within TCE, because it requires
ensuring consistency at the row level, but TCE obviously works at column
level only.

I believe TCE can do AEAD at the column level, which protects against
attacks that flipping bits, and similar attacks. It's just a matter of
how the client encrypts the data.

Extending it to protect the whole row seems tricky, because the client
may not even know the other columns, and it's not clear to me how it'd
deal with things like updates of the other columns, hint bits, dropped
columns, etc.

It's probably possible to get something like this (row-level AEAD) by
encrypting enriched data, i.e. not just the card number, but {user ID,
card number} or something like that, and verify that in the webapp. The
problem of course is that the "user ID" is just another column in the
table, and there's nothing preventing the DBA from modifying that too.

So I think it's pointless to try extending this to row-level AEAD.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Tomas Vondra (#7)
Re: Transparent column encryption

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:

IMO it's impossible to solve this attack within TCE, because it requires
ensuring consistency at the row level, but TCE obviously works at column
level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt? That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

I believe TCE can do AEAD at the column level, which protects against
attacks that flipping bits, and similar attacks. It's just a matter of
how the client encrypts the data.

Right, I think authenticated encryption ciphers (without AD) will be
important to support in practice. I think users are going to want
*some* protection against active attacks.

Extending it to protect the whole row seems tricky, because the client
may not even know the other columns, and it's not clear to me how it'd
deal with things like updates of the other columns, hint bits, dropped
columns, etc.

Covering the entire row automatically probably isn't super helpful in
practice. As you mention later:

It's probably possible to get something like this (row-level AEAD) by
encrypting enriched data, i.e. not just the card number, but {user ID,
card number} or something like that, and verify that in the webapp. The
problem of course is that the "user ID" is just another column in the
table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

--Jacob

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jacob Champion (#8)
Re: Transparent column encryption

On 12/8/21 00:26, Jacob Champion wrote:

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:

IMO it's impossible to solve this attack within TCE, because it requires
ensuring consistency at the row level, but TCE obviously works at column
level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

Imagine you do this

UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt?

I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

I believe TCE can do AEAD at the column level, which protects against
attacks that flipping bits, and similar attacks. It's just a matter of
how the client encrypts the data.

Right, I think authenticated encryption ciphers (without AD) will be
important to support in practice. I think users are going to want
*some* protection against active attacks.

Extending it to protect the whole row seems tricky, because the client
may not even know the other columns, and it's not clear to me how it'd
deal with things like updates of the other columns, hint bits, dropped
columns, etc.

Covering the entire row automatically probably isn't super helpful in
practice. As you mention later:

It's probably possible to get something like this (row-level AEAD) by
encrypting enriched data, i.e. not just the card number, but {user ID,
card number} or something like that, and verify that in the webapp. The
problem of course is that the "user ID" is just another column in the
table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

Do any other databases have this capability? How do they do it?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Tomas Vondra (#9)
Re: Transparent column encryption

On Wed, 2021-12-08 at 02:58 +0100, Tomas Vondra wrote:

On 12/8/21 00:26, Jacob Champion wrote:

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:

IMO it's impossible to solve this attack within TCE, because it requires
ensuring consistency at the row level, but TCE obviously works at column
level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

The way I originally described my request -- "I'd like to be able to
tie an encrypted value to other column (or external) data" -- was not
very clear.

With my proposed model -- where the DBA (and the server) are completely
untrusted, and the DBA needs to be prevented from using the encrypted
value -- I don't think there's a useful way for the client to use
associated data that comes from the server. The client has to know what
the AD should be beforehand, because otherwise the DBA can make it so
the server returns whatever is correct.

Imagine you do this

UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

So with my (hopefully more clear) model above, it wouldn't. The client
would already have the AD, and somehow tell libpq what that data was
for the query.

The rabbit hole I led you down is one where we use the rest of the row
as AD, to try to freeze pieces of it in place. That might(?) have some
useful security properties (if the client defines its use and doesn't
defer to the server). But it's not what I intended to propose and I'd
have to think about that case some more.

In my credit card example, I'm imagining something like (forgive the
contrived syntax):

SELECT address, :{aead(users.credit_card, 'user@example.com')}
FROM users WHERE email = 'user@example.com';

UPDATE users
SET :{aead(users.credit_card, 'user@example.com')} = '1234-...'
WHERE email = 'user@example.com';

The client explicitly links a table's column to its AD for the duration
of the query. This approach can't scale to

SELECT credit_card FROM users;

because in this case the AD for each row is different, but I'd argue
that's ideal for this particular case. The client doesn't need to (and
probably shouldn't) grab everyone's credit card details all at once, so
there's no reason to optimize for it.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt?

I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

Right, exactly. When I mentioned AEAD I had assumed that "allow the
client to define which columns have to be encrypted" was already
planned or in the works; I just misunderstood pieces of Peter's email.
It's that piece where a client would probably have to add details
around AEAD and its use.

That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

I agree that resistance against offline attacks is a useful step
forward (it seems to be a strict improvement over pgcrypto). I have a
feeling that end users will *expect* some protection against online
attacks too, since an evil DBA is going to be well-positioned to do
exactly that.

It's probably possible to get something like this (row-level AEAD) by
encrypting enriched data, i.e. not just the card number, but {user ID,
card number} or something like that, and verify that in the webapp. The
problem of course is that the "user ID" is just another column in the
table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

I'll do some more thinking on the case you're talking about here, where
pieces of the row are transparently tied together.

Do any other databases have this capability? How do they do it?

BigQuery advertises AEAD support. I don't think their model is the same
as ours, though; from the docs it looks like it's essentially pgcrypto,
where you tell the server to encrypt stuff for you.

--Jacob

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jacob Champion (#10)
Re: Transparent column encryption

On 12/9/21 01:12, Jacob Champion wrote:

On Wed, 2021-12-08 at 02:58 +0100, Tomas Vondra wrote:

On 12/8/21 00:26, Jacob Champion wrote:

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:

IMO it's impossible to solve this attack within TCE, because it requires
ensuring consistency at the row level, but TCE obviously works at column
level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

The way I originally described my request -- "I'd like to be able to
tie an encrypted value to other column (or external) data" -- was not
very clear.

With my proposed model -- where the DBA (and the server) are completely
untrusted, and the DBA needs to be prevented from using the encrypted
value -- I don't think there's a useful way for the client to use
associated data that comes from the server. The client has to know what
the AD should be beforehand, because otherwise the DBA can make it so
the server returns whatever is correct.

True. With untrusted server the additional data would have to come from
some other source. Say, an isolated auth system or so.

Imagine you do this

UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

So with my (hopefully more clear) model above, it wouldn't. The client
would already have the AD, and somehow tell libpq what that data was
for the query.

The rabbit hole I led you down is one where we use the rest of the row
as AD, to try to freeze pieces of it in place. That might(?) have some
useful security properties (if the client defines its use and doesn't
defer to the server). But it's not what I intended to propose and I'd
have to think about that case some more.

OK

In my credit card example, I'm imagining something like (forgive the
contrived syntax):

SELECT address, :{aead(users.credit_card, 'user@example.com')}
FROM users WHERE email = 'user@example.com';

UPDATE users
SET :{aead(users.credit_card, 'user@example.com')} = '1234-...'
WHERE email = 'user@example.com';

The client explicitly links a table's column to its AD for the duration
of the query. This approach can't scale to

SELECT credit_card FROM users;

because in this case the AD for each row is different, but I'd argue
that's ideal for this particular case. The client doesn't need to (and
probably shouldn't) grab everyone's credit card details all at once, so
there's no reason to optimize for it.

Maybe, but it seems like a rather annoying limitation, as it restricts
the client to single-row queries (or at least it looks like that to me).
Yes, it may be fine for some use cases, but I'd bet a DBA who can modify
data can do plenty other things - swapping "old" values, which will have
the right AD, for example.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt?

I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

Right, exactly. When I mentioned AEAD I had assumed that "allow the
client to define which columns have to be encrypted" was already
planned or in the works; I just misunderstood pieces of Peter's email.
It's that piece where a client would probably have to add details
around AEAD and its use.

That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

I agree that resistance against offline attacks is a useful step
forward (it seems to be a strict improvement over pgcrypto). I have a
feeling that end users will *expect* some protection against online
attacks too, since an evil DBA is going to be well-positioned to do
exactly that.

Yeah.

It's probably possible to get something like this (row-level AEAD) by
encrypting enriched data, i.e. not just the card number, but {user ID,
card number} or something like that, and verify that in the webapp. The
problem of course is that the "user ID" is just another column in the
table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

I'll do some more thinking on the case you're talking about here, where
pieces of the row are transparently tied together.

OK. In any case, I think we shouldn't require this capability from the
get go - it's fine to get the simple version done first, which gives us
privacy / protects against passive attacker. And then sometime in the
future improve this further.

Do any other databases have this capability? How do they do it?

BigQuery advertises AEAD support. I don't think their model is the same
as ours, though; from the docs it looks like it's essentially pgcrypto,
where you tell the server to encrypt stuff for you.

Pretty sure it's server-side. The docs say it's for encryption at rest,
all the examples do the encryption/decryption in SQL, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#1)
Re: Transparent column encryption

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption). These are essentially cousins of bytea.

Does that mean someone could go in with psql and select out the data
without any keys and just get a raw bytea-like representation? That
seems like a natural and useful thing to be able to do. For example to
allow dumping a table and loading it elsewhere and transferring keys
through some other channel (perhaps only as needed).

#13Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#12)
Re: Transparent column encryption

On 16.12.21 05:47, Greg Stark wrote:

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption). These are essentially cousins of bytea.

Does that mean someone could go in with psql and select out the data
without any keys and just get a raw bytea-like representation? That
seems like a natural and useful thing to be able to do. For example to
allow dumping a table and loading it elsewhere and transferring keys
through some other channel (perhaps only as needed).

Yes to all of that.

#14Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Tomas Vondra (#11)
Re: Transparent column encryption

On Thu, 2021-12-09 at 11:04 +0100, Tomas Vondra wrote:

On 12/9/21 01:12, Jacob Champion wrote:

The rabbit hole I led you down is one where we use the rest of the row
as AD, to try to freeze pieces of it in place. That might(?) have some
useful security properties (if the client defines its use and doesn't
defer to the server). But it's not what I intended to propose and I'd
have to think about that case some more.

So after thinking about it some more, in the case where the client is
relying on the server to return both the encrypted data and its
associated data -- and you don't trust the server -- then tying even
the entire row together doesn't help you.

I was briefly led astray by the idea that you could include a unique or
primary key column in the associated data, and then SELECT based on
that column -- but a motivated DBA could simply corrupt state so that
the row they wanted got returned regardless of the query. So the client
still has to have prior knowledge.

In my credit card example, I'm imagining something like (forgive the
contrived syntax):

SELECT address, :{aead(users.credit_card, 'user@example.com')}
FROM users WHERE email = 'user@example.com';

UPDATE users
SET :{aead(users.credit_card, 'user@example.com')} = '1234-...'
WHERE email = 'user@example.com';

The client explicitly links a table's column to its AD for the duration
of the query. This approach can't scale to

SELECT credit_card FROM users;

because in this case the AD for each row is different, but I'd argue
that's ideal for this particular case. The client doesn't need to (and
probably shouldn't) grab everyone's credit card details all at once, so
there's no reason to optimize for it.

Maybe, but it seems like a rather annoying limitation, as it restricts
the client to single-row queries (or at least it looks like that to me).
Yes, it may be fine for some use cases, but I'd bet a DBA who can modify
data can do plenty other things - swapping "old" values, which will have
the right AD, for example.

Resurrecting old data doesn't help the DBA read the values, right? I
view that as similar to the "increasing account balance" problem, in
that it's definitely a problem but not one we're trying to tackle here.

(And I'm not familiar with any solutions for resurrections -- other
than having data expire and tying the timestamp into the
authentication, which I think again requires AD. Revoking signed data
is one of those hard problems. Do you know a better way?)

OK. In any case, I think we shouldn't require this capability from the
get go - it's fine to get the simple version done first, which gives us
privacy / protects against passive attacker. And then sometime in the
future improve this further.

Agreed. (And I think the client should be able to enforce encryption in
the first place, before I distract you too much with other stuff.)

--Jacob

#15Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#14)
Re: Transparent column encryption

On 17.12.21 01:41, Jacob Champion wrote:

(And I think the client should be able to enforce encryption in
the first place, before I distract you too much with other stuff.)

Yes, this is a useful point that I have added to my notes.

#16Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#1)
Re: Transparent column encryption

Here is a new version of this patch. See also the original description
quoted below. I have done a significant amount of work on this over the
last few months. Some important news include:

- The cryptography has been improved. It now uses an AEAD scheme, and
for deterministic encryption a proper SIV construction.

- The OpenSSL-specific parts have been moved to a separate file in
libpq. Non-OpenSSL builds compile and work (without functionality, of
course).

- libpq handles multiple CEKs and CMKs, including changing keys on the fly.

- libpq supports a mode to force encryption of certain values.

- libpq supports a flexible configuration system for looking up CMKs,
including support for external key management systems.

- psql has a new \gencr command that allows passing in bind parameters
for (potential) encryption.

- There is some more pg_dump and psql support.

- The new data types for storing encrypted data have been renamed for
clarity.

- Various changes to the protocol compared to the previous patch.

- The patch contains full documentation of the protocol changes,
glossary entries, and more new documentation.

The major pieces that are still missing are:

- DDL support for registering keys

- Protocol versioning or feature flags

Other than that it's pretty complete in my mind.

For interested reviewers, I have organized the patch so that you can
start reading it top to bottom: The documentation comes first, then the
tests, then the code changes. Even some feedback on the first or first
two aspects would be valuable to me.

Old news follows:

Show quoted text

On 03.12.21 22:32, Peter Eisentraut wrote:

I want to present my proof-of-concept patch for the transparent column
encryption feature.  (Some might also think of it as automatic
client-side encryption or similar, but I like my name.)  This feature
enables the {automatic,transparent} encryption and decryption of
particular columns in the client.  The data for those columns then
only ever appears in ciphertext on the server, so it is protected from
the "prying eyes" of DBAs, sysadmins, cloud operators, etc.  The
canonical use case for this feature is storing credit card numbers
encrypted, in accordance with PCI DSS, as well as similar situations
involving social security numbers etc.  Of course, you can't do any
computations with encrypted values on the server, but for these use
cases, that is not necessary.  This feature does support deterministic
encryption as an alternative to the default randomized encryption, so
in that mode you can do equality lookups, at the cost of some
security.

This functionality also exists in other SQL database products, so the
overall concepts weren't invented by me by any means.

Also, this feature has nothing to do with the on-disk encryption
feature being contemplated in parallel.  Both can exist independently.

The attached patch has all the necessary pieces in place to make this
work, so you can have an idea how the overall system works.  It
contains some documentation and tests to help illustrate the
functionality.  But it's missing the remaining 90% of the work,
including additional DDL support, error handling, robust memory
management, protocol versioning, forward and backward compatibility,
pg_dump support, psql \d support, refinement of the cryptography, and
so on.  But I think obvious solutions exist to all of those things, so
it isn't that interesting to focus on them for now.

------

Now to the explanation of how it works.

You declare a column as encrypted in a CREATE TABLE statement.  The
column value is encrypted by a symmetric key called the column
encryption key (CEK).  The CEK is a catalog object.  The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK).  The CMK is not stored in the database but somewhere
where the client can get to it, for example in a file or in a key
management system.  When a server sends rows containing encrypted
column values to the client, it first sends the required CMK and CEK
information (new protocol messages), which the client needs to record.
Then, the client can use this information to automatically decrypt the
incoming row data and forward it in plaintext to the application.

For the CMKs, the catalog object specifies a "provider" and generic
options.  Right now, libpq has a "file" provider hardcoded, and it
takes a "filename" option.  Via some mechanism to be determined,
additional providers could be loaded and then talk to key management
systems via http or whatever.  I have left some comments in the libpq
code where the hook points for this could be.

The general idea would be for an application to have one CMK per area
of secret stuff, for example, for credit card data.  The CMK can be
rotated: each CEK can be represented multiple times in the database,
encrypted by a different CMK.  (The CEK can't be rotated easily, since
that would require reading out all the data from a table/column and
reencrypting it.  We could/should add some custom tooling for that,
but it wouldn't be a routine operation.)

The encryption algorithms are mostly hardcoded right now, but there
are facilities for picking algorithms and adding new ones that will be
expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
AES-128-CBC right now; a more complete solution should probably
involve some HMAC thrown in.

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption).  These are essentially cousins of bytea.  For the rest of
the database system below the protocol handling, there is nothing
special about those.  For example, encryptedr has no operators at all,
encryptedd has only an equality operator.  pg_attribute has a new
column attrealtypid that stores the original type of the data in the
column.  This is only used for providing it to clients, so that
higher-level clients can convert the decrypted value to their
appropriate data types in their environments.

Some protocol extensions are required.  These should be guarded by
some _pq_... setting, but this is not done in this patch yet.  As
mentioned above, extra messages are added for sending the CMKs and
CEKs.  In the RowDescription message, I have commandeered the format
field to add a bit that indicates that the field is encrypted.  This
could be made a separate field, and there should probably be
additional fields to indicate the algorithm and CEK name, but this was
easiest for now.  The ParameterDescription message is extended to
contain format fields for each parameter, for the same purpose.
Again, this could be done differently.

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.
The client must first prepare a statement, then describe the statement
to get parameter metadata, which indicates which parameters are to be
encrypted and how.  So this will require some care by applications
that want to do this, but, well, they probably should be careful
anyway.  In libpq, the existing APIs make this difficult, because
there is no way to pass the result of a describe-statement call back
into execute-statement-with-parameters.  I added new functions that do
this, so you then essentially do

    res0 = PQdescribePrepared(conn, "");
    res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0);

(The name could obviously be improved.)  Other client APIs that have a
"statement handle" concept could do this more elegantly and probably
without any API changes.

Another challenge is that the parse analysis must check which
underlying column a parameter corresponds to.  This is similar to
resorigtbl and resorigcol in the opposite direction.  The current
implementation of this works for the test cases, but I know it has
some problems, so I'll continue working in this.  This functionality
is in principle available to all prepared-statement variants, not only
protocol-level.  So you can see in the tests that I expanded the
pg_prepared_statements view to show this information as well, which
also provides an easy way to test and debug this functionality
independent of column encryption.

And also, psql doesn't use prepared statements, so writing into
encrypted columns currently doesn't work at all via psql.  (Reading
works no problem.)  All the test code currently uses custom libpq C
programs.  We should think about a way to enable prepared statements
in psql, perhaps something like

INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2'

(\gexec and \gx are already taken.)

------

This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.  As I mentioned above, a lot of the remaining work
is arguably mostly straightforward.  Some closer examination of the
issues surrounding the libpq API changes and psql would be useful.
Perhaps there are other projects where that kind of functionality
would also be useful.

Attachments:

v2-0001-Transparent-column-encryption.patchtext/plain; charset=UTF-8; name=v2-0001-Transparent-column-encryption.patchDownload+4380-123
#17Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#16)
Re: Transparent column encryption

Rebased patch, no new functionality.

Show quoted text

On 29.06.22 01:29, Peter Eisentraut wrote:

Here is a new version of this patch.  See also the original description
quoted below.  I have done a significant amount of work on this over the
last few months.  Some important news include:

- The cryptography has been improved.  It now uses an AEAD scheme, and
for deterministic encryption a proper SIV construction.

- The OpenSSL-specific parts have been moved to a separate file in
libpq.  Non-OpenSSL builds compile and work (without functionality, of
course).

- libpq handles multiple CEKs and CMKs, including changing keys on the fly.

- libpq supports a mode to force encryption of certain values.

- libpq supports a flexible configuration system for looking up CMKs,
including support for external key management systems.

- psql has a new \gencr command that allows passing in bind parameters
for (potential) encryption.

- There is some more pg_dump and psql support.

- The new data types for storing encrypted data have been renamed for
clarity.

- Various changes to the protocol compared to the previous patch.

- The patch contains full documentation of the protocol changes,
glossary entries, and more new documentation.

The major pieces that are still missing are:

- DDL support for registering keys

- Protocol versioning or feature flags

Other than that it's pretty complete in my mind.

For interested reviewers, I have organized the patch so that you can
start reading it top to bottom: The documentation comes first, then the
tests, then the code changes.  Even some feedback on the first or first
two aspects would be valuable to me.

Old news follows:

On 03.12.21 22:32, Peter Eisentraut wrote:

I want to present my proof-of-concept patch for the transparent column
encryption feature.  (Some might also think of it as automatic
client-side encryption or similar, but I like my name.)  This feature
enables the {automatic,transparent} encryption and decryption of
particular columns in the client.  The data for those columns then
only ever appears in ciphertext on the server, so it is protected from
the "prying eyes" of DBAs, sysadmins, cloud operators, etc.  The
canonical use case for this feature is storing credit card numbers
encrypted, in accordance with PCI DSS, as well as similar situations
involving social security numbers etc.  Of course, you can't do any
computations with encrypted values on the server, but for these use
cases, that is not necessary.  This feature does support deterministic
encryption as an alternative to the default randomized encryption, so
in that mode you can do equality lookups, at the cost of some
security.

This functionality also exists in other SQL database products, so the
overall concepts weren't invented by me by any means.

Also, this feature has nothing to do with the on-disk encryption
feature being contemplated in parallel.  Both can exist independently.

The attached patch has all the necessary pieces in place to make this
work, so you can have an idea how the overall system works.  It
contains some documentation and tests to help illustrate the
functionality.  But it's missing the remaining 90% of the work,
including additional DDL support, error handling, robust memory
management, protocol versioning, forward and backward compatibility,
pg_dump support, psql \d support, refinement of the cryptography, and
so on.  But I think obvious solutions exist to all of those things, so
it isn't that interesting to focus on them for now.

------

Now to the explanation of how it works.

You declare a column as encrypted in a CREATE TABLE statement.  The
column value is encrypted by a symmetric key called the column
encryption key (CEK).  The CEK is a catalog object.  The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK).  The CMK is not stored in the database but somewhere
where the client can get to it, for example in a file or in a key
management system.  When a server sends rows containing encrypted
column values to the client, it first sends the required CMK and CEK
information (new protocol messages), which the client needs to record.
Then, the client can use this information to automatically decrypt the
incoming row data and forward it in plaintext to the application.

For the CMKs, the catalog object specifies a "provider" and generic
options.  Right now, libpq has a "file" provider hardcoded, and it
takes a "filename" option.  Via some mechanism to be determined,
additional providers could be loaded and then talk to key management
systems via http or whatever.  I have left some comments in the libpq
code where the hook points for this could be.

The general idea would be for an application to have one CMK per area
of secret stuff, for example, for credit card data.  The CMK can be
rotated: each CEK can be represented multiple times in the database,
encrypted by a different CMK.  (The CEK can't be rotated easily, since
that would require reading out all the data from a table/column and
reencrypting it.  We could/should add some custom tooling for that,
but it wouldn't be a routine operation.)

The encryption algorithms are mostly hardcoded right now, but there
are facilities for picking algorithms and adding new ones that will be
expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
AES-128-CBC right now; a more complete solution should probably
involve some HMAC thrown in.

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption).  These are essentially cousins of bytea.  For the rest of
the database system below the protocol handling, there is nothing
special about those.  For example, encryptedr has no operators at all,
encryptedd has only an equality operator.  pg_attribute has a new
column attrealtypid that stores the original type of the data in the
column.  This is only used for providing it to clients, so that
higher-level clients can convert the decrypted value to their
appropriate data types in their environments.

Some protocol extensions are required.  These should be guarded by
some _pq_... setting, but this is not done in this patch yet.  As
mentioned above, extra messages are added for sending the CMKs and
CEKs.  In the RowDescription message, I have commandeered the format
field to add a bit that indicates that the field is encrypted.  This
could be made a separate field, and there should probably be
additional fields to indicate the algorithm and CEK name, but this was
easiest for now.  The ParameterDescription message is extended to
contain format fields for each parameter, for the same purpose.
Again, this could be done differently.

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.
The client must first prepare a statement, then describe the statement
to get parameter metadata, which indicates which parameters are to be
encrypted and how.  So this will require some care by applications
that want to do this, but, well, they probably should be careful
anyway.  In libpq, the existing APIs make this difficult, because
there is no way to pass the result of a describe-statement call back
into execute-statement-with-parameters.  I added new functions that do
this, so you then essentially do

     res0 = PQdescribePrepared(conn, "");
     res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0);

(The name could obviously be improved.)  Other client APIs that have a
"statement handle" concept could do this more elegantly and probably
without any API changes.

Another challenge is that the parse analysis must check which
underlying column a parameter corresponds to.  This is similar to
resorigtbl and resorigcol in the opposite direction.  The current
implementation of this works for the test cases, but I know it has
some problems, so I'll continue working in this.  This functionality
is in principle available to all prepared-statement variants, not only
protocol-level.  So you can see in the tests that I expanded the
pg_prepared_statements view to show this information as well, which
also provides an easy way to test and debug this functionality
independent of column encryption.

And also, psql doesn't use prepared statements, so writing into
encrypted columns currently doesn't work at all via psql.  (Reading
works no problem.)  All the test code currently uses custom libpq C
programs.  We should think about a way to enable prepared statements
in psql, perhaps something like

INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2'

(\gexec and \gx are already taken.)

------

This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.  As I mentioned above, a lot of the remaining work
is arguably mostly straightforward.  Some closer examination of the
issues surrounding the libpq API changes and psql would be useful.
Perhaps there are other projects where that kind of functionality
would also be useful.

Attachments:

v3-0001-Transparent-column-encryption.patchtext/plain; charset=UTF-8; name=v3-0001-Transparent-column-encryption.patchDownload+4407-130
#18Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#17)
Re: Transparent column encryption

Updated patch, to resolve some merge conflicts.

Also, I added some CREATE DDL commands. These aren't fully robust yet,
but they do the basic job, so it makes the test cases easier to write
and read, and they can be referred to in the documentation. (Note that
the corresponding DROP aren't there yet.) I also expanded the
documentation in the DDL chapter to give a complete recipe of how to set
it up and use it.

Attachments:

v4-0001-Transparent-column-encryption.patchtext/plain; charset=UTF-8; name=v4-0001-Transparent-column-encryption.patchDownload+5148-145
#19Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#18)
Re: Transparent column encryption

On 7/12/22 11:29, Peter Eisentraut wrote:

Updated patch, to resolve some merge conflicts.

Thank you for working on this; it's an exciting feature.

The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK).

I'm not yet understanding why the CMK is asymmetric. Maybe you could use
the public key to add ephemeral, single-use encryption keys that no one
but the private key holder could use (after you forget them on your
side, that is). But since the entire column is encrypted with a single
CEK, you would essentially only be able to do that if you created an
entirely new column or table; do I have that right?

I'm used to public keys being safe for... publication, but if I'm
understanding correctly, it's important that the server admin doesn't
get hold of the public key for your CMK, because then they could
substitute their own CEKs transparently and undermine future encrypted
writes. That seems surprising. Am I just missing something important
about RSAES-OAEP?

+#define PG_CEK_AEAD_AES_128_CBC_HMAC_SHA_256   130
+#define PG_CEK_AEAD_AES_192_CBC_HMAC_SHA_384   131
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_384   132
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_512   133

It looks like these ciphersuites were abandoned by the IETF. Are there
existing implementations of them that have been audited/analyzed? Are
they safe (and do we know that the claims made in the draft are
correct)? How do they compare to other constructions like AES-GCM-SIV
and XChacha20-Poly1305?

+-- \gencr
+-- (This just tests the parameter passing; there is no encryption here.)
+CREATE TABLE test_gencr (a int, b text);
+INSERT INTO test_gencr VALUES (1, 'one') \gencr
+SELECT * FROM test_gencr WHERE a = 1 \gencr
+ a |  b
+---+-----
+ 1 | one
+(1 row)
+
+INSERT INTO test_gencr VALUES ($1, $2) \gencr 2 'two'
+SELECT * FROM test_gencr WHERE a IN ($1, $2) \gencr 2 3
+ a |  b
+---+-----
+ 2 | two
+(1 row)

I'd expect \gencr to error out without sending plaintext. I know that
under the hood this is just setting up a prepared statement, but if I'm
using \gencr, presumably I really do want to be encrypting my data.
Would it be a problem to always set force-column-encryption for the
parameters we're given here? Any unencrypted columns could be provided
directly.

Another idle thought I had was that it'd be nice to have some syntax for
providing a null value to \gencr (assuming I didn't overlook it in the
patch). But that brings me to...

+  <para>
+   Null values are not encrypted by transparent column encryption; null values
+   sent by the client are visible as null values in the database.  If the fact
+   that a value is null needs to be hidden from the server, this information
+   needs to be encoded into a nonnull value in the client somehow.
+  </para>

This is a major gap, IMO. Especially with the switch to authenticated
ciphers, because it means you can't sign your NULL values. And having
each client or user that's out there solve this with a magic in-band
value seems like a recipe for pain.

Since we're requiring "canonical" use of text format, and the docs say
there are no embedded or trailing nulls allowed in text values, could we
steal the use of a single zero byte to mean NULL? One additional
complication would be that the client would have to double-check that
we're not writing a NULL into a NOT NULL column, and complain if it
reads one during decryption. Another complication would be that the
client would need to complain if it got a plaintext NULL.

(The need for robust client-side validation of encrypted columns might
be something to expand on in the docs more generally, since before this
feature, it could probably be assumed that the server was buggy if it
sent you unparsable junk in a column.)

+   <para>
+    The <quote>associated data</quote> in these algorithms consists of 4
+    bytes: The ASCII letters <literal>P</literal> and <literal>G</literal>
+    (byte values 80 and 71), followed by the algorithm ID as a 16-bit unsigned
+    integer in network byte order.
+   </para>

Is this AD intended as a placeholder for the future, or does it serve a
particular purpose?

Thanks,
--Jacob

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#19)
Re: Transparent column encryption

On 15.07.22 19:47, Jacob Champion wrote:

The CEK key
material is in turn encrypted by an assymmetric key called the column
master key (CMK).

I'm not yet understanding why the CMK is asymmetric.

I'm not totally sure either. I started to build it that way because
other systems were doing it that way, too. But I have been thinking
about adding a symmetric alternative for the CMKs as well (probably AESKW).

I think there are a couple of reasons why asymmetric keys are possibly
useful for CMKs:

Some other products make use of secure enclaves to do computations on
(otherwise) encrypted values on the server. I don't fully know how that
works, but I suspect that asymmetric keys can play a role in that. (I
don't have any immediate plans for that in my patch. It seems to be a
dying technology at the moment.)

Asymmetric keys gives you some more options for how you set up the keys
at the beginning. For example, you create the asymmetric key pair on
the host where your client program that wants access to the encrypted
data will run. You put the private key in an appropriate location for
run time. You send the public key to another host. On that other host,
you create the CEK, encrypt it with the CMK, and then upload it into the
server (CREATE COLUMN ENCRYPTION KEY). Then you can wipe that second
host. That way, you can be even more sure that the unencrypted CEK
isn't left anywhere. I'm not sure whether this method is very useful in
practice, but it's interesting.

In any case, as I mentioned above, this particular aspect is up for
discussion.

Also note that if you use a KMS (cmklookup "run" method), the actual
algorithm doesn't even matter (depending on details of the KMS setup),
since you just tell the KMS "decrypt this", and the KMS knows by itself
what algorithm to use. Maybe there should be a way to specify "unknown"
in the ckdcmkalg field.

+#define PG_CEK_AEAD_AES_128_CBC_HMAC_SHA_256   130
+#define PG_CEK_AEAD_AES_192_CBC_HMAC_SHA_384   131
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_384   132
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_512   133

It looks like these ciphersuites were abandoned by the IETF. Are there
existing implementations of them that have been audited/analyzed? Are
they safe (and do we know that the claims made in the draft are
correct)? How do they compare to other constructions like AES-GCM-SIV
and XChacha20-Poly1305?

The short answer is, these same algorithms are used in equivalent
products (see MS SQL Server, MongoDB). They even reference the same
exact draft document.

Besides that, here is my analysis for why these are good choices: You
can't use any of the counter modes, because since the encryption happens
on the client, there is no way to coordinate to avoid nonce reuse. So
among mainstream modes, you are basically left with AES-CBC with a
random IV. In that case, even if you happen to reuse an IV, the
possible damage is very contained.

And then, if you want to use AEAD, you combine that with some MAC, and
HMAC is just as good as any for that.

The referenced draft document doesn't really contain any additional
cryptographic insights, it's just a guide on a particular way to put
these two together.

So altogether I think this is a pretty solid choice.

+-- \gencr
+-- (This just tests the parameter passing; there is no encryption here.)
+CREATE TABLE test_gencr (a int, b text);
+INSERT INTO test_gencr VALUES (1, 'one') \gencr
+SELECT * FROM test_gencr WHERE a = 1 \gencr
+ a |  b
+---+-----
+ 1 | one
+(1 row)
+
+INSERT INTO test_gencr VALUES ($1, $2) \gencr 2 'two'
+SELECT * FROM test_gencr WHERE a IN ($1, $2) \gencr 2 3
+ a |  b
+---+-----
+ 2 | two
+(1 row)

I'd expect \gencr to error out without sending plaintext. I know that
under the hood this is just setting up a prepared statement, but if I'm
using \gencr, presumably I really do want to be encrypting my data.
Would it be a problem to always set force-column-encryption for the
parameters we're given here? Any unencrypted columns could be provided
directly.

Yeah, this needs a bit of refinement. You don't want something named
"encr" but it only encrypts some of the time. We could possibly do what
you suggest and make it set the force-encryption flag, or maybe rename
it or add another command that just uses prepared statements and doesn't
promise anything about encryption from its name.

This also ties in with how pg_dump will eventually work. I think by
default pg_dump will just dump things encrypted and set it up so that
COPY writes it back encrypted. But there should probably be a mode that
dumps out plaintext and then uses one of these commands to load the
plaintext back in. What these psql commands need to do also depends on
what pg_dump needs them to do.

+  <para>
+   Null values are not encrypted by transparent column encryption; null values
+   sent by the client are visible as null values in the database.  If the fact
+   that a value is null needs to be hidden from the server, this information
+   needs to be encoded into a nonnull value in the client somehow.
+  </para>

This is a major gap, IMO. Especially with the switch to authenticated
ciphers, because it means you can't sign your NULL values. And having
each client or user that's out there solve this with a magic in-band
value seems like a recipe for pain.

Since we're requiring "canonical" use of text format, and the docs say
there are no embedded or trailing nulls allowed in text values, could we
steal the use of a single zero byte to mean NULL? One additional
complication would be that the client would have to double-check that
we're not writing a NULL into a NOT NULL column, and complain if it
reads one during decryption. Another complication would be that the
client would need to complain if it got a plaintext NULL.

You're already alluding to some of the complications. Also consider
that null values could arise from, say, outer joins. So you could be in
a situation where encrypted and unencrypted null values coexist. And of
course the server doesn't know about the encrypted null values. So how
do you maintain semantics, like for aggregate functions, primary keys,
anything that treats null values specially? How do clients deal with a
mix of encrypted and unencrypted null values, how do they know which one
is real. What if the client needs to send a null value back as a
parameter? All of this would create enormous complications, if they can
be solved at all.

I think a way to look at this is that this column encryption feature
isn't suitable for disguising the existence or absence of data, it can
only disguise the particular data that you know exists.

+   <para>
+    The <quote>associated data</quote> in these algorithms consists of 4
+    bytes: The ASCII letters <literal>P</literal> and <literal>G</literal>
+    (byte values 80 and 71), followed by the algorithm ID as a 16-bit unsigned
+    integer in network byte order.
+   </para>

Is this AD intended as a placeholder for the future, or does it serve a
particular purpose?

It has been recommended that you include the identity of the encryption
algorithm in the AD. This protects the client from having to decrypt
stuff that wasn't meant to be decrypted (in that way).

#21Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#20)
#22Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#18)
#23Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Eisentraut (#22)
#24Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#20)
#25Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Robert Haas (#21)
#26Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#20)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Jacob Champion (#25)
#28Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Jacob Champion (#28)
#30Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Robert Haas (#29)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#22)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Masahiko Sawada (#23)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#30)
#34Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#33)
#35Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#31)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#35)
#37Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#36)
#38Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#37)
#39Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#38)
#40Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#39)
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#40)
#42Mark Woodward
woodwardm@google.com
In reply to: Peter Eisentraut (#41)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#37)
In reply to: Peter Eisentraut (#43)
#45Frédéric Yhuel
frederic.yhuel@dalibo.com
In reply to: Jehan-Guillaume de Rorthais (#44)
#46Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#43)
#47Peter Eisentraut
peter_e@gmx.net
In reply to: Jehan-Guillaume de Rorthais (#44)
In reply to: Peter Eisentraut (#47)
#49Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#46)
#50Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#49)
#51Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#50)
#52Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Eisentraut (#50)
#53Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Eisentraut (#51)
#54Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#53)
#55vignesh C
vignesh21@gmail.com
In reply to: Peter Eisentraut (#51)
#56Peter Eisentraut
peter_e@gmx.net
In reply to: Mark Dilger (#53)
#57Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#51)
#58Peter Eisentraut
peter_e@gmx.net
In reply to: vignesh C (#55)
#59Peter Eisentraut
peter_e@gmx.net
In reply to: Justin Pryzby (#52)
#60Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#56)
#61Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#57)
#62Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#61)
#63Peter Eisentraut
peter_e@gmx.net
In reply to: Jacob Champion (#62)
#64Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#63)
#65Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Eisentraut (#58)
#66Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#65)
#67Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#58)
#68Peter Eisentraut
peter_e@gmx.net
In reply to: Mark Dilger (#65)
#69Peter Eisentraut
peter_e@gmx.net
In reply to: Mark Dilger (#66)
#70Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#67)
#71Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#70)
#72Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Eisentraut (#71)
#73Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#71)
#74Peter Eisentraut
peter_e@gmx.net
In reply to: Mark Dilger (#72)
#75Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#73)
#76Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#75)
#77Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#76)
#78Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#77)
#79Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#78)
#80Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#79)
#81Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#80)
#82Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#81)
#83Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#83)
#85Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#83)
#87Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#86)
#88Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#87)
#89Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#89)
#91Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#90)
#92Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#91)
#93Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#92)
#94Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#92)
#95Peter Eisentraut
peter_e@gmx.net
In reply to: Stephen Frost (#93)
#96Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#78)
#97Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Peter Eisentraut (#96)
#98Peter Eisentraut
peter_e@gmx.net
In reply to: Jelte Fennema-Nio (#97)
#99Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Peter Eisentraut (#98)
#100Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#96)
#101Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Robert Haas (#100)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Jelte Fennema-Nio (#101)
#103Dave Cramer
pg@fastcrypt.com
In reply to: Robert Haas (#100)