Complete data erasure

Started by asaba.takanori@fujitsu.comalmost 6 years ago29 messages
#1asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com

Hello hackers,

I want to add the feature to erase data so that it cannot be restored
because it prevents attackers from stealing data from released data area.

- Background
International security policies require that above threat is taken measures.
It is "Base Protection Profile for Database Management Systems Version 2.12 (DBMS PP)" [1]https://www.commoncriteriaportal.org/files/ppfiles/pp0088V2b_pdf.pdf P44 8.1.2 based on iso 15408.
If the security is improved, it will be more likely to be adopted by security-conscious procurers such as public agencies.

- Feature
This feature erases data area just before it is returned to the OS (“erase” means that overwrite data area to hide its contents here)
because there is a risk that the data will be restored by attackers if it is returned to the OS without being overwritten.
The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.
I want users to be able to customize the erasure method for their security policies.

- Implementation
My idea is adding a new parameter erase_command to postgresql.conf.
The command that users set in this parameter is executed just before unlink(path) or ftruncate(fd, 0) is called.
For example, the command is shred on Linux and SDelete on Windows.

When erase_command is set, VACUUM does not truncate a file size to non-zero
because it's safer for users to return the entire file to the OS than to return part of it.
Also, there is no standard tool that overwrites part of a file.
With the above specifications, users can easily and safely use this feature using standard tool that overwrites entire file like shred.

Hope to hear your feedback and comments.

[1]: https://www.commoncriteriaportal.org/files/ppfiles/pp0088V2b_pdf.pdf P44 8.1.2
P44 8.1.2

- Threat/Policy
A threat agent may use or manage TSF, bypassing the protection mechanisms of the TSF.

- TOE Security Objectives Addressing the Threat/Policy
The TOE will ensure that any information contained in a protected resource within its Scope of Control
is not inappropriately disclosed when the resource is reallocated.

- Rationale
diminishes this threat by ensuring that TSF data and user data is not persistent
when resources are released by one user/process and allocated to another user/process.

TOE: Target of Evaluation
TSF: TOE Security Functionality

Regards

--
Takanori Asaba

#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: asaba.takanori@fujitsu.com (#1)
Re: Complete data erasure

Hello, Asaba-san.

At Wed, 15 Jan 2020 01:31:44 +0000, "asaba.takanori@fujitsu.com" <asaba.takanori@fujitsu.com> wrote in

Hello hackers,

I want to add the feature to erase data so that it cannot be restored
because it prevents attackers from stealing data from released data area.

- Background
International security policies require that above threat is taken measures.
It is "Base Protection Profile for Database Management Systems Version 2.12 (DBMS PP)" [1] based on iso 15408.
If the security is improved, it will be more likely to be adopted by security-conscious procurers such as public agencies.

- Feature
This feature erases data area just before it is returned to the OS (-Y´erase¡ means that overwrite data area to hide its contents here) -A
because there is a risk that the data will be restored by attackers if it is returned to the OS without being overwritten.
The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.
I want users to be able to customize the erasure method for their security policies.

shred(1) or wipe(1) doesn't seem to contribute to the objective on
journaled or copy-on-write file systems. I'm not sure, but maybe the
same can be true for read-modify-write devices like SSD. I'm not sure
about SDelete, but anyway replacing unlink() with something like
'system("shred")' leads to siginificant performance degradation.

man 1 wipe says (https://linux.die.net/man/1/wipe) : (shred has a
similar note.)

NOTE ABOUT JOURNALING FILESYSTEMS AND SOME RECOMMENDATIONS (JUNE 2004)
Journaling filesystems (such as Ext3 or ReiserFS) are now being used
by default by most Linux distributions. No secure deletion program
that does filesystem-level calls can sanitize files on such
filesystems, because sensitive data and metadata can be written to the
journal, which cannot be readily accessed. Per-file secure deletion is
better implemented in the operating system.

WAL files contain copies of such sensitive information, which is not
covered by the proposal. Also temporary files are not. If the system
doesn't want not to be recoverable after corruption, it must copy such
WAL files to archive.

Currently there's a discussion on transparent data encyryption
covering the all of the above cases on and off of this mailing list.
It is different from device-level encryption mentioned in the man
page. Doesn't that fit the requirement?

/messages/by-id/CALS+J3-57cL=jz_eT9uxiLa8CAh5BE3-HcQvXQBz0ScMjag4Zg@mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: asaba.takanori@fujitsu.com (#1)
Re: Complete data erasure

"asaba.takanori@fujitsu.com" <asaba.takanori@fujitsu.com> writes:

I want to add the feature to erase data so that it cannot be restored
because it prevents attackers from stealing data from released data area.

I think this is fairly pointless, unfortunately.

Dropping or truncating tables is as much as we can do without making
unwarranted assumptions about the filesystem's behavior. You say
you want to zero out the files first, but what will that accomplish
on copy-on-write filesystems?

Even granting that zeroing our storage files is worth something,
it's not worth much if there are more copies of the data elsewhere.
And any well-run database is going to have backups, or standby servers,
or both. There's no way for the primary node to force standbys to erase
themselves (and it'd be a different sort of security hazard if there
were).

Also to the point: if your assumption is that an attacker has access
to the storage medium at a sufficiently low level that they can examine
previously-deleted files, what's stopping them from reading the data
*before* it's deleted?

So I think doing anything useful in this area is a bit outside
Postgres' remit. You'd need to be thinking at an operating-system
or hardware level.

regards, tom lane

#4Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#3)
Re: Complete data erasure

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

"asaba.takanori@fujitsu.com" <asaba.takanori@fujitsu.com> writes:

I want to add the feature to erase data so that it cannot be restored
because it prevents attackers from stealing data from released data area.

I think this is fairly pointless, unfortunately.

I disagree- it's a feature that's been asked for multiple times and does
have value in some situations.

Dropping or truncating tables is as much as we can do without making
unwarranted assumptions about the filesystem's behavior. You say
you want to zero out the files first, but what will that accomplish
on copy-on-write filesystems?

What about filesystems which are not copy-on-write though?

Even granting that zeroing our storage files is worth something,
it's not worth much if there are more copies of the data elsewhere.

Backups are not our responsibility to worry about, or, at least, it'd be
an independent feature if we wanted to add something like this to
pg_basebackup, and not something the initial feature would need to worry
about.

And any well-run database is going to have backups, or standby servers,
or both. There's no way for the primary node to force standbys to erase
themselves (and it'd be a different sort of security hazard if there
were).

A user can't "force" PG to do anything more than we can "force" a
replica to do something, but a user can issue a request to a primary and
that primary can then pass that request along to the replica as part of
the WAL stream.

Also to the point: if your assumption is that an attacker has access
to the storage medium at a sufficiently low level that they can examine
previously-deleted files, what's stopping them from reading the data
*before* it's deleted?

This argument certainly doesn't make any sense- who said they had access
to the storage medium at a time before the files were deleted? What if
they only had access after the files were zero'd? When you consider the
lifetime of a storage medium, it's certainly a great deal longer than
the length of time that a given piece of sensitive data might reside on
it.

So I think doing anything useful in this area is a bit outside
Postgres' remit. You'd need to be thinking at an operating-system
or hardware level.

I disagree entirely. If the operating system and hardware level provide
a way for this to work, which is actually rather common when you
consider that ext4 is an awful popular filesystem where this would work
just fine with nearly all traditional hardware underneath it, then we're
just blocking enabling this capability for no justifiably reason.

Thanks,

Stephen

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#4)
Re: Complete data erasure

On Wed, Jan 15, 2020 at 10:23:22AM -0500, Stephen Frost wrote:

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

"asaba.takanori@fujitsu.com" <asaba.takanori@fujitsu.com> writes:

I want to add the feature to erase data so that it cannot be restored
because it prevents attackers from stealing data from released data area.

I think this is fairly pointless, unfortunately.

I disagree- it's a feature that's been asked for multiple times and does
have value in some situations.

Dropping or truncating tables is as much as we can do without making
unwarranted assumptions about the filesystem's behavior. You say
you want to zero out the files first, but what will that accomplish
on copy-on-write filesystems?

What about filesystems which are not copy-on-write though?

Even granting that zeroing our storage files is worth something,
it's not worth much if there are more copies of the data elsewhere.

Backups are not our responsibility to worry about, or, at least, it'd be
an independent feature if we wanted to add something like this to
pg_basebackup, and not something the initial feature would need to worry
about.

And any well-run database is going to have backups, or standby servers,
or both. There's no way for the primary node to force standbys to erase
themselves (and it'd be a different sort of security hazard if there
were).

A user can't "force" PG to do anything more than we can "force" a
replica to do something, but a user can issue a request to a primary and
that primary can then pass that request along to the replica as part of
the WAL stream.

Also to the point: if your assumption is that an attacker has access
to the storage medium at a sufficiently low level that they can examine
previously-deleted files, what's stopping them from reading the data
*before* it's deleted?

This argument certainly doesn't make any sense- who said they had access
to the storage medium at a time before the files were deleted? What if
they only had access after the files were zero'd? When you consider the
lifetime of a storage medium, it's certainly a great deal longer than
the length of time that a given piece of sensitive data might reside on
it.

So I think doing anything useful in this area is a bit outside
Postgres' remit. You'd need to be thinking at an operating-system
or hardware level.

I disagree entirely. If the operating system and hardware level provide
a way for this to work, which is actually rather common when you
consider that ext4 is an awful popular filesystem where this would work
just fine with nearly all traditional hardware underneath it, then we're
just blocking enabling this capability for no justifiably reason.

Not sure. I agree the goal (securely discarding data) is certainly
worthwile, although I suspect it's just of many things you'd need to
care about. That is, there's probably a million other things you'd need
to worry about (logs, WAL, CSV files, temp files, ...), so with actually
sensitive data I'd expect people to just dump/load the data into a clean
system and rebuild the old one (zero drives, ...).

But let's assume it makes sense - is this really the right solution? I
think what I'd prefer is encryption + rotation of the keys. Which should
work properly even on COW filesystems, the performance impact is kinda
low and amortized etc. Of course, we're discussing built-in encryption
for quite a bit of time.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#5)
Re: Complete data erasure

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Wed, Jan 15, 2020 at 10:23:22AM -0500, Stephen Frost wrote:

I disagree entirely. If the operating system and hardware level provide
a way for this to work, which is actually rather common when you
consider that ext4 is an awful popular filesystem where this would work
just fine with nearly all traditional hardware underneath it, then we're
just blocking enabling this capability for no justifiably reason.

Not sure. I agree the goal (securely discarding data) is certainly
worthwile, although I suspect it's just of many things you'd need to
care about. That is, there's probably a million other things you'd need
to worry about (logs, WAL, CSV files, temp files, ...), so with actually
sensitive data I'd expect people to just dump/load the data into a clean
system and rebuild the old one (zero drives, ...).

Of course there's other things that one would need to worry about, but
saying this isn't useful because PG isn't also scanning your entire
infrastructure for CSV files that have this sensitive information isn't
a sensible argument.

I agree that there are different levels of sensitive data- and for many,
many cases what is being proposed here will work just fine, even if that
level of sensitive data isn't considered "actually sensitive data" by
other people.

But let's assume it makes sense - is this really the right solution? I
think what I'd prefer is encryption + rotation of the keys. Which should
work properly even on COW filesystems, the performance impact is kinda
low and amortized etc. Of course, we're discussing built-in encryption
for quite a bit of time.

In some cases that may make sense as an alternative, in other situations
it doesn't. In other words, I disagree that what you're proposing is
just a different implementation of the same thing (which I'd be happy to
discuss), it's an entirely different feature with different trade-offs.

I'm certainly on-board with the idea of having table level and column
level encryption to facilitate an approach like this, but I disagree
strongly that we shouldn't have this simple "just overwrite the data a
few times" because we might, one day, have useful in-core encryption or
even that, even if we had it, that we should tell everyone to use that
instead of providing this capability. I don't uninstall 'shread' when I
install 'gpg' on my system.

Thanks,

Stephen

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#5)
Re: Complete data erasure

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But let's assume it makes sense - is this really the right solution? I
think what I'd prefer is encryption + rotation of the keys. Which should
work properly even on COW filesystems, the performance impact is kinda
low and amortized etc. Of course, we're discussing built-in encryption
for quite a bit of time.

Yeah, it seems to me that encrypted storage would solve strictly more
cases than this proposal does, and it has fewer foot-guns.

One foot-gun that had been vaguely bothering me yesterday, but the point
didn't quite crystallize till just now, is that the only place that we
could implement such file zeroing is post-commit in a transaction that
has done DROP TABLE or TRUNCATE. Of course post-commit is an absolutely
horrid place to be doing anything that could fail, since there's no very
good way to recover from an error. It's an even worse place to be doing
anything that could take a long time --- if the user loses patience and
kills the session, now your problems are multiplied.

Right now our risks in that area are confined to leaking files if
unlink() fails, which isn't great but it isn't catastrophic either.
With this proposal, erroring out post-commit becomes a security
failure, if it happens anywhere before we've finished a possibly
large amount of zero-writing. I don't want to go there.

regards, tom lane

#8Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#7)
Re: Complete data erasure

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But let's assume it makes sense - is this really the right solution? I
think what I'd prefer is encryption + rotation of the keys. Which should
work properly even on COW filesystems, the performance impact is kinda
low and amortized etc. Of course, we're discussing built-in encryption
for quite a bit of time.

Yeah, it seems to me that encrypted storage would solve strictly more
cases than this proposal does, and it has fewer foot-guns.

I still view that as strictly a different solution and one that
certainly won't fit in all cases, not to mention that it's a great deal
more complicated and we're certainly no where near close to having it.

One foot-gun that had been vaguely bothering me yesterday, but the point
didn't quite crystallize till just now, is that the only place that we
could implement such file zeroing is post-commit in a transaction that
has done DROP TABLE or TRUNCATE. Of course post-commit is an absolutely
horrid place to be doing anything that could fail, since there's no very
good way to recover from an error. It's an even worse place to be doing
anything that could take a long time --- if the user loses patience and
kills the session, now your problems are multiplied.

This is presuming that we make this feature something that can be run in
a transaction and rolled back. I don't think there's been any specific
expression that there is such a requirement, and you bring up good
points that show why providing that functionality would be particularly
challenging, but that isn't really an argument against this feature in
general.

Thanks,

Stephen

#9asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: Kyotaro Horiguchi (#2)
RE: Complete data erasure

Hello, Horiguchi-san

Thank you for comment.

At Wed, 15 Jan 2020 03:46 +0000, "Kyotaro Horiguchi "<horikyota.ntt@gmail.com> wrote in

shred(1) or wipe(1) doesn't seem to contribute to the objective on
journaled or copy-on-write file systems. I'm not sure, but maybe the
same can be true for read-modify-write devices like SSD. I'm not sure
about SDelete, but anyway replacing unlink() with something like
'system("shred")' leads to siginificant performance degradation.

man 1 wipe says (https://linux.die.net/man/1/wipe) : (shred has a
similar note.)

NOTE ABOUT JOURNALING FILESYSTEMS AND SOME RECOMMENDATIONS

(JUNE 2004)

Journaling filesystems (such as Ext3 or ReiserFS) are now being used
by default by most Linux distributions. No secure deletion program
that does filesystem-level calls can sanitize files on such
filesystems, because sensitive data and metadata can be written to the
journal, which cannot be readily accessed. Per-file secure deletion is
better implemented in the operating system.

shred can be used in certain modes of journaled file systems.
How about telling users that they must set the certain mode
if they set shred for erase_command in journaled file systems?
man 1 shred goes on like this:

In the case of ext3 file systems, the above disclaimer applies (and shred is thus
of limited effectiveness) only in data=journal mode, which journals file data in
addition to just metadata. In both the data=ordered (default) and data=writeback
modes, shred works as usual. Ext3 journaling modes can be changed by adding the
data=something option to the mount options for a particular file system in the
/etc/fstab file, as documented in the mount man page (man mount).

As shown above, shred works as usual in both the data=ordered (default) and data=writeback modes.
I think data=journal mode is not used in many cases because it degrades performance.
Therefore, I think it is enough to indicate that shred cannot be used in data=journal mode.

Regards,

--
Takanori Asaba

#10Stephen Frost
sfrost@snowman.net
In reply to: asaba.takanori@fujitsu.com (#1)
Re: Complete data erasure

Greetings,

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

This feature erases data area just before it is returned to the OS (“erase” means that overwrite data area to hide its contents here)
because there is a risk that the data will be restored by attackers if it is returned to the OS without being overwritten.
The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.

Looking at this fresh, I wanted to point out that I think Tom's right-
we aren't going to be able to reasonbly support this kind of data
erasure on a simple DROP TABLE or TRUNCATE.

I want users to be able to customize the erasure method for their security policies.

There's also this- but I think what it means is that we'd probably have
a top-level command that basically is "ERASE TABLE blah;" or similar
which doesn't operate during transaction commit but instead marks the
table as "to be erased" and then perhaps "erasure in progress" and then
"fully erased" (or maybe just back to 'normal' at that point). Making
those updates will require the command to perform its own transaction
management which is why it can't be in a transaction itself but also
means that the data erasure process doesn't need to be done during
commit.

My idea is adding a new parameter erase_command to postgresql.conf.

Yeah, I don't think that's really a sensible option or even approach.

When erase_command is set, VACUUM does not truncate a file size to non-zero
because it's safer for users to return the entire file to the OS than to return part of it.

There was discussion elsewhere about preventing VACUUM from doing a
truncate on a file because of the lock it requires and problems with
replicas.. I'm not sure where that ended up, but, in general, I don't
think this feature and VACUUM should really have anything to do with
each other except for the possible case that a user might be told to
configure their system to not allow VACUUM to truncate tables if they
care about this case.

As mentioned elsewhere, you do also have to consider that the sensitive
data will end up in the WAL and on replicas. I don't believe that means
this feature is without use, but it means that users of this feature
will also need to understand and be able to address WAL and replicas
(along with backups and such too, of course).

Thanks,

Stephen

#11asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: Stephen Frost (#10)
RE: Complete data erasure

Hello Stephen,

Thank you for comment.

From: Stephen Frost <sfrost@snowman.net>

Greetings,

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

This feature erases data area just before it is returned to the OS (“erase”

means that overwrite data area to hide its contents here)

because there is a risk that the data will be restored by attackers if it is returned

to the OS without being overwritten.

The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.

Looking at this fresh, I wanted to point out that I think Tom's right-
we aren't going to be able to reasonbly support this kind of data
erasure on a simple DROP TABLE or TRUNCATE.

I want users to be able to customize the erasure method for their security

policies.

There's also this- but I think what it means is that we'd probably have
a top-level command that basically is "ERASE TABLE blah;" or similar
which doesn't operate during transaction commit but instead marks the
table as "to be erased" and then perhaps "erasure in progress" and then
"fully erased" (or maybe just back to 'normal' at that point). Making
those updates will require the command to perform its own transaction
management which is why it can't be in a transaction itself but also
means that the data erasure process doesn't need to be done during
commit.

My idea is adding a new parameter erase_command to postgresql.conf.

Yeah, I don't think that's really a sensible option or even approach.

I think erase_command can also manage the state of a table.
The exit status of a configured command shows it.( 0 is "fully erased" or "normal", 1 is "erasure in progress")
erase_command is executed not during a transaction but when unlink() is executed.
(for example, after a transaction that has done DROP TABLE)
I think that this shows " to be erased ".

When erase_command is set, VACUUM does not truncate a file size to non-zero
because it's safer for users to return the entire file to the OS than to return part

of it.

There was discussion elsewhere about preventing VACUUM from doing a
truncate on a file because of the lock it requires and problems with
replicas.. I'm not sure where that ended up, but, in general, I don't
think this feature and VACUUM should really have anything to do with
each other except for the possible case that a user might be told to
configure their system to not allow VACUUM to truncate tables if they
care about this case.

I think that if ftruncate(fd, 0) is executed in VACUUM,
data area allocated to a file is returned to the OS, so that area must be overwritten.

As mentioned elsewhere, you do also have to consider that the sensitive
data will end up in the WAL and on replicas. I don't believe that means
this feature is without use, but it means that users of this feature
will also need to understand and be able to address WAL and replicas
(along with backups and such too, of course).

I see.
I can't think of it right away, but I will deal with it.

Sorry for my late reply.
It takes time to understand email from you because I'm a beginner.
Please point out any mistakes.

Regards,

--
Takanori Asaba

#12Stephen Frost
sfrost@snowman.net
In reply to: asaba.takanori@fujitsu.com (#11)
Re: Complete data erasure

Greetings,

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

From: Stephen Frost <sfrost@snowman.net>

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

This feature erases data area just before it is returned to the OS (“erase”

means that overwrite data area to hide its contents here)

because there is a risk that the data will be restored by attackers if it is returned

to the OS without being overwritten.

The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.

Looking at this fresh, I wanted to point out that I think Tom's right-
we aren't going to be able to reasonbly support this kind of data
erasure on a simple DROP TABLE or TRUNCATE.

I want users to be able to customize the erasure method for their security

policies.

There's also this- but I think what it means is that we'd probably have
a top-level command that basically is "ERASE TABLE blah;" or similar
which doesn't operate during transaction commit but instead marks the
table as "to be erased" and then perhaps "erasure in progress" and then
"fully erased" (or maybe just back to 'normal' at that point). Making
those updates will require the command to perform its own transaction
management which is why it can't be in a transaction itself but also
means that the data erasure process doesn't need to be done during
commit.

My idea is adding a new parameter erase_command to postgresql.conf.

Yeah, I don't think that's really a sensible option or even approach.

I think erase_command can also manage the state of a table.
The exit status of a configured command shows it.( 0 is "fully erased" or "normal", 1 is "erasure in progress")
erase_command is executed not during a transaction but when unlink() is executed.

I really don't see what the advantage of having this be configurable is.
In addition, an external command's actions wouldn't be put through the
WAL meaning that replicas would have to be dealt with in some other way
beyind regular WAL and that seems like it'd just be ugly.

(for example, after a transaction that has done DROP TABLE)

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

When erase_command is set, VACUUM does not truncate a file size to non-zero
because it's safer for users to return the entire file to the OS than to return part

of it.

There was discussion elsewhere about preventing VACUUM from doing a
truncate on a file because of the lock it requires and problems with
replicas.. I'm not sure where that ended up, but, in general, I don't
think this feature and VACUUM should really have anything to do with
each other except for the possible case that a user might be told to
configure their system to not allow VACUUM to truncate tables if they
care about this case.

I think that if ftruncate(fd, 0) is executed in VACUUM,
data area allocated to a file is returned to the OS, so that area must be overwritten.

As I mentioned, there was already talk of making that disallowed in
VACUUM, so that would just need to be configured (and be able to be
configured) when running in an environment which requires this.

As mentioned elsewhere, you do also have to consider that the sensitive
data will end up in the WAL and on replicas. I don't believe that means
this feature is without use, but it means that users of this feature
will also need to understand and be able to address WAL and replicas
(along with backups and such too, of course).

I see.
I can't think of it right away, but I will deal with it.

It's something that will need to be understood and dealt with. A simple
answer might be "the user must also destroy all WAL and all backups in
an approved manner, and ensure all replicas have replayed all WAL or
been destroyed".

Thanks,

Stephen

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#12)
Re: Complete data erasure

On Tue, Jan 28, 2020 at 02:34:07PM -0500, Stephen Frost wrote:

Greetings,

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

From: Stephen Frost <sfrost@snowman.net>

* asaba.takanori@fujitsu.com (asaba.takanori@fujitsu.com) wrote:

This feature erases data area just before it is returned to the OS (“erase”

means that overwrite data area to hide its contents here)

because there is a risk that the data will be restored by attackers if it is returned

to the OS without being overwritten.

The erase timing is when DROP, VACUUM, TRUNCATE, etc. are executed.

Looking at this fresh, I wanted to point out that I think Tom's right-
we aren't going to be able to reasonbly support this kind of data
erasure on a simple DROP TABLE or TRUNCATE.

I want users to be able to customize the erasure method for their security

policies.

There's also this- but I think what it means is that we'd probably have
a top-level command that basically is "ERASE TABLE blah;" or similar
which doesn't operate during transaction commit but instead marks the
table as "to be erased" and then perhaps "erasure in progress" and then
"fully erased" (or maybe just back to 'normal' at that point). Making
those updates will require the command to perform its own transaction
management which is why it can't be in a transaction itself but also
means that the data erasure process doesn't need to be done during
commit.

My idea is adding a new parameter erase_command to postgresql.conf.

Yeah, I don't think that's really a sensible option or even approach.

I think erase_command can also manage the state of a table.
The exit status of a configured command shows it.( 0 is "fully erased" or "normal", 1 is "erasure in progress")
erase_command is executed not during a transaction but when unlink() is executed.

I really don't see what the advantage of having this be configurable is.
In addition, an external command's actions wouldn't be put through the
WAL meaning that replicas would have to be dealt with in some other way
beyind regular WAL and that seems like it'd just be ugly.

(for example, after a transaction that has done DROP TABLE)

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

IMO the best solution would be that the DROP TABLE does everything as
usual, but instead of deleting the relfilenode it moves it to some sort
of queue. And then a background worker would "erase" these relfilenodes
outside the COMMIT.

And yes, we need to do this in a way that works with replicas, i.e. we
need to WAL-log it somehow. And it should to be done in a way that works
when the replica is on a different type of filesystem.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#13)
Re: Complete data erasure

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Jan 28, 2020 at 02:34:07PM -0500, Stephen Frost wrote:

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

IMO the best solution would be that the DROP TABLE does everything as
usual, but instead of deleting the relfilenode it moves it to some sort
of queue. And then a background worker would "erase" these relfilenodes
outside the COMMIT.

That sounds interesting, though I'm a bit worried that it's going to
lead to the same kind of complications and difficulty that we have with
deleted columns- anything that's working with the system tables will
need to see this new "dropped but pending delete" flag. Would we also
rename the table when this happens? Or change the schema it's in?
Otherwise, your typical DROP IF EXISTS / CREATE could end up breaking.

And yes, we need to do this in a way that works with replicas, i.e. we
need to WAL-log it somehow. And it should to be done in a way that works
when the replica is on a different type of filesystem.

I agree it should go through WAL somehow (ideally without needing an
actual zero'd or whatever page for every page in the relation), but why
do we care about the filesystem on the replica? We don't have anything
that's really filesystem specific in WAL replay today and I don't see
this as needing to change that..

Thanks,

Stephen

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#14)
Re: Complete data erasure

On Mon, Feb 03, 2020 at 09:07:09AM -0500, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Jan 28, 2020 at 02:34:07PM -0500, Stephen Frost wrote:

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

IMO the best solution would be that the DROP TABLE does everything as
usual, but instead of deleting the relfilenode it moves it to some sort
of queue. And then a background worker would "erase" these relfilenodes
outside the COMMIT.

That sounds interesting, though I'm a bit worried that it's going to
lead to the same kind of complications and difficulty that we have with
deleted columns- anything that's working with the system tables will
need to see this new "dropped but pending delete" flag. Would we also
rename the table when this happens? Or change the schema it's in?
Otherwise, your typical DROP IF EXISTS / CREATE could end up breaking.

That's not really what I meant - let me explain. When I said DROP TABLE
should do everything as usual, that includes catalog changes. I.e. after
the commit there would not be any remaining entries in system catalogs
or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode from
the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough, I
guess, but maybe not.

And yes, we need to do this in a way that works with replicas, i.e. we
need to WAL-log it somehow. And it should to be done in a way that works
when the replica is on a different type of filesystem.

I agree it should go through WAL somehow (ideally without needing an
actual zero'd or whatever page for every page in the relation), but why
do we care about the filesystem on the replica? We don't have anything
that's really filesystem specific in WAL replay today and I don't see
this as needing to change that..

I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on the
primary, you probably also want to erase the data on the replica. But if
the instances use different file systems (COW vs. non-COW, ...) the
exact thing that needs to happen may be different. Or maybe the replica
does not need to do anything, making it noop?

In which case we don't need to WAL-log the exact change for each page,
it might even be fine to not even WAL-log anything except for the final
removal from the queue. I mean, the data is worthless and not used by
anyone at this point, there's no point in replicating it ...

I haven't thought about this very hard. It's not clear what should
happen if we complete the erasure on primary, remove the relfilenode
from the queue, and then restart the replica before it finishes the
local erasure. The queue (if represented by a simple table) will be
replicated, so the replica will forget it still has work to do.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#15)
RE: Complete data erasure

From: Tomas Vondra <tomas.vondra@2ndquadrant.com>

That's not really what I meant - let me explain. When I said DROP TABLE
should do everything as usual, that includes catalog changes. I.e. after
the commit there would not be any remaining entries in system catalogs
or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode from
the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

That sounds good. I think we can also follow the way the WAL archiver does its job, instead of using a regular table. That is, when the transaction that performed DROP TABLE commits, it puts the data files in the "trash bin," which is actually a filesystem directory. Or, it just renames the data files in the original directory by appending some suffix such as ".del". Then, the background worker scans the trash bin or the data directory to erase the file content and delete the file.

The trash bin mechanism may open up the application for restoring mistakenly dropped tables, a feature like Oracle's Flash Drop. The dropping transaction puts the table metadata (system catalog data or DDL) in the trash bin as well as the data file.

I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough, I
guess, but maybe not.

Agreed, because the user should expect the disk space to be available after DROP TABLE has been committed. Can't we really make the COMMIT to wait for the erasure to complete? Do we have to use an asynchronous erasure method with a background worker? For example, COMMIT performs:

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.

What is concerned about is that the need to erase and delete the data file would be forgotten if the server crashes during step 3. If so, postmaster can do the job at startup, just like it deletes temporary files (although it delays the startup.)

I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on the
primary, you probably also want to erase the data on the replica. But if
the instances use different file systems (COW vs. non-COW, ...) the
exact thing that needs to happen may be different. Or maybe the replica
does not need to do anything, making it noop?

We can guide the use of non-COW file systems on both the primary and standby in the manual.

Regards
Takayuki Tsunakawa

#17imai.yoshikazu@fujitsu.com
imai.yoshikazu@fujitsu.com
In reply to: tsunakawa.takay@fujitsu.com (#16)
RE: Complete data erasure
#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: tsunakawa.takay@fujitsu.com (#16)
Re: Complete data erasure

On Tue, Feb 04, 2020 at 12:53:44AM +0000, tsunakawa.takay@fujitsu.com
wrote:

From: Tomas Vondra <tomas.vondra@2ndquadrant.com>

That's not really what I meant - let me explain. When I said DROP
TABLE should do everything as usual, that includes catalog changes.
I.e. after the commit there would not be any remaining entries in
system catalogs or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode
from the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

That sounds good. I think we can also follow the way the WAL archiver
does its job, instead of using a regular table. That is, when the
transaction that performed DROP TABLE commits, it puts the data files
in the "trash bin," which is actually a filesystem directory. Or, it
just renames the data files in the original directory by appending some
suffix such as ".del". Then, the background worker scans the trash bin
or the data directory to erase the file content and delete the file.

Yeah, that could work, I guess.

The trash bin mechanism may open up the application for restoring
mistakenly dropped tables, a feature like Oracle's Flash Drop. The
dropping transaction puts the table metadata (system catalog data or
DDL) in the trash bin as well as the data file.

That seems like a very different feature, and I doubt this is the right
way to implement that. That would require much more infrastructure than
just moving the file to a separate dir.

I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough,
I guess, but maybe not.

Agreed, because the user should expect the disk space to be available
after DROP TABLE has been committed. Can't we really make the COMMIT
to wait for the erasure to complete? Do we have to use an asynchronous
erasure method with a background worker? For example, COMMIT performs:

I think it depends how exactly it's implemented. As Tom pointed out in
his message [1]/messages/by-id/9104.1579107235@sss.pgh.pa.us, we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

[1]: /messages/by-id/9104.1579107235@sss.pgh.pa.us

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.

I don't think the COMMIT has to wait for (3) - it might, of course, but
for some use cases it may be better to just commit and leave the
bgworker do the work. And then allow checking if it completed.

What is concerned about is that the need to erase and delete the data
file would be forgotten if the server crashes during step 3. If so,
postmaster can do the job at startup, just like it deletes temporary
files (although it delays the startup.)

Startup seems like a pretty bad place to do this stuff. There may be a
lot of data to erase, making recovery very long.

I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on
the primary, you probably also want to erase the data on the replica.
But if the instances use different file systems (COW vs. non-COW,
...) the exact thing that needs to happen may be different. Or maybe
the replica does not need to do anything, making it noop?

We can guide the use of non-COW file systems on both the primary and
standby in the manual.

I don't see how that solves the issue. I think it's quite useful to be
able to use different filesystems for primary/replica. And we may even
know how to securely erase data on both, in which case I don't see a
point not to allow such configurations.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#18)
Re: Complete data erasure

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I think it depends how exactly it's implemented. As Tom pointed out in
his message [1], we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

How is requiring a file rename to be completed post-commit any less
problematic than the other way? You still have a non-negligible
chance of failure.

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.

I don't think the COMMIT has to wait for (3) - it might, of course, but
for some use cases it may be better to just commit and leave the
bgworker do the work. And then allow checking if it completed.

This doesn't seem like a path that will lead to success. The fundamental
point here is that COMMIT has to be an atomic action --- or if it isn't,
failure partway through has to lead to a database crash & restart, which
isn't very pleasant, especially if WAL replay of the commit after the
restart re-encounters the same error.

Up to now, we've sort of looked the other way with respect to failures
of file unlinks post-commit, reasoning that the worst that will happen
is disk space leakage from no-longer-referenced files that we failed to
unlink. (Which is bad, certainly, but not catastrophic; it's immaterial
to database semantics.) This patch basically needs to raise the level of
guarantee that exists in this area, or it won't do what it says on the
tin. But I've not seen any indication that we know how to do that in a
workable way.

regards, tom lane

#20tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tom Lane (#19)
RE: Complete data erasure

From: Tom Lane <tgl@sss.pgh.pa.us>

Up to now, we've sort of looked the other way with respect to failures
of file unlinks post-commit, reasoning that the worst that will happen
is disk space leakage from no-longer-referenced files that we failed to
unlink. (Which is bad, certainly, but not catastrophic; it's immaterial
to database semantics.) This patch basically needs to raise the level of
guarantee that exists in this area, or it won't do what it says on the
tin. But I've not seen any indication that we know how to do that in a
workable way.

Hmm, the error case is a headache.Even if the bgworker does the erasure, it could hit the same error repeatedly when the file system or disk is broken, causing repeated I/O that may hamper performance.

Do we have no good choice but to leave it up to the user to erase the file content like the following?

https://docs.oracle.com/en/database/oracle/oracle-database/19/asoag/general-considerations-of-using-transparent-data-encryption.html#GUID-F02C9CBF-0374-408B-8655-F7531B681D41
--------------------------------------------------
Oracle Database
Advanced Security Guide
7 General Considerations of Using Transparent Data Encryption

Managing Security for Plaintext Fragments

You should remove old plaintext fragments that can appear over time.

Old plaintext fragments may be present for some time until the database overwrites the blocks containing such values. If privileged operating system users bypass the access controls of the database, then they might be able to directly access these values in the data file holding the tablespace.

To minimize this risk:

1.Create a new tablespace in a new data file.

You can use the CREATE TABLESPACE statement to create this tablespace.

2.Move the table containing encrypted columns to the new tablespace. You can use the ALTER TABLE.....MOVE statement.

Repeat this step for all of the objects in the original tablespace.

3.Drop the original tablespace.

You can use the DROP TABLESPACE tablespace INCLUDING CONTENTS KEEP DATAFILES statement. Oracle recommends that you securely delete data files using platform-specific utilities.

4.Use platform-specific and file system-specific utilities to securely delete the old data file. Examples of such utilities include shred (on Linux) and sdelete (on Windows).
--------------------------------------------------

Regards
Takayuki Tsunakawa

#21Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: tsunakawa.takay@fujitsu.com (#16)
Re: Complete data erasure

On Tue, 4 Feb 2020 at 09:53, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

From: Tomas Vondra <tomas.vondra@2ndquadrant.com>

That's not really what I meant - let me explain. When I said DROP TABLE
should do everything as usual, that includes catalog changes. I.e. after
the commit there would not be any remaining entries in system catalogs
or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode from
the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

That sounds good. I think we can also follow the way the WAL archiver does its job, instead of using a regular table. That is, when the transaction that performed DROP TABLE commits, it puts the data files in the "trash bin," which is actually a filesystem directory. Or, it just renames the data files in the original directory by appending some suffix such as ".del". Then, the background worker scans the trash bin or the data directory to erase the file content and delete the file.

The trash bin mechanism may open up the application for restoring mistakenly dropped tables, a feature like Oracle's Flash Drop. The dropping transaction puts the table metadata (system catalog data or DDL) in the trash bin as well as the data file.

I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough, I
guess, but maybe not.

Agreed, because the user should expect the disk space to be available after DROP TABLE has been committed. Can't we really make the COMMIT to wait for the erasure to complete? Do we have to use an asynchronous erasure method with a background worker? For example, COMMIT performs:

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.

What is concerned about is that the need to erase and delete the data file would be forgotten if the server crashes during step 3. If so, postmaster can do the job at startup, just like it deletes temporary files (although it delays the startup.)

Please note that we need to erase files not only when dropping or
truncating tables but also when aborting the transaction that created
a new table. If user wants to sure the data is actually erased they
needs to wait for rollback as well that could be ROLLBACK command by
user or an error during transaction etc.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: Stephen Frost (#6)
RE: Complete data erasure

Hello Stephen,

From: Stephen Frost <sfrost@snowman.net>

I disagree- it's a feature that's been asked for multiple times and does
have value in some situations.

I'm rethinking the need for this feature although I think that it improves the security.
You said that this feature has value in some situations.
Could you tell me about that situations?

Regards,

--
Takanori Asaba

#23asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: asaba.takanori@fujitsu.com (#22)
RE: Complete data erasure

Greetings,

From: asaba.takanori@fujitsu.com <asaba.takanori@fujitsu.com>

Hello Stephen,

From: Stephen Frost <sfrost@snowman.net>

I disagree- it's a feature that's been asked for multiple times and does
have value in some situations.

I'm rethinking the need for this feature although I think that it improves the
security.
You said that this feature has value in some situations.
Could you tell me about that situations?

Regards,

--
Takanori Asaba

I think that the use scene is to ensure that no data remains.
This feature will give users peace of mind.

There is a risk of leakage as long as data remains.
I think that there are some things that users are worried about.
For example, there is a possibility that even if it takes years, attackers decrypt encrypted data.
Or some users may be concerned about disk management in cloud environments.
These concerns will be resolved if they can erase data themselves.

I think that this feature is valuable, so I would appreciate your continued cooperation.

Regards,

--
Takanori Asaba

#24asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: Tom Lane (#19)
RE: Complete data erasure

Hello Tom,

From: Tom Lane <tgl@sss.pgh.pa.us>

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I think it depends how exactly it's implemented. As Tom pointed out in
his message [1], we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

How is requiring a file rename to be completed post-commit any less
problematic than the other way? You still have a non-negligible
chance of failure.

I think that errors of rename(2) listed in [1]http://man7.org/linux/man-pages/man2/rename.2.html cannot occur or can be handled.
What do you think?

[1]: http://man7.org/linux/man-pages/man2/rename.2.html

Regards,

--
Takanori Asaba

#25asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: asaba.takanori@fujitsu.com (#24)
RE: Complete data erasure

Hello Tom,

From: asaba.takanori@fujitsu.com <asaba.takanori@fujitsu.com>

Hello Tom,

From: Tom Lane <tgl@sss.pgh.pa.us>

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I think it depends how exactly it's implemented. As Tom pointed out in
his message [1], we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

How is requiring a file rename to be completed post-commit any less
problematic than the other way? You still have a non-negligible
chance of failure.

I think that errors of rename(2) listed in [1] cannot occur or can be handled.
What do you think?

[1] http://man7.org/linux/man-pages/man2/rename.2.html

I have another idea.
How about managing status of data file like the WAL archiver?
For example,

1. Create a status file "...ready" in a transaction that has DROP TABLE. (not rename the data file)
2. Background worker scans the directory that has status file.
3. Rename the status file to "...progress" when the erase of the data file starts.
4. Rename the status file to "...done" when the erase of the data file finished.

I think that it's OK because step1 is not post-commit and background worker can handle error of the erase.

Regards,

--
Takanori Asaba

#26asaba.takanori@fujitsu.com
asaba.takanori@fujitsu.com
In reply to: asaba.takanori@fujitsu.com (#25)
RE: Complete data erasure

Hello,

I was off the point.
I want to organize the discussion and suggest feature design.

There are two opinions.
1. COMMIT should not take a long time because errors are more likely to occur.
2. The data area should be released when COMMIT is completed because COMMIT has to be an atomic action.

These opinions are correct.
But it is difficult to satisfy them at the same time.
So I suggest that users have the option to choose.
DROP TABLE works as following two patterns:

1. Rename data file to "...del" instead of ftruncate(fd,0).
After that, bgworker scan the directory and run erase_command.
(erase_command is command set by user like archive_command.
For example, shred on Linux.)

2. Run erase_command for data file immediately before ftruncate(fd,0).
Wait until it completes, then reply COMMIT to the client.
After that, it is the same as normal processing.

If error of erase_command occurs, it issues WARNING and don't request unlink to CheckPointer.
It’s not a security failure because I think that there is a risk when data area is returned to OS.

I will implement from pattern 2 because it's more similar to user experience than pattern 1.
This method has been pointed out as follows.

From Stephen

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

I think it means that error of external commands can't be handled.
If so, it's no problem because I determined behavior after error.
Are there any other problems?

Regards,

--
Takanori Asaba

#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: asaba.takanori@fujitsu.com (#26)
Re: Complete data erasure

On Fri, Apr 10, 2020 at 08:23:32AM +0000, asaba.takanori@fujitsu.com wrote:

Hello,

I was off the point.
I want to organize the discussion and suggest feature design.

There are two opinions.
1. COMMIT should not take a long time because errors are more likely to occur.

I don't think it's a matter of commit duration but a question what to do
in response to errors in the data erasure code - which is something we
can't really rule out if we allow custom scripts to perform the erasure.
If the erasure took very long but couldn't possibly fail, it'd be much
easier to handle than fast erasure failing often.

The difficulty of error-handling is why adding new stuff to commit may
be tricky. Which is why I proposed not to do the failure-prone code in
commit itself, but move it to a separate process.

2. The data area should be released when COMMIT is completed because COMMIT has to be an atomic action.

I don't think "commit is atomic" really implies "data should be released
at commit". This is precisely what makes the feature extremely hard to
implement, IMHO.

Why wouldn't it be acceptable to do something like this?

BEGIN;
...
DROP TABLE x ERASE;
...
COMMIT; <-- Don't do data erasure, just add "x" to queue.

-- wait for another process to complete the erasure
SELECT pg_wait_for_erasure();

That means we're not running any custom commands / code during commit,
which should (hopefully) make it easier to handle errors.

These opinions are correct.
But it is difficult to satisfy them at the same time.
So I suggest that users have the option to choose.
DROP TABLE works as following two patterns:

1. Rename data file to "...del" instead of ftruncate(fd,0).
After that, bgworker scan the directory and run erase_command.
(erase_command is command set by user like archive_command.
For example, shred on Linux.)

2. Run erase_command for data file immediately before ftruncate(fd,0).
Wait until it completes, then reply COMMIT to the client.
After that, it is the same as normal processing.

If error of erase_command occurs, it issues WARNING and don't request unlink to CheckPointer.
It’s not a security failure because I think that there is a risk when data area is returned to OS.

I think it was already disicussed why doing file renames and other
expensive stuff that could fail is a bad idea. And I'm not sure just
ignoring erase_command failures (because that's what WARNING does) is
really appropriate for this feature.

I will implement from pattern 2 because it's more similar to user experience than pattern 1.
This method has been pointed out as follows.

From Stephen

We certainly can't run external commands during transaction COMMIT, so
this can't be part of a regular DROP TABLE.

I think it means that error of external commands can't be handled.
If so, it's no problem because I determined behavior after error.
Are there any other problems?

I'm not sure what you mean by "determined behavior after error"? You
essentially propose to just print a warning and be done with it. But
that means we can simply leave data files with sensitive data on the
disk, which seems ... not great.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#27)
Re: Complete data erasure

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I don't think "commit is atomic" really implies "data should be released
at commit". This is precisely what makes the feature extremely hard to
implement, IMHO.

Why wouldn't it be acceptable to do something like this?

BEGIN;
...
DROP TABLE x ERASE;
...
COMMIT; <-- Don't do data erasure, just add "x" to queue.

-- wait for another process to complete the erasure
SELECT pg_wait_for_erasure();

That means we're not running any custom commands / code during commit,
which should (hopefully) make it easier to handle errors.

Yeah, adding actions-that-could-fail to commit is a very hard sell,
so something like this API would probably have a better chance.

However ... the whole concept of erasure being a committable action
seems basically misguided from here. Consider this scenario:

begin;

create table full_o_secrets (...);

... manipulate secret data in full_o_secrets ...

drop table full_o_secrets erase;

... do something that unintentionally fails, causing xact abort ...

commit;

Now what? Your secret data is all over the disk and you have *no*
recourse to get rid of it; that's true even at a very low level,
because we unlinked the file when rolling back the transaction.
If the error occurred before getting to "drop table full_o_secrets
erase" then there isn't even any way in principle for the server
to know that you might not be happy about leaving that data lying
around.

And I haven't even spoken of copies that may exist in WAL, or
have been propagated to standby servers by now.

I have no idea what an actual solution that accounted for those
problems would look like. But as presented, this is a toy feature
offering no real security gain, if you ask me.

regards, tom lane

#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#28)
Re: Complete data erasure

On Sat, Apr 11, 2020 at 01:56:10PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I don't think "commit is atomic" really implies "data should be released
at commit". This is precisely what makes the feature extremely hard to
implement, IMHO.

Why wouldn't it be acceptable to do something like this?

BEGIN;
...
DROP TABLE x ERASE;
...
COMMIT; <-- Don't do data erasure, just add "x" to queue.

-- wait for another process to complete the erasure
SELECT pg_wait_for_erasure();

That means we're not running any custom commands / code during commit,
which should (hopefully) make it easier to handle errors.

Yeah, adding actions-that-could-fail to commit is a very hard sell,
so something like this API would probably have a better chance.

However ... the whole concept of erasure being a committable action
seems basically misguided from here. Consider this scenario:

begin;

create table full_o_secrets (...);

... manipulate secret data in full_o_secrets ...

drop table full_o_secrets erase;

... do something that unintentionally fails, causing xact abort ...

commit;

Now what? Your secret data is all over the disk and you have *no*
recourse to get rid of it; that's true even at a very low level,
because we unlinked the file when rolling back the transaction.
If the error occurred before getting to "drop table full_o_secrets
erase" then there isn't even any way in principle for the server
to know that you might not be happy about leaving that data lying
around.

And I haven't even spoken of copies that may exist in WAL, or
have been propagated to standby servers by now.

I have no idea what an actual solution that accounted for those
problems would look like. But as presented, this is a toy feature
offering no real security gain, if you ask me.

Yeah, unfortunately the feature as proposed has these weaknesses.

This is why I proposed that a solution based on encryption and throwing
away a key might be more reliable - if you don't have a key, who cares
if the encrypted data file (or parts of it) is still on disk?

It has issues too, though - a query might need a temporary file to do a
sort, hash join spills to disk, or something like that. And those won't
be encrypted without some executor changes (e.g. we might propagate
"needs erasure" to temp files, and do erasure when necessary).

I doubt a perfect solution would be so complex it's not feasible in
practice, especially in v1. So maybe the best thing we can do is
documenting those limitations, but I'm not sure where to draw the line
between acceptable and unacceptable limitations.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services