Orphaned relations after crash/sigkill during CREATE TABLE

Started by Jason Myersover 5 years ago9 messagesgeneral
Jump to latest
#1Jason Myers
j.myers@brstrat.com

Postgres 12.4

I was directed in slack to mention here that we're being impacted by
Postgres leaving orphaned pages in /base/<db> after a crash while a CREATE
TABLE is being run in transaction.

The issue is the same as the reproduction steps listed here [1]https://github.com/bdrouvot/pg_orphaned#example-1, that is:

- Start a CREATE TABLE transaction for a large table
- Terminate the process via kill -9
- Pages are left in /base that have no filenode references anymore, such
that `pg_database_size()` grows to be very large while total table+index
size remains constant

However in our particular case, we're using a managed/cloud Postgres server
and our `CREATE TABLE` transaction was being terminated by the OOM killer.
Using a managed service, we don't have filesystem access to go and clear
out these orphaned pages. This caused our total db size to grow from 40GB
of table+index data to 4TB on-disk (but still only 40GB of table+index
data, the other ~3.95TB being orphaned CREATE TABLE pages)

I realize (per a blog post from Robert Haas [2]https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html and from slack
conversation) that this is a known issue, but was directed here from slack
to just mention that we were impacted by it, and have no resolution due to
not having filesystem access, and not having a method internally to
Postgres to deal with these orphaned relations. (Our recourse currently is
to do something like a pg_dump/pg_restore onto a clean instance in order to
escape the orphaned files)

-Jason

[1]: https://github.com/bdrouvot/pg_orphaned#example-1
[2]: https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html
https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Jason Myers (#1)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On 8/18/20 12:35 PM, Jason Myers wrote:

Postgres 12.4

I was directed in slack to mention here that we're being impacted by
Postgres leaving orphaned pages in /base/<db> after a crash while a
CREATE TABLE is being run in transaction.

The issue is the same as the reproduction steps listed here [1], that is:

- Start a CREATE TABLE transaction for a large table
- Terminate the process via kill -9
- Pages are left in /base that have no filenode references anymore, such
that `pg_database_size()` grows to be very large while total table+index
size remains constant

So from [1] you are using CREATE TABLE AS. Have you tried with:

BEGIN;
CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
NO DATA;
COMMIT;

The above gets you the table structure, but no data.

BEGIN;
INSERT into some_table SELECT * FROM other_table;
COMMIT;

The above populates the table. Have not tested but I'm going to assume
if you kill the above the problem would not happen or would be fixable
by DELETE FROM some_table/TRUNCATE some_table;

However in our particular case, we're using a managed/cloud Postgres
server and our `CREATE TABLE` transaction was being terminated by the
OOM killer.  Using a managed service, we don't have filesystem access to
go and clear out these orphaned pages.  This caused our total db size to
grow from 40GB of table+index data to 4TB on-disk (but still only 40GB
of table+index data, the other ~3.95TB being orphaned CREATE TABLE pages)

I realize (per a blog post from Robert Haas [2] and from slack
conversation) that this is a known issue, but was directed here from
slack to just mention that we were impacted by it, and have no
resolution due to not having filesystem access, and not having a method
internally to Postgres to deal with these orphaned relations.  (Our
recourse currently is to do something like a pg_dump/pg_restore onto a
clean instance in order to escape the orphaned files)

-Jason

[1] https://github.com/bdrouvot/pg_orphaned#example-1
[2]
https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Jason Myers
j.myers@brstrat.com
In reply to: Adrian Klaver (#2)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com>

wrote:

Have you tried with:

BEGIN;
CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
NO DATA;
COMMIT;

The above gets you the table structure, but no data.

BEGIN;
INSERT into some_table SELECT * FROM other_table;
COMMIT;

The above populates the table

Thanks -- we were indeed creating and populating the new table all in a
single transaction.

I'll see if we can split this into two transactions so that the table
structure is committed quickly. I think you're right that this would
mostly sidestep the issue.

-Jason

p.s. Apologies if this is formatted wrong, this is my first mailing list
post.

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Jason Myers (#3)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On 8/18/20 1:19 PM, Jason Myers wrote:

On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver

<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

Have you tried with:
BEGIN;
CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
NO DATA;
COMMIT;

The above gets you the table structure, but no data.

BEGIN;
INSERT into some_table SELECT * FROM other_table;
COMMIT;

The above populates the table

Thanks -- we were indeed creating and populating the new table all in a
single transaction.

I'll see if we can split this into two transactions so that the table
structure is committed quickly.  I think you're right that this would
mostly sidestep the issue.

Assuming the table you are pulling from is fairly static, it could also
allow you to regulate the amount of data you transfer at any one time
into the new table.

-Jason

p.s. Apologies if this is formatted wrong, this is my first mailing list
post.

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Jason Myers
j.myers@brstrat.com
In reply to: Adrian Klaver (#2)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

So from [1] you are using CREATE TABLE AS. Have you tried with:

BEGIN;
CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
NO DATA;
COMMIT;

The above gets you the table structure, but no data.

BEGIN;
INSERT into some_table SELECT * FROM other_table;
COMMIT;

The above populates the table. Have not tested but I'm going to assume
if you kill the above the problem would not happen or would be fixable
by DELETE FROM some_table/TRUNCATE some_table;

I was able to implement this, which creates the table quickly in a first
transaction and populates it in a second transaction.

However we were still seeing orphaned files on crash, and I believe I
tracked it down to subsequent CREATE INDEX statements also creating these
orphaned files (if they are running during a crash).

Is that issue known as well? I don't believe I can use the same trick to
sidestep that one...

-Jason

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Jason Myers (#5)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On 8/20/20 1:37 PM, Jason Myers wrote:

On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

So from [1] you are using CREATE TABLE AS. Have you tried with:

BEGIN;
CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
NO DATA;
COMMIT;

The above gets you the table structure, but no data.

BEGIN;
INSERT into some_table SELECT * FROM other_table;
COMMIT;

The above populates the table. Have not tested but I'm going to assume
if you kill the above the problem would not happen or would be fixable
by DELETE FROM some_table/TRUNCATE some_table;

I was able to implement this, which creates the table quickly in a first
transaction and populates it in a second transaction.

However we were still seeing orphaned files on crash, and I believe I
tracked it down to subsequent CREATE INDEX statements also creating
these orphaned files (if they are running during a crash).

If the crashes are still being caused by the OOM killer then it looks to
me you need a more capable cloud instance.

Can you partition the tables to break the work into smaller units?

Is that issue known as well?  I don't believe I can use the same trick
to sidestep that one...

-Jason

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jason Myers (#5)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

Jason Myers <j.myers@brstrat.com> writes:

However we were still seeing orphaned files on crash, and I believe I
tracked it down to subsequent CREATE INDEX statements also creating these
orphaned files (if they are running during a crash).
Is that issue known as well? I don't believe I can use the same trick to
sidestep that one...

Yeah, it's entirely intentional that we don't try to clean up orphaned
disk files after a database crash. There's a long discussion of this and
related topics in src/backend/access/transam/README. What that says about
why not is that such files' contents might be useful for forensic analysis
of the crash, and anyway "Orphan files are harmless --- at worst they
waste a bit of disk space". A point not made in that text, but true
anyway, is that it'd also be quite expensive to search a large database
for orphaned files, so people would likely not want to pay that price
on the way to getting their database back up.

There might be value in a user-invokable tool that runs in an existing
non-crashed database and looks for orphan files, but I'm not aware that
anyone has written one. (Race conditions against concurrent table
creation would be a problem; but probably that can be finessed somehow,
maybe by noting the file's creation time.)

In the meantime I've got to say that routinely kill 9'ing database
processes just doesn't seem like a very good idea. Yeah, we do our best
to ensure that there won't be data loss, but you're really doubling down
on a hard assumption that Postgres contains zero bugs when you operate
that way. I'd suggest reconfiguring things to avoid the OOM kill hazard;
or if your cloud provider makes that effectively impossible, maybe you
need another provider. But on most systems I'd think you could use ulimit
or the like even if you don't have root privileges.

regards, tom lane

#8Jason Myers
j.myers@brstrat.com
In reply to: Tom Lane (#7)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On Thu, Aug 20, 2020 at 5:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jason Myers <j.myers@brstrat.com> writes:

However we were still seeing orphaned files on crash, and I believe I
tracked it down to subsequent CREATE INDEX statements also creating these
orphaned files (if they are running during a crash).
Is that issue known as well? I don't believe I can use the same trick to
sidestep that one...

Yeah, it's entirely intentional that we don't try to clean up orphaned
disk files after a database crash. There's a long discussion of this and
related topics in src/backend/access/transam/README. What that says about
why not is that such files' contents might be useful for forensic analysis
of the crash, and anyway "Orphan files are harmless --- at worst they
waste a bit of disk space". A point not made in that text, but true
anyway, is that it'd also be quite expensive to search a large database
for orphaned files, so people would likely not want to pay that price
on the way to getting their database back up.

There might be value in a user-invokable tool that runs in an existing
non-crashed database and looks for orphan files, but I'm not aware that
anyone has written one. (Race conditions against concurrent table
creation would be a problem; but probably that can be finessed somehow,
maybe by noting the file's creation time.)

In the meantime I've got to say that routinely kill 9'ing database
processes just doesn't seem like a very good idea. Yeah, we do our best
to ensure that there won't be data loss, but you're really doubling down
on a hard assumption that Postgres contains zero bugs when you operate
that way. I'd suggest reconfiguring things to avoid the OOM kill hazard;
or if your cloud provider makes that effectively impossible, maybe you
need another provider. But on most systems I'd think you could use ulimit
or the like even if you don't have root privileges.

regards, tom lane

Understood, thanks for the reply.

-Jason

#9Jeremy Schneider
schneider@ardentperf.com
In reply to: Tom Lane (#7)
Re: Orphaned relations after crash/sigkill during CREATE TABLE

On 8/20/20 14:46, Tom Lane wrote:

There might be value in a user-invokable tool that runs in an existing
non-crashed database and looks for orphan files, but I'm not aware that
anyone has written one. (Race conditions against concurrent table
creation would be a problem; but probably that can be finessed somehow,
maybe by noting the file's creation time.)

Bertrand Drouvot just put out a tool to list potentially orphaned files
a couple months ago.

https://github.com/bdrouvot/pg_orphaned

Implemented as an extension, doesn't remove the files, but useful
nonetheless.

-Jeremy

--
http://about.me/jeremy_schneider