Why copy_relation_data only use wal when WAL archiving is enabled
If I run the database under non-archiving mode, and execute the following
command:
alter table t set tablespace tblspc1;
Isn't it possible that the "new t" cann't be recovered?
Jacky Leng wrote:
If I run the database under non-archiving mode, and execute the following
command:
alter table t set tablespace tblspc1;
Isn't it possible that the "new t" cann't be recovered?
No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs
the new relation file.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Jacky Leng wrote:
If I run the database under non-archiving mode, and execute the following
command:
alter table t set tablespace tblspc1;
Isn't it possible that the "new t" cann't be recovered?No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs
the new relation file.
Usually it's true, but how about this situation:
* First, do the following series:
* Create two tablespace SPC1, SPC2;
* Create table T1 in SPC1 and insert some values into it, suppose T1's
oid/relfilenode is OID1;
* Drop table T1;----------OID1 was released in pg_class and can be
reused.
* Do anything that will make the next oid that'll be allocated from
pg_class be OID1, e.g. insert
many many tuples into a relation with oid;
* Create table T2 in SPC2, and insert some values into it, and its
oid/relfilenode is OID1;
* Alter table T2 set tablespace SPC1;---------T2 goes to SPC1 and uses
the same file name with old T1;
* Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;
* Kill the database abnormaly;
* Restart the database;
Let's analyze what will happen during the recovery process:
* When T1 is re-created, it finds that its file has already been
there--actually this file is T2's;
* "T1" ' s file(actually T2's) is re-dropped;
* ....
* T2 is re-created, and finds that its file has disappeared, so it re-create
one;
* As copy_relation_data didn't record any xlog about T2's AlterTableSpace
op,
after recovery, we'll find that T2 is empty!!!
Show quoted text
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;
That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.
There's no hole that I can see.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.There's no hole that I can see.
Yes, it's really forced out.
But if there's no checkpoint, the recovery process will begin from
the time point before T1 is created, and as T1 was dropped, it'll
remove T2's file!
Show quoted text
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Simon Riggs wrote:
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.
Attached is a shell script that demonstrates the problem on CVS HEAD
with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
happens to get the same relfilenode that T1 had. Then we crash the
server, forcing a WAL replay. After that, T2 is empty. Oops.
Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Forgot to attach the script I promised..
You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.
Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.Attached is a shell script that demonstrates the problem on CVS HEAD
with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
happens to get the same relfilenode that T1 had. Then we crash the
server, forcing a WAL replay. After that, T2 is empty. Oops.Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
I wrote:
Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.
Ok, here's one idea:
Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.
Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a "deletion request", is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.
We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.
Any better ideas?
Do we care enough about this to fix this? Enough to backpatch? The
probability of this happening is pretty small, but the consequences are
really bad, so my vote is "yes" and "yes".
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
I wrote:
Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.Ok, here's one idea:
Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a "deletion request", is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.Any better ideas?
Couldn't we fix this by forcing a checkpoint before we commit the transaction
that created the new pg_class entry for the clustered table? Or rather, more
generally, before committing a transaction that created a new non-temporary
relfilenode but didn't WAL-log any subsequent inserts.
Thats of course a rather sledgehammer-like approach to this problem - but at
least for the backbranched the fix would be less intrusive...
regards, Florian Pflug
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.Attached is a shell script that demonstrates the problem on CVS HEAD
with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
happens to get the same relfilenode that T1 had. Then we crash the
server, forcing a WAL replay. After that, T2 is empty. Oops.Unfortunately I don't see any easy way to fix it.
So, what you are saying is that re-using relfilenodes can cause problems
during recovery in any command that alters the relfilenode of a
relation?
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper
series--authough not quite possible;That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.Attached is a shell script that demonstrates the problem on CVS HEAD with
CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is
dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to
get the same relfilenode that T1 had. Then we crash the server, forcing a
WAL replay. After that, T2 is empty. Oops.Unfortunately I don't see any easy way to fix it.
So, what you are saying is that re-using relfilenodes can cause problems
during recovery in any command that alters the relfilenode of a relation?
For what I understand, I'd say that creating a relfilenode *and* subsequently
inserting data without WAL-logging causes the problem. If the relfilenode was
recently deleted, the inserts might be effectively undone upon recovery (because
we first replay the delete), but later *not* redone (because we didn't WAL-log
the inserts).
That brings me to another idea from a fix that is less heavyweight than my
previous checkpoint-before-commit suggestion.
We could make relfilenodes globally unique if we added the xid and epoch of the
creating transaction to the filename. Those are 64 bits, so if we encode them
in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13.
regards, Florian Pflug
Simon Riggs wrote:
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.
Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.
Or
Performing non-WAL logged operations on a relation file leads to a
truncated file on recovery, if the relfilenode of that file used to
belong to a relation that was dropped after the last checkpoint.
Happy?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Florian G. Pflug wrote:
Heikki Linnakangas wrote:
I wrote:
Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.Ok, here's one idea:
Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a "deletion request", is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.Any better ideas?
Couldn't we fix this by forcing a checkpoint before we commit the
transaction that created the new pg_class entry for the clustered table?
Or rather, more generally, before committing a transaction that created
a new non-temporary relfilenode but didn't WAL-log any subsequent inserts.
Yes, that would work. As a small optimization, you could set a flag in
shared mem whenever you delete a rel file, and skip the checkpoint when
that flag isn't set.
Thats of course a rather sledgehammer-like approach to this problem -
but at least for the backbranched the fix would be less intrusive...
Too much of a sledgehammer IMHO.
BTW, CREATE INDEX is also vulnerable. And in 8.3, COPY to a table
created/truncated in the same transaction.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.
OK, thanks.
I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
look deterministic to me, or at least isn't meant to be.
GetNewObjectId() should be cycling around, so although the oid index
scan using SnapshotDirty won't see committed deleted rows that shouldn't
matter for 2^32 oids. So what gives?
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.OK, thanks.
I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
look deterministic to me, or at least isn't meant to be.
GetNewObjectId() should be cycling around, so although the oid index
scan using SnapshotDirty won't see committed deleted rows that shouldn't
matter for 2^32 oids. So what gives?
I don't think you still quite understand what's happening. GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
I don't think you still quite understand what's happening. GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.
Simon has a legitimate objection; not that there's no bug, but that the
probability of getting bitten is exceedingly small. The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice. The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.
I think we should think about ways to fix this, but I don't feel a need
to try to backpatch a solution.
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.
I think the objection about leaking files on crash is wrong. We'd
have the replay of the deletion to fix things up --- it could probably
delete the file immediately, and if not could certainly put it back
on the fsync request queue.
regards, tom lane
On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.OK, thanks.
I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
look deterministic to me, or at least isn't meant to be.
GetNewObjectId() should be cycling around, so although the oid index
scan using SnapshotDirty won't see committed deleted rows that shouldn't
matter for 2^32 oids. So what gives?I don't think you still quite understand what's happening.
Clearly. It's not a problem to admit that.
GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.
Which calls GetNewOid() in all cases, AFAICS.
How does the reuse you say is happening come about? Seems like the bug
is in the reuse, not in how we cope with potential reuse.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Tom Lane wrote:
Simon has a legitimate objection; not that there's no bug, but that the
probability of getting bitten is exceedingly small.
Oh, if that's what he meant, he's right.
The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice. The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.
Yeah, it's unlikely to happen, but the consequences are horrible.
Note that it's not just DROP TABLE that's a problem, but anything that
uses smgrscheduleunlink, including CLUSTER and REINDEX.
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.
Ok, I'll write a patch to do that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote:
On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
If you've got a better problem statement it would be good to get that
right first before we discuss solutions.Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.OK, thanks.
I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
look deterministic to me, or at least isn't meant to be.
GetNewObjectId() should be cycling around, so although the oid index
scan using SnapshotDirty won't see committed deleted rows that shouldn't
matter for 2^32 oids. So what gives?I don't think you still quite understand what's happening.
Clearly. It's not a problem to admit that.
GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.Which calls GetNewOid() in all cases, AFAICS.
How does the reuse you say is happening come about? Seems like the bug
is in the reuse, not in how we cope with potential reuse.
After a table is dropped, the dropping transaction has been committed,
and the relation file has been deleted, there's nothing preventing the
reuse. There's no trace of that relfilenode in the system (except in the
WAL, which we never look into except on WAL replay). There's a dead row
in pg_class with that relfilenode, but even that could be vacuumed away
(not that it matters because we don't examine that).
Now the problem is that there's a record in the WAL to delete a relation
file with that relfilenode. If that relfilenode was reused, we delete
the contents of the new relation file when we replay that WAL record.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:
The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice. The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.Yeah, it's unlikely to happen, but the consequences are horrible.
When is this going to happen?
We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert
2^32 large objects, or create 2^32 tables, or any combination of the
above all within one checkpoint duration *and* exactly hit the exact
same relation.
That's a weird and huge application, a very fast server and an unlucky
DBA to hit the exact OID to be reused and then have the server crash so
we'll ever notice.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:
The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice. The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.Yeah, it's unlikely to happen, but the consequences are horrible.
When is this going to happen?
We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert
2^32 large objects, or create 2^32 tables, or any combination of the
above all within one checkpoint duration *and* exactly hit the exact
same relation.
The consumption of the OIDs don't need to happen within one checkpoint
duration. As long as the DROP and the reuse happens in the same
checkpoint cycle, you're screwed.
Granted that you're not likely to ever experience OID wrap-around unless
you have a heavily used user table with OIDs. Or create/drop temp tables
a lot.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.
How about just allowing to use wal even WAL archiving is disabled?
It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the
right thing for us, look at "heap_xlog_newpage": XLogReadBuffer
with init=true will extend the block rightly and rebuild it rightly.
Someone may say that it's not worth recording xlog for operations
such as copy_relation_data, but these operations shouldn't happen
frequently.
You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.
I've tried this script, and superisingly found that T2 is not empty, just
as it should be.
Then I see how cluster is done, and found that
You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.
I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest
version
act just like what you said.
Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of
cluser
under non-archiving mode.
So the conclusion is: we can replace wal mechanism with smgrimmedsync only
if
relfilenode is not allowed to be reused, but this's not true, so what we
should
keep wal.
Is it right?
You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.
I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest
version
act just like what you said.
Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of
cluser
under non-archiving mode.
So the conclusion is: we can replace wal mechanism with smgrimmedsync only
if
relfilenode is not allowed to be reused, but this's not true, so what we
should
keep wal.
Is it right?
Sorry, send the mail wrongly just now.
You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.
I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest
version
act just like what you said.
Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of
cluser
under non-archiving mode.
So the conclusion is: we can replace wal mechanism with smgrimmedsync only
if
relfilenode is not allowed to be reused, but this's not true, so what we
should
keep wal.
Is it right?
Jacky Leng wrote:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.How about just allowing to use wal even WAL archiving is disabled?
It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the
right thing for us, look at "heap_xlog_newpage": XLogReadBuffer
with init=true will extend the block rightly and rebuild it rightly.Someone may say that it's not worth recording xlog for operations
such as copy_relation_data, but these operations shouldn't happen
frequently.
Always using WAL would fix the problem, but it's a big performance hit.
WAL-logging doubles the amount of write I/O required.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Tom Lane wrote:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.Ok, I'll write a patch to do that.
There's a small problem with that: DROP TABLESPACE checks that the
tablespace directory is empty, and fails if it sees one of those empty
files. You also run into that problem if you
1. BEGIN; CREATE TABLE; -- no commit
2. crash+restart
3. DROP TABLESPACE
because we leave behind the stale file created by CREATE TABLE.
The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.
That still won't fix the problem with files created by a crashed
transaction. For that we had a plan a long time ago: after recovery,
scan the data directory for any files don't have a live row in pg_class,
and write a message to log for each so that the DBA can delete them
(deleting them automatically was considered too dangerous). That's
probably 8.4 material, though.
Thoughts?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.
This is getting too Rube Goldbergian for my tastes. What if we just
make DROP TABLESPACE force a checkpoint before proceeding?
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.This is getting too Rube Goldbergian for my tastes. What if we just
make DROP TABLESPACE force a checkpoint before proceeding?
True, that would work. DROP TABLESPACE should be uncommon enough that
the performance hit is ok. We only need to checkpoint if the directory
isn't empty, though I think that's the case more often than not; you're
most likely to drop a tablespace right after dropping all relations in it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Tom Lane wrote:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.Ok, I'll write a patch to do that.
What is the argument against making relfilenodes globally unique by adding the
xid and epoch of the creating transaction to the filename? Those 64 bits could
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem
should support IMHO.
regards, Florian Pflug
Heikki Linnakangas wrote:
Tom Lane wrote:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.Ok, I'll write a patch to do that.
What is the argument against making relfilenodes globally unique by adding the
xid and epoch of the creating transaction to the filename? Those 64 bits could
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem
should support IMHO.
regards, Florian Pflug
PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup.
Florian G. Pflug wrote:
Heikki Linnakangas wrote:
Tom Lane wrote:
I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.Ok, I'll write a patch to do that.
What is the argument against making relfilenodes globally unique by
adding the xid and epoch of the creating transaction to the filename?
Those 64 bits could be stuffed into 13 bytes by base-36 encoding
(A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13
= 24, which any reasonable filesystem should support IMHO.
The size of would be xid + epoch + oid = 96 bits, not 64 bits.
That would work, but sounds like a much bigger change.
sizeof(RelFileNode) would increase from 12 to 20, so any data structure
that deals with RelFileNodes would take more memory. Hash function in
buffer manager would get more expensive. I remember seeing that showing
up in oprofile sometimes, but it'd need to be benchmarked to see if it
really matters.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
"Florian G. Pflug" <fgp@phlo.org> writes:
What is the argument against making relfilenodes globally unique by adding the
xid and epoch of the creating transaction to the filename?
1. Zero chance of ever backpatching. (I know I said I wasn't excited
about that, but it's still a strike against a proposed fix.)
2. Adds new fields to RelFileNode, which will be a major code change,
and possibly a noticeable performance hit (bigger hashtable keys).
3. Adds new columns to pg_class, which is a real PITA ...
4. Breaks oid2name and all similar code that knows about relfilenode.
regards, tom lane
Tom Lane wrote:
"Florian G. Pflug" <fgp@phlo.org> writes:
What is the argument against making relfilenodes globally unique by adding
the xid and epoch of the creating transaction to the filename?1. Zero chance of ever backpatching. (I know I said I wasn't excited about
that, but it's still a strike against a proposed fix.)2. Adds new fields to RelFileNode, which will be a major code change, and
possibly a noticeable performance hit (bigger hashtable keys).3. Adds new columns to pg_class, which is a real PITA ...
4. Breaks oid2name and all similar code that knows about relfilenode.
Ah, Ok. I was under the impression that relfilenode in pg_class is a string of
some kind. In that case only GetNewRelFileNode would have needed patching...
But that is obviously not the case, as I realized now :-(
Thanks for setting me straight ;-)
regards, Florian Pflug
Tom Lane wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.This is getting too Rube Goldbergian for my tastes. What if we just
make DROP TABLESPACE force a checkpoint before proceeding?
Patch attached.
The scenario we're preventing is still possible if for some reason the
latest checkpoint record is damaged, and we start recovery from the
previous checkpoint record. I think the probability of that happening,
together with the OID wrap-around and hitting the relfilenode of a
recently deleted file with a new one, is low enough to not worry about.
If we cared, we could fix it by letting the files to linger for two
checkpoint cycles instead of one.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
avoid-premature-relfilenode-reuse-1.patchtext/x-diff; name=avoid-premature-relfilenode-reuse-1.patchDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.286
diff -c -r1.286 xlog.c
*** src/backend/access/transam/xlog.c 12 Oct 2007 19:39:59 -0000 1.286
--- src/backend/access/transam/xlog.c 18 Oct 2007 20:16:56 -0000
***************
*** 45,50 ****
--- 45,51 ----
#include "storage/fd.h"
#include "storage/pmsignal.h"
#include "storage/procarray.h"
+ #include "storage/smgr.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/pg_locale.h"
***************
*** 5668,5673 ****
--- 5669,5680 ----
checkPoint.time = time(NULL);
/*
+ * Let the md storage manager to prepare for checkpoint before
+ * we determine the REDO pointer.
+ */
+ mdcheckpointbegin();
+
+ /*
* We must hold WALInsertLock while examining insert state to determine
* the checkpoint REDO pointer.
*/
***************
*** 5887,5892 ****
--- 5894,5904 ----
END_CRIT_SECTION();
/*
+ * Let the md storage manager to do its post-checkpoint work.
+ */
+ mdcheckpointend();
+
+ /*
* Delete old log files (those no longer needed even for previous
* checkpoint).
*/
Index: src/backend/commands/tablespace.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/tablespace.c,v
retrieving revision 1.49
diff -c -r1.49 tablespace.c
*** src/backend/commands/tablespace.c 1 Aug 2007 22:45:08 -0000 1.49
--- src/backend/commands/tablespace.c 18 Oct 2007 20:31:53 -0000
***************
*** 460,472 ****
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
/*
! * Try to remove the physical infrastructure
*/
if (!remove_tablespace_directories(tablespaceoid, false))
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("tablespace \"%s\" is not empty",
! tablespacename)));
/* Record the filesystem change in XLOG */
{
--- 460,488 ----
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
/*
! * Try to remove the physical infrastructure.
*/
if (!remove_tablespace_directories(tablespaceoid, false))
! {
! /*
! * There can be lingering empty files in the directories, left behind
! * by for example DROP TABLE, that have been scheduled for deletion
! * at next checkpoint (see comments in mdunlink() for details). We
! * could just delete them immediately, but we can't tell them apart
! * from important data files that we mustn't delete. So instead, we
! * force a checkpoint which will clean out any lingering files, and
! * try again.
! */
! RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
! if (!remove_tablespace_directories(tablespaceoid, false))
! {
! /* Still not empty, the files must be important then */
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("tablespace \"%s\" is not empty",
! tablespacename)));
! }
! }
/* Record the filesystem change in XLOG */
{
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.129
diff -c -r1.129 md.c
*** src/backend/storage/smgr/md.c 3 Jul 2007 14:51:24 -0000 1.129
--- src/backend/storage/smgr/md.c 18 Oct 2007 21:11:43 -0000
***************
*** 34,39 ****
--- 34,40 ----
/* special values for the segno arg to RememberFsyncRequest */
#define FORGET_RELATION_FSYNC (InvalidBlockNumber)
#define FORGET_DATABASE_FSYNC (InvalidBlockNumber-1)
+ #define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
/*
* On Windows, we have to interpret EACCES as possibly meaning the same as
***************
*** 113,118 ****
--- 114,123 ----
* table remembers the pending operations. We use a hash table mostly as
* a convenient way of eliminating duplicate requests.
*
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after next checkpoint, but we use a linked list instead of hash
+ * table, because we don't expect there to be any duplicates.
+ *
* (Regular backends do not track pending operations locally, but forward
* them to the bgwriter.)
*/
***************
*** 131,139 ****
--- 136,152 ----
CycleCtr cycle_ctr; /* mdsync_cycle_ctr when request was made */
} PendingOperationEntry;
+ typedef struct
+ {
+ RelFileNode rnode; /* the dead relation to delete */
+ CycleCtr cycle_ctr; /* mdunlink_cycle_ctr when request was made */
+ } PendingUnlinkEntry;
+
static HTAB *pendingOpsTable = NULL;
+ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
+ static CycleCtr mdunlink_cycle_ctr = 0;
typedef enum /* behavior for mdopen & _mdfd_getseg */
***************
*** 146,151 ****
--- 159,165 ----
/* local routines */
static MdfdVec *mdopen(SMgrRelation reln, ExtensionBehavior behavior);
static void register_dirty_segment(SMgrRelation reln, MdfdVec *seg);
+ static void register_unlink(RelFileNode rnode);
static MdfdVec *_fdvec_alloc(void);
#ifndef LET_OS_MANAGE_FILESIZE
***************
*** 188,193 ****
--- 202,208 ----
100L,
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+ pendingUnlinks = NIL;
}
}
***************
*** 257,267 ****
--- 272,300 ----
* If isRedo is true, it's okay for the relation to be already gone.
* Also, any failure should be reported as WARNING not ERROR, because
* we are usually not in a transaction anymore when this is called.
+ *
+ * If isRedo is false, the relation is actually just truncated to release
+ * the space, and left to linger as an empty file until the next checkpoint.
+ * This is to avoid reusing the same relfilenode for a new relation file,
+ * until the commit record containing the deletion has been flushed out.
+ *
+ * The scenario we're trying to protect from is this:
+ * After crash, we replay the commit WAL record of a transaction that
+ * deleted a relation, like DROP TABLE. But before the crash, after deleting
+ * the old relation, we created a new one, and it happened to get the same
+ * relfilenode as the deleted relation (OID must've wrapped around for
+ * that to happen). Now replaying the deletion of the old relation
+ * deletes the new one instead, because they had the same relfilenode.
+ * Normally the new relation would be recreated later in the WAL replay, but
+ * if we relied on fsyncing the file after populating it, like CLUSTER and
+ * CREATE INDEX do, for example, the contents of the file would be lost
+ * forever.
*/
void
mdunlink(RelFileNode rnode, bool isRedo)
{
char *path;
+ int ret;
/*
* We have to clean out any pending fsync requests for the doomed relation,
***************
*** 271,278 ****
path = relpath(rnode);
! /* Delete the first segment, or only segment if not doing segmenting */
! if (unlink(path) < 0)
{
if (!isRedo || errno != ENOENT)
ereport(WARNING,
--- 304,318 ----
path = relpath(rnode);
! /*
! * Delete or truncate the first segment, or only segment if not doing
! * segmenting
! */
! if (!isRedo)
! ret = truncate(path, 0);
! else
! ret = unlink(path);
! if (ret < 0)
{
if (!isRedo || errno != ENOENT)
ereport(WARNING,
***************
*** 316,321 ****
--- 356,364 ----
#endif
pfree(path);
+
+ if (!isRedo)
+ register_unlink(rnode);
}
/*
***************
*** 1096,1114 ****
}
}
/*
* RememberFsyncRequest() -- callback from bgwriter side of fsync request
*
! * We stuff the fsync request into the local hash table for execution
* during the bgwriter's next checkpoint.
*
* The range of possible segment numbers is way less than the range of
* BlockNumber, so we can reserve high values of segno for special purposes.
! * We define two: FORGET_RELATION_FSYNC means to cancel pending fsyncs for
! * a relation, and FORGET_DATABASE_FSYNC means to cancel pending fsyncs for
! * a whole database. (These are a tad slow because the hash table has to be
! * searched linearly, but it doesn't seem worth rethinking the table structure
! * for them.)
*/
void
RememberFsyncRequest(RelFileNode rnode, BlockNumber segno)
--- 1139,1268 ----
}
}
+
+ /*
+ * register_unlink() -- Schedule a relation to be deleted after next checkpoint
+ */
+ static void
+ register_unlink(RelFileNode rnode)
+ {
+ if (pendingOpsTable)
+ {
+ /* standalone backend or startup process: fsync state is local */
+ RememberFsyncRequest(rnode, UNLINK_RELATION_REQUEST);
+ }
+ else if (IsUnderPostmaster)
+ {
+ /*
+ * Notify the bgwriter about it. If we fail to queue the revoke
+ * message, we have to sleep and try again ... ugly, but hopefully
+ * won't happen often.
+ *
+ * XXX should we just leave the file orphaned instead?
+ */
+ while (!ForwardFsyncRequest(rnode, UNLINK_RELATION_REQUEST))
+ pg_usleep(10000L); /* 10 msec seems a good number */
+ }
+ }
+
+ /*
+ * mdcheckpointbegin() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started and those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must called *before* RedoRecPtr is determined.
+ */
+ void
+ mdcheckpointbegin()
+ {
+ PendingUnlinkEntry *entry;
+ ListCell *cell;
+
+ /*
+ * Just in case the prior checkpoint failed, stamp all entries in
+ * the list with the current cycle counter. Anything that's in the
+ * list at the start of checkpoint can surely be deleted after the
+ * checkpoint is finished, regardless of when the request was made.
+ */
+ foreach(cell, pendingUnlinks)
+ {
+ entry = lfirst(cell);
+ entry->cycle_ctr = mdunlink_cycle_ctr;
+ }
+
+ /*
+ * Any unlink requests arriving after this point will be assigned the
+ * next cycle counter, and won't be unlinked until next checkpoint.
+ */
+ mdunlink_cycle_ctr++;
+ }
+
+ /*
+ * mdcheckpointend() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+ void
+ mdcheckpointend()
+ {
+ while(pendingUnlinks != NIL)
+ {
+ PendingUnlinkEntry *entry = linitial(pendingUnlinks);
+ char *path;
+
+ /*
+ * New entries are appended to the end, so if the entry is new
+ * we've reached the end of old entries.
+ */
+ if (entry->cycle_ctr == mdsync_cycle_ctr)
+ break;
+
+ /* Else assert we haven't missed it */
+ Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
+
+ /* Unlink the file */
+ path = relpath(entry->rnode);
+ if (unlink(path) < 0)
+ {
+ /*
+ * ENOENT shouldn't happen either, but it doesn't really matter
+ * because we would've deleted it now anyway.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove relation %u/%u/%u: %m",
+ entry->rnode.spcNode,
+ entry->rnode.dbNode,
+ entry->rnode.relNode)));
+ }
+ pfree(path);
+
+ pendingUnlinks = list_delete_first(pendingUnlinks);
+ }
+ }
+
/*
* RememberFsyncRequest() -- callback from bgwriter side of fsync request
*
! * We stuff normal fsync request into the local hash table for execution
* during the bgwriter's next checkpoint.
*
* The range of possible segment numbers is way less than the range of
* BlockNumber, so we can reserve high values of segno for special purposes.
! * We define three:
! * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation
! * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
! * - UNLINK_REQUEST is a request to delete a lingering file after next
! * checkpoint. These are stuffed into a linked list separate from
! * the normal fsync requests.
! *
! * (Handling the FORGET_* requests is a tad slow because the hash table has to
! * be searched linearly, but it doesn't seem worth rethinking the table
! * structure for them.)
*/
void
RememberFsyncRequest(RelFileNode rnode, BlockNumber segno)
***************
*** 1147,1152 ****
--- 1301,1322 ----
}
}
}
+ else if (segno == UNLINK_RELATION_REQUEST)
+ {
+ /* Unlink request: put it in the linked list */
+ MemoryContext oldcxt;
+ PendingUnlinkEntry *entry;
+
+ oldcxt = MemoryContextSwitchTo(MdCxt);
+
+ entry = palloc(sizeof(PendingUnlinkEntry));
+ memcpy(&entry->rnode, &rnode, sizeof(RelFileNode));
+ entry->cycle_ctr = mdunlink_cycle_ctr;
+
+ pendingUnlinks = lappend(pendingUnlinks, entry);
+
+ MemoryContextSwitchTo(oldcxt);
+ }
else
{
/* Normal case: enter a request to fsync this segment */
Index: src/include/storage/smgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/smgr.h,v
retrieving revision 1.59
diff -c -r1.59 smgr.h
*** src/include/storage/smgr.h 5 Sep 2007 18:10:48 -0000 1.59
--- src/include/storage/smgr.h 18 Oct 2007 09:52:43 -0000
***************
*** 110,115 ****
--- 110,118 ----
extern void ForgetRelationFsyncRequests(RelFileNode rnode);
extern void ForgetDatabaseFsyncRequests(Oid dbid);
+ extern void mdcheckpointbegin(void);
+ extern void mdcheckpointend(void);
+
/* smgrtype.c */
extern Datum smgrout(PG_FUNCTION_ARGS);
extern Datum smgrin(PG_FUNCTION_ARGS);
Here's an updated version of the patch. There was a bogus assertion in
the previous one, comparing against mdsync_cycle_ctr instead of
mdunlink_cycle_ctr.
Heikki Linnakangas wrote:
Tom Lane wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.This is getting too Rube Goldbergian for my tastes. What if we just
make DROP TABLESPACE force a checkpoint before proceeding?Patch attached.
The scenario we're preventing is still possible if for some reason the
latest checkpoint record is damaged, and we start recovery from the
previous checkpoint record. I think the probability of that happening,
together with the OID wrap-around and hitting the relfilenode of a
recently deleted file with a new one, is low enough to not worry about.
If we cared, we could fix it by letting the files to linger for two
checkpoint cycles instead of one.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
avoid-premature-relfilenode-reuse-2.patchtext/x-diff; name=avoid-premature-relfilenode-reuse-2.patchDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.286
diff -c -r1.286 xlog.c
*** src/backend/access/transam/xlog.c 12 Oct 2007 19:39:59 -0000 1.286
--- src/backend/access/transam/xlog.c 18 Oct 2007 20:16:56 -0000
***************
*** 45,50 ****
--- 45,51 ----
#include "storage/fd.h"
#include "storage/pmsignal.h"
#include "storage/procarray.h"
+ #include "storage/smgr.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/pg_locale.h"
***************
*** 5668,5673 ****
--- 5669,5680 ----
checkPoint.time = time(NULL);
/*
+ * Let the md storage manager to prepare for checkpoint before
+ * we determine the REDO pointer.
+ */
+ mdcheckpointbegin();
+
+ /*
* We must hold WALInsertLock while examining insert state to determine
* the checkpoint REDO pointer.
*/
***************
*** 5887,5892 ****
--- 5894,5904 ----
END_CRIT_SECTION();
/*
+ * Let the md storage manager to do its post-checkpoint work.
+ */
+ mdcheckpointend();
+
+ /*
* Delete old log files (those no longer needed even for previous
* checkpoint).
*/
Index: src/backend/commands/tablespace.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/tablespace.c,v
retrieving revision 1.49
diff -c -r1.49 tablespace.c
*** src/backend/commands/tablespace.c 1 Aug 2007 22:45:08 -0000 1.49
--- src/backend/commands/tablespace.c 18 Oct 2007 20:31:53 -0000
***************
*** 460,472 ****
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
/*
! * Try to remove the physical infrastructure
*/
if (!remove_tablespace_directories(tablespaceoid, false))
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("tablespace \"%s\" is not empty",
! tablespacename)));
/* Record the filesystem change in XLOG */
{
--- 460,488 ----
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
/*
! * Try to remove the physical infrastructure.
*/
if (!remove_tablespace_directories(tablespaceoid, false))
! {
! /*
! * There can be lingering empty files in the directories, left behind
! * by for example DROP TABLE, that have been scheduled for deletion
! * at next checkpoint (see comments in mdunlink() for details). We
! * could just delete them immediately, but we can't tell them apart
! * from important data files that we mustn't delete. So instead, we
! * force a checkpoint which will clean out any lingering files, and
! * try again.
! */
! RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
! if (!remove_tablespace_directories(tablespaceoid, false))
! {
! /* Still not empty, the files must be important then */
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("tablespace \"%s\" is not empty",
! tablespacename)));
! }
! }
/* Record the filesystem change in XLOG */
{
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.129
diff -c -r1.129 md.c
*** src/backend/storage/smgr/md.c 3 Jul 2007 14:51:24 -0000 1.129
--- src/backend/storage/smgr/md.c 20 Oct 2007 14:10:08 -0000
***************
*** 34,39 ****
--- 34,40 ----
/* special values for the segno arg to RememberFsyncRequest */
#define FORGET_RELATION_FSYNC (InvalidBlockNumber)
#define FORGET_DATABASE_FSYNC (InvalidBlockNumber-1)
+ #define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
/*
* On Windows, we have to interpret EACCES as possibly meaning the same as
***************
*** 113,118 ****
--- 114,123 ----
* table remembers the pending operations. We use a hash table mostly as
* a convenient way of eliminating duplicate requests.
*
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after next checkpoint, but we use a linked list instead of hash
+ * table, because we don't expect there to be any duplicates.
+ *
* (Regular backends do not track pending operations locally, but forward
* them to the bgwriter.)
*/
***************
*** 131,139 ****
--- 136,152 ----
CycleCtr cycle_ctr; /* mdsync_cycle_ctr when request was made */
} PendingOperationEntry;
+ typedef struct
+ {
+ RelFileNode rnode; /* the dead relation to delete */
+ CycleCtr cycle_ctr; /* mdunlink_cycle_ctr when request was made */
+ } PendingUnlinkEntry;
+
static HTAB *pendingOpsTable = NULL;
+ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
+ static CycleCtr mdunlink_cycle_ctr = 0;
typedef enum /* behavior for mdopen & _mdfd_getseg */
***************
*** 146,151 ****
--- 159,165 ----
/* local routines */
static MdfdVec *mdopen(SMgrRelation reln, ExtensionBehavior behavior);
static void register_dirty_segment(SMgrRelation reln, MdfdVec *seg);
+ static void register_unlink(RelFileNode rnode);
static MdfdVec *_fdvec_alloc(void);
#ifndef LET_OS_MANAGE_FILESIZE
***************
*** 188,193 ****
--- 202,208 ----
100L,
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+ pendingUnlinks = NIL;
}
}
***************
*** 257,267 ****
--- 272,300 ----
* If isRedo is true, it's okay for the relation to be already gone.
* Also, any failure should be reported as WARNING not ERROR, because
* we are usually not in a transaction anymore when this is called.
+ *
+ * If isRedo is false, the relation is actually just truncated to release
+ * the space, and left to linger as an empty file until the next checkpoint.
+ * This is to avoid reusing the same relfilenode for a new relation file,
+ * until the commit record containing the deletion has been flushed out.
+ *
+ * The scenario we're trying to protect from is this:
+ * After crash, we replay the commit WAL record of a transaction that
+ * deleted a relation, like DROP TABLE. But before the crash, after deleting
+ * the old relation, we created a new one, and it happened to get the same
+ * relfilenode as the deleted relation (OID must've wrapped around for
+ * that to happen). Now replaying the deletion of the old relation
+ * deletes the new one instead, because they had the same relfilenode.
+ * Normally the new relation would be recreated later in the WAL replay, but
+ * if we relied on fsyncing the file after populating it, like CLUSTER and
+ * CREATE INDEX do, for example, the contents of the file would be lost
+ * forever.
*/
void
mdunlink(RelFileNode rnode, bool isRedo)
{
char *path;
+ int ret;
/*
* We have to clean out any pending fsync requests for the doomed relation,
***************
*** 271,278 ****
path = relpath(rnode);
! /* Delete the first segment, or only segment if not doing segmenting */
! if (unlink(path) < 0)
{
if (!isRedo || errno != ENOENT)
ereport(WARNING,
--- 304,318 ----
path = relpath(rnode);
! /*
! * Delete or truncate the first segment, or only segment if not doing
! * segmenting
! */
! if (!isRedo)
! ret = truncate(path, 0);
! else
! ret = unlink(path);
! if (ret < 0)
{
if (!isRedo || errno != ENOENT)
ereport(WARNING,
***************
*** 316,321 ****
--- 356,364 ----
#endif
pfree(path);
+
+ if (!isRedo)
+ register_unlink(rnode);
}
/*
***************
*** 1096,1114 ****
}
}
/*
* RememberFsyncRequest() -- callback from bgwriter side of fsync request
*
! * We stuff the fsync request into the local hash table for execution
* during the bgwriter's next checkpoint.
*
* The range of possible segment numbers is way less than the range of
* BlockNumber, so we can reserve high values of segno for special purposes.
! * We define two: FORGET_RELATION_FSYNC means to cancel pending fsyncs for
! * a relation, and FORGET_DATABASE_FSYNC means to cancel pending fsyncs for
! * a whole database. (These are a tad slow because the hash table has to be
! * searched linearly, but it doesn't seem worth rethinking the table structure
! * for them.)
*/
void
RememberFsyncRequest(RelFileNode rnode, BlockNumber segno)
--- 1139,1268 ----
}
}
+
+ /*
+ * register_unlink() -- Schedule a relation to be deleted after next checkpoint
+ */
+ static void
+ register_unlink(RelFileNode rnode)
+ {
+ if (pendingOpsTable)
+ {
+ /* standalone backend or startup process: fsync state is local */
+ RememberFsyncRequest(rnode, UNLINK_RELATION_REQUEST);
+ }
+ else if (IsUnderPostmaster)
+ {
+ /*
+ * Notify the bgwriter about it. If we fail to queue the revoke
+ * message, we have to sleep and try again ... ugly, but hopefully
+ * won't happen often.
+ *
+ * XXX should we just leave the file orphaned instead?
+ */
+ while (!ForwardFsyncRequest(rnode, UNLINK_RELATION_REQUEST))
+ pg_usleep(10000L); /* 10 msec seems a good number */
+ }
+ }
+
+ /*
+ * mdcheckpointbegin() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started and those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must called *before* RedoRecPtr is determined.
+ */
+ void
+ mdcheckpointbegin()
+ {
+ PendingUnlinkEntry *entry;
+ ListCell *cell;
+
+ /*
+ * Just in case the prior checkpoint failed, stamp all entries in
+ * the list with the current cycle counter. Anything that's in the
+ * list at the start of checkpoint can surely be deleted after the
+ * checkpoint is finished, regardless of when the request was made.
+ */
+ foreach(cell, pendingUnlinks)
+ {
+ entry = lfirst(cell);
+ entry->cycle_ctr = mdunlink_cycle_ctr;
+ }
+
+ /*
+ * Any unlink requests arriving after this point will be assigned the
+ * next cycle counter, and won't be unlinked until next checkpoint.
+ */
+ mdunlink_cycle_ctr++;
+ }
+
+ /*
+ * mdcheckpointend() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+ void
+ mdcheckpointend()
+ {
+ while(pendingUnlinks != NIL)
+ {
+ PendingUnlinkEntry *entry = linitial(pendingUnlinks);
+ char *path;
+
+ /*
+ * New entries are appended to the end, so if the entry is new
+ * we've reached the end of old entries.
+ */
+ if (entry->cycle_ctr == mdsync_cycle_ctr)
+ break;
+
+ /* Else assert we haven't missed it */
+ Assert((CycleCtr) (entry->cycle_ctr + 1) == mdunlink_cycle_ctr);
+
+ /* Unlink the file */
+ path = relpath(entry->rnode);
+ if (unlink(path) < 0)
+ {
+ /*
+ * ENOENT shouldn't happen either, but it doesn't really matter
+ * because we would've deleted it now anyway.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove relation %u/%u/%u: %m",
+ entry->rnode.spcNode,
+ entry->rnode.dbNode,
+ entry->rnode.relNode)));
+ }
+ pfree(path);
+
+ pendingUnlinks = list_delete_first(pendingUnlinks);
+ }
+ }
+
/*
* RememberFsyncRequest() -- callback from bgwriter side of fsync request
*
! * We stuff normal fsync request into the local hash table for execution
* during the bgwriter's next checkpoint.
*
* The range of possible segment numbers is way less than the range of
* BlockNumber, so we can reserve high values of segno for special purposes.
! * We define three:
! * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation
! * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
! * - UNLINK_REQUEST is a request to delete a lingering file after next
! * checkpoint. These are stuffed into a linked list separate from
! * the normal fsync requests.
! *
! * (Handling the FORGET_* requests is a tad slow because the hash table has to
! * be searched linearly, but it doesn't seem worth rethinking the table
! * structure for them.)
*/
void
RememberFsyncRequest(RelFileNode rnode, BlockNumber segno)
***************
*** 1147,1152 ****
--- 1301,1322 ----
}
}
}
+ else if (segno == UNLINK_RELATION_REQUEST)
+ {
+ /* Unlink request: put it in the linked list */
+ MemoryContext oldcxt;
+ PendingUnlinkEntry *entry;
+
+ oldcxt = MemoryContextSwitchTo(MdCxt);
+
+ entry = palloc(sizeof(PendingUnlinkEntry));
+ memcpy(&entry->rnode, &rnode, sizeof(RelFileNode));
+ entry->cycle_ctr = mdunlink_cycle_ctr;
+
+ pendingUnlinks = lappend(pendingUnlinks, entry);
+
+ MemoryContextSwitchTo(oldcxt);
+ }
else
{
/* Normal case: enter a request to fsync this segment */
Index: src/include/storage/smgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/smgr.h,v
retrieving revision 1.59
diff -c -r1.59 smgr.h
*** src/include/storage/smgr.h 5 Sep 2007 18:10:48 -0000 1.59
--- src/include/storage/smgr.h 18 Oct 2007 09:52:43 -0000
***************
*** 110,115 ****
--- 110,118 ----
extern void ForgetRelationFsyncRequests(RelFileNode rnode);
extern void ForgetDatabaseFsyncRequests(Oid dbid);
+ extern void mdcheckpointbegin(void);
+ extern void mdcheckpointend(void);
+
/* smgrtype.c */
extern Datum smgrout(PG_FUNCTION_ARGS);
extern Datum smgrin(PG_FUNCTION_ARGS);
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
Here's an updated version of the patch. There was a bogus assertion in
the previous one, comparing against mdsync_cycle_ctr instead of
mdunlink_cycle_ctr.
Applied with minor corrections.
I'm not sure whether we should consider back-patching this. Thoughts?
regards, tom lane