Unlogged relation copy is not fsync'd
I noticed another missing fsync() with unlogged tables, similar to the
one at [1]/messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi.
RelationCopyStorage does this:
/*
* When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is that since we're copying outside shared buffers, a CHECKPOINT
* occurring during the copy has no way to flush the previously written
* data to disk (indeed it won't know the new rel even exists). A crash
* later on would replay WAL from the checkpoint, therefore it wouldn't
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
if (use_wal || copying_initfork)
smgrimmedsync(dst, forkNum);
That 'copying_initfork' condition is wrong. The first hint of that is
that 'use_wal' is always true for an init fork. I believe this was meant
to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise,
this bad thing can happen:
1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose power
When you recover, the unlogged table is not reset, because it was a
clean postgres shutdown. But the part of the file that was copied after
the checkpoint at step 3 was never fsync'd, so part of the file is lost.
I was able to reproduce with a VM that I kill to simulate step 5.
This is the same scenario that the smgrimmedsync() call above protects
from for WAL-logged relations. But we got the condition wrong: instead
of worrying about the init fork, we need to call smgrimmedsync() for all
*other* forks of an unlogged relation.
Fortunately the fix is trivial, see attached. I suspect we have similar
problems in a few other places, though. end_heap_rewrite(), _bt_load(),
and gist_indexsortbuild have a similar-looking smgrimmedsync() calls,
with no consideration for unlogged relations at all. I haven't tested
those, but they look wrong to me.
I'm also attaching the scripts I used to reproduce this, although they
will require some manual fiddling to run.
[1]: /messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi
/messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
fix-unlogged-rel-copy-fsync.patchtext/x-patch; charset=UTF-8; name=fix-unlogged-rel-copy-fsync.patchDownload
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..ebb80fa822 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,7 +533,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
- if (use_wal || copying_initfork)
+ if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
smgrimmedsync(dst, forkNum);
}
On Fri, Aug 25, 2023 at 03:47:27PM +0300, Heikki Linnakangas wrote:
That 'copying_initfork' condition is wrong. The first hint of that is that
'use_wal' is always true for an init fork. I believe this was meant to check
for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise, this bad thing
can happen:1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose powerWhen you recover, the unlogged table is not reset, because it was a clean
postgres shutdown. But the part of the file that was copied after the
checkpoint at step 3 was never fsync'd, so part of the file is lost. I was
able to reproduce with a VM that I kill to simulate step 5.
Oops.
The comment at the top of smgrimmedsync() looks incorrect now to me
now. When copying the data from something else than an init fork, the
relation pages are not WAL-logged for an unlogged relation.
Fortunately the fix is trivial, see attached. I suspect we have similar
problems in a few other places, though. end_heap_rewrite(), _bt_load(), and
gist_indexsortbuild have a similar-looking smgrimmedsync() calls, with no
consideration for unlogged relations at all. I haven't tested those, but
they look wrong to me.
Oops ^ N. And end_heap_rewrite() mentions directly
RelationCopyStorage()..
--
Michael
On Fri, Aug 25, 2023 at 8:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I noticed another missing fsync() with unlogged tables, similar to the
one at [1].RelationCopyStorage does this:
/*
* When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is that since we're copying outside shared buffers, a CHECKPOINT
* occurring during the copy has no way to flush the previously written
* data to disk (indeed it won't know the new rel even exists). A crash
* later on would replay WAL from the checkpoint, therefore it wouldn't
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
if (use_wal || copying_initfork)
smgrimmedsync(dst, forkNum);That 'copying_initfork' condition is wrong. The first hint of that is
that 'use_wal' is always true for an init fork. I believe this was meant
to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise,
this bad thing can happen:1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose powerWhen you recover, the unlogged table is not reset, because it was a
clean postgres shutdown. But the part of the file that was copied after
the checkpoint at step 3 was never fsync'd, so part of the file is lost.
I was able to reproduce with a VM that I kill to simulate step 5.This is the same scenario that the smgrimmedsync() call above protects
from for WAL-logged relations. But we got the condition wrong: instead
of worrying about the init fork, we need to call smgrimmedsync() for all
*other* forks of an unlogged relation.Fortunately the fix is trivial, see attached. I suspect we have similar
problems in a few other places, though. end_heap_rewrite(), _bt_load(),
and gist_indexsortbuild have a similar-looking smgrimmedsync() calls,
with no consideration for unlogged relations at all. I haven't tested
those, but they look wrong to me.
The patch looks reasonable to me. Is this [1]/messages/by-id/CAAKRu_bPc81M121pOEU7W=+wSWEebiLnrie4NpaFC+kWATFtSA@mail.gmail.com case in hash index build
that I reported but didn't take the time to reproduce similar?
- Melanie
[1]: /messages/by-id/CAAKRu_bPc81M121pOEU7W=+wSWEebiLnrie4NpaFC+kWATFtSA@mail.gmail.com
On Fri, Aug 25, 2023 at 8:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose powerWhen you recover, the unlogged table is not reset, because it was a
clean postgres shutdown. But the part of the file that was copied after
the checkpoint at step 3 was never fsync'd, so part of the file is lost.
I was able to reproduce with a VM that I kill to simulate step 5.This is the same scenario that the smgrimmedsync() call above protects
from for WAL-logged relations. But we got the condition wrong: instead
of worrying about the init fork, we need to call smgrimmedsync() for all
*other* forks of an unlogged relation.Fortunately the fix is trivial, see attached.
The general rule throughout the system is that the init-fork of an
unlogged relation is treated the same as a permanent relation: it is
WAL-logged and fsyncd. But the other forks of an unlogged relation are
neither WAL-logged nor fsync'd ... except in the case of a clean
shutdown, when we fsync even that data.
In other words, somehow it feels like we ought to be trying to defer
the fsync here until a clean shutdown actually occurs, instead of
performing it immediately. Admittedly, the bookkeeping seems like a
problem, so maybe this is the best we can do, but it's clearly worse
than what we do in other cases.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Sep 05, 2023 at 02:20:18PM -0400, Robert Haas wrote:
The general rule throughout the system is that the init-fork of an
unlogged relation is treated the same as a permanent relation: it is
WAL-logged and fsyncd. But the other forks of an unlogged relation are
neither WAL-logged nor fsync'd ... except in the case of a clean
shutdown, when we fsync even that data.In other words, somehow it feels like we ought to be trying to defer
the fsync here until a clean shutdown actually occurs, instead of
performing it immediately. Admittedly, the bookkeeping seems like a
problem, so maybe this is the best we can do, but it's clearly worse
than what we do in other cases.
That's where we usually rely more on RegisterSyncRequest() and
register_dirty_segment() so as the flush of the dirty segments can
happen when they should, but we don't go through the shared buffers
when copying all the forks of a relation file across tablespaces..
--
Michael
On 04/09/2023 16:59, Melanie Plageman wrote:
The patch looks reasonable to me. Is this [1] case in hash index build
that I reported but didn't take the time to reproduce similar?[1] /messages/by-id/CAAKRu_bPc81M121pOEU7W=+wSWEebiLnrie4NpaFC+kWATFtSA@mail.gmail.com
Yes, I think you're right. Any caller of smgrwrite() must either:
a) Call smgrimmedsync(), after smgrwrite() and before the relation is
visible to other transactions. Regardless of the 'skipFsync' parameter!
I don't think this is ever completely safe unless it's a new relation.
Like you pointed out with the hash index case. Or:
b) Hold the buffer lock, so that if a checkpoint happens, it cannot
"race past" the page without seeing the sync request.
The comment on smgwrite() doesn't make this clear. It talks about
'skipFsync', and gives the impression that as long as you pass
'skipFsync=false', you don't need to worry about fsyncing. But that is
not true. Even in these bulk loading cases where we currently call
smgrimmedsync(), we would *still* need to call smgrimmedsync() if we
used 'skipFsync=false'.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 05/09/2023 21:20, Robert Haas wrote:
In other words, somehow it feels like we ought to be trying to defer
the fsync here until a clean shutdown actually occurs, instead of
performing it immediately.
+1
Admittedly, the bookkeeping seems like a problem, so maybe this is
the best we can do, but it's clearly worse than what we do in other
cases.
I think we can do it, I have been working on a patch along those lines
on the side. But I want to focus on a smaller, backpatchable fix in this
thread.
Thinking about this some more, I think this is still not 100% correct,
even with the patch I posted earlier:
/*
* When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is that since we're copying outside shared buffers, a CHECKPOINT
* occurring during the copy has no way to flush the previously written
* data to disk (indeed it won't know the new rel even exists). A crash
* later on would replay WAL from the checkpoint, therefore it wouldn't
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
smgrimmedsync(dst, forkNum);
Let's walk through each case one by one:
1. Temporary table. No fsync() needed. This is correct.
2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL,
and also because we bypassed the buffer manager. Correct.
3. Unlogged rel, init fork. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. This is now correct with the patch.
4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. Correct.
5. Permanent rel, use_wal == false. We skip fsync, because it means that
it's a new relation, so we have a sync request pending for it. (An
assertion for that would be nice). At the end of transaction, in
smgrDoPendingSyncs(), we will either fsync it or we will WAL-log all the
pages if it was a small relation. I believe this is *wrong*. If
smgrDoPendingSyncs() decides to WAL-log the pages, we have the same race
condition that's explained in the comment, because we bypassed the
buffer manager:
1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate()
registered a dirty segment when the relation was created)
3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.
5. Another checkpoint happens. It does *not* fsync the relation.
6. Crash.
WAL replay will not see the WAL-logged pages, because they were
WAL-logged before the last checkpoint. And the contents were not fsync'd
either.
In other words, we must do smgrimmedsync() here for permanent relations,
even if use_wal==false, because we bypassed the buffer manager. Same
reason we need to do it with use_wal==true.
For a backportable fix, I think we should change this to only exempt
temporary tables, and call smgrimmedsync() for all other cases. Maybe
would be safe to skip it in some cases, but it feels too dangerous to be
clever here. The other similar callers of smgrimmedsync() in nbtsort.c,
gistbuild.c, and rewriteheap.c, need similar treatment.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
On 05/09/2023 21:20, Robert Haas wrote:
Thinking about this some more, I think this is still not 100% correct, even
with the patch I posted earlier:/*
* When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is that since we're copying outside shared buffers, a CHECKPOINT
* occurring during the copy has no way to flush the previously written
* data to disk (indeed it won't know the new rel even exists). A crash
* later on would replay WAL from the checkpoint, therefore it wouldn't
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
smgrimmedsync(dst, forkNum);Let's walk through each case one by one:
1. Temporary table. No fsync() needed. This is correct.
2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
also because we bypassed the buffer manager. Correct.
Agreed.
3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
it, because we bypassed the buffer manager. Like the comment explains. This
is now correct with the patch.
This has two subtypes:
3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.
3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync(). (RelationCreateStorage() could start calling
AddPendingSync() for this case. I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers. On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems. Also,
the number of distinguishable cases is unpleasantly high.)
4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. Correct.5. Permanent rel, use_wal == false. We skip fsync, because it means that
it's a new relation, so we have a sync request pending for it. (An assertion
for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
we will either fsync it or we will WAL-log all the pages if it was a small
relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
WAL-log the pages, we have the same race condition that's explained in the
comment, because we bypassed the buffer manager:1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
dirty segment when the relation was created)
3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.
smgrDoPendingSyncs() delegates to log_newpage_range(). log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...
5. Another checkpoint happens. It does *not* fsync the relation.
... I think this will fsync the relation. No?
Show quoted text
6. Crash.
On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
On 05/09/2023 21:20, Robert Haas wrote:
Thinking about this some more, I think this is still not 100% correct, even
with the patch I posted earlier:/*
* When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is that since we're copying outside shared buffers, a CHECKPOINT
* occurring during the copy has no way to flush the previously written
* data to disk (indeed it won't know the new rel even exists). A crash
* later on would replay WAL from the checkpoint, therefore it wouldn't
* replay our earlier WAL entries. If we do not fsync those pages here,
* they might still not be on disk when the crash occurs.
*/
if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
smgrimmedsync(dst, forkNum);Let's walk through each case one by one:
1. Temporary table. No fsync() needed. This is correct.
2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
also because we bypassed the buffer manager. Correct.Agreed.
3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
it, because we bypassed the buffer manager. Like the comment explains. This
is now correct with the patch.This has two subtypes:
3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync(). (RelationCreateStorage() could start calling
AddPendingSync() for this case. I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers. On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems. Also,
the number of distinguishable cases is unpleasantly high.)4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. Correct.5. Permanent rel, use_wal == false. We skip fsync, because it means that
it's a new relation, so we have a sync request pending for it. (An assertion
for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
we will either fsync it or we will WAL-log all the pages if it was a small
relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
WAL-log the pages, we have the same race condition that's explained in the
comment, because we bypassed the buffer manager:1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
dirty segment when the relation was created)
3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.smgrDoPendingSyncs() delegates to log_newpage_range(). log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...5. Another checkpoint happens. It does *not* fsync the relation.
... I think this will fsync the relation. No?
6. Crash.
While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:
/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. Recovery may remove it while
* replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrlocator, INIT_FORKNUM);
}
XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint. XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08). In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks. Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.
On Fri, Sep 15, 2023 at 7:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Thinking about this some more, I think this is still not 100% correct,
even with the patch I posted earlier:
This is marked as needing review, but that doesn't appear to be
correct, because there's this comment, indicating that the patch
requires re-work, and there's also two emails from Noah on the thread
providing further feedback. So it seems this has been waiting for
Heikki or someone else to have time to work it for the last 3 months.
Hence, marking RwF for now; if someone gets back to it, please reopen.
--
Robert Haas
EDB: http://www.enterprisedb.com