data corruption hazard in reorderbuffer.c
Hackers,
I believe there is a hazard in reorderbuffer.c if a call to write() buffers data rather than flushing it to disk, only to fail when flushing the data during close(). The relevant code is in ReorderBufferSerializeTXN(), which iterates over changes for a transaction, opening transient files as needed for the changes to be written, writing the transaction changes to the transient files using ReorderBufferSerializeChange(), and closing the files when finished using CloseTransientFile(), the return code from which is ignored. If ReorderBufferSerializeChange() were fsync()ing the writes, then ignoring the failures on close() would likely be ok, or at least in line with what we do elsewhere. But as far as I can see, no call to sync the file is performed.
I expect a failure here could result in a partially written change in the transient file, perhaps preceded or followed by additional complete or partial changes. Perhaps even worse, a failure could result in a change not being written at all (rather than partially), potentially with preceding and following changes written intact, with no indication that one change is missing.
Upon testing, both of these expectations appear to be true. Skipping some writes while not others easily creates a variety of failures, and for brevity I won't post a patch to demonstrate that here. The following code change causes whole rather than partial changes to be written or skipped. After applying this change and running the tests in contrib/test_decoding/, "test toast" shows that an UPDATE command has been silently skipped.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7378beb684..a6c60b81c9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -108,6 +108,7 @@
#include "utils/rel.h"
#include "utils/relfilenodemap.h"
+static bool lose_writes_until_close = false;
/* entry for a hash table we use to map from xid to our transaction state */
typedef struct ReorderBufferTXNByIdEnt
@@ -3523,6 +3524,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Size spilled = 0;
Size size = txn->size;
+ lose_writes_until_close = false;
+
elog(DEBUG2, "spill %u changes in XID %u to disk",
(uint32) txn->nentries_mem, txn->xid);
@@ -3552,7 +3555,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
char path[MAXPGPATH];
if (fd != -1)
+ {
CloseTransientFile(fd);
+ lose_writes_until_close = !lose_writes_until_close;
+ }
XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
@@ -3600,6 +3606,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
if (fd != -1)
CloseTransientFile(fd);
+
+ lose_writes_until_close = false;
}
/*
@@ -3790,7 +3798,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
- if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
+
+ if (lose_writes_until_close)
+ ; /* silently do nothing with buffer contents */
+ else if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
{
int save_errno = errno;
The attached patch catches the failures on close(), but to really fix this properly, we should call pg_fsync() before close().
Any thoughts on this? It seems to add a fair amount of filesystem burden to add all the extra fsync activity, though I admit to having not benchmarked that yet. Perhaps doing something like Thomas's work for commit dee663f7843 where closing files is delayed so that fewer syncs are needed? I'm not sure how much that would help here, and would like feedback before pursuing anything of that sort.
The relevant code exists back as far as the 9_4_STABLE branch, where commit b89e151054a from 2014 first appears.
Attachments:
v1-0001-No-longer-ignoring-failures-on-file-close.patchapplication/octet-stream; name=v1-0001-No-longer-ignoring-failures-on-file-close.patch; x-unix-mode=0644Download
From a66fede89c21bf6c6737581d1baf177047ad8a83 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 15 Jul 2021 12:10:54 -0700
Subject: [PATCH v1] No longer ignoring failures on file close
When writing the changes from transactions as part of logical
replication, the transient files to which transaction changes are
written were being closed without checking the return value from
close(). Fix that.
The return code from write() was already being handled reasonably,
but on filesystems which buffer write()s, the attempt to flush the
buffer during the close() operation may fail, with write() returning
-1 and setting errno. This was not checked, meaning that some or
all of the changes for a transaction could be partially or entirely
missing, resulting in garbled data that cannot be replayed on the
subscriber side, or worse still, silently missing changes.
---
src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7378beb684..3d13c42098 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3552,7 +3552,11 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
char path[MAXPGPATH];
if (fd != -1)
- CloseTransientFile(fd);
+ if (CloseTransientFile(fd))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file: %m")));
+
XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
@@ -3599,7 +3603,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
- CloseTransientFile(fd);
+ if (CloseTransientFile(fd))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file: %m")));
}
/*
@@ -3794,7 +3801,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
int save_errno = errno;
- CloseTransientFile(fd);
+ (void) CloseTransientFile(fd); /* no error checking */
/* if write didn't set errno, assume problem is no disk space */
errno = save_errno ? save_errno : ENOSPC;
--
2.21.1 (Apple Git-122.3)
On Jul 15, 2021, at 1:03 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
Skipping some writes while not others easily creates a variety of failures, and for brevity I won't post a patch to demonstrate that here.
If anybody is curious, one common error I see when simulating a close() skipping partial changes rather than whole ones looks like:
ERROR: got sequence entry 31 for toast chunk 16719 instead of seq 21
(where the exact numbers differ, of course). This symptom has shown up in at least two ([1]/messages/by-id/CA+_m4OBs2aPkjqd1gxnx2ykuTJogYCfq8TZgr1uPP3ZtBTyvew@mail.gmail.com, [2]/messages/by-id/3bd6953d-3da6-040c-62bf-9522808d5c2f@2ndquadrant.com below) unsolved user bug reports specifically mentioning replication. That doesn't prove a connection between the those reports and this issue, but it makes me wonder.
[1]: /messages/by-id/CA+_m4OBs2aPkjqd1gxnx2ykuTJogYCfq8TZgr1uPP3ZtBTyvew@mail.gmail.com
[2]: /messages/by-id/3bd6953d-3da6-040c-62bf-9522808d5c2f@2ndquadrant.com
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jul 15, 2021, at 1:51 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
one common error I see
Another common error is of the form
ERROR: could not map filenode "base/16384/16413" to relation OID
resulting from a ddl statement having not been written correctly, I think. This, too, has shown up in [1]https://www.postgresql-archive.org/BUG-16812-Logical-decoding-error-td6170022.html fairly recently against a 12.3 database. A similar looking bug was reported a couple years earlier in [2]/messages/by-id/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de, for which Andres pushed the fix e9edc1ba0be21278de8f04a068c2fb3504dc03fc and backpatched as far back as 9.2. Postgres 12.3 was released quite a bit more recently than that, so it appears the bug report from [1]https://www.postgresql-archive.org/BUG-16812-Logical-decoding-error-td6170022.html occurred despite having the fix from [2]/messages/by-id/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de.
[1]: https://www.postgresql-archive.org/BUG-16812-Logical-decoding-error-td6170022.html
[2]: /messages/by-id/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I think it's mostly futile to list all the possible issues this might
have caused - if you skip arbitrary decoded changes, that can trigger
pretty much any bug in reorder buffer. But those bugs can be triggered
by various other issues, of course.
It's hard to say what was the cause, but the "logic" bugs are probably
permanent, while the issues triggered by I/O probably disappear after a
restart?
That being said, I agree this seems like an issue and we should not
ignore I/O errors. I'd bet other places using transient files (like
sorting or hashagg spilling to disk) has the same issue, although in
that case the impact is likely limited to a single query.
I wonder if sync before the close is an appropriate solution, though. It
seems rather expensive, and those files are meant to be "temporary"
(i.e. we don't keep them over restart). So maybe we could ensure the
consistency is a cheaper way - perhaps tracking some sort of checksum
for each file, or something like that?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jul 15, 2021, at 3:32 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I think it's mostly futile to list all the possible issues this might have caused - if you skip arbitrary decoded changes, that can trigger pretty much any bug in reorder buffer. But those bugs can be triggered by various other issues, of course.
I thought that at first, too, but when I searched for bug reports with the given error messages, they all had to do with logical replication or logical decoding. That seems a bit fishy to me. If these bugs were coming from all over the system, why would that be so?
It's hard to say what was the cause, but the "logic" bugs are probably permanent, while the issues triggered by I/O probably disappear after a restart?
If you mean "logic" bugs like passing an invalid file descriptor to close(), then yes, those would be permanent, but I don't believe we have any bugs of that kind.
The consequences of I/O errors are not going to go away after restart. On the subscriber side, if logical replication has replayed less than a full transaction worth of changes without raising an error, the corruption will be permanent.
That being said, I agree this seems like an issue and we should not ignore I/O errors.
I agree.
I'd bet other places using transient files (like sorting or hashagg spilling to disk) has the same issue, although in that case the impact is likely limited to a single query.
I'm not so convinced. In most places, the result is ignored for close()-type operations only when the prior operation failed and we're closing the handle in preparation for raising an error. There are a small number of other places, such as pg_import_system_collations and perform_base_backup, which ignore the result from closing a handle, but those are reading from the handle, not writing to it, so the situation is not comparable.
I believe the oversight in reorderbuffer.c really is a special case.
I wonder if sync before the close is an appropriate solution, though. It seems rather expensive, and those files are meant to be "temporary" (i.e. we don't keep them over restart). So maybe we could ensure the consistency is a cheaper way - perhaps tracking some sort of checksum for each file, or something like that?
I'm open to suggestions of what to do, but thinking of these files as temporary may be what misleads developers to believe they don't have to be treated carefully. The code was first committed in 2014 and as far as I am aware nobody else has complained about this before. In some part that might be because CloseTemporaryFile() is less familiar than close() to most developers, and they may be assuming that it contains its own error handling and just don't realize that it returns an error code just like regular close() does.
The point of the reorder buffer is to sort the changes from a single transaction so that they can be replayed in order. The files used for the sorting are temporary, but there is nothing temporary about the failure to include all the changes in the files, as that will have permanent consequences for whoever replays them. If lucky, the attempt to replay the changes will abort because they don't make sense, but I've demonstrated to my own satisfaction that nothing prevents silent data loss if the failure to write changes happens to destroy a complete rather than partial change.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7/16/21 1:07 AM, Mark Dilger wrote:
On Jul 15, 2021, at 3:32 PM, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:I think it's mostly futile to list all the possible issues this
might have caused - if you skip arbitrary decoded changes, that can
trigger pretty much any bug in reorder buffer. But those bugs can
be triggered by various other issues, of course.I thought that at first, too, but when I searched for bug reports
with the given error messages, they all had to do with logical
replication or logical decoding. That seems a bit fishy to me. If
these bugs were coming from all over the system, why would that be
so?
No, I'm not suggesting those bugs are coming from all over the system.
The theory is that there's a bug in logical decoding / reorder buffer,
with the same symptoms. So it'd affect systems with logical replication.
It's hard to say what was the cause, but the "logic" bugs are
probably permanent, while the issues triggered by I/O probably
disappear after a restart?If you mean "logic" bugs like passing an invalid file descriptor to
close(), then yes, those would be permanent, but I don't believe we
have any bugs of that kind.
No, by logic bugs I mean cases like failing to update the snapshot,
losing track of relfilenode changes, etc. due to failing to consider
some cases etc. As opposed to "I/O errors" where this is caused by
external events.
The consequences of I/O errors are not going to go away after
restart. On the subscriber side, if logical replication has replayed
less than a full transaction worth of changes without raising an
error, the corruption will be permanent.
True, good point. I was thinking about the "could not map relfilenode"
error, which forces the decoding to restart, and on the retry it's
unlikely to hit the same I/O error, so it'll succeed. But you're right
that if it reaches the subscriber and gets committed (e.g. missing a
row), it's permanent.
That being said, I agree this seems like an issue and we should not
ignore I/O errors.I agree.
I'd bet other places using transient files (like sorting or hashagg
spilling to disk) has the same issue, although in that case the
impact is likely limited to a single query.I'm not so convinced. In most places, the result is ignored for
close()-type operations only when the prior operation failed and
we're closing the handle in preparation for raising an error. There
are a small number of other places, such as
pg_import_system_collations and perform_base_backup, which ignore the
result from closing a handle, but those are reading from the handle,
not writing to it, so the situation is not comparable.I believe the oversight in reorderbuffer.c really is a special case.
Hmm, you might be right. I have only briefly looked at buffile.c and
tuplestore.c yesterday, and I wasn't sure about the error checking. But
maybe that works fine, now that I look at it, because we don't really
use the files after close().
I wonder if sync before the close is an appropriate solution,
though. It seems rather expensive, and those files are meant to be
"temporary" (i.e. we don't keep them over restart). So maybe we
could ensure the consistency is a cheaper way - perhaps tracking
some sort of checksum for each file, or something like that?I'm open to suggestions of what to do, but thinking of these files as
temporary may be what misleads developers to believe they don't have
to be treated carefully. The code was first committed in 2014 and as
far as I am aware nobody else has complained about this before. In
some part that might be because CloseTemporaryFile() is less familiar
than close() to most developers, and they may be assuming that it
contains its own error handling and just don't realize that it
returns an error code just like regular close() does.
I don't think anyone thinks corruption of temporary files would not be
an issue, but making them fully persistent (which is what fsync would
do) just to eliminate a piece is not lost seems like an overkill to me.
And the file may get corrupted even with fsync(), so I'm wondering if
there's a way to ensure integrity without the fsync (so cheaper).
The scheme I was thinking about is a simple checksum for each BufFile.
When writing a buffer to disk, update the crc32c checksum, and then
check it after reading the file (and error-out if it disagrees).
The point of the reorder buffer is to sort the changes from a single
transaction so that they can be replayed in order. The files used
for the sorting are temporary, but there is nothing temporary about
the failure to include all the changes in the files, as that will
have permanent consequences for whoever replays them. If lucky, the
attempt to replay the changes will abort because they don't make
sense, but I've demonstrated to my own satisfaction that nothing
prevents silent data loss if the failure to write changes happens to
destroy a complete rather than partial change.
Sure, I agree with all of that.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company