BRIN summarization vs. WAL logging
Hi,
In a thread about sequences and sync replication [1]/messages/by-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com, I've explained
that the issue we're observing is due to not waiting for WAL at commit
if the transaction only did nextval(). In which case we don't flush WAL
in RecordTransactionCommit, we don't wait for sync replica, etc. The WAL
may get lost in case of crash, etc.
As I explained in the other thread, there are various other cases where
a transaction generates WAL but does not have XID, which is sufficient
for not flushing/waiting at transaction commit. Some of those cases are
probably fine (e.g. there are comments explaining why this is fine for
PRUNE record).
But other cases (discovered by running regression tests with extra
logging) looked a bit suspicious - particularly those that write
multiple WAL messages, because what if we lose just some of those?
So I looked at two cases related to BRIN, mostly because those were
fairly simple, and I think at least the brin_summarize_range() is
somewhat broken.
1) brin_desummarize_range()
This is pretty simple, because this function generates a single WAL
record, without waiting for it to be flushed:
DESUMMARIZE pagesPerRange 1, heapBlk 0, page offset 9, blkref #0: ...
But if the cluster/VM/... crashes right after you ran the function (and
it completed just fine, possibly even in an explicit transaciton), that
change will get lost. Not really a serious data corruption/loss, and you
can simply run it again, but IMHO rather surprising.
Of course, most people are unlikely to run brin_desummarize_range() very
often, so maybe it's acceptable? But of course - if we expect this to be
very rare operation, why skip the WAL at all?
2) brin_summarize_range()
Now, the issue I think is more serious, more likely to happen, and
harder to fix. When summarizing a range, we write two WAL records:
INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2
So, what happens if we lost the second WAL record, e.g. due to a crash?
To experiment with this, I wrote a trivial patch (attached) that allows
crashing on WAL message of certain type by simply setting a GUC.
Now, consider this example:
create table t (a int);
insert into t select i from generate_series(1,5000) s(i);
create index on t using brin (a);
select brin_desummarize_range('t_a_idx', 1);
set crash_on_wal_message = 'SAMEPAGE_UPDATE';
select brin_summarize_range('t_a_idx', 5);
PANIC: crashing before 'SAMEPAGE_UPDATE' WAL message
server closed the connection unexpectedly
...
After recovery, this is what we have:
select * from brin_page_items(get_Raw_page('t_a_idx', 2), 't_a_idx');
... | allnulls | hasnulls | placeholder | value
... -+----------+----------+-------------+-------
... | t | f | t |
(1 row)
So the BRIN tuple is still marked as placeholder, which is a problem
because that means we'll always consider it as matching, making the
bitmap index scan less efficient. And we'll *never* fix this, because
just summarizing the range does nothing:
select brin_summarize_range('t_a_idx', 5);
brin_summarize_range
----------------------
0
(1 row)
So it's still marked as placeholder, and to fix it you have to
explicitly desummarize the range first.
The reason for this seems obvious - only the process that created the
placeholder tuple is expected to mark it as "placeholder=false", but
this is described as two WAL records. And if we lose the update, the
tuple will stay marked as a placeholder forever.
Of course, this requires a crash while something is summarizing ranges.
But consider the summarization is often done by autovacuum, so it's not
just about hitting this from manually-executed brin_summarize_range.
I'm not quite sure what to do about this. Just doing XLogFlush() does
not really fix this - it makes it less likely, but the root cause is the
change is described by multiple WAL messages that are not linked
together in any way. We may lost the last message without noticing that,
and the flush does not fix that.
I didn't look at the other cases mentioned in [1]/messages/by-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com, but I would't be
surprised if some had a similar issue (e.g. the GIN pending list cleanup
seems like another candidate).
regards
[1]: /messages/by-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com
/messages/by-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
trigger-wal-crash.patchtext/x-patch; charset=UTF-8; name=trigger-wal-crash.patchDownload
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 17257919dbf..daed8ba3bc8 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -190,6 +190,8 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
xlrec.offnum = oldoff;
+ maybe_crash_on_wal("SAMEPAGE_UPDATE");
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBrinSamepageUpdate);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c9516e03fae..562192dc514 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -84,6 +84,8 @@ bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+char *crash_on_wal_message;
+
/*
* CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
* transaction. Currently, it is used in logical decoding. It's possible
@@ -6157,3 +6159,17 @@ xact_redo(XLogReaderState *record)
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
+
+void
+maybe_crash_on_wal(char *msg)
+{
+ /* GUC not set or not the right message */
+ if (strcmp(crash_on_wal_message, msg) != 0)
+ return;
+
+ /* flush whatever was generated in this xact so far */
+ XLogFlush(XactLastRecEnd);
+
+ elog(PANIC, "crashing before '%s' WAL message", msg);
+}
+
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c94f09c645..37bf20e7f04 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -139,6 +139,7 @@ extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool ignore_invalid_pages;
extern bool synchronize_seqscans;
+extern char *crash_on_wal_message;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -4070,6 +4071,17 @@ static struct config_string ConfigureNamesString[] =
check_default_tablespace, NULL, NULL
},
+ {
+ {"crash_on_wal_message", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("crash on this WAL message"),
+ NULL,
+ GUC_IS_NAME
+ },
+ &crash_on_wal_message,
+ "",
+ NULL, NULL, NULL
+ },
+
{
{"temp_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the tablespace(s) to use for temporary tables and sort files."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 02276d3edd5..fa13c60da32 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -487,4 +487,6 @@ extern void CancelBackup(void);
/* in executor/nodeHash.c */
extern size_t get_hash_memory_limit(void);
+extern void maybe_crash_on_wal(char *msg);
+
#endif /* MISCADMIN_H */
On Tue, Jan 25, 2022 at 10:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
1) brin_desummarize_range()
But if the cluster/VM/... crashes right after you ran the function (and
it completed just fine, possibly even in an explicit transaciton), that
change will get lost. Not really a serious data corruption/loss, and you
can simply run it again, but IMHO rather surprising.
I don't have a big problem with inserting an XLogFlush() here, but I
also don't think there's a hard and fast rule that maintenance
operations have to have full transactional behavior. Because that's
violated in various different ways by various different DDL commands.
CREATE INDEX CONCURRENTLY can leave detritus around if it fails. At
least one variant of ALTER TABLE does a non-MVCC-aware rewrite of the
table contents. COPY FREEZE violates MVCC. Concurrent partition attach
and detach aren't fully serializable with concurrent transactions.
Years ago TRUNCATE didn't have MVCC semantics. We often make small
compromises in these areas for implementation simplicity or other
benefits.
2) brin_summarize_range()
Now, the issue I think is more serious, more likely to happen, and
harder to fix. When summarizing a range, we write two WAL records:INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2So, what happens if we lost the second WAL record, e.g. due to a crash?
Ouch. As you say, XLogFlush() won't fix that. The WAL logging scheme
needs to be redesigned.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2022-Jan-26, Robert Haas wrote:
On Tue, Jan 25, 2022 at 10:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
2) brin_summarize_range()
Now, the issue I think is more serious, more likely to happen, and
harder to fix. When summarizing a range, we write two WAL records:INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2So, what happens if we lost the second WAL record, e.g. due to a crash?
Ouch. As you say, XLogFlush() won't fix that. The WAL logging scheme
needs to be redesigned.
I'm not sure what a good fix is. I was thinking that maybe if a
placeholder tuple is found during index scan, and the corresponding
process is no longer running, then the index scanner would remove the
placeholder tuple, making the range unsummarized again. However, how
would we know that the process is gone?
Another idea is to use WAL's rm_cleanup: during replay, remember that a
placeholder tuple was seen, then remove the info if we see an update
from the originating process that replaces the placeholder tuple with a
real one; at cleanup time, if the list of remembered placeholder tuples
is nonempty, remove them.
(I vaguely recall we used the WAL rm_cleanup mechanism for something
like this, but we no longer do AFAICS.)
... Oh, but if there is a promotion involved, we may end up with a
placeholder insertion before the promotion and the update afterwards.
That would probably not be handled well.
One thing not completely clear to me is whether this only affects
placeholder tuples. Is it possible to have this problem with regular
BRIN tuples? I think it isn't.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On 1/26/22 19:14, Alvaro Herrera wrote:
On 2022-Jan-26, Robert Haas wrote:
On Tue, Jan 25, 2022 at 10:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:2) brin_summarize_range()
Now, the issue I think is more serious, more likely to happen, and
harder to fix. When summarizing a range, we write two WAL records:INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2So, what happens if we lost the second WAL record, e.g. due to a crash?
Ouch. As you say, XLogFlush() won't fix that. The WAL logging scheme
needs to be redesigned.I'm not sure what a good fix is. I was thinking that maybe if a
placeholder tuple is found during index scan, and the corresponding
process is no longer running, then the index scanner would remove the
placeholder tuple, making the range unsummarized again. However, how
would we know that the process is gone?Another idea is to use WAL's rm_cleanup: during replay, remember that a
placeholder tuple was seen, then remove the info if we see an update
from the originating process that replaces the placeholder tuple with a
real one; at cleanup time, if the list of remembered placeholder tuples
is nonempty, remove them.(I vaguely recall we used the WAL rm_cleanup mechanism for something
like this, but we no longer do AFAICS.)... Oh, but if there is a promotion involved, we may end up with a
placeholder insertion before the promotion and the update afterwards.
That would probably not be handled well.
Right, I think we'd miss those. And can't that happen for regular
recovery too. If the placeholder tuple is before the redo LSN, we'd miss
it too, right? But something prevents that.
I think the root cause is the two WAL records are not linked together,
so we fail to ensure the atomicity of the operation. Trying to fix this
by tracking placeholder tuples seems like solving the symptoms.
The easiest solution would be to link the records by XID, but of course
that goes against the whole placeholder tuple idea - no one could modify
the placeholder tuple in between. Maybe that's a price we have to pay.
One thing not completely clear to me is whether this only affects
placeholder tuples. Is it possible to have this problem with regular
BRIN tuples? I think it isn't.
Yeah. I've been playing with this for a while, trying to cause issues
with non-placeholder tuples. But I think that's fine - once the tuple
becomes non-placeholder, all subsequent updates are part of the
transaction that modifies the table. So if that fails, we don't update
the index tuple.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company