Snapbuild woes followup
In SnapBuildFinalSnapshot(), we have this comment:
/*
* c) transition from START to BUILDING_SNAPSHOT.
*
* In START state, and a xl_running_xacts record with running xacts is
* encountered. In that case, switch to BUILDING_SNAPSHOT state, and
* record xl_running_xacts->nextXid. Once all running xacts have finished
* (i.e. they're all >= nextXid), we have a complete catalog snapshot. It
* might look that we could use xl_running_xact's ->xids information to
* get there quicker, but that is problematic because transactions marked
* as running, might already have inserted their commit record - it's
* infeasible to change that with locking.
*/
This was added in commit 955a684e040, before that we did in fact use the
xl_running_xacts list of XIDs, but it was buggy. Commit 955a684e040
fixed that by waiting for *two* xl_running_xacts, such that the second
xl_running_xact doesn't contain any of the XIDs from the first one.
To fix the case mentioned in that comment, where a transaction listed in
xl_running_xacts is in fact already committed or aborted, wouldn't it be
more straightforward to check each XID, if they are in fact already
committed or aborted? The CLOG is up-to-date here, I believe.
I bumped into this, after I noticed that read_local_xlog_page() has a
pretty busy polling loop, with just 1 ms delay to keep it from hogging
CPU. I tried to find the call sites where we might get into that loop,
and found that the snapshot building code to do that: the
'delayed_startup' regression test in contrib/test_decoding exercises it.
In a primary server, SnapBuildWaitSnapshot() inserts a new running-xacts
record, and then read_local_xlog_page() will poll until that record has
been flushed. We could add an explicit XLogFlush() there to avoid the
wait. However, if I'm reading the code correctly, in a standby server,
we can't write a new running-xacts record so we just wait for one that's
created periodically by bgwriter in the primary. That can take several
seconds. Or indefinitely, if the standby isn't connected to the primary
at the moment. Would be nice to not poll.
- Heikki
P.S. There's this in SnapBuildNextPhaseAt():
/*
* For backward compatibility reasons this has to be stored in the wrongly
* named field. Will be fixed in next major version.
*/
return builder->was_running.was_xmax;
We could fix that now... Andres, what did you have in mind for a proper
name?
P.P.S. Two thinkos in comments in snapbuild.c: s/write for/wait for/.
Hi,
Thomas, CCing you because of the condvar bit below.
On 2021-01-25 19:28:33 +0200, Heikki Linnakangas wrote:
In SnapBuildFinalSnapshot(), we have this comment:
/*
* c) transition from START to BUILDING_SNAPSHOT.
*
* In START state, and a xl_running_xacts record with running xacts is
* encountered. In that case, switch to BUILDING_SNAPSHOT state, and
* record xl_running_xacts->nextXid. Once all running xacts have finished
* (i.e. they're all >= nextXid), we have a complete catalog snapshot. It
* might look that we could use xl_running_xact's ->xids information to
* get there quicker, but that is problematic because transactions marked
* as running, might already have inserted their commit record - it's
* infeasible to change that with locking.
*/This was added in commit 955a684e040, before that we did in fact use the
xl_running_xacts list of XIDs, but it was buggy. Commit 955a684e040 fixed
that by waiting for *two* xl_running_xacts, such that the second
xl_running_xact doesn't contain any of the XIDs from the first one.
Not really just that, but we also just don't believe ->xids to be
consistent for visibility purposes...
To fix the case mentioned in that comment, where a transaction listed in
xl_running_xacts is in fact already committed or aborted, wouldn't it be
more straightforward to check each XID, if they are in fact already
committed or aborted? The CLOG is up-to-date here, I believe.
Well, we can't *just* go the clog since that will contain transactions
as committed/aborted even when not yet visible, due to still being in
the procarray. And I don't think it's easy to figure out how to
crosscheck between clog and procarray in this instance (since we don't
have the past procarray). This is different in the recovery path because
there we know that changes to the procarray / knownassignedxids
machinery are only done by one backend.
I bumped into this, after I noticed that read_local_xlog_page() has a pretty
busy polling loop, with just 1 ms delay to keep it from hogging CPU.
Hm - but why is that really related to the initial snapshot building
logic? Logical decoding constantly waits for WAL outside of that too,
no?
ISTM that we should improve the situation substantially in a fairly easy
way. Like:
1) Improve ConditionVariableBroadcast() so it doesn't take the spinlock
if there are no wakers - afaict that's pretty trivial.
2) Replace WalSndWakeup() with ConditionVariableBroadcast().
3) Replace places that need to wait for new WAL to be written with a
call to function doing something like
XLogRecPtr flushed_to = GetAppropriateFlushRecPtr(); // works for both normal / recovery
if (flush_requirement <= flushed_to)
break;
ConditionVariablePrepareToSleep(&XLogCtl->flush_progress_cv);
while (true)
{
flushed_to = GetAppropriateFlushRecPtr();
if (flush_requirement <= flushed_to)
break;
ConditionVariableSleep(&XLogCtl->flush_progress_cv);
}
this should end up being more efficient than the current WalSndWakeup()
mechanism because we'll only wake up the processes that need to be woken
up, rather than checking/setting each walsenders latch.
P.S. There's this in SnapBuildNextPhaseAt():
/*
* For backward compatibility reasons this has to be stored in the wrongly
* named field. Will be fixed in next major version.
*/
return builder->was_running.was_xmax;We could fix that now... Andres, what did you have in mind for a proper
name?
next_phase_at seems like it'd do the trick?
P.P.S. Two thinkos in comments in snapbuild.c: s/write for/wait for/.
Will push a fix.
Greetings,
Andres Freund
Hi,
On 2021-01-25 12:00:08 -0800, Andres Freund wrote:
/*
* For backward compatibility reasons this has to be stored in the wrongly
* named field. Will be fixed in next major version.
*/
return builder->was_running.was_xmax;We could fix that now... Andres, what did you have in mind for a proper
name?next_phase_at seems like it'd do the trick?
See attached patch...
Attachments:
0001-Remove-backwards-compat-ugliness-in-snapbuild.c.patchtext/x-diff; charset=us-asciiDownload
From 5c7d735d71beae1f6e6616f1a160543fae3ac91b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 25 Jan 2021 12:47:04 -0800
Subject: [PATCH] Remove backwards compat ugliness in snapbuild.c.
In 955a684e040 we fixed a bug in initial snapshot creation. In the
course of which several members of struct SnapBuild were obsoleted. As
SnapBuild is serialized to disk we couldn't change the memory layout.
Unfortunately I subsequently forgot about removing the backward compat
gunk, but luckily Heikki just reminded me.
This commit bumps SNAPBUILD_VERSION, therefore breaking existing
slots (which is fine in a major release).
Author: Andres Freund
Reminded-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c94be044-818f-15e3-1ad3-7a7ae2dfed0a@iki.fi
---
src/backend/replication/logical/snapbuild.c | 103 ++++----------------
1 file changed, 18 insertions(+), 85 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e903e561afc..752cf2d7dbc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -189,24 +189,11 @@ struct SnapBuild
ReorderBuffer *reorder;
/*
- * Outdated: This struct isn't used for its original purpose anymore, but
- * can't be removed / changed in a minor version, because it's stored
- * on-disk.
+ * TransactionId at which the next phase of initial snapshot building will
+ * happen. InvalidTransactionId if not known (i.e. SNAPBUILD_START), or
+ * when no next phase necessary (SNAPBUILD_CONSISTENT).
*/
- struct
- {
- /*
- * NB: This field is misused, until a major version can break on-disk
- * compatibility. See SnapBuildNextPhaseAt() /
- * SnapBuildStartNextPhaseAt().
- */
- TransactionId was_xmin;
- TransactionId was_xmax;
-
- size_t was_xcnt; /* number of used xip entries */
- size_t was_xcnt_space; /* allocated size of xip */
- TransactionId *was_xip; /* running xacts array, xidComparator-sorted */
- } was_running;
+ TransactionId next_phase_at;
/*
* Array of transactions which could have catalog changes that committed
@@ -272,34 +259,6 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
-/*
- * Return TransactionId after which the next phase of initial snapshot
- * building will happen.
- */
-static inline TransactionId
-SnapBuildNextPhaseAt(SnapBuild *builder)
-{
- /*
- * For backward compatibility reasons this has to be stored in the wrongly
- * named field. Will be fixed in next major version.
- */
- return builder->was_running.was_xmax;
-}
-
-/*
- * Set TransactionId after which the next phase of initial snapshot building
- * will happen.
- */
-static inline void
-SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
-{
- /*
- * For backward compatibility reasons this has to be stored in the wrongly
- * named field. Will be fixed in next major version.
- */
- builder->was_running.was_xmax = at;
-}
-
/*
* Allocate a new snapshot builder.
*
@@ -728,7 +687,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
* we got into the SNAPBUILD_FULL_SNAPSHOT state.
*/
if (builder->state < SNAPBUILD_CONSISTENT &&
- TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder)))
+ TransactionIdPrecedes(xid, builder->next_phase_at))
return false;
/*
@@ -945,7 +904,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
*/
if (builder->state == SNAPBUILD_START ||
(builder->state == SNAPBUILD_BUILDING_SNAPSHOT &&
- TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
+ TransactionIdPrecedes(xid, builder->next_phase_at)))
{
/* ensure that only commits after this are getting replayed */
if (builder->start_decoding_at <= lsn)
@@ -1267,7 +1226,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
Assert(TransactionIdIsNormal(builder->xmax));
builder->state = SNAPBUILD_CONSISTENT;
- SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+ builder->next_phase_at = InvalidTransactionId;
ereport(LOG,
(errmsg("logical decoding found consistent point at %X/%X",
@@ -1299,7 +1258,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
else if (builder->state == SNAPBUILD_START)
{
builder->state = SNAPBUILD_BUILDING_SNAPSHOT;
- SnapBuildStartNextPhaseAt(builder, running->nextXid);
+ builder->next_phase_at = running->nextXid;
/*
* Start with an xmin/xmax that's correct for future, when all the
@@ -1331,11 +1290,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* be decoded. Switch to FULL_SNAPSHOT.
*/
else if (builder->state == SNAPBUILD_BUILDING_SNAPSHOT &&
- TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
+ TransactionIdPrecedesOrEquals(builder->next_phase_at,
running->oldestRunningXid))
{
builder->state = SNAPBUILD_FULL_SNAPSHOT;
- SnapBuildStartNextPhaseAt(builder, running->nextXid);
+ builder->next_phase_at = running->nextXid;
ereport(LOG,
(errmsg("logical decoding found initial consistent point at %X/%X",
@@ -1356,11 +1315,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* collected. Switch to CONSISTENT.
*/
else if (builder->state == SNAPBUILD_FULL_SNAPSHOT &&
- TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
+ TransactionIdPrecedesOrEquals(builder->next_phase_at,
running->oldestRunningXid))
{
builder->state = SNAPBUILD_CONSISTENT;
- SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+ builder->next_phase_at = InvalidTransactionId;
ereport(LOG,
(errmsg("logical decoding found consistent point at %X/%X",
@@ -1463,7 +1422,7 @@ typedef struct SnapBuildOnDisk
offsetof(SnapBuildOnDisk, version)
#define SNAPBUILD_MAGIC 0x51A1E001
-#define SNAPBUILD_VERSION 2
+#define SNAPBUILD_VERSION 3
/*
* Store/Load a snapshot from disk, depending on the snapshot builder's state.
@@ -1508,6 +1467,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
if (builder->state < SNAPBUILD_CONSISTENT)
return;
+ /* consistent snapshots have no next phase */
+ Assert(builder->next_phase_at == InvalidTransactionId);
+
/*
* We identify snapshots by the LSN they are valid for. We don't need to
* include timelines in the name as each LSN maps to exactly one timeline
@@ -1596,9 +1558,6 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
&ondisk->builder,
sizeof(SnapBuild));
- /* there shouldn't be any running xacts */
- Assert(builder->was_running.was_xcnt == 0);
-
/* copy committed xacts */
sz = sizeof(TransactionId) * builder->committed.xcnt;
memcpy(ondisk_c, builder->committed.xip, sz);
@@ -1801,34 +1760,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
}
COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
- /* restore running xacts (dead, but kept for backward compat) */
- sz = sizeof(TransactionId) * ondisk.builder.was_running.was_xcnt_space;
- ondisk.builder.was_running.was_xip =
- MemoryContextAllocZero(builder->context, sz);
- pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
- readBytes = read(fd, ondisk.builder.was_running.was_xip, sz);
- pgstat_report_wait_end();
- if (readBytes != sz)
- {
- int save_errno = errno;
-
- CloseTransientFile(fd);
-
- if (readBytes < 0)
- {
- errno = save_errno;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not read file \"%s\": %m", path)));
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read file \"%s\": read %d of %zu",
- path, readBytes, sz)));
- }
- COMP_CRC32C(checksum, ondisk.builder.was_running.was_xip, sz);
-
/* restore committed xacts information */
sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
@@ -1890,6 +1821,8 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
if (TransactionIdPrecedes(ondisk.builder.xmin, builder->initial_xmin_horizon))
goto snapshot_not_interesting;
+ /* consistent snapshots have no next phase */
+ Assert(ondisk.builder.next_phase_at == InvalidTransactionId);
/* ok, we think the snapshot is sensible, copy over everything important */
builder->xmin = ondisk.builder.xmin;
--
2.29.2.540.g3cf59784d4
On Tue, Jan 26, 2021 at 2:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-01-25 12:00:08 -0800, Andres Freund wrote:
/*
* For backward compatibility reasons this has to be stored in the wrongly
* named field. Will be fixed in next major version.
*/
return builder->was_running.was_xmax;We could fix that now... Andres, what did you have in mind for a proper
name?next_phase_at seems like it'd do the trick?
The new proposed name sounds good to me.
See attached patch...
LGTM.
--
With Regards,
Amit Kapila.
On 2021-Jan-25, Andres Freund wrote:
See attached patch...
Looks good to me.
I was wondering if there would be a point in using a FullTransactionId
instead of an unadorned one. I don't know what's the true risk of
an Xid wraparound occurring here, but it seems easier to reason about.
But then that's probably a larger change to make all of snapbuild use
FullTransactionIds, so not for this patch to worry about.
--
�lvaro Herrera 39�49'30"S 73�17'W
Hi,
On 2021-01-29 14:04:47 +0530, Amit Kapila wrote:
On Tue, Jan 26, 2021 at 2:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-01-25 12:00:08 -0800, Andres Freund wrote:
/*
* For backward compatibility reasons this has to be stored in the wrongly
* named field. Will be fixed in next major version.
*/
return builder->was_running.was_xmax;We could fix that now... Andres, what did you have in mind for a proper
name?next_phase_at seems like it'd do the trick?
The new proposed name sounds good to me.
And pushed.
See attached patch...
LGTM.
Thanks for looking over - should have added your name to reviewed-by,
sorry...
Greetings,
Andres Freund