[PATCH] Full support for index LP_DEAD hint bits on standby
Hello, hackers.
[ABSTRACT]
Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]/messages/by-id/7067.1529246768@sss.pgh.pa.us:
* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.
Also, according to [2]/messages/by-id/12843.1529331619@sss.pgh.pa.us and cases like [3]/messages/by-id/20170428133818.24368.33533@wrigleys.postgresql.org, it seems to be a good idea
to support "ignore_killed_tuples" on standby.
The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.
This thread is the continuation (and party copy-paste) of the old
previous one [4]/messages/by-id/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ@mail.gmail.com.
[PROBLEM]
The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.
So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.
The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...
But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5]/messages/by-id/CANtu0oigC0+H0UkxktyovdLLU67ikM0+Dw3J4EQqiDDeGhcwsQ@mail.gmail.com.
Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.
[DETAILS]
The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.
`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).
Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).
New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).
There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.
It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.
What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).
While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.
To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.
So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.
And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.
Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs
*heap* hint bits to be already set at standby. So, the same code
decides to set hint bits on the heap (it is done already on standby
for a long time) and in the index.
[EVALUATION]
It is not possible to find an ideal performance test for such kind of
optimization.
But there is a possible example in the attachment. It is a standard
pgbench schema with an additional index on balance and random balance
values.
On primary test do next:
1) transfer some money from one random of the top 100 rich accounts to
one random of the top 100 poor accounts.
2) calculate the amount of money in the top 10 rich and top 10 poor
accounts (and include an additional field to avoid index-only-scan).
In the case of standby only step 2 is used.
The patched version is about 9x faster for standby queries - like 455
TPS versus 4192 TPS on my system. There is no visible difference for
primary.
To estimate the additional amount of WAL logs, I have checked records
in WAL-segments during different conditions:
(pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l)
- hot_standby_feedback=off - 5181 of 226274 records ~2%
- hot_standby_feedback=on (without load on standby) - 70 of 202594
records ~ 0.03%
- hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02%
So, with HSF=on (which is the default value) WAL increase is not
significant. Also, for HSF=off it should be possible to radically
reduce the number of additional WAL logs by using `latestRemovedXid`
from other records (like Heap2/CLEAN) in "send only newer xid"
optimization (I have skipped it for now for simplicity).
[CONCLUSION]
The only thing we pay – a few additional WAL records and some
additional moderate code complexity. But the support of hint-bits on
standby is a huge advantage for many workloads. I was able to get more
than a 900% performance boost (and it is not surprising – index hint
bits are just great optimization). And it works for almost all index
types out of the box.
Another major thing here – everything is based on old, well-tested
mechanics: query cancelation because of snapshot conflicts, setting
heap hint bits on standby, hot standby feedback.
[REFERENCES]
[1]: /messages/by-id/7067.1529246768@sss.pgh.pa.us
[2]: /messages/by-id/12843.1529331619@sss.pgh.pa.us
[3]: /messages/by-id/20170428133818.24368.33533@wrigleys.postgresql.org
[4]: /messages/by-id/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ@mail.gmail.com
[5]: /messages/by-id/CANtu0oigC0+H0UkxktyovdLLU67ikM0+Dw3J4EQqiDDeGhcwsQ@mail.gmail.com
Hello, everyone.
Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN
forward).
Something like this:
LSN=10 STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
<-----------minRecoveryPoint will go here
LSN=20 STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)
REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
CRASH
On crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
(LSN=20).
So, need to think here.
Thanks,
Michail.
Hello, hackers.
I think I was able to fix the issue related to minRecoveryPoint and crash
recovery. To make sure standby will be consistent after crash recovery, we
need to take the current value of minRecoveryPoint into account while
setting LP_DEAD hints (almost the same way as it is done for *heap* hint
bits already).
I have introduced new structure IndexHintBitsData:
-------
/* guaranteed not visible for all backends */
bool all_dead;
/* latest removed xid if known */
TransactionId latest_removed_xid;
/* lsn of page where dead tuple located */
XLogRecPtr page_lsn;
-------
This structure is filled by the `heap_hot_search_buffer` function. After,
we decide to set or not `kill_prior_tuple` depending on its content
(calling `IsMarkBufferDirtyIndexHintAllowed`).
For primary - it is always safe to set LP_DEAD in index if `all_dead` ==
true.
In the case of standby, we need to check `latest_removed_xid` (if
available) first. If commit LSN of the latest removed xid is already lower
than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set
`kill_prior_tuple`.
Sometimes we are not sure about the latest removed xid - heap record could
be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case
we check the LSN of the *heap* page containing the tuple (LSN could be
updated by other transactions already - but it does not matter in that
situation). If page LSN is lower than minRecoveryPoint - it is safe to set
LP_DEAD in the index too. Otherwise - just leave the index tuple alive.
So, to bring it all together:
* Normal operation, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPI because
of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
It is safe for standby to set its index hint bits because
`ComputeXidHorizons` honors other read-only procs xmin and lowest xid on
primary (`KnownAssignedXidsGetOldestXmin`).
* Normal operation, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
* Crash recovery, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPW because
XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record
of transaction removed the tuple is logged before
XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken
into account by minRecoveryPoint) - both transaction-remover and horizon
records are replayed before reading queries.
It is safe for standby to use its hint bits because they can be set
only if the commit record of transaction-remover is lower than
minRecoveryPoint or LSN of heap page with removed tuples is lower than
minRecoveryPoint.
* Crash recovery, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
So, now it seems correct to me.
Another interesting point here - now position of minRecoveryPoint affects
performance a lot. It is happening already (because of *heap* hint bits)
but after the patch, it is noticeable even more. Is there any sense to keep
minRecoveryPoint at a low value?
Rebased and updated patch in attachment.
Will be happy if someone could recheck my ideas or even the code :)
Thanks a lot,
Michail.
Hello, everyone.
After some correspondence with Peter Geoghegan (1) and his ideas, I
have reworked the patch a lot and now it is much more simple with even
better performance (no new WAL or conflict resolution, hot standby
feedback is unrelated).
The idea is pretty simple now - let’s mark the page with
“standby-safe” LP_DEAD hints by the bit in btpo_flags
(BTP_LP_SAFE_ON_STANDBY and similar for gist and hash).
If standby wants to set LP_DEAD - it checks BTP_LP_SAFE_ON_STANDBY on
the page first, if it is not set - all “primary” hints are removed
first, and then the flag is set (with memory barrier to avoid memory
ordering issues in concurrent scans).
Also, standby checks BTP_LP_SAFE_ON_STANDBY to be sure about ignoring
tuples marked by LP_DEAD during the scan.
Of course, it is not so easy. If standby was promoted (or primary was
restored from standby backup) - it is still possible to receive FPI
with such flag set in WAL logs. So, the main problem is still there.
But we could just clear this flag while applying FPI because the page
remains dirty after that anyway! It should not cause any checksum,
consistency, or pg_rewind issues as explained in (2).
Semantically it is the same as set hint bit one milisecond after FPI
was applied (while page still remains dirty after FPI replay) - and
standby already does it with *heap* hint bits.
Also, TAP-test attached to (2) shows how it is easy to flush a hint
bit which was set by standby to achieve different checksum comparing
to primary already.
If standby was promoted (or restored from standby backup) it is safe
to use LP_DEAD with or without BTP_LP_SAFE_ON_STANDBY on a page. But
for accuracy BTP_LP_SAFE_ON_STANDBY is cleared by primary if found.
Also, we should take into account minRecoveryPoint as described in (3)
to avoid consistency issues during crash recovery (see
IsIndexLpDeadAllowed).
Also, as far as I know - there is no practical sense to keep
minRecoveryPoint at a low value. So, there is an optional patch that
moves minRecoveryPoint forward at each xl_running_data (to allow
standby to set hint bits and LP_DEADs more aggressively). It is about
every 15s.
There are some graphics showing performance testing results on my PC
in the attachment (test is taken from (4)). Each test was running for
10 minutes.
Additional primary performance is probably just measurement error. But
standby performance gain is huge.
Feel free to ask if you need more proof about correctness.
Thanks,
Michail.
[1]: /messages/by-id/CAH2-Wz=-BoaKgkN-MnKj6hFwO1BOJSA+yLMMO+LRZK932fNUXA@mail.gmail.com
[2]: /messages/by-id/CANtu0oiAtteJ+MpPonBg6WfEsJCKrxuLK15P6GsaGDcYGjefVQ@mail.gmail.com
[3]: /messages/by-id/CANtu0oh28mX5gy5jburH+n1mcczK5_dCQnhbBnCM=Pfqh-A26Q@mail.gmail.com
[4]: /messages/by-id/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com
I'm trying to review the patch, but not sure if I understand this problem,
please see my comment below.
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN
forward).Something like this:
LSN=10 STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
<-----------minRecoveryPoint will go here
LSN=20 STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)
Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
replaying the commit record. And if the standby happens to crash before the
commit record could be replayed, no query should see the deletion and thus no
hint bit should be set in the index.
REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
CRASHOn crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
(LSN=20).
So, need to think here.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hello, Antonin.
I'm trying to review the patch, but not sure if I understand this problem,
please see my comment below.
Thanks a lot for your attention. It is strongly recommended to look at
version N3 (1) because it is a much more elegant, easy, and reliable
solution :) But the minRecoveryPoint-related issue affects it anyway.
Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
replaying the commit record. And if the standby happens to crash before the
commit record could be replayed, no query should see the deletion and thus no
hint bit should be set in the index.
minRecoveryPoint is not affected by replaying the commit record in
most cases. It is updated in a lazy way, something like this:
minRecoveryPoint = max LSN of flushed page. Version 3 of a patch
contains a code_optional.patch to move minRecoveryPoint more
aggressively to get additional performance on standby (based on
Peter’s answer in (2).
So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS
NEW ROW IN INDEX” it is just a random event.
Thanks,
Michail.
[1]: /messages/by-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH=cMXZSf+nzvg@mail.gmail.com
[2]: /messages/by-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u=k6igAg@mail.gmail.com (“Also, btw, do you know any reason to keep minRecoveryPoint at a low value?”)
(“Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?”)
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
Hello, Antonin.
I'm trying to review the patch, but not sure if I understand this problem,
please see my comment below.Thanks a lot for your attention. It is strongly recommended to look at
version N3 (1) because it is a much more elegant, easy, and reliable
solution :) But the minRecoveryPoint-related issue affects it anyway.
Indeed I'm reviewing (1), but I wanted to discuss this particular question in
context, so I replied here.
Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
replaying the commit record. And if the standby happens to crash before the
commit record could be replayed, no query should see the deletion and thus no
hint bit should be set in the index.minRecoveryPoint is not affected by replaying the commit record in
most cases. It is updated in a lazy way, something like this:
minRecoveryPoint = max LSN of flushed page. Version 3 of a patch
contains a code_optional.patch to move minRecoveryPoint more
aggressively to get additional performance on standby (based on
Peter’s answer in (2).
So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS
NEW ROW IN INDEX” it is just a random event.
Michail.
Sorry, I missed the fact that your example can be executed inside BEGIN - END
block, in which case minRecoveryPoint won't advance after each command.
I'll continue my review by replying to (1)
[1]: /messages/by-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH=cMXZSf+nzvg@mail.gmail.com
[2]: /messages/by-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u=k6igAg@mail.gmail.com
(“Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?”)
I'm not an expert in this area (I'm reviewing this patch also to learn more
about recovery and replication), but after a breif research I think that
postgres tries not to update the control file too frequently, see comments in
UpdateMinRecoveryPoint(). I don't know if what you do in code_optional.patch
would be a problem. Actually I think that a commit record should be replayed
more often than XLOG_RUNNING_XACTS, shouldn't it?
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hello,
Antonin.
Sorry, I missed the fact that your example can be executed inside BEGIN - END
block, in which case minRecoveryPoint won't advance after each command.
No, the block is not executed as a single transaction, all commands
are separated transactions (see below)
Actually I think that a commit record should be replayed
more often than XLOG_RUNNING_XACTS, shouldn't it?
Yes, but replaying commit records DOES NOT affect minRecoveryPoint in
almost all cases.
UpdateMinRecoveryPoint is called by XLogFlush, but xact_redo_commit
calls XLogFlush only in two cases:
* DropRelationFiles is called (some relation are dropped)
* If ForceSyncCommit was used on primary - few “heavy” commands, like
DropTableSpace, CreateTableSpace, movedb, etc.
But “regular” commit record is replayed without XLogFlush and, as
result, without UpdateMinRecoveryPoint.
So, in practice, UpdateMinRecoveryPoint is updated in an “async” way
by checkpoint job. This is why there is a sense to call it on
XLOG_RUNNING_XACTS.
Thanks,
Michail.
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
Sorry, I missed the fact that your example can be executed inside BEGIN - END
block, in which case minRecoveryPoint won't advance after each command.No, the block is not executed as a single transaction, all commands
are separated transactions (see below)Actually I think that a commit record should be replayed
more often than XLOG_RUNNING_XACTS, shouldn't it?Yes, but replaying commit records DOES NOT affect minRecoveryPoint in
almost all cases.UpdateMinRecoveryPoint is called by XLogFlush, but xact_redo_commit
calls XLogFlush only in two cases:
* DropRelationFiles is called (some relation are dropped)
* If ForceSyncCommit was used on primary - few “heavy” commands, like
DropTableSpace, CreateTableSpace, movedb, etc.But “regular” commit record is replayed without XLogFlush and, as
result, without UpdateMinRecoveryPoint.
ok, I missed this. Thanks for explanation.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
After some correspondence with Peter Geoghegan (1) and his ideas, I
have reworked the patch a lot and now it is much more simple with even
better performance (no new WAL or conflict resolution, hot standby
feedback is unrelated).
My review that started in [1]/messages/by-id/61470.1620647290@antos continues here.
(Please note that code.patch does not apply to the current master branch.)
I think I understand your approach now and couldn't find a problem by reading
the code. What I consider worth improving is documentation, both code comments
and nbtree/README. Especially for the problem discussed in [1]/messages/by-id/61470.1620647290@antos it should be
explained what would happen if kill_prior_tuple_min_lsn was not checked.
Also, in IsIndexLpDeadAllowed() you say that invalid
deadness->latest_removed_xid means the following:
/*
* Looks like it is tuple cleared by heap_page_prune_execute,
* we must be sure if LSN of XLOG_HEAP2_CLEAN (or any subsequent
* updates) less than minRecoveryPoint to avoid MVCC failure
* after crash recovery.
*/
However I think there's one more case: if heap_hot_search_buffer() considers
all tuples in the chain to be "surely dead", but
HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason:
/*
* Ignore tuples inserted by an aborted transaction or if the tuple was
* updated/deleted by the inserting transaction.
*
* Look for a committed hint bit, or if no xmin bit is set, check clog.
*/
I think that the dead tuples produced this way should never be visible on the
standby (and even if they were, they would change the page LSN so your
algorithm would treat them correctly) so I see no correctness problem. But it
might be worth explaining better the meaning of invalid "latest_removed_xid"
in comments.
In the nbtree/README, you say
"... if the commit record of latestRemovedXid is more ..."
but it's not clear to me what "latestRemovedXid" is. If you mean the
scan->kill_prior_tuple_min_lsn field, you probably need more words to explain
it.
* IsIndexLpDeadAllowed()
/* It all always allowed on primary if *all_dead. */
should probably be
/* It is always allowed on primary if *all_dead. */
* gistkillitems()
As the function is only called if (so->numKilled > 0), I think both
"killedsomething" and "dirty" variables should always have the same value, so
one variable should be enough. Assert(so->numKilled) would be appropriate in
that case.
The situation is similar for btree and hash indexes.
doc.patch:
"+applying the fill page write."
[1]: /messages/by-id/61470.1620647290@antos
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hello,
Antonin.
My review that started in [1] continues here.
Thanks a lot for the review.
(Please note that code.patch does not apply to the current master branch.)
Rebased.
Especially for the problem discussed in [1] it should be
explained what would happen if kill_prior_tuple_min_lsn was not checked.
Updated README, hope it is better now. Also, added few details related
to the flush of hint bits.
However I think there's one more case: if heap_hot_search_buffer() considers
all tuples in the chain to be "surely dead", but
HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason:
Yes, good catch, missed it.
I think that the dead tuples produced this way should never be visible on the
standby (and even if they were, they would change the page LSN so your
algorithm would treat them correctly) so I see no correctness problem. But it
might be worth explaining better the meaning of invalid "latest_removed_xid"
in comments.
Added additional comment.
but it's not clear to me what "latestRemovedXid" is. If you mean the
scan->kill_prior_tuple_min_lsn field, you probably need more words to explain
it.
Hope it is better now.
should probably be
/* It is always allowed on primary if *all_dead. */
Fixed.
As the function is only called if (so->numKilled > 0), I think both
"killedsomething" and "dirty" variables should always have the same value, so
one variable should be enough. Assert(so->numKilled) would be appropriate in
that case.
Fixed, but partly. It is because I have added additional checks for a
long transaction in the case of promoted server.
"+applying the fill page write."
Fixed.
Updated version in attach.
Thanks a lot,
Michail.
Attachments:
v2-0004-doc.patchtext/x-patch; charset=US-ASCII; name=v2-0004-doc.patchDownload+29-15
v2-0002-code-optional.patchtext/x-patch; charset=US-ASCII; name=v2-0002-code-optional.patchDownload+6-1
v2-0003-test.patchtext/x-patch; charset=US-ASCII; name=v2-0003-test.patchDownload+249-1
v2-0001-code.patchtext/x-patch; charset=US-ASCII; name=v2-0001-code.patchDownload+399-91
Hello.
Added a check for standby promotion with the long transaction to the
test (code and docs are unchanged).
Thanks,
Michail.
Attachments:
v3-0003-test.patchtext/x-patch; charset=US-ASCII; name=v3-0003-test.patchDownload+266-1
v3-0001-code.patchtext/x-patch; charset=US-ASCII; name=v3-0001-code.patchDownload+399-91
v3-0002-code-optional.patchtext/x-patch; charset=US-ASCII; name=v3-0002-code-optional.patchDownload+6-1
v3-0004-docs.patchtext/x-patch; charset=US-ASCII; name=v3-0004-docs.patchDownload+29-15
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
Hello.
Added a check for standby promotion with the long transaction to the
test (code and docs are unchanged).
I'm trying to continue the review, sorry for the delay. Following are a few
question about the code:
* Does the masking need to happen in the AM code, e.g. _bt_killitems()? I'd
expect that the RmgrData.rm_fpi_mask can do all the work.
Maybe you're concerned about clearing the "LP-safe-on-standby" bits after
promotion, but I wouldn't consider this a problem: once the standby is
allowed to set the hint bits (i.e. minRecoveryPoint is high enough, see
IsIndexLpDeadAllowed() -> XLogNeedsFlush()), promotion shouldn't break
anything because it should not allow minRecoveryPoint to go backwards.
* How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps
a boolean argument can be added to distinguish the purpose of the masking.
* Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags
to LP_UNUSED unconditionally, which IMO should only be done by VACUUM. I
think you only need to revert the effect of prior ItemIdMarkDead(), so you
only need to change the status LP_DEAD to LP_NORMAL if the tuple still has
storage. (And maybe add an assertion to ItemIdMarkDead() confirming that
it's only used for LP_NORMAL items?)
As far as I understand, the current code only uses mask_lp_flags() during
WAL consistency check on copies of pages which don't eventually get written
to disk.
* IsIndexLpDeadAllowed()
** is bufmgr.c the best location for this function?
** the header comment should explain the minLsn argument.
** comment
/* It is always allowed on primary if *all_dead. */
should probably be
/* It is always allowed on primary if ->all_dead. */
* comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14.
On regression tests:
* Is the purpose of the repeatable read (RR) snapshot to test that
heap_hot_search_buffer() does not set deadness->all_dead if some transaction
can still see a tuple of the chain? If so, I think the RR snapshot does not
have to be used in the tests because this patch does not really affect the
logic: heap_hot_search_buffer() only sets deadness->all_dead to false, just
like it sets *all_dead in the current code. Besides that,
IsIndexLpDeadAllowed() too can avoid setting of the LP_DEAD flag on an index
tuple (at least until the commit record of the deleting/updating transaction
gets flushed to disk), so it can hide the behaviour of
heap_hot_search_buffer().
* Unless I miss something, the tests check that the hint bits are not
propagated from primary (or they are propagated but marked non-safe),
however there's no test to check that standby does set the hint bits itself.
* I'm also not sure if promotion needs to be tested. What's specific about the
promoted cluster from the point of view of this feature? The only thing I
can think of is clearing of the "LP-safe-on-standby" bits, but, as I said
above, I'm not sure if the tests ever let standby to set those bits before
the promotion.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hello, Antonin.
I'm trying to continue the review, sorry for the delay. Following are a few
question about the code:
Thanks for the review :) And sorry for the delay too :)
* Does the masking need to happen in the AM code, e.g. _bt_killitems()?
I'd expect that the RmgrData.rm_fpi_mask can do all the work.
RmgrData.rm_fpi_mask clears a single BTP_LP_SAFE_ON_STANDBY bit only
to indicate that hints bit are not safe to be used on standby.
Why do not clear LP_DEAD bits in rm_fpi_mask? There is no sense
because we could get such bits in multiple ways:
* the standby was created from the base backup of the primary
* some pages were changed by pg_rewind
* the standby was updated to the version having this feature (so, old
pages still contains LP_DEAD)
So, AM code needs to know when and why clear LP_DEAD bits if
BTP_LP_SAFE_ON_STANDBY is not set.
Also, the important moment here is pg_memory_barrier() usage.
* How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps
a boolean argument can be added to distinguish the purpose of the masking.
I have tried this way but the code was looking dirty and complicated.
Also, the separated fpi_mask provides some semantics to the function.
* Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags
to LP_UNUSED unconditionally, which IMO should only be done by VACUUM.
Oh, good catch. I made mask_lp_dead for this. Also, added such a
situation to the test.
** is bufmgr.c the best location for this function?
Moved to indexam.c and made static (is_index_lp_dead_allowed).
should probably be
/* It is always allowed on primary if ->all_dead. */
Fixed.
* comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14.
Fixed.
* Is the purpose of the repeatable read (RR) snapshot to test that
heap_hot_search_buffer() does not set deadness->all_dead if some transaction
can still see a tuple of the chain?
The main purpose is to test xactStartedInRecovery logic after the promotion.
For example -
if (scan->xactStartedInRecovery && !RecoveryInProgress())`
* Unless I miss something, the tests check that the hint bits are not
propagated from primary (or they are propagated but marked non-safe),
however there's no test to check that standby does set the hint bits itself.
It is tested on different standby, see
is(hints_num($node_standby_2), qq(10), 'index hint bits already
set on second standby 2');
Also, I added checks for BTP_LP_SAFE_ON_STANDBY to make sure
everything in the test goes by scenario.
Thanks a lot,
Michail.
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
* Is the purpose of the repeatable read (RR) snapshot to test that
heap_hot_search_buffer() does not set deadness->all_dead if some transaction
can still see a tuple of the chain?The main purpose is to test xactStartedInRecovery logic after the promotion.
For example -if (scan->xactStartedInRecovery && !RecoveryInProgress())`
I understand that the RR snapshot is used to check the MVCC behaviour, however
this comment seems to indicate that the RR snapshot should also prevent the
standb from setting the hint bits.
# Make sure previous queries not set the hints on standby because
# of RR snapshot
I can imagine that on the primary, but I don't think that the backend that
checks visibility on standby does checks other snapshots/backends. And it
didn't work when I ran the test manually, although I could have missed
something.
A few more notes regarding the tests:
* 026_standby_index_lp_dead.pl should probably be renamed to
027_standby_index_lp_dead.pl (026_* was created in the master branch
recently)
* The test fails, although I do have convigrured the build with
--enable-tap-tests.
BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5.
t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200)
I suspect the testing infrastructure changed recently.
* The messages like this
is(hints_num($node_standby_1), qq(10),
'hints are set on standby1 because FPI but marked as non-safe');
say that the hints are "marked as non-safe", but the hints_num() function
does not seem to check that.
* wording:
is(hints_num($node_standby_2), qq(10), 'index hint bits already set on second standby 2');
->
is(hints_num($node_standby_2), qq(10), 'index hint bits already set on standby 2');
And a few more notes on the code:
* There's an extra colon in mask_lp_dead():
bufmask.c:148:38: warning: for loop has empty body [-Wempty-body]
offnum = OffsetNumberNext(offnum));
^
bufmask.c:148:38: note: put the semicolon on a separate line to silence this warning
* the header comment of heap_hot_search_buffer() still says "*all_dead"
whereas I'd expect "->all_dead".
The same for "*page_lsn".
* I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
rename the function is_index_lp_dead_allowed() to
is_index_lp_dead_maybe_allowed()?
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hello, Antonin.
Thanks for pushing it forward.
I understand that the RR snapshot is used to check the MVCC behaviour, however
this comment seems to indicate that the RR snapshot should also prevent the
standb from setting the hint bits.
# Make sure previous queries not set the hints on standby because
# of RR snapshot
I can imagine that on the primary, but I don't think that the backend that
checks visibility on standby does checks other snapshots/backends. And it
didn't work when I ran the test manually, although I could have missed
something.
Yes, it checks - you could see ComputeXidHorizons for details. It is
the main part of the correctness of the whole feature. I added some
details about it to the test.
* 026_standby_index_lp_dead.pl should probably be renamed to
027_standby_index_lp_dead.pl (026_* was created in the master branch
recently)
Done.
BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5.
t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Fixed.
* The messages like this
Fixed.
* There's an extra colon in mask_lp_dead():
Oh, it is a huge error really (the loop was empty) :) Fixed.
* the header comment of heap_hot_search_buffer() still says "*all_dead"
whereas I'd expect "->all_dead".
The same for "*page_lsn".
I was trying to mimic the style of comment (it says about “*tid” from
2007). So, I think it is better to keep it in the same style for the
whole function comment.
* I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
rename the function is_index_lp_dead_allowed() to
is_index_lp_dead_maybe_allowed()?
Yes, this way it is looks better. Done. Also, I have added some checks
for “maybe” LSN-related logic to the test.
Thanks a lot,
Michail.
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
I understand that the RR snapshot is used to check the MVCC behaviour, however
this comment seems to indicate that the RR snapshot should also prevent the
standb from setting the hint bits.
# Make sure previous queries not set the hints on standby because
# of RR snapshot
I can imagine that on the primary, but I don't think that the backend that
checks visibility on standby does checks other snapshots/backends. And it
didn't work when I ran the test manually, although I could have missed
something.Yes, it checks - you could see ComputeXidHorizons for details. It is
the main part of the correctness of the whole feature. I added some
details about it to the test.
Ah, ok. I thought that only KnownAssignedXids is used on standby, but that
would ignore the RR snapshot. It wasn't clear to me when the xmin of the
hot-standby backends is set, now I think it's done by GetSnapshotData().
* I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
rename the function is_index_lp_dead_allowed() to
is_index_lp_dead_maybe_allowed()?Yes, this way it is looks better. Done. Also, I have added some checks
for “maybe” LSN-related logic to the test.
Attached is a proposal for a minor addition that would make sense to me, add
it if you think it's appropriate.
I think I've said enough, changing the status to "ready for committer" :-)
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
test_proposal.patchtext/x-diffDownload+2-1
I have changed approach, so it is better to start from this email:
/messages/by-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH=cMXZSf+nzvg@mail.gmail.com
Woo-hoo :)
Attached is a proposal for a minor addition that would make sense to me, add
it if you think it's appropriate.
Yes, I'll add to the patch.
I think I've said enough, changing the status to "ready for committer" :-)
Thanks a lot for your help and attention!
Best regards,
Michail.
Hello.
Attached is a proposal for a minor addition that would make sense to me, add
it if you think it's appropriate.
Added. Also, I updated the documentation a little.
I have changed approach, so it is better to start from this email:
Oops, I was thinking the comments feature in the commitfest app works
in a different way :)
Best regards,
Michail.