Thoughts on "killed tuples" index hint bits support on standby
Hello, hackers.
Currently hint bits in the index pages (dead tuples) are set and taken
into account only at primary server. Standby just ignores it. It is
done for reasons, of course (see RelationGetIndexScan and [1]/messages/by-id/7067.1529246768@sss.pgh.pa.us):
* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.
Also, according to [2]/messages/by-id/12843.1529331619@sss.pgh.pa.us and cases like [3]/messages/by-id/20170428133818.24368.33533@wrigleys.postgresql.org it seems to be good idea to
support "ignore_killed_tuples" on standby.
I hope I know the way to support it correctly with reasonable amount of changes.
First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.
And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits. So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).
Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.
Could someone validate my thoughts please? If the idea is mostly
correct - I could try to implement and test it.
[1]: /messages/by-id/7067.1529246768@sss.pgh.pa.us
[2]: /messages/by-id/12843.1529331619@sss.pgh.pa.us
[3]: /messages/by-id/20170428133818.24368.33533@wrigleys.postgresql.org
Hi,
On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote:
First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.
Note that the FPIs are only going to be sent for the first write to a
page after a checksum. I don't think you're suggesting we rely on them
for correctness (this far into the email at least), but it still seems
worthwhile to point out.
And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits. So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Well, not easily. There's no guarantee that the node it reports
hot_standby_feedback to is actually the primary. It could be an
cascading replica setup, that doesn't report hot_standby_feedback
upstream. Also hot_standby_feedback only takes effect while a streaming
connection is active, if that is even temporarily interrupted, the
primary will loose all knowledge of the standby's horizon - unless
replication slots are in use, that is.
Additionally, we also need to handle the case where the replica
currently has restarted, and is recovering using local WAL, and/or
archive based recovery. In that case the standby could already have sent
a *newer* horizon as feedback upstream, but currently again have an
older view. It is entirely possible that the standby is consistent and
queryable in such a state (if nothing forced disk writes during WAL
replay, minRecoveryLSN will not be set to something too new).
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).
We do something similar for heap hint bits already...
Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
I'm not sure this is a problem. If that happens we cannot rely on the
different xmin horizon anyway, because action may have been taken on the
old RecentGlobalXmin. Thus we need to be safe against that anyway.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.
I don't think we can rely on hot_standby_feedback at all. We can to
avoid unnecessary cancellations, etc, and even assume it's setup up
reasonably for some configurations, but there always needs to be an
independent correctness backstop.
I think it might be more promising to improve the the kill logic based
on the WAL logged horizons from the primary. All I think we need to do
is to use a more conservatively computed RecentGlobalXmin when
determining whether tuples are dead, I think. We already regularly log a
xl_running_xacts, adding information about the primaries horizon to
that, and stashing it in shared memory on the standby, shouldn't be too
hard. Then we can make a correct, albeit likely overly pessimistic,
visibility determinations about tuples, and go on to set LP_DEAD.
There are some complexities around how to avoid unnecessary query
cancellations. We'd not want to trigger recovery conflicts based on the
new xl_running_xacts field, as that'd make the conflict rate go through
the roof - but I think we could safely use the logical minimum of the
local RecentGlobalXmin, and the primaries'.
That should allow us to set additional LP_DEAD safely, I believe. We
could even rely on those LP_DEAD bits. But:
I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
skip over entries during scans, however. The bits set as described above
would be safe, but we also can see LP_DEAD set by the primary (and even
upstream cascading standbys at least in case of new base backups taken
from them), due to them being not being WAL logged. As we don't WAL log,
there is no conflict associated with the LP_DEADs being set. My gut
feeling is that it's going to be very hard to get around this, without
adding WAL logging for _bt_killitems et al (including an interface for
kill_prior_tuple to report the used horizon to the index).
I'm wondering if we could recycle BTPageOpaqueData.xact to store the
horizon applying to killed tuples on the page. We don't need to store
the level for leaf pages, because we have BTP_LEAF, so we could make
space for that (potentially signalled by a new BTP flag). Obviously we
have to be careful with storing xids in the index, due to potential
wraparound danger - but I think such page would have to be vacuumed
anyway, before a potential wraparound. I think we could safely unset
the xid during nbtree single page cleanup, and vacuum, by making sure no
LP_DEAD entries survive, and by including the horizon in the generated
WAL record.
That however still doesn't really fully allow us to set LP_DEAD on
standbys, however - but it'd allow us to take the primary's LP_DEADs
into account on a standby. I think we'd run into torn page issues, if we
were to do so without WAL logging, because we'd rely on the LP_DEAD bits
and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to
do so *iff* the page's LSN indicates that there has been a WAL record
covering it since the last redo location.
I only had my first cup of coffee for the day, so I might also just be
entirely off base here.
Greetings,
Andres Freund
On Thu, Jan 16, 2020 at 9:54 AM Andres Freund <andres@anarazel.de> wrote:
I don't think we can rely on hot_standby_feedback at all. We can to
avoid unnecessary cancellations, etc, and even assume it's setup up
reasonably for some configurations, but there always needs to be an
independent correctness backstop.
+1
I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
skip over entries during scans, however. The bits set as described above
would be safe, but we also can see LP_DEAD set by the primary (and even
upstream cascading standbys at least in case of new base backups taken
from them), due to them being not being WAL logged. As we don't WAL log,
there is no conflict associated with the LP_DEADs being set. My gut
feeling is that it's going to be very hard to get around this, without
adding WAL logging for _bt_killitems et al (including an interface for
kill_prior_tuple to report the used horizon to the index).
I agree.
What about calling _bt_vacuum_one_page() more often than strictly
necessary to avoid a page split on the primary? The B-Tree
deduplication patch sometimes does that, albeit for completely
unrelated reasons. (We don't want to have to unset an LP_DEAD bit in
the case when a new/incoming duplicate tuple has a TID that overlaps
with the posting list range of some existing duplicate posting list
tuple.)
I have no idea how you'd determine that it was time to call
_bt_vacuum_one_page(). Seems worth considering.
I'm wondering if we could recycle BTPageOpaqueData.xact to store the
horizon applying to killed tuples on the page. We don't need to store
the level for leaf pages, because we have BTP_LEAF, so we could make
space for that (potentially signalled by a new BTP flag). Obviously we
have to be careful with storing xids in the index, due to potential
wraparound danger - but I think such page would have to be vacuumed
anyway, before a potential wraparound.
You would think that, but unfortunately we don't currently do it that
way. We store XIDs in deleted leaf pages that can sometimes be missed
until the next wraparound.
We need to do something like commit
6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's
somewhere on my TODO list.
I think we could safely unset
the xid during nbtree single page cleanup, and vacuum, by making sure no
LP_DEAD entries survive, and by including the horizon in the generated
WAL record.That however still doesn't really fully allow us to set LP_DEAD on
standbys, however - but it'd allow us to take the primary's LP_DEADs
into account on a standby. I think we'd run into torn page issues, if we
were to do so without WAL logging, because we'd rely on the LP_DEAD bits
and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to
do so *iff* the page's LSN indicates that there has been a WAL record
covering it since the last redo location.
That sounds like a huge mess.
--
Peter Geoghegan
Hello again.
Andres, Peter, thanks for your comments.
Some of issues your mentioned (reporting feedback to the another
cascade standby, processing queries after restart and newer xid
already reported) could be fixed in provided design, but your
intention to have "independent correctness backstop" is a right thing
to do.
So, I was thinking about another approach which is:
* still not too tricky to implement
* easy to understand
* does not rely on hot_standby_feedback for correctness, but only for efficiency
* could be used with any kind of index
* does not generate a lot of WAL
Let's add a new type of WAL record like "some index killed tuple hint
bits are set according to RecentGlobalXmin=x" (without specifying page
or even relation). Let's call 'x' as 'LastKilledIndexTuplesXmin' and
track it in standby memory. It is sent only in case of
wal_log_hints=true. If hints cause FPW - it is sent before FPW record.
Also, it is not required to write such WAL every time primary marks
index tuple as dead. It should be done only in case
'LastKilledIndexTuplesXmin' is changed (moved forward).
On standby such record is used to cancel queries. If transaction is
executed with "ignore_killed_tuples==true" (set on snapshot creation)
and its xid is less than received LastKilledIndexTuplesXmin - just
cancel the query (because it could rely on invalid hint bit). So,
technically it should be correct to use hints received from master to
skip tuples according to MVCC, but "the conflict rate goes through the
roof".
To avoid any real conflicts standby sets
ignore_killed_tuples = (hot_standby_feedback is on)
AND (wal_log_hints is on on primary)
AND (standby new snapshot xid >= last
LastKilledIndexTuplesXmin received)
AND (hot_standby_feedback is reported
directly to master).
So, hot_standby_feedback loop effectively eliminates any conflicts
(because LastKilledIndexTuplesXmin is technically RecentGlobalXmin in
such case). But if feedback is broken for some reason - query
cancellation logic will keep everything safe.
For correctness LastKilledIndexTuplesXmin (and as consequence
RecentGlobalXmin) should be moved only forward.
To set killed bits on standby we should check tuples visibility
according to last LastKilledIndexTuplesXmin received. It is just like
master sets these bits according to its state - so it is even safe to
transfer them to another standby.
Does it look better now?
Thanks, Michail.
Hi Michail,
On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Some of issues your mentioned (reporting feedback to the another
cascade standby, processing queries after restart and newer xid
already reported) could be fixed in provided design, but your
intention to have "independent correctness backstop" is a right thing
to do.
Attached is a very rough POC patch of my own, which makes item
deletion occur "non-opportunistically" in unique indexes. The idea is
that we exploit the uniqueness property of unique indexes to identify
"version churn" from non-HOT updates. If any single value on a leaf
page has several duplicates, then there is a good chance that we can
safely delete some of them. It's worth going to the heap to check
whether that's safe selectively, at the point where we'd usually have
to split the page. We only have to free one or two items to avoid
splitting the page. If we can avoid splitting the page immediately, we
may well avoid splitting it indefinitely, or forever.
This approach seems to be super effective. It can leave the PK on
pgbench_accounts in pristine condition (no page splits) after many
hours with a pgbench-like workload that makes all updates non-HOT
updates. Even with many clients, and a skewed distribution. Provided
the index isn't tiny to begin with, we can always keep up with
controlling index bloat -- once the client backends themselves begin
to take active responsibility for garbage collection, rather than just
treating it as a nice-to-have. I'm pretty sure that I'm going to be
spending a lot of time developing this approach, because it really
works.
This seems fairly relevant to what you're doing. It makes almost all
index cleanup occur using the new delete infrastructure for some of
the most interesting workloads where deletion takes place in client
backends. In practice, a standby will almost be in the same position
as the primary in a workload that this approach really helps with,
since setting the LP_DEAD bit itself doesn't really need to happen (we
can go straight to deleting the items in the new deletion path).
To address the questions you've asked: I don't really like the idea of
introducing new rules around tuple visibility and WAL logging to set
more LP_DEAD bits like this at all. It seems very complicated. I
suspect that we'd be better off introducing ways of making the actual
deletes occur sooner on the primary, possibly much sooner, avoiding
any need for special infrastructure on the standby. This is probably
not limited to the special unique index case that my patch focuses on
-- we can probably push this general approach forward in a number of
different ways. I just started with unique indexes because that seemed
most promising. I have only worked on the project for a few days. I
don't really know how it will evolve.
--
Peter Geoghegan
Attachments:
0001-Non-opportunistically-delete-B-Tree-items.patchapplication/octet-stream; name=0001-Non-opportunistically-delete-B-Tree-items.patchDownload
From 7b5f22da46d5d9f917cf3c676e55e9915e92c412 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 20 Mar 2020 16:41:52 -0700
Subject: [PATCH] Non-opportunistically delete B-Tree items.
Repurpose deduplication infrastructure to delete items in unique indexes
at the point where we'd usually have to split the page, even when they
don't have their LP_DEAD bits set. Testing has shown that this is
almost completely effective at preventing "version bloat" from non-HOT
updates in unique indexes, provided there are no long running
transactions.
Note that INCLUDE indexes support the optimization.
---
src/include/access/nbtree.h | 2 +-
src/backend/access/nbtree/nbtdedup.c | 322 +++++++++++++++++++++++++-
src/backend/access/nbtree/nbtinsert.c | 5 +-
3 files changed, 324 insertions(+), 5 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 5f67fc04e0..9847118542 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1039,7 +1039,7 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
IndexTuple newitem, Size newitemsz,
- bool checkingunique);
+ bool checkingunique, bool allequalimage);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
OffsetNumber baseoff);
extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index b20faf693d..8c601cf681 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,8 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "utils/rel.h"
@@ -23,6 +25,8 @@ static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
+static bool _bt_dedup_vacuum_one_page(Relation rel, Buffer buffer,
+ Relation heapRel, Size itemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
@@ -54,7 +58,8 @@ static bool _bt_posting_valid(IndexTuple posting);
*/
void
_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz, bool checkingunique)
+ IndexTuple newitem, Size newitemsz, bool checkingunique,
+ bool allequalimage)
{
OffsetNumber offnum,
minoff,
@@ -109,6 +114,32 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
maxoff = PageGetMaxOffsetNumber(page);
}
+ /*
+ * Before we actually deduplicate, see if we can non-opportunistically
+ * free some duplicate if this is special checkingunique case
+ */
+ if (checkingunique)
+ {
+ if (IsSystemRelation(rel))
+ return;
+ if (_bt_dedup_vacuum_one_page(rel, buf, heapRel, newitemsz))
+ return;
+
+ /*
+ * Reconsider number of items on page, in case
+ * _bt_dedup_vacuum_one_page() managed to delete an item or two
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Can't actually deduplicate, so give up and split page (only here to see
+ * if we can delete non-opportunistically)
+ */
+ if (!allequalimage)
+ return;
+
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -794,6 +825,295 @@ _bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
return nposting;
}
+static void
+_bt_dedup_vacuum_finish_pending(BTDedupState state)
+{
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+ if (state->nitems == 1)
+ {
+ /* Don't merge */
+ }
+ else
+ {
+ /* Save final number of items for posting list */
+ state->intervals[state->nintervals].nitems = state->nitems;
+ /* Increment nintervals, since we wrote a new posting list tuple */
+ state->nintervals++;
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->phystupsize = 0;
+}
+
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_intervalcmp(const void *arg1, const void *arg2)
+{
+ BTDedupInterval *inter1 = (BTDedupInterval *) arg1;
+ BTDedupInterval *inter2 = (BTDedupInterval *) arg2;
+
+ /* Note: This sorts intervals in descending order */
+ if (inter1->nitems > inter2->nitems)
+ return -1;
+ if (inter1->nitems < inter2->nitems)
+ return 1;
+
+ /* Now tiebreak on itemoffsetnumber order, asc */
+ if (inter1->baseoff > inter2->baseoff)
+ return 1;
+ if (inter1->baseoff < inter2->baseoff)
+ return -1;
+
+ return 0;
+
+ return 0;
+}
+
+static int
+_bt_offsetnumber_cmp(const void *arg1, const void *arg2)
+{
+ OffsetNumber *inter1 = (OffsetNumber *) arg1;
+ OffsetNumber *inter2 = (OffsetNumber *) arg2;
+
+ if (*inter1 > *inter2)
+ return 1;
+ if (*inter1 < *inter2)
+ return -1;
+
+ return 0;
+}
+
+static bool
+_bt_dedup_delete_item(Relation heapRel, Snapshot snapshot, IndexTuple itup,
+ int *heapaccesses)
+{
+ bool all_dead;
+
+ all_dead = false;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ table_index_fetch_tuple_check(heapRel, &itup->t_tid,
+ snapshot, &all_dead);
+ (*heapaccesses)++;
+ }
+ else
+ {
+ for (int k = 0; k < BTreeTupleGetNPosting(itup); k++)
+ {
+ table_index_fetch_tuple_check(heapRel,
+ BTreeTupleGetPostingN(itup, k),
+ snapshot, &all_dead);
+ (*heapaccesses)++;
+ if (!all_dead)
+ break;
+ }
+ }
+
+ return all_dead;
+}
+
+/*
+ * See if duplicate unique index tuples in a unique index are eligible to be
+ * deleted, even though they don't have their LP_DEAD bit set already.
+ * Concentrate on groups of duplicates, since we're most likely to be
+ * successful there. Give up if we have to access more than a few heap pages
+ * before we can free enough space to avoid a page split.
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ *
+ * FIXME: Be less eager with tuples that contain NULLs, since they can be
+ * duplicates without that signaling anything about version churn. Just
+ * because we're checkingunique (which implies that incoming newitem isn't
+ * a NULL) doesn't mean there aren't lots of other NULLs on the page.
+ */
+//#define DEBUG
+static bool
+_bt_dedup_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size itemsz)
+{
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ int heapaccesses = 0;
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Size freespace;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool finished = false;
+ bool success = false;
+ BTDedupState state;
+ SnapshotData SnapshotDirty;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ /* Final "posting list" size should not restrict anything */
+ state->maxpostingsize = BLCKSZ;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->phystupsize = 0;
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+
+ }
+ else
+ {
+ _bt_dedup_vacuum_finish_pending(state);
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+ /* Handle the last item */
+ _bt_dedup_vacuum_finish_pending(state);
+
+ if (state->nintervals == 0)
+ {
+ pfree(state->htids);
+ pfree(state);
+ return false;
+ }
+
+ /* Access items in order of nitems, then asc page offset number order */
+ qsort(state->intervals, state->nintervals, sizeof(BTDedupInterval),
+ _bt_intervalcmp);
+
+ //#define DEBUG
+ freespace = PageGetFreeSpace(page);
+ InitDirtySnapshot(SnapshotDirty);
+ for (int i = 0; i < state->nintervals; i++)
+ {
+ BTDedupInterval interval = state->intervals[i];
+ bool killed_in_interval = false;
+
+#ifdef DEBUG
+ if (interval.nitems > 2)
+ elog(WARNING, "item %d base off %u nitems %d", i,
+ interval.baseoff, interval.nitems);
+#endif
+ Assert(interval.nitems > 0);
+ /* Iterate through tuples of given value */
+ for (int j = 0; j < interval.nitems; j++)
+ {
+ OffsetNumber ioff = interval.baseoff + j;
+ ItemId itemid = PageGetItemId(page, ioff);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (killed_in_interval && j == 1 && interval.nitems == 2)
+ {
+ /*
+ * If we already killed first item in interval, and there are
+ * only two items, then assume that this isn't going to be
+ * another kill -- move on to next interval
+ */
+ break;
+ }
+
+ if (_bt_dedup_delete_item(heapRel, &SnapshotDirty, itup,
+ &heapaccesses))
+ {
+ /* Delete item */
+ ItemIdMarkDead(itemid);
+ deletable[ndeletable++] = ioff;
+ freespace += ItemIdGetLength(itemid);
+ killed_in_interval = true;
+
+ if (freespace >= itemsz)
+ {
+ /*
+ * We successfully avoided a page split, but we may still
+ * want to get a few more items to avoid excessive WAL
+ * logging.
+ *
+ * Don't bother with this once we're passed the point
+ * where intervals have more than 2 items -- that suggests
+ * that there is no strong correlation.
+ */
+ success = true;
+ if (interval.nitems == 2 || ndeletable >= 3)
+ {
+ /*
+ * Okay, we have enough stuff to free -- don't want to
+ * do too much under leaf page buffer lock
+ */
+ finished = true;
+ break;
+ }
+ }
+ }
+
+ if (heapaccesses >= 10)
+ {
+ finished = true;
+ break;
+ }
+ }
+
+ if (finished)
+ break;
+ if (i >= 5)
+ break;
+ }
+
+ if (ndeletable > 0)
+ {
+ /* Have to give array to _bt_delitems_delete in asc order */
+ qsort(deletable, ndeletable, sizeof(OffsetNumber),
+ _bt_offsetnumber_cmp);
+ MarkBufferDirtyHint(buffer, true);
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+#ifdef DEBUG
+ elog(WARNING, "succeeded with block %u, heapaccesses %d, ndeletable %u, first offset %u",
+ BufferGetBlockNumber(buffer), heapaccesses, ndeletable,
+ deletable[0]);
+ }
+ else
+ {
+ elog(WARNING, "failed with block %u, heapaccesses %d, ndeletable %u",
+ BufferGetBlockNumber(buffer), heapaccesses, ndeletable);
+#endif
+ }
+
+ pfree(state->htids);
+ pfree(state);
+
+ return success;
+}
+
/*
* Verify posting list invariants for "posting", which must be a posting list
* tuple. Used within assertions.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1675298f73..db9b7e9981 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -892,13 +892,12 @@ _bt_findinsertloc(Relation rel,
uniquedup = true;
}
- if (itup_key->allequalimage && BTGetDeduplicateItems(rel) &&
- (!checkingunique || uniquedup) &&
+ if (BTGetDeduplicateItems(rel) && (!checkingunique || uniquedup) &&
PageGetFreeSpace(page) < insertstate->itemsz)
{
_bt_dedup_one_page(rel, insertstate->buf, heapRel,
insertstate->itup, insertstate->itemsz,
- checkingunique);
+ checkingunique, itup_key->allequalimage);
insertstate->bounds_valid = false;
}
}
--
2.25.1
Hello, Peter.
Thanks for your feedback.
Attached is a very rough POC patch of my own, which makes item
deletion occur "non-opportunistically" in unique indexes. The idea is
that we exploit the uniqueness property of unique indexes to identify
"version churn" from non-HOT updates. If any single value on a leaf
page has several duplicates, then there is a good chance that we can
safely delete some of them. It's worth going to the heap to check
whether that's safe selectively, at the point where we'd usually have
to split the page. We only have to free one or two items to avoid
splitting the page. If we can avoid splitting the page immediately, we
may well avoid splitting it indefinitely, or forever.
Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am
not able to understand all the code now, but I’ll check the patch in more
detail later.
This seems fairly relevant to what you're doing. It makes almost all
index cleanup occur using the new delete infrastructure for some of
the most interesting workloads where deletion takes place in client
backends. In practice, a standby will almost be in the same position
as the primary in a workload that this approach really helps with,
since setting the LP_DEAD bit itself doesn't really need to happen (we
can go straight to deleting the items in the new deletion path).
This is probably
not limited to the special unique index case that my patch focuses on
-- we can probably push this general approach forward in a number of
different ways. I just started with unique indexes because that seemed
most promising. I have only worked on the project for a few days. I
don't really know how it will evolve.
Yes, it is relevant, but I think it is «located in a different plane» and
complement each other. Because most of the indexes are not unique these days
and most of the standbys (and even primaries) have long snapshots (up to
minutes, hours) – so, multiple versions of index records are still required for
them. Even if we could avoid multiple versions somehow - it could lead to a very
high rate of query cancelations on standby.
To address the questions you've asked: I don't really like the idea of
introducing new rules around tuple visibility and WAL logging to set
more LP_DEAD bits like this at all. It seems very complicated.
I don’t think it is too complicated. I have polished the idea a little and now
it looks even elegant for me :) I’ll try to explain the concept briefly (there
are no new visibility rules or changes to set more LP_DEAD bits than now –
everything is based on well-tested mechanics):
1) There is some kind of horizon of xmin values primary pushes to a standby
currently. All standby’s snapshots are required to satisfice this horizon to
access heap and indexes. This is done by *ResolveRecoveryConflictWithSnapshot*
and corresponding WAL records (for example -XLOG_BTREE_DELETE).
2) We could introduce a new WAL record (named XLOG_INDEX_HINT in the patch for
now) to define a horizon of xmin required for standby’s snapshot to use LP_DEAD
bits in the indexes.
3) Master sends XLOG_INDEX_HINT in case it sets LP_DEAD bit on the index page
(but before possible FPW caused by hints) by calling *LogIndexHintIfNeeded*. It
is required to send such a record only if the new xmin value is greater than
one send before. I made tests to estimate the amount of new WAL – it is really
small (especially compared to FPW writes done because of LP_DEAD bit set).
4) New XLOG_INDEX_HINT contains only a database id and value of
*latestIndexHintXid* (new horizon position). For simplicity, the primary could
set just set horizon to *RecentGlobalXmin*. But for now in the patch horizon
value extracted from heap in *HeapTupleIsSurelyDead* to reduce the number of
XLOG_INDEX_HINT records even more).
5) There is a new field in PGPROC structure - *indexIgnoreKilledTuples*. If it
is set to true – standby queries are going to use LP_DEAD bits in index scans.
In such a case snapshot is required to satisfice new LP_DEAD-horizon pushed by
XLOG_INDEX_HINT records. It is done by the same mechanism as used for heap -
*ResolveRecoveryConflictWithSnapshot*.
6) The major thing here – it is safe to set *indexIgnoreKilledTuples* to both
‘true’ and ‘false’ from the perspective of correctness. It is just some kind of
performance compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT
horizon or vice versa.
7) What is the way to do the right decision about this compromise? It is pretty
simple – if hot_standby_feedback is on and primary confirmed our feedback is
received – then set *indexIgnoreKilledTuples* too ‘true’ – since while feedback
is working as expected – the query will be never canceled by XLOG_INDEX_HINT
horizon!
8) To support cascading standby setups (with a possible break of feedback
chain) – additional byte added to the ‘keep-alive’ message of the feedback
protocol.
9) So, at the moment we are safe to use LP_DEAD bits received from the
primary when we want to.
10) What is about setting LP_DEAD bits by standby? The main thing here -
*RecentGlobalXmin* on standby is always lower than XLOG_INDEX_HINT horizon by
definition – standby is always behind the primary. So, if something looks dead
on standby – it is definitely dead on the primary.
11) Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.
Because *HeapTupleIsSurelyDead* (and index LP_DEAD as result) needs *HEAP* hint
bits to be already set at standby. So, the same code decides to set hint bits
in the heap (it is done already on standby for a long time) and in the index.
So, the only thing we pay – a few additional bytes of WAL and some additional
moderate code complexity. But the support of hint-bits on standby is a huge
advantage for many workloads. I was able to get more than 1000% performance
boost (and it is not surprising – index hint bits is just great optimization).
And it works for almost all index types out of the box.
Another major thing here – everything is based on old, well-tested mechanics:
query cancelation because of snapshot conflicts and setting heap hint bits on
standby.
Most of the patch – are technical changes to support new query cancelation
counters, new WAL record, new PGPROC field and so on. There are some places I
am not sure about yet, naming is bad – it is still POC.
Thanks,
Michail.
Attachments:
standby-hint-bits-poc.patchapplication/octet-stream; name=standby-hint-bits-poc.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 08353cb343..f937b8df14 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1100,6 +1100,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to execute <function>txid_status</function> or update
the oldest transaction id available to it.</entry>
</row>
+ <row>
+ <entry><literal>IndexHintXMinDbHashLock</literal></entry>
+ <entry>Waiting to read or update information about lastest index hint LSN.</entry>
+ </row>
<row>
<entry><literal>clog</literal></entry>
<entry>Waiting for I/O on a clog (transaction status) buffer.</entry>
@@ -2722,6 +2726,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Number of queries in this database that have been canceled due to
old snapshots</entry>
</row>
+ <row>
+ <entry><structfield>confl_indexhint</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of queries in this database that have been canceled due to
+ too new index hint bits</entry>
+ </row>
<row>
<entry><structfield>confl_bufferpin</structfield></entry>
<entry><type>bigint</type></entry>
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index a4edcc77e9..3ea5fa15e9 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -22,6 +22,7 @@
#include "pgstat.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/standby.h"
#include "utils/float.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -87,6 +88,7 @@ gistkillitems(IndexScanDesc scan)
if (killedsomething)
{
GistMarkPageHasGarbage(page);
+ LogIndexHintIfNeeded(scan->indexRelation, so->killedItemsXmax);
MarkBufferDirtyHint(buffer, true);
}
@@ -666,8 +668,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
MemoryContextSwitchTo(oldCxt);
}
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
+
+ if (!TransactionIdFollowsOrEquals(so->killedItemsXmax, scan->kill_prior_tuple_xmax))
+ so->killedItemsXmax = scan->kill_prior_tuple_xmax;
+ }
}
/* continuing to return tuples from a leaf page */
scan->xs_heaptid = so->pageData[so->curPageData].heapPtr;
@@ -703,8 +710,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
MemoryContextSwitchTo(oldCxt);
}
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
so->killedItems[so->numKilled++] =
so->pageData[so->curPageData - 1].offnum;
+
+ if (!TransactionIdFollowsOrEquals(so->killedItemsXmax, scan->kill_prior_tuple_xmax))
+ so->killedItemsXmax = scan->kill_prior_tuple_xmax;
+ }
}
/* find and process the next index page */
do
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index b8aa77f70f..b4e69b44c9 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -107,6 +107,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
}
so->killedItems = NULL; /* until needed */
+ so->killedItemsXmax = InvalidTransactionId;
so->numKilled = 0;
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 4871b7ff4d..9bbb6f40c5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -308,7 +308,12 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
palloc(MaxIndexTuplesPerPage * sizeof(int));
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
+
+ if (!TransactionIdFollowsOrEquals(so->killedItemsXmax, scan->kill_prior_tuple_xmax))
+ so->killedItemsXmax = scan->kill_prior_tuple_xmax;
+ }
}
/*
@@ -377,6 +382,7 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL;
so->numKilled = 0;
+ so->killedItemsXmax = InvalidTransactionId;
scan->opaque = so;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9efc8016bc..82154c43a9 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -18,6 +18,7 @@
#include "access/reloptions.h"
#include "access/relscan.h"
#include "storage/buf_internals.h"
+#include "storage/standby.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
@@ -625,6 +626,7 @@ _hash_kill_items(IndexScanDesc scan)
if (killedsomething)
{
opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+ LogIndexHintIfNeeded(rel, so->killedItemsXmax);
MarkBufferDirtyHint(buf, true);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index db6fad76bc..8f8cb99075 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1527,7 +1527,7 @@ heap_fetch(Relation relation,
bool
heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call)
+ bool *all_dead, bool first_call, TransactionId *all_dead_xmax)
{
Page dp = (Page) BufferGetPage(buffer);
TransactionId prev_xmax = InvalidTransactionId;
@@ -1539,7 +1539,10 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* If this is not the first call, previous call returned a (live!) tuple */
if (all_dead)
+ {
*all_dead = first_call;
+ if (all_dead_xmax) *all_dead_xmax = InvalidTransactionId;
+ }
blkno = ItemPointerGetBlockNumber(tid);
offnum = ItemPointerGetOffsetNumber(tid);
@@ -1553,6 +1556,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
for (;;)
{
ItemId lp;
+ TransactionId xmax = InvalidTransactionId;
/* check for bogus TID */
if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
@@ -1571,6 +1575,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
at_chain_start = false;
continue;
}
+ else if (ItemIdIsDead(lp) && all_dead && all_dead_xmax)
+ {
+ if (ItemIdHasStorage(lp))
+ {
+ Assert((HeapTupleHeader)PageGetItem(dp, lp) != FrozenTransactionId);
+ *all_dead_xmax = HeapTupleHeaderGetXmin((HeapTupleHeader)PageGetItem(dp, lp));
+ }
+ else
+ *all_dead_xmax = InvalidTransactionId;
+ }
/* else must be end of chain */
break;
}
@@ -1635,9 +1649,13 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
* Note: if you change the criterion here for what is "dead", fix the
* planner's get_actual_variable_range() function to match.
*/
- if (all_dead && *all_dead &&
- !HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
- *all_dead = false;
+ if (all_dead && *all_dead)
+ {
+ if (!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin, &xmax))
+ *all_dead = false;
+ else if (all_dead_xmax && !TransactionIdFollowsOrEquals(*all_dead_xmax, xmax))
+ *all_dead_xmax = xmax;
+ }
/*
* Check to see if HOT chain continues past this tuple; if so fetch
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3fa4b766db..6e97b0e9f5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -112,7 +112,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot,
- bool *call_again, bool *all_dead)
+ bool *call_again, bool *all_dead,
+ TransactionId *all_dead_xmax)
{
IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
@@ -145,7 +146,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
snapshot,
&bslot->base.tupdata,
all_dead,
- !*call_again);
+ !*call_again,
+ all_dead_xmax);
bslot->base.tupdata.t_self = *tid;
LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
@@ -2140,7 +2142,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
ItemPointerSet(&tid, page, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
- &heapTuple, NULL, true))
+ &heapTuple, NULL, true, NULL))
hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
}
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..84ee006d40 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1418,7 +1418,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
* if the tuple is removable.
*/
bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin, TransactionId *TupleXmax)
{
HeapTupleHeader tuple = htup->t_data;
@@ -1459,7 +1459,12 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
return false;
/* Deleter committed, so tuple is dead if the XID is old enough. */
- return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+ if (TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
+ {
+ if (TupleXmax) *TupleXmax = HeapTupleHeaderGetRawXmax(tuple);
+ return true;
+ }
+ else return false;
}
/*
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..03fbf537e7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/proc.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -115,8 +116,23 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
* should not be altered by index AMs.
*/
scan->kill_prior_tuple = false;
- scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
- scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
+ scan->kill_prior_tuple_xmax = InvalidTransactionId;
+ if (!RecoveryInProgress())
+ {
+ scan->ignore_killed_tuples = true;
+ }
+ else
+ {
+ if (LWLockConditionalAcquire(ProcArrayLock, LW_SHARED)) // TODO: is it really required?
+ {
+ scan->ignore_killed_tuples = MyProc->indexIgnoreKilledTuples;
+ LWLockRelease(ProcArrayLock);
+ }
+ else
+ {
+ scan->ignore_killed_tuples = false;
+ }
+ }
scan->opaque = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index a5210d0b34..b3b6e78702 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -301,6 +301,7 @@ index_rescan(IndexScanDesc scan,
table_index_fetch_reset(scan->xs_heapfetch);
scan->kill_prior_tuple = false; /* for safety */
+ scan->kill_prior_tuple_xmax = InvalidTransactionId;
scan->xs_heap_continue = false;
scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
@@ -378,6 +379,7 @@ index_restrpos(IndexScanDesc scan)
table_index_fetch_reset(scan->xs_heapfetch);
scan->kill_prior_tuple = false; /* for safety */
+ scan->kill_prior_tuple_xmax = InvalidTransactionId;
scan->xs_heap_continue = false;
scan->indexRelation->rd_indam->amrestrpos(scan);
@@ -525,6 +527,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
/* Reset kill flag immediately for safety */
scan->kill_prior_tuple = false;
+ scan->kill_prior_tuple_xmax = InvalidTransactionId;
scan->xs_heap_continue = false;
/* If we're out of index entries, we're done */
@@ -567,10 +570,11 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
{
bool all_dead = false;
bool found;
+ TransactionId kill_prior_tuple_xmax = InvalidTransactionId;
found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
scan->xs_snapshot, slot,
- &scan->xs_heap_continue, &all_dead);
+ &scan->xs_heap_continue, &all_dead, &kill_prior_tuple_xmax);
if (found)
pgstat_count_heap_fetch(scan->indexRelation);
@@ -578,12 +582,13 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
/*
* If we scanned a whole HOT chain and found only dead tuples, tell index
* AM to kill its entry for that TID (this will take effect in the next
- * amgettuple call, in index_getnext_tid). We do not do this when in
- * recovery because it may violate MVCC to do so. See comments in
- * RelationGetIndexScan().
+ * amgettuple call, in index_getnext_tid).
*/
- if (!scan->xactStartedInRecovery)
+ if (scan->ignore_killed_tuples)
+ {
scan->kill_prior_tuple = all_dead;
+ scan->kill_prior_tuple_xmax = kill_prior_tuple_xmax;
+ }
return found;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..290d476926 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "storage/smgr.h"
+#include "storage/standby.h"
/* Minimum tree height for application of fastpath optimization */
#define BTREE_FASTPATH_MIN_LEVEL 2
@@ -424,6 +425,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
{
ItemPointerData htid;
bool all_dead;
+ TransactionId allHeadHtupAmax;
if (_bt_compare(rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
@@ -451,7 +453,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
*/
else if (table_index_fetch_tuple_check(heapRel, &htid,
&SnapshotDirty,
- &all_dead))
+ &all_dead, &allHeadHtupAmax))
{
TransactionId xwait;
@@ -508,7 +510,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
*/
htid = itup->t_tid;
if (table_index_fetch_tuple_check(heapRel, &htid,
- SnapshotSelf, NULL))
+ SnapshotSelf, NULL, NULL))
{
/* Normal case --- it's still live */
}
@@ -579,6 +581,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* Mark buffer with a dirty hint, since state is not
* crucial. Be sure to mark the proper buffer dirty.
*/
+ LogIndexHintIfNeeded(rel, allHeadHtupAmax);
if (nbuf != InvalidBuffer)
MarkBufferDirtyHint(nbuf, true);
else
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..a0fcb27bcf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -266,7 +266,12 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
so->killedItems = (int *)
palloc(MaxIndexTuplesPerPage * sizeof(int));
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
+
+ if (!TransactionIdFollowsOrEquals(so->killedItemsXmax, scan->kill_prior_tuple_xmax))
+ so->killedItemsXmax = scan->kill_prior_tuple_xmax;
+ }
}
/*
@@ -372,6 +377,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
+ so->killedItemsXmax = InvalidTransactionId;
so->numKilled = 0;
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..5321b72f29 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -28,6 +28,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "storage/standby.h"
typedef struct BTSortArrayContext
@@ -1794,6 +1795,8 @@ _bt_killitems(IndexScanDesc scan)
if (killedsomething)
{
opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+ LogIndexHintIfNeeded(scan->indexRelation, so->killedItemsXmax);
MarkBufferDirtyHint(so->currPos.buf, true);
}
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 1ce2664014..6a265a1afb 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -36,6 +36,14 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
appendStringInfoString(buf, "; subxid ovf");
}
+static void
+standby_desc_index_hint(StringInfo buf, xl_index_hint* xlrec)
+{
+ appendStringInfo(buf, "latestIndexHintXid %u db %u",
+ xlrec->latestIndexHintXid,
+ xlrec->dbId);
+}
+
void
standby_desc(StringInfo buf, XLogReaderState *record)
{
@@ -66,6 +74,12 @@ standby_desc(StringInfo buf, XLogReaderState *record)
xlrec->dbId, xlrec->tsId,
xlrec->relcacheInitFileInval);
}
+ else if (info == XLOG_INDEX_HINT)
+ {
+ xl_index_hint* xlrec = (xl_index_hint*)rec;
+
+ standby_desc_index_hint(buf, xlrec);
+ }
}
const char *
@@ -84,6 +98,9 @@ standby_identify(uint8 info)
case XLOG_INVALIDATIONS:
id = "INVALIDATIONS";
break;
+ case XLOG_INDEX_HINT:
+ id = "INDEX_HINT";
+ break;
}
return id;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..0d84ddcd5a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -201,7 +201,8 @@ bool
table_index_fetch_tuple_check(Relation rel,
ItemPointer tid,
Snapshot snapshot,
- bool *all_dead)
+ bool *all_dead,
+ TransactionId *all_dead_xmax)
{
IndexFetchTableData *scan;
TupleTableSlot *slot;
@@ -211,7 +212,7 @@ table_index_fetch_tuple_check(Relation rel,
slot = table_slot_create(rel, NULL);
scan = table_index_fetch_begin(rel);
found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
- all_dead);
+ all_dead, all_dead_xmax);
table_index_fetch_end(scan);
ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5adf956f41..be55c3bdd0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -477,6 +477,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
+ proc->indexIgnoreKilledTuples = true;
for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
SHMQueueInit(&(proc->myProcLocks[i]));
/* subxid data must be filled later by GXactLoadSubxactData */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..789a82c201 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,7 @@ CREATE VIEW pg_stat_database_conflicts AS
pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+ pg_stat_get_db_conflict_indexhint(D.oid) AS confl_indexhint,
pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
FROM pg_database D;
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 20dee11909..3653ddd7ae 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -112,7 +112,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
bool call_again = false;
if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
- &call_again, NULL))
+ &call_again, NULL, NULL))
{
/*
* All rows referenced by the index entry are dead, so skip the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509a79..9f6e26d6f0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4695,6 +4695,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
dbentry->n_conflict_tablespace = 0;
dbentry->n_conflict_lock = 0;
dbentry->n_conflict_snapshot = 0;
+ dbentry->n_conflict_indexhint = 0;
dbentry->n_conflict_bufferpin = 0;
dbentry->n_conflict_startup_deadlock = 0;
dbentry->n_temp_files = 0;
@@ -6319,6 +6320,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
dbentry->n_conflict_snapshot++;
break;
+ case PROCSIG_RECOVERY_CONFLICT_INDEXHINT:
+ dbentry->n_conflict_indexhint++;
+ break;
case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
dbentry->n_conflict_bufferpin++;
break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a651..39ad9b53f6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -330,6 +330,7 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
break;
case XLOG_STANDBY_LOCK:
+ case XLOG_INDEX_HINT:
break;
case XLOG_INVALIDATIONS:
{
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5dd..d1ac98553d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -576,6 +576,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
#endif
MyPgXact->xmin = snap->xmin;
+ MyProc->indexIgnoreKilledTuples = true;
/* allocate in transaction context */
newxip = (TransactionId *)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a5e85d32f3..5e662dce55 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -227,6 +227,9 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_STREAMING;
+ /* initially true so we always send at least one feedback message */
+ walrcv->sender_has_standby_xmin = true;
+ walrcv->sender_reports_feedback_to_master = false;
/* Fetch information required to start streaming */
walrcv->ready_to_display = false;
@@ -833,6 +836,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
XLogRecPtr walEnd;
TimestampTz sendTime;
bool replyRequested;
+ bool senderReportsFeedbackToMaster;
resetStringInfo(&incoming_message);
@@ -862,7 +866,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
case 'k': /* Keepalive */
{
/* copy message to StringInfo */
- hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+ hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char) + sizeof(char);
if (len != hdrlen)
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -872,8 +876,11 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
/* read the fields */
walEnd = pq_getmsgint64(&incoming_message);
sendTime = pq_getmsgint64(&incoming_message);
+ senderReportsFeedbackToMaster = pq_getmsgbyte(&incoming_message);
replyRequested = pq_getmsgbyte(&incoming_message);
+ WalRcv->sender_reports_feedback_to_master = senderReportsFeedbackToMaster;
+
ProcessWalSndrMessage(walEnd, sendTime);
/* If the primary requested a reply, send one immediately */
@@ -1155,15 +1162,13 @@ XLogWalRcvSendHSFeedback(bool immed)
catalog_xmin;
static TimestampTz sendTime = 0;
- /* initially true so we always send at least one feedback message */
- static bool master_has_standby_xmin = true;
/*
* If the user doesn't want status to be reported to the master, be sure
* to exit before doing anything at all.
*/
if ((wal_receiver_status_interval <= 0 || !hot_standby_feedback) &&
- !master_has_standby_xmin)
+ !WalRcv->sender_has_standby_xmin)
return;
/* Get current timestamp. */
@@ -1248,9 +1253,9 @@ XLogWalRcvSendHSFeedback(bool immed)
pq_sendint32(&reply_message, catalog_xmin_epoch);
walrcv_send(wrconn, reply_message.data, reply_message.len);
if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
- master_has_standby_xmin = true;
+ WalRcv->sender_has_standby_xmin = true;
else
- master_has_standby_xmin = false;
+ WalRcv->sender_has_standby_xmin = false;
}
/*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b9d0..cd641c1d27 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1906,6 +1906,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
SpinLockAcquire(&slot->mutex);
MyPgXact->xmin = InvalidTransactionId;
+ MyProc->indexIgnoreKilledTuples = true;
/*
* For physical replication we don't need the interlock provided by xmin
@@ -2092,6 +2093,8 @@ ProcessStandbyHSFeedbackMessage(void)
else
MyPgXact->xmin = feedbackXmin;
}
+
+ WalSndKeepalive(false);
}
/*
@@ -3374,12 +3377,14 @@ static void
WalSndKeepalive(bool requestReply)
{
elog(DEBUG2, "sending replication keepalive");
+ bool am_report_feedback_to_master = !am_cascading_walsender || (WalRcv->sender_has_standby_xmin && WalRcv->sender_reports_feedback_to_master);
/* construct the message... */
resetStringInfo(&output_message);
pq_sendbyte(&output_message, 'k');
pq_sendint64(&output_message, sentPtr);
pq_sendint64(&output_message, GetCurrentTimestamp());
+ pq_sendbyte(&output_message, am_report_feedback_to_master ? 1 : 0);
pq_sendbyte(&output_message, requestReply ? 1 : 0);
/* ... and send it wrapped in CopyData */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..a97a92e7fb 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -263,6 +263,7 @@ CreateSharedMemoryAndSemaphores(void)
BTreeShmemInit();
SyncScanShmemInit();
AsyncShmemInit();
+ StandByShmemInit();
#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4a5b26c23d..f4f4ec40b2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -62,6 +62,7 @@
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "replication/walreceiver.h"
#define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var))))
@@ -433,6 +434,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->indexIgnoreKilledTuples = true;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
pgxact->delayChkpt = false; /* be sure this is cleared in abort */
@@ -455,6 +457,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->indexIgnoreKilledTuples = true;
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
pgxact->delayChkpt = false; /* be sure this is cleared in abort */
@@ -611,6 +614,7 @@ ProcArrayClearTransaction(PGPROC *proc)
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->indexIgnoreKilledTuples = true;
proc->recoveryConflictPending = false;
/* redundant, but just in case */
@@ -1707,7 +1711,15 @@ GetSnapshotData(Snapshot snapshot)
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
if (!TransactionIdIsValid(MyPgXact->xmin))
+ {
MyPgXact->xmin = TransactionXmin = xmin;
+ /*
+ * Always use and set LP_DEAD bits on primary. In case of stand by
+ * only if hot_standby_feedback is enabled (to avoid unnecessary cancelations).
+ */
+ Assert(!RecoveryInProgress() || WalRcv);
+ MyProc->indexIgnoreKilledTuples = !RecoveryInProgress() || (WalRcv->sender_reports_feedback_to_master && WalRcv->sender_has_standby_xmin);
+ }
LWLockRelease(ProcArrayLock);
@@ -2578,7 +2590,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
* this array sufficiently often that we use malloc for the result.
*/
VirtualTransactionId *
-GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
+GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, bool onlyIndexIgnoreKilledTuples)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
@@ -2617,6 +2629,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
{
/* Fetch xmin just once - can't change on us, but good coding */
TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+ bool indexIgnoreKilledTuples = proc->indexIgnoreKilledTuples;
/*
* We ignore an invalid pxmin because this means that backend has
@@ -2627,7 +2640,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
* test here.
*/
if (!TransactionIdIsValid(limitXmin) ||
- (TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)))
+ (TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin) && (!onlyIndexIgnoreKilledTuples || indexIgnoreKilledTuples)))
{
VirtualTransactionId vxid;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 65d3946386..a2d7630dbf 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -558,6 +558,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+ if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_INDEXHINT))
+ RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_INDEXHINT);
+
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 3090e57fa4..9a3dd4df85 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -41,6 +41,7 @@ int max_standby_archive_delay = 30 * 1000;
int max_standby_streaming_delay = 30 * 1000;
static HTAB *RecoveryLockLists;
+static HTAB *IndexHintXMinDb;
static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
ProcSignalReason reason);
@@ -57,6 +58,12 @@ typedef struct RecoveryLockListsEntry
List *locks;
} RecoveryLockListsEntry;
+typedef struct IndexHintXMinDbEntry
+{
+ Oid dbOid;
+ TransactionId oldestXMin;
+} IndexHintXMinDbEntry;
+
/*
* InitRecoveryTransactionEnvironment
* Initialize tracking of in-progress transactions in master
@@ -290,6 +297,43 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
}
}
+static bool
+IsNewerIndexHintXid(Oid dbOid, TransactionId latestIndexHintXid)
+{
+ bool found, result;
+ IndexHintXMinDbEntry* entry;
+
+ LWLockAcquire(IndexHintXMinDbHashLock, LW_SHARED);
+ entry = (IndexHintXMinDbEntry*)hash_search(IndexHintXMinDb,
+ &dbOid,
+ HASH_FIND,
+ &found);
+ result = !found || TransactionIdPrecedes(entry->oldestXMin, latestIndexHintXid);
+ LWLockRelease(IndexHintXMinDbHashLock);
+
+ return result;
+}
+
+static void
+UpsertLatestIndexHintXid(Oid dbOid, TransactionId latestIndexHintXid)
+{
+
+ bool found;
+ IndexHintXMinDbEntry* entry;
+
+ LWLockAcquire(IndexHintXMinDbHashLock, LW_EXCLUSIVE);
+
+ entry = (IndexHintXMinDbEntry*)hash_search(IndexHintXMinDb,
+ &dbOid,
+ HASH_ENTER,
+ &found);
+
+ if (!found || TransactionIdPrecedes(entry->oldestXMin, latestIndexHintXid))
+ entry->oldestXMin = latestIndexHintXid;
+
+ LWLockRelease(IndexHintXMinDbHashLock);
+}
+
void
ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
{
@@ -308,10 +352,12 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
return;
backends = GetConflictingVirtualXIDs(latestRemovedXid,
- node.dbNode);
+ node.dbNode, false);
ResolveRecoveryConflictWithVirtualXIDs(backends,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+ UpsertLatestIndexHintXid(node.dbNode, latestRemovedXid);
}
void
@@ -337,7 +383,7 @@ ResolveRecoveryConflictWithTablespace(Oid tsid)
* We don't wait for commit because drop tablespace is non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
- InvalidOid);
+ InvalidOid, false);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
}
@@ -827,6 +873,22 @@ standby_redo(XLogReaderState *record)
xlrec->dbId,
xlrec->tsId);
}
+ else if (info == XLOG_INDEX_HINT)
+ {
+ xl_index_hint* xlrec = (xl_index_hint*)XLogRecGetData(record);
+ if (InHotStandby)
+ {
+ if (IsNewerIndexHintXid(xlrec->dbId, xlrec->latestIndexHintXid))
+ {
+ VirtualTransactionId* backends = GetConflictingVirtualXIDs(xlrec->latestIndexHintXid,
+ xlrec->dbId,
+ true);
+ ResolveRecoveryConflictWithVirtualXIDs(backends, PROCSIG_RECOVERY_CONFLICT_INDEXHINT);
+
+ UpsertLatestIndexHintXid(xlrec->dbId, xlrec->latestIndexHintXid);
+ }
+ }
+ }
else
elog(PANIC, "standby_redo: unknown op code %u", info);
}
@@ -1094,3 +1156,46 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
nmsgs * sizeof(SharedInvalidationMessage));
XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
}
+
+void
+LogIndexHint(Oid dbId, TransactionId latestIndexHintXid)
+{
+ xl_index_hint xlrec;
+
+ xlrec.dbId = dbId;
+ xlrec.latestIndexHintXid = latestIndexHintXid;
+
+ XLogBeginInsert();
+ XLogRegisterData((char*)&xlrec, sizeof(xl_index_hint));
+
+ XLogInsert(RM_STANDBY_ID, XLOG_INDEX_HINT);
+}
+
+extern void LogIndexHintIfNeeded(Relation rel, TransactionId oldestXmin)
+{
+ if (!RecoveryInProgress() && XLogStandbyInfoActive() && TransactionIdIsValid(oldestXmin)) {
+ if (IsNewerIndexHintXid(rel->rd_node.dbNode, oldestXmin))
+ {
+ LogIndexHint(rel->rd_node.dbNode, oldestXmin);
+ UpsertLatestIndexHintXid(rel->rd_node.dbNode, oldestXmin);
+ }
+ }
+}
+
+void
+StandByShmemInit(void)
+{
+ HASHCTL info;
+
+ MemSet(&info, 0, sizeof(info));
+ info.keysize = sizeof(Oid);
+ info.entrysize = sizeof(IndexHintXMinDbEntry);
+
+ LWLockAcquire(IndexHintXMinDbHashLock, LW_EXCLUSIVE);
+
+ IndexHintXMinDb = ShmemInitHash("IndexHintXMinDb",
+ 64, 64,
+ &info, HASH_ELEM | HASH_BLOBS);
+
+ LWLockRelease(IndexHintXMinDbHashLock);
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..d13ac7a512 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock 41
OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
CLogTruncationLock 44
+IndexHintXMinDbHashLock 45
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f72ea..81a6e4b49c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -389,6 +389,7 @@ InitProcess(void)
MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
MyPgXact->xid = InvalidTransactionId;
MyPgXact->xmin = InvalidTransactionId;
+ MyProc->indexIgnoreKilledTuples = true;
MyProc->pid = MyProcPid;
/* backendId, databaseId and roleId will be filled in later */
MyProc->backendId = InvalidBackendId;
@@ -573,6 +574,7 @@ InitAuxiliaryProcess(void)
MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
MyPgXact->xid = InvalidTransactionId;
MyPgXact->xmin = InvalidTransactionId;
+ MyProc->indexIgnoreKilledTuples = true;
MyProc->backendId = InvalidBackendId;
MyProc->databaseId = InvalidOid;
MyProc->roleId = InvalidOid;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0a6f80963b..4e4bdb4420 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2441,6 +2441,9 @@ errdetail_recovery_conflict(void)
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
errdetail("User query might have needed to see row versions that must be removed.");
break;
+ case PROCSIG_RECOVERY_CONFLICT_INDEXHINT:
+ errdetail("User query might have see index hint bits that are newer.");
+ break;
case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
errdetail("User transaction caused buffer deadlock with recovery.");
break;
@@ -2909,6 +2912,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+ case PROCSIG_RECOVERY_CONFLICT_INDEXHINT:
/*
* If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7e6a3c1774..b25dc41553 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1469,6 +1469,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
PG_RETURN_INT64(result);
}
+Datum
+pg_stat_get_db_conflict_indexhint(PG_FUNCTION_ARGS)
+{
+ Oid dbid = PG_GETARG_OID(0);
+ int64 result;
+ PgStat_StatDBEntry* dbentry;
+
+ if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+ result = 0;
+ else
+ result = (int64)(dbentry->n_conflict_indexhint);
+
+ PG_RETURN_INT64(result);
+}
+
Datum
pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
{
@@ -1512,6 +1527,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
result = (int64) (dbentry->n_conflict_tablespace +
dbentry->n_conflict_lock +
dbentry->n_conflict_snapshot +
+ dbentry->n_conflict_indexhint +
dbentry->n_conflict_bufferpin +
dbentry->n_conflict_startup_deadlock);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..fcac777d59 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1016,6 +1016,7 @@ SnapshotResetXmin(void)
if (pairingheap_is_empty(&RegisteredSnapshots))
{
+ MyProc->indexIgnoreKilledTuples = true;
MyPgXact->xmin = InvalidTransactionId;
return;
}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..d6630b25c2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -167,6 +167,7 @@ typedef struct GISTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
OffsetNumber *killedItems; /* offset numbers of killed items */
int numKilled; /* number of currently stored items */
+ TransactionId killedItemsXmax;
BlockNumber curBlkno; /* current number of block */
GistNSN curPageLSN; /* pos in the WAL stream when page was read */
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9fc0696096..7b2fd9f84a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -179,6 +179,7 @@ typedef struct HashScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ TransactionId killedItemsXmax;
/*
* Identify all the matching items on a page and save them in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 47fda28daa..8c04800194 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -127,7 +127,7 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
HeapTuple tuple, Buffer *userbuf);
extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
- bool *all_dead, bool first_call);
+ bool *all_dead, bool first_call, TransactionId* all_dead_xmax);
extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid);
extern void setLastTid(const ItemPointer tid);
@@ -208,7 +208,7 @@ extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin, TransactionId *TupleXmax);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..f7e218eea9 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -644,6 +644,7 @@ typedef struct BTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ TransactionId killedItemsXmax;
/*
* If we are doing an index-only scan, these are the tuple storage
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6f0258831f..a5c3f9fcd6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -110,10 +110,10 @@ typedef struct IndexScanDescData
bool xs_temp_snap; /* unregister snapshot at scan end? */
/* signaling to index AM about killing index tuples */
- bool kill_prior_tuple; /* last-returned tuple is dead */
- bool ignore_killed_tuples; /* do not return killed entries */
- bool xactStartedInRecovery; /* prevents killing/seeing killed
- * tuples */
+ bool kill_prior_tuple; /* last-returned tuple is dead */
+ TransactionId kill_prior_tuple_xmax; /* last-returned tuple xmax */
+ bool ignore_killed_tuples; /* enables killing tuples and
+ * prevents seeing killed tuples */
/* index access method's private state */
void *opaque; /* access-method-specific info */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5a663975b9..31a40929de 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -302,7 +302,8 @@ typedef struct TableAmRoutine
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot,
- bool *call_again, bool *all_dead);
+ bool *call_again, bool *all_dead,
+ TransactionId* all_dead_xmax);
/* ------------------------------------------------------------------------
@@ -1014,12 +1015,12 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot,
- bool *call_again, bool *all_dead)
+ bool *call_again, bool *all_dead, TransactionId *all_dead_xmax)
{
return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
slot, call_again,
- all_dead);
+ all_dead, all_dead_xmax);
}
/*
@@ -1031,7 +1032,8 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
extern bool table_index_fetch_tuple_check(Relation rel,
ItemPointer tid,
Snapshot snapshot,
- bool *all_dead);
+ bool *all_dead,
+ TransactionId *all_dead_xmax);
/* ------------------------------------------------------------------------
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 226c904c04..76a59a98af 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5320,6 +5320,11 @@
proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '8469',
+ descr => 'statistics: recovery conflicts in database caused by index hint bits',
+ proname => 'pg_stat_get_db_conflict_indexhint', provolatile => 's',
+ proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+ prosrc => 'pg_stat_get_db_conflict_indexhint' },
{ oid => '3068',
descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb6013f0..029ecd43a0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -603,6 +603,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_conflict_tablespace;
PgStat_Counter n_conflict_lock;
PgStat_Counter n_conflict_snapshot;
+ PgStat_Counter n_conflict_indexhint;
PgStat_Counter n_conflict_bufferpin;
PgStat_Counter n_conflict_startup_deadlock;
PgStat_Counter n_temp_files;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..399855be3e 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -148,6 +148,10 @@ typedef struct
* store semantics, so use sig_atomic_t.
*/
sig_atomic_t force_reply; /* used as a bool */
+
+ sig_atomic_t sender_has_standby_xmin; /* used as a bool */
+
+ sig_atomic_t sender_reports_feedback_to_master; /* used as a bool */
} WalRcvData;
extern WalRcvData *WalRcv;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d21780108b..c6c5d4079b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -125,6 +125,7 @@ struct PGPROC
* though not required. Accessed without lock, if needed.
*/
bool recoveryConflictPending;
+ bool indexIgnoreKilledTuples;
/* Info about LWLock the process is currently waiting for, if any. */
bool lwWaiting; /* true if waiting for an LW lock */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..5a125fa604 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -103,7 +103,8 @@ extern bool IsBackendPid(int pid);
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
bool excludeXmin0, bool allDbs, int excludeVacuum,
int *nvxids);
-extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
+extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
+ bool onlyIndexIgnoreKilledTuples);
extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
extern bool MinimumActiveBackends(int min);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 90607df106..33dcb62e90 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
PROCSIG_RECOVERY_CONFLICT_LOCK,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+ PROCSIG_RECOVERY_CONFLICT_INDEXHINT,
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cfbe426e5a..0d2d5c6af9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -18,6 +18,7 @@
#include "storage/procsignal.h"
#include "storage/relfilenode.h"
#include "storage/standbydefs.h"
+#include "utils/relcache.h"
/* User-settable GUC parameters */
extern int vacuum_defer_cleanup_age;
@@ -81,9 +82,12 @@ typedef struct RunningTransactionsData
typedef RunningTransactionsData *RunningTransactions;
+extern void StandByShmemInit(void);
extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid);
extern void LogAccessExclusiveLockPrepare(void);
+extern void LogIndexHintIfNeeded(Relation rel, TransactionId oldestXmin);
+
extern XLogRecPtr LogStandbySnapshot(void);
extern void LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInitFileInval);
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index 4876d2eeea..97e1f4fb6d 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -34,6 +34,7 @@ extern void standby_desc_invalidations(StringInfo buf,
#define XLOG_STANDBY_LOCK 0x00
#define XLOG_RUNNING_XACTS 0x10
#define XLOG_INVALIDATIONS 0x20
+#define XLOG_INDEX_HINT 0x30
typedef struct xl_standby_locks
{
@@ -71,4 +72,11 @@ typedef struct xl_invalidations
#define MinSizeOfInvalidations offsetof(xl_invalidations, msgs)
+typedef struct xl_index_hint
+{
+ TransactionId latestIndexHintXid;
+ Oid dbId;
+} xl_index_hint;
+
+
#endif /* STANDBYDEFS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..7be8e99303 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1845,6 +1845,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+ pg_stat_get_db_conflict_indexhint(d.oid) AS confl_indexhint,
pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
FROM pg_database d;
On Wed, Apr 8, 2020 at 5:23 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am
not able to understand all the code now, but I’ll check the patch in more
detail later.
The design is probably simpler than you imagine. The algorithm tries
to be clever about unique indexes in various ways, but don't get
distracted by those details. At a high level we're merely performing
atomic actions that can already happen, and already use the same high
level rules.
There is LP_DEAD bit setting that's similar to what often happens in
_bt_check_unique() already, plus there is a new way in which we can
call _bt_delitems_delete(). Importantly, we're only changing the time
and the reasons for doing these things.
Yes, it is relevant, but I think it is «located in a different plane» and
complement each other. Because most of the indexes are not unique these days
and most of the standbys (and even primaries) have long snapshots (up to
minutes, hours) – so, multiple versions of index records are still required for
them. Even if we could avoid multiple versions somehow - it could lead to a very
high rate of query cancelations on standby.
I admit that I didn't really understand what you were trying to do
initially. I now understand a little better.
I think that there is something to be said for calling
_bt_delitems_delete() more frequently on the standby, not necessarily
just for the special case of unique indexes. That might also help on
standbys, at the cost of more recovery conflicts. I admit that it's
unclear how much that would help with the cases that you seem to
really care about. I'm not going to argue that the kind of thing I'm
talking about (actually doing deletion more frequently on the primary)
is truly a replacement for your patch, even if it was generalized
further than my POC patch -- it is not a replacement. As best, it is a
simpler way of "sending the LP_DEAD bits on the primary to the standby
sooner". Even still, I cannot help but wonder how much value there is
in just doing this much (or figuring out some way to make LP_DEAD bits
from the primary usable on the standby). That seems like a far less
risky project, even if it is less valuable to some users.
Let me make sure I understand your position:
You're particularly concerned about cases where there are relatively
few page splits, and the standby has to wait for VACUUM to run on the
primary before dead index tuples get cleaned up. The primary itself
probably has no problem with setting LP_DEAD bits to avoid having
index scans visiting the heap unnecessarily. Or maybe the queries are
different on the standby anyway, so it matters to the standby that
certain index pages get LP_DEAD bits set quickly, though not to the
primary (at least not for the same pages). Setting the LP_DEAD bits on
the standby (in about the same way as we can already on the primary)
is a "night and day" level difference. And we're willing to account
for FPIs on the primary (and the LP_DEAD bits set there) just to be
able to also set LP_DEAD bits on the standby.
Right?
--
Peter Geoghegan
Hello, Peter.
Let me make sure I understand your position:
You're particularly concerned about cases where there are relatively
few page splits, and the standby has to wait for VACUUM to run on the
primary before dead index tuples get cleaned up. The primary itself
probably has no problem with setting LP_DEAD bits to avoid having
index scans visiting the heap unnecessarily. Or maybe the queries are
different on the standby anyway, so it matters to the standby that
certain index pages get LP_DEAD bits set quickly, though not to the
primary (at least not for the same pages). Setting the LP_DEAD bits on
the standby (in about the same way as we can already on the primary)
is a "night and day" level difference.
Right?
Yes, exactly.
My initial attempts were too naive (first and second letter) - but you and
Andres gave me some hints on how to make it reliable.
The main goal is to make the standby to be able to use and set LP_DEAD almost
as a primary does. Of course, standby could receive LP_DEAD with FPI from
primary at any moment - so, some kind of cancellation logic is required. Also,
we should keep the frequency of query cancellation at the same level - for that
reason LP_DEAD bits better to be used only by standbys with
hot_standby_feedback enabled. So, I am just repeating myself from the previous
letter here.
And we're willing to account
for FPIs on the primary (and the LP_DEAD bits set there) just to be
able to also set LP_DEAD bits on the standby.
Yes, metaphorically saying - master sending WAL record with the letter:
"Attention, it is possible to receive FPI from me with LP_DEAD set for tuple
with xmax=ABCD, so, if you using LP_DEAD - your xmin should be greater or you
should cancel yourself". And such a letter is required only if this horizon is
moved forward.
And... Looks like it works - queries are mush faster, results look correct,
additional WAL traffic is low, cancellation at the same level... As far as I
can see - the basic concept is correct and effective (but of course, I
could miss something).
The patch is hard to look into - I'll try to split it into several patches
later. And of course, a lot of polishing is required (and there are few places
I am not sure about yet).
Thanks,
Michail.
Hello, hackers.
Sorry for necroposting, but if someone is interested - I hope the patch is
ready now and available in the other thread (1).
Thanks,
Michail.
[1]: /messages/by-id/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com
/messages/by-id/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com
On Wed, Jan 27, 2021 at 11:30 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1).
I wonder if it would help to not actually use the LP_DEAD bit for
this. Instead, you could use the currently-unused-in-indexes
LP_REDIRECT bit. The general idea here is that this status bit is
interpreted as a "recently dead" bit in nbtree. It is always something
that is true *relative* to some *ephemeral* coarse-grained definition
of recently dead. Whether or not "recently dead" means "dead to my
particular MVCC snapshot" can be determined using some kind of
in-memory state that won't survive a crash (or a per-index-page
epoch?). We still have LP_DEAD bits in this world, which work in the
same way as now (and so unambiguously indicate that the index tuple is
dead-to-all, at least on the primary). I think that this idea of a
"recently dead" bit might be able to accomplish a few goals all at
once, including your specific goal of setting "hint bits" in indexes.
The issue here is that it would also be nice to use a "recently dead"
bit on the primary, but if you do that then maybe you're back to the
original problem. OTOH, I also think that we could repurpose the
LP_NORMAL bit in index AMs, so we could potentially have 3 different
definitions of dead-ness without great difficulty!
Perhaps this doesn't seem all that helpful, since I am expanding scope
here for a project that is already very difficult. And maybe it just
isn't helpful -- I don't know. But it is at least possible that
expanding scope could actually *help* your case a lot, even if you
only ever care about your original goal. My personal experience with
nbtree patches is just that: sometimes *expanding* scope actually
makes things *easier*, not harder. This is sometimes called "The
Inventor's Paradox":
https://en.wikipedia.org/wiki/Inventor%27s_paradox
Consider the example of my work on nbtree in PostgreSQL 12. It was
actually about 6 or 7 enhancements, not just 1 big enhancement -- it
is easy to forget that now. I think that it was *necessary* to add at
least 5 of these enhancements at the same time (maybe 2 or so really
were optional). This is deeply counter-intuitive, but still seems to
be true in my case. The specific reasons why I believe that this is
true of the PostgreSQL 12 work are complicated, but it boils down to
this: the ~5 related-though-distinct enhancements were individually
not all that compelling (certainly not compelling enough to make
on-disk changes for), but were much greater than the sum of their
parts when considered together. Maybe I got lucky there.
More generally, the nbtree stuff that I worked on in 12, 13, and now
14 actually feels like one big project to me. I will refrain from
explaining exactly why that is right now, but you might be very
surprised at how closely related it all is. I didn't exactly plan it
that way, but trying to see things in the most general terms turned
out to be a huge asset to me. If there are several independent reasons
to move in one particular direction all at once, you can generally
afford to be wrong about a lot of things without it truly harming
anybody. Plus it's easy to see that you will not frustrate future work
that is closely related but distinct when that future work is
*directly enabled* by what you're doing.
What's more, the extra stuff I'm talking about probably has a bunch of
other benefits on the primary, if done well. Consider how the deletion
stuff with LP_DEAD bits now considers "extra" index tuples to delete
when they're close by. We could even do something like that with these
LP_REDIRECT/recently dead bits on the primary.
I understand that it's hard to have a really long term outlook. I
think that that's almost necessary when working on a project like
this, though.
Don't forget that this works both ways. Maybe a person that wanted to
do this "recently dead" stuff (which sounds really interesting to me
right now) would similarly be compelled to consider the bigger
picture, including the question of setting hint bits on standbys --
this other person had better not make that harder in the future, for
the same reasons (obviously what you want to do here makes sense, it
just isn't everything to everybody). This is yet another way in which
expanding scope can help.
--
Peter Geoghegan
On Wed, Jan 27, 2021 at 5:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
The issue here is that it would also be nice to use a "recently dead"
bit on the primary, but if you do that then maybe you're back to the
original problem. OTOH, I also think that we could repurpose the
LP_NORMAL bit in index AMs, so we could potentially have 3 different
definitions of dead-ness without great difficulty!
To be clear, what I mean is that you currently have two bits in line
pointers. In an nbtree leaf page we only currently use one -- the
LP_DEAD bit. But bringing the second bit into it allows us to have a
representation for two additional states (not just one), since of
course the meaning of the second bit can be interpreted using the
LP_DEAD bit. You could for example having encodings for each of the
following distinct per-LP states, while still preserving on-disk
compatibility:
"Not known to be dead in any sense" (0), "Unambiguously dead to all"
(what we now simply call LP_DEAD), "recently dead on standby"
(currently-spare bit is set), and "recently dead on primary" (both
'lp_flags' bits set).
Applying FPIs on the standby would have to be careful to preserve a
standby-only bit. I'm probably not thinking of every subtlety, but
"putting all of the pieces of the puzzle together for further
consideration" is likely to be a useful exercise.
--
Peter Geoghegan
Hello, Peter.
I wonder if it would help to not actually use the LP_DEAD bit for
this. Instead, you could use the currently-unused-in-indexes
LP_REDIRECT bit.
Hm… Sound very promising - an additional bit is a lot in this situation.
Whether or not "recently dead" means "dead to my
particular MVCC snapshot" can be determined using some kind of
in-memory state that won't survive a crash (or a per-index-page
epoch?).
Do you have any additional information about this idea? (maybe some
thread). What kind of “in-memory state that won't survive a crash” and how
to deal with flushed bits after the crash?
"Not known to be dead in any sense" (0), "Unambiguously dead to all"
(what we now simply call LP_DEAD), "recently dead on standby"
(currently-spare bit is set), and "recently dead on primary" (both
'lp_flags' bits set).
Hm. What is about this way:
10 - dead to all on standby (LP_REDIRECT)
11 - dead to all on primary (LP_DEAD)
01 - future “recently DEAD” on primary (LP_NORMAL)
In such a case standby could just always ignore all LP_DEAD bits from
primary (standby will lose its own hint after FPI - but it is not a big
deal). So, we don’t need any conflict resolution (and any additional WAL
records). Also, hot_standby_feedback-related stuff is not required anymore.
All we need to do (without details of course) - is correctly check if it is
safe to set LP_REDIRECT on standby according to `minRecoveryPoint` (to
avoid consistency issues during crash recovery). Or, probably, introduce
some kind of `indexHintMinRecoveryPoint`.
Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.
So, it will remove more than 80% of the current patch complexity!
Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
Because it blocks standby for settings hints bits in *heap* already. And,
probably, will block standby to set *index* hint bits aggressively.
Thanks a lot,
Michail.
On Thu, Jan 28, 2021 at 10:16 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
I wonder if it would help to not actually use the LP_DEAD bit for
this. Instead, you could use the currently-unused-in-indexes
LP_REDIRECT bit.Hm… Sound very promising - an additional bit is a lot in this situation.
Yeah, it would help a lot. But those bits are precious. So it makes
sense to think about what to do with both of them in index AMs at the
same time. Otherwise we risk missing some important opportunity.
Whether or not "recently dead" means "dead to my
particular MVCC snapshot" can be determined using some kind of
in-memory state that won't survive a crash (or a per-index-page
epoch?).Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survive a crash” and how to deal with flushed bits after the crash?
Honestly, that part wasn't very well thought out. A lot of things might work.
Some kind of "recently dead" bit is easier on the primary. If we have
recently dead bits set on the primary (using a dedicated LP bit for
original execution recently-dead-ness), then we wouldn't even
necessarily have to change anything about index scans/visibility at
all. There would still be a significant benefit if we simply used the
recently dead bits when considering which heap blocks nbtree simple
deletion will visit inside _bt_deadblocks() -- in practice there would
probably be no real downside from assuming that the recently dead bits
are now fully dead (it would sometimes be wrong, but not enough to
matter, probably only when there is a snapshot held for way way too
long).
Deletion in indexes can work well while starting off with only an
*approximate* idea of which index tuples will be safe to delete --
this is a high level idea behind my recent commit d168b666823. It
seems very possible that that could be pushed even further here on the
primary.
On standbys (which set standby recently dead bits) it will be
different, because you need "index hint bits" set that are attuned to
the workload on the standby, and because you don't ever use the bit to
help with deleting anything on the standby (that all happens during
original execution).
BTW, what happens when the page splits on the primary, btw? Does your
patch "move over" the LP_DEAD bits to each half of the split?
Hm. What is about this way:
10 - dead to all on standby (LP_REDIRECT)
11 - dead to all on primary (LP_DEAD)
01 - future “recently DEAD” on primary (LP_NORMAL)
Not sure.
Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.
Right -- if we were to do this, the idea would be that it would apply
to all index AMs that currently have (or will ever have) something
like the LP_DEAD bit stuff. The GiST and hash support for index
deletion is directly based on the original nbtree version, and there
is no reason why we cannot eventually do all this stuff in at least
those three AMs.
There are already some line-pointer level differences in index AMs:
LP_DEAD items have storage in index AMs, but not in heapam. This
all-table-AMs/all-index-AMs divide in how item pointers work would be
preserved.
Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
Not offhand.
--
Peter Geoghegan
Hello, Peter.
Yeah, it would help a lot. But those bits are precious. So it makes
sense to think about what to do with both of them in index AMs at the
same time. Otherwise we risk missing some important opportunity.
Hm. I was trying to "expand the scope" as you said and got an idea... Why
we even should do any conflict resolution for hint bits? Or share precious
LP bits?
The only way standby could get an "invalid" hint bit - is an FPI from the
primary. We could just use the current *btree_mask* infrastructure and
clear all "probably invalid" bits from primary in case of standby while
applying FPI (in `XLogReadBufferForRedoExtended`)!
For binary compatibility, we could use one of `btpo_flags` bits to mark the
page as "primary bits masked". The same way would work for hash\gist too.
No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many
free bits for now, easy to implement, page content of primary\standby
already differs in this bits...
Looks like an easy and effective solution for me.
What do you think?
Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?
Not offhand.
If so, looks like it is not a bad idea to move minRecoveryPoint forward
from time to time (for more aggressive standby index hint bits). For each
`xl_running_xacts` (about each 15s), for example.
BTW, what happens when the page splits on the primary, btw? Does your
patch "move over" the LP_DEAD bits to each half of the split?
That part is not changed in any way.
Thanks,
Michail.
On Sat, Jan 30, 2021 at 9:11 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Yeah, it would help a lot. But those bits are precious. So it makes
sense to think about what to do with both of them in index AMs at the
same time. Otherwise we risk missing some important opportunity.Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?
What does it mean today when an LP_DEAD bit is set on a standby? I
don't think that it means nothing at all -- at least not if you think
about it in the most general terms.
In a way it actually means something like "recently dead" even today,
at least in one narrow specific sense: recovery may end, and then we
actually *will* do *something* with the LP_DEAD bit, without directly
concerning ourselves with when or how each LP_DEAD bit was set.
During recovery, we will probably always have to consider the
possibility that LP_DEAD bits that get set on the primary may be
received by a replica through some implementation detail (e.g. LP_DEAD
bits are set in FPIs we replay, maybe even some other thing that
neither of us have thought of). We can't really mask LP_DEAD bits from
the primary in recovery anyway, because of stuff like page-level
checksums. I suspect that it just isn't useful to fight against that.
These preexisting assumptions are baked into everything already.
Why should we assume that *anybody* understands all of the ways in
which that is true?
Even today, "LP_DEAD actually means a limited kind of 'recently dead'
during recovery + hot standby" is something that is true (as I just
went into), but at the same time also has a fuzzy definition. My gut
instinct is to be conservative about that. As I suggested earlier, you
could invent some distinct kind of "recently dead" that achieves your
goals in a way that is 100% orthogonal.
The two unused LP dead bits (unused in indexes, though not tables) are
only precious if we assume that somebody will eventually use them for
something -- if nobody ever uses them then they're 100% useless. The
number of possible projects that might end up using the two bits for
something is not that high -- certainly not more than 3 or 4. Besides,
it is always a good idea to keep the on-disk format as simple and
explicit as possible -- it should be easy to analyze forensically in
the event of bugs or some other kind of corruption, especially by
using tools like amcheck.
As I said in my earlier email, we can even play tricks during page
deletion by treating certain kinds of "recently dead" bits as
approximate things. As a further example, we can "rely" on the
"dead-to-all but only on standby" bits when recovery ends, during a
subsequent write transactions. We can do so by simply using them in
_bt_deadblocks() as if they were LP_DEAD (we'll recheck heap tuples in
heapam.c instead).
The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!
I don't like that idea. Apart from what I said already, you're
assuming that setting LP_DEAD bits in indexes on the primary won't
eventually have some value on the standby after it is promoted and can
accept writes -- they really do have meaning and value on standbys.
Plus you'll introduce new overhead for this process during replay,
which creates significant overhead -- often most leaf pages have some
LP_DEAD bits set during recovery.
For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.
I agree that there are plenty of btpo_flags bits. However, I have my
doubts about this.
Why shouldn't this break page-level checksums (or wal_log_hints) in
some way? What about pg_rewind, some eventual implementation of
incremental backups, etc? I suspect that it will be necessary to
invent some entirely new concept that is like a hint bit, but works on
standbys (without breaking anything else).
No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
Looks like an easy and effective solution for me.
Note that the BTP_HAS_GARBAGE flag (which goes in btpo_flags) was
deprecated in commit cf2acaf4. It was never a reliable indicator of
whether or not some LP_DEAD bits are set in the page. And I think that
it never could be made to work like that.
Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
Not offhand.
If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.
It's necessary for recoverypoints (i.e. the standby equivalent of a
checkpoint) to do that in order to ensure that we won't break
checksums. This whole area is scary to me:
/messages/by-id/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
Since, as I said, it's already true that LP_DEAD bits on standbys are
some particular kind of "recently dead bit", even today, any design
that uses LP_DEAD bits in some novel new way (also on the standby) is
very hard to test. It might easily have very subtle bugs -- obviously
a recently dead bit relates to a tuple pointing to a logical row that
is bound to become dead soon enough. The difference between dead and
recently dead is already blurred, and I would rather not go there.
If you invent some entirely new category of standby-only hint bit at a
level below the access method code, then you can use it inside access
method code such as nbtree. Maybe you don't have to play games with
minRecoveryPoint in code like the "if (RecoveryInProgress())" path
from the XLogNeedsFlush() function. Maybe you can do some kind of
rudimentary "masking" for the in recovery case at the point that we
*write out* a buffer (*not* at the point hint bits are initially set)
-- maybe this could happen near to or inside FlushBuffer(), and maybe
only when checksums are enabled? I'm unsure.
BTW, what happens when the page splits on the primary, btw? Does your
patch "move over" the LP_DEAD bits to each half of the split?That part is not changed in any way.
Maybe it's okay to assume that it's no loss to throw away hint bits
set on the standby, because in practice deletion on the primary will
usually do the right thing anyway. But you never said that. I think
that you should take an explicit position on this question -- make it
a formal part of your overall design.
--
Peter Geoghegan
On Sat, Jan 30, 2021 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
If you invent some entirely new category of standby-only hint bit at a
level below the access method code, then you can use it inside access
method code such as nbtree. Maybe you don't have to play games with
minRecoveryPoint in code like the "if (RecoveryInProgress())" path
from the XLogNeedsFlush() function. Maybe you can do some kind of
rudimentary "masking" for the in recovery case at the point that we
*write out* a buffer (*not* at the point hint bits are initially set)
-- maybe this could happen near to or inside FlushBuffer(), and maybe
only when checksums are enabled? I'm unsure.
I should point out that hint bits in heap pages are really not like
LP_DEAD bits in indexes -- if they're similar at all then the
similarity is only superficial/mechanistic. In fact, the term "hint
bits in indexes" does not seem useful at all to me, for this reason.
Heap hint bits indicate whether or not the xmin or xmax in a heap
tuple header committed or aborted. We cache the commit or abort status
of one particular XID in the heap tuple header. Notably, this
information alone tells us nothing about whether or not the tuple
should be visible to any given MVCC snapshot. Except perhaps in cases
involving aborted transactions -- but that "exception" is just a
limited special case (and less than 1% of transactions are aborted in
almost all environments anyway).
In contrast, a set LP_DEAD bit in an index is all the information we
need to know that the tuple is dead, and can be ignored completely
(except during hot standby, where at least today we assume nothing
about the status of the tuple, since that would be unsafe). Generally
speaking, the index page LP_DEAD bit is "direct" visibility
information about the tuple, not information about XIDs that are
stored in the tuple header. So a set LD_DEAD bit in an index is
actually like an LP_DEAD-set line pointer in the heap (that's the
closest equivalent in the heap, by far). It's also like a frozen heap
tuple (except it's dead-to-all, not live-to-all).
The difference may be subtle, but it's important here -- it justifies
inventing a whole new type of LP_DEAD-style status bit that gets set
only on standbys. Even today, heap tuples can have hint bits
"independently" set on standbys, subject to certain limitations needed
to avoid breaking things like data page checksums. Hint bits are
ultimately just a thing that remembers the status of transactions that
are known committed or aborted, and so can be set immediately after
the relevant xact commits/aborts (at least on the primary, during
original execution). A long-held MVCC snapshot is never a reason to
not set a hint bit in a heap tuple header (during original execution
or during hot standby/recovery). Of course, a long-held MVCC snapshot
*is* often a reason why we cannot set an LP_DEAD bit in an index.
Conclusion: The whole minRecoveryPoint question that you're trying to
answer to improve things for your patch is just the wrong question.
Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
would be useful to set "true hint bits" on standbys earlier, and maybe
thinking about minRecoveryPoint would help with that problem, but that
doesn't help your index-related patch -- because indexes simply don't
have true hint bits.
--
Peter Geoghegan
Hello, Peter.
Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and
your comments:
During recovery, we will probably always have to consider the
possibility that LP_DEAD bits that get set on the primary may be
received by a replica through some implementation detail (e.g. LP_DEAD
bits are set in FPIs we replay, maybe even some other thing that
neither of us have thought of).
It is fine to receive a page to the standby from any source: `btpo_flags`
should have some kind “LP_DEAD safe for standby” bit set to allow new bits
to be set and old - read.
We can't really mask LP_DEAD bits from
the primary in recovery anyway, because of stuff like page-level
checksums. I suspect that it just isn't useful to fight against that.
As far as I can see - there is no problem here. Checksums already differ
for both heap and index pages on standby and primary. Checksums are
calculated before the page is written to the disk (not after applying FPI).
So, the masking page during *applying* the FPI is semantically the same as
setting a bit in it 1 nanosecond after.
And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.
Plus you'll introduce new overhead for this process during replay,
which creates significant overhead -- often most leaf pages have some
LP_DEAD bits set during recovery.
I hope it is not big enough, because FPIs are not too frequent + positive
effect will easily overcome additional CPU cycles of `btree_mask` (and the
page is already in CPU cache at the moment).
I don't like that idea. Apart from what I said already, you're
assuming that setting LP_DEAD bits in indexes on the primary won't
eventually have some value on the standby after it is promoted and can
accept writes -- they really do have meaning and value on standbys.
I think it is fine to lose part of LP_DEAD on promotion (which can be
received only by FPI in practice). They will be set on the first scan
anyway. Also, bits set by standby may be used by newly promoted primary if
we honor OldestXmin of the previous primary while setting it (just need to
add OldestXmin in xl_running_xacts and include it into dead-horizon on
standby).
Why shouldn't this break page-level checksums (or wal_log_hints) in
some way? What about pg_rewind, some eventual implementation of
incremental backups, etc? I suspect that it will be necessary to
invent some entirely new concept that is like a hint bit, but works on
standbys (without breaking anything else).
As I said before - applying the mask on *standby* will not break any
checksums - because the page is still dirty after that (and it is even
possible to call `PageSetChecksumInplace` for an additional paranoia).
Actual checksums on standby and primary already have different values (and,
probably, in the most of the pages because LP_DEAD and “classic” hint bits).
If you invent some entirely new category of standby-only hint bit at a
level below the access method code, then you can use it inside access
method code such as nbtree. Maybe you don't have to play games with
minRecoveryPoint in code like the "if (RecoveryInProgress())" path
from the XLogNeedsFlush() function. Maybe you can do some kind of
rudimentary "masking" for the in recovery case at the point that we
*write out* a buffer (*not* at the point hint bits are initially set)
-- maybe this could happen near to or inside FlushBuffer(), and maybe
only when checksums are enabled? I'm unsure.
Not sure I was able to understand your idea here, sorry.
The difference may be subtle, but it's important here -- it justifies
inventing a whole new type of LP_DEAD-style status bit that gets set
only on standbys. Even today, heap tuples can have hint bits
"independently" set on standbys, subject to certain limitations needed
to avoid breaking things like data page checksums
Yes, and I see three major ways to implement it in the current
infrastructure:
1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts
(like here -
/messages/by-id/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com
)
3) Mask index FPI during a replay on hot standby + mark page as “primary
LP_DEAD free” in btpo_flags
Of course, each variant requires some special additional things to keep
everything safe.
As I see in SetHintsBits limitations are related to XLogNeedsFlush (check
of minRecoveryPoint in case of standby).
Conclusion: The whole minRecoveryPoint question that you're trying to
answer to improve things for your patch is just the wrong question.
Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
would be useful to set "true hint bits" on standbys earlier, and maybe
thinking about minRecoveryPoint would help with that problem, but that
doesn't help your index-related patch -- because indexes simply don't
have true hint bits.
Attention to minRecoveryPoint is required because of possible situation
described here -
/messages/by-id/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw@mail.gmail.com
The source of the potential problem - is the fact what the setting of
LP_DEAD does not change page LSN and it could cause consistency issues
during crash recovery.
Thanks,
Michail.
On Mon, Feb 1, 2021 at 1:19 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.
We can't really mask LP_DEAD bits from
the primary in recovery anyway, because of stuff like page-level
checksums. I suspect that it just isn't useful to fight against that.As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary.
AFAICT that's not true, at least not in any practical sense. See the
comment in the middle of MarkBufferDirtyHint() that begins with "If we
must not write WAL, due to a relfilenode-specific...", and see the
"Checksums" section at the end of src/backend/storage/page/README. The
last paragraph in the README is particularly relevant:
New WAL records cannot be written during recovery, so hint bits set during
recovery must not dirty the page if the buffer is not already dirty, when
checksums are enabled. Systems in Hot-Standby mode may benefit from hint bits
being set, but with checksums enabled, a page cannot be dirtied after setting a
hint bit (due to the torn page risk). So, it must wait for full-page images
containing the hint bit updates to arrive from the primary.
IIUC the intention is that MarkBufferDirtyHint() is a no-op during hot
standby when we successfully set a hint bit, though only in the
XLogHintBitIsNeeded() case. So we don't really dirty the page within
SetHintBits() in this specific scenario. That is, the buffer header
won't actually get marked BM_DIRTY or BM_JUST_DIRTIED within
MarkBufferDirtyHint() when in Hot Standby + XLogHintBitIsNeeded().
What else could work at all? The only "alternative" is to write an
FPI, just like on the primary -- but writing new WAL records is not
possible during Hot Standby!
A comment within MarkBufferDirtyHint() spells it out directly -- we
can have hint bits set in hot standby independently of the primary,
but it works in a way that makes sure that the hint bits never make it
out to disk:
"We can set the hint, just not dirty the page as a result so the hint
is lost when we evict the page or shutdown"
You may be right in some narrow sense -- checksums can differ on a
standby. But that's like saying that it's sometimes okay to have a
torn page on disk. Yes, it's okay, but only because we expect the
problem during crash recovery, and can reliably repair it.
Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.
And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
I don't see how that is relevant. btree_mask() is only used by
wal_consistency_checking, which is mostly just for Postgres hackers.
--
Peter Geoghegan
Hello, Peter.
AFAICT that's not true, at least not in any practical sense. See the
comment in the middle of MarkBufferDirtyHint() that begins with "If we
must not write WAL, due to a relfilenode-specific...", and see the
"Checksums" section at the end of src/backend/storage/page/README. The
last paragraph in the README is particularly relevant:
I have attached a TAP-test to demonstrate how easily checksums on standby
and primary starts to differ. The test shows two different scenarios - for
both heap and index (and the bit is placed in both standby and primary).
Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on
secondary could be easily lost. But it leaves the page dirty if it already
is (or it could be marked dirty by WAL replay later). So, hints bits could
be easily flushed and taken into account during checksum calculation on
both - standby and primary.
"We can set the hint, just not dirty the page as a result so the hint
is lost when we evict the page or shutdown"
Yes, it is not allowed to mark a page as dirty because of hints on standby.
Because we could achieve this:
CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT
But this scenario is totally fine:
CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK
And, as result, this is fine too:
CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY
So, my point here - it is fine to mask LP_DEAD bits during replay because
they are already different on standby and primary. And it is fine to set
and flush hint bits (and LP_DEADs) on standby because they already could be
easily flushed (just need to consider minRecovertPoint and, probably,
OldesXmin from primary in case of LP_DEAD to make promotion easily).
And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.
I don't see how that is relevant. btree_mask() is only used by
wal_consistency_checking, which is mostly just for Postgres hackers.
I was thinking about the possibility to reuse these functions in masking
during replay.
Thanks,
Michail.
Attachments:
Hello, Peter.
If you are interested, the possible patch (based on FPI mask during
replay) was sent with some additional explanation and graphics to (1).
At the moment I unable to find any "incorrectness" in it.
Thanks again for your comments.
Michail.
[1]: /messages/by-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH=cMXZSf+nzvg@mail.gmail.com