GiST VACUUM

Started by Andrey Borodinabout 8 years ago82 messageshackers
Jump to latest
#1Andrey Borodin
amborodin@acm.org

Hi, hackers!

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
Original thread can be found at [0]/messages/by-id/1147341441925550@web17j.yandex.ru

==Purpose==
Currently GiST never reuse pages even if they are completely empty. This can lead to a bloat, if a lot of index tuples are deleted from key space, which do not receive newly inserted tuples.
First patch fixes that issue: empty pages are collected and reused for new page splits.

Second patch improves scanning algorithm to read GiST blocks in physical order. This can dramatically increase speed of scanning, especially on HDD.
This scan is using relatively big chunk of memory to build map of whole GiST graph. If there is not enough maintenance memory, patch had the fallback to old GiST VACUUM (from first patch).

==How logical scan works==
GiST VACUUM scans graph in DFS search to find removed tuples on leaf pages. It remembers internal pages that are referencing completely empty leaf pages.
Then in additional step, these pages are rescanned to delete references and mark leaf pages as free.

==How physical scan works==
Scan builds array of GistPSItem (Physical scan item). GistPSItem contains List of offsets pointing to potentially empty leaf pages and the information necessary to collect that list in one sequential file read.
Array of GistPSItems has one item for each GiST block.
When we encounter leaf page, if it is empty - we mark it empty and jump to parent (if known) to update it's list.
When we encounter internal page, we check GistPSItem of every child block to check if it is empty leaf and to mark parent ptr there.

==Limitations==
At least one reference on each internal pages is left undeleted to preserve balancing of the tree.
Pages that has FOLLOW-RIGHT flag also are not deleted, even if empty.

Thank you for your attention, any thoughts are welcome.

Best regards, Andrey Borodin.

[0]: /messages/by-id/1147341441925550@web17j.yandex.ru

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v3.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v3.patch; x-unix-mode=0644Download+203-21
0002-Physical-GiST-scan-during-VACUUM-v3.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v3.patch; x-unix-mode=0644Download+332-41
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#1)
Re: GiST VACUUM

On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1]https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16.

+ uint16_t flags;

I think that needs to be uint16?

[1]: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16

--
Thomas Munro
http://www.enterprisedb.com

#3Andrey Borodin
amborodin@acm.org
In reply to: Thomas Munro (#2)
Re: GiST VACUUM

Hi, Thomas!

17 мая 2018 г., в 9:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1].

+ uint16_t flags;

I think that needs to be uint16?

Thanks! Fixed.

Best regards, Andrey Borodin.

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v4.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v4.patch; x-unix-mode=0644Download+203-21
0002-Physical-GiST-scan-during-VACUUM-v4.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v4.patch; x-unix-mode=0644Download+332-41
#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#3)
Re: GiST VACUUM

I'm now looking at the first patch in this series, to allow completely
empty GiST pages to be recycled. I've got some questions:

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as
I can see, this patch never marks internal pages as deleted, only leaf
pages. However, we should have something like this in the leaf-page
branch, to deal with the case that an insertion lands on a page that was
concurrently deleted. Did you have any tests, where an insertion runs
concurrently with vacuum, that would exercise this?

The code in gistbulkdelete() seems pretty expensive. In the first phase,
it records the parent of every empty leaf page it encounters. In the
second phase, it scans every leaf page of that parent, not only those
leaves that were seen as empty.

I'm a bit wary of using pd_prune_xid for the checks to determine if a
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a
hint, but here it's used for a critical check. This seems to be the same
mechanism we use in B-trees, but in B-trees, we store the XID in
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See
comments in _bt_unlink_halfdead_page() for explanation. This patch is
missing any comments to explain how this works in GiST.

If you crash in the middle of gistbulkdelete(), after it has removed the
downlink from the parent, but before it has marked the leaf page as
deleted, the leaf page is "leaked". I think that's acceptable, but a
comment at least would be good.

- Heikki

#5Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#4)
Re: GiST VACUUM

Hi, Heikki!

Thanks for looking into the patch!

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

I'm now looking at the first patch in this series, to allow completely empty GiST pages to be recycled. I've got some questions:

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted.

That's a bug. Will fix this.

Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?

Yes, I've tried to test this, but, obviously, not enough. I'll think more about how to deal with it.

The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty.

Yes, in first patch there is simplest gistbulkdelete(), second patch will remember line pointers of downlinks to empty leafs.

I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST.

Will look into this. I remember it was OK half a year ago, but now it is clear to me that I had to document that part when I understood it....

If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good.

I was considering doing reverse-split (page merge) concurrency like in Lanin and Shasha's paper, but it is just too complex for little purpose. Will add comments on possible orphan pages.

Many thanks! I hope to post updated patch series this week.

Best regards, Andrey Borodin.

#6Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#4)
Re: GiST VACUUM

Hi!

PFA v5 of the patch series.

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?

That bug could manifest only in case of crash between removing downlinks and marking pages deleted. I do not know how to test this reliably.
Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being deleted, unless crash happens.

I've replaced code covering this situation into leaf code path and added comment.

The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty.

It is fixed in second patch of the series.

I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST.

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() and added explanation of what is going on. Also, I've added comments about that pd_prune_xid usage. There is no other use in GiST for this field. And there is no other room to place this xid on a page without pg_upgrade.

If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good.

Added explanatory comment between WAL-logging downlink removal and marking pages deleted.

Thank you for reviewing the patch!

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v5.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v5.patch; x-unix-mode=0644Download+332-41
0001-Delete-pages-during-GiST-VACUUM-v5.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v5.patch; x-unix-mode=0644Download+229-21
#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#6)
Re: GiST VACUUM

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?

That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that
situation after a crash. After a crash, you might have some empty,
orphaned, pages that have already been unlinked from the parent, but a
search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving
orphaned pages like that behind. Let's WAL-log the removal of the
downlink, and marking the leaf pages as deleted, in one WAL record, to
avoid that.

But the situation in gistdoinsert(), where you encounter a deleted leaf
page, could happen during normal operation, if vacuum runs concurrently
with an insert. Insertion locks only one page at a time, as it descends
the tree, so after it has released the lock on the parent, but before it
has locked the child, vacuum might have deleted the page. In the latest
patch, you're checking for that just before swapping the shared lock for
an exclusive one, but I think that's wrong; you need to check for that
after swapping the lock, because otherwise vacuum might delete the page
while you're not holding the lock.

I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki

#8Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#7)
Re: GiST VACUUM

12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):
This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?

That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that situation after a crash. After a crash, you might have some empty, orphaned, pages that have already been unlinked from the parent, but a search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while).

But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

Sorry for overquoting

+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent

We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? I just do not know.
I can do this tests manually if this is enough.

Best regards, Andrey Borodin.

#9Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#8)
Re: GiST VACUUM

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v6.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v6.patch; x-unix-mode=0644Download+332-40
0001-Delete-pages-during-GiST-VACUUM-v6.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v6.patch; x-unix-mode=0644Download+245-21
#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#9)
Re: GiST VACUUM

On 13/07/18 16:41, Andrey Borodin wrote:

12 июля 2018 г., в 21:07, Andrey Borodin <x4mmm@yandex-team.ru
<mailto:x4mmm@yandex-team.ru>> написал(а):
12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> написал(а):

Actually, now that I think about it more, I'm not happy with leaving orphaned
pages like that behind. Let's WAL-log the removal of the downlink, and
marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not
know a proper way to test that (BTW there is GiST amcheck in progress, but I
decided to leave it for a while).

Done. Now WAL record for deleted page also removes downlink from internal page.
I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
I do not think it will have any impact on performance.

Yeah, I think that's fine, this isn't that performance critical

But the situation in gistdoinsert(), where you encounter a deleted leaf page,
could happen during normal operation, if vacuum runs concurrently with an
insert. Insertion locks only one page at a time, as it descends the tree, so
after it has released the lock on the parent, but before it has locked the
child, vacuum might have deleted the page. In the latest patch, you're
checking for that just before swapping the shared lock for an exclusive one,
but I think that's wrong; you need to check for that after swapping the lock,
because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.

Ok, the comment now says:

+			/*
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */

But that's not accurate, right? You should never see deleted pages after
a crash, because the parent is updated in the same WAL record as the
child page, right?

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

- Heikki

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#9)
Re: GiST VACUUM

Looking at the second patch, to scan the GiST index in physical order,
that seems totally unsafe, if there are any concurrent page splits. In
the logical scan, pushStackIfSplited() deals with that, by comparing the
page's NSN with the parent's LSN. But I don't see anything like that in
the physical scan code.

I think we can do something similar in the physical scan: remember the
current LSN position at the beginning of the vacuum, and compare with
that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the
index in physical order in the bulkdelete pass already?

- Heikki

#12Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#10)
Re: GiST VACUUM

13 июля 2018 г., в 18:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.

Ok, the comment now says:

+			/*
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */

But that's not accurate, right? You should never see deleted pages after a crash, because the parent is updated in the same WAL record as the child page, right?

Fixed the comment.

I'm still a bit scared about using pd_prune_xid to store the XID that prevents recycling the page too early. Can we use some field in GISTPageOpaqueData for that, similar to how the B-tree stores it in BTPageOpaqueData?

There is no room in opaque data, but, technically all page is just a tombstone until reused, so we can pick arbitrary place. PFA v7 where xid after delete is stored in opaque data, but we can use other places, like line pointer array or opaque-1

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Looking at the second patch, to scan the GiST index in physical order, that seems totally unsafe, if there are any concurrent page splits. In the logical scan, pushStackIfSplited() deals with that, by comparing the page's NSN with the parent's LSN. But I don't see anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink simultaneously. The purpose of NSN is to visit this page exactly once by following only on of two links in a scan. This is achieved naturally if we read everything from the beginning to the end. (That is how I understand, I can be wrong)

I think we can do something similar in the physical scan: remember the current LSN position at the beginning of the vacuum, and compare with that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the index in physical order in the bulkdelete pass already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() and IndexFreeSpaceMapVacuum(). Is it correct to move this to first scan?

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v7.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v7.patch; x-unix-mode=0644Download+298-41
0001-Delete-pages-during-GiST-VACUUM-v7.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v7.patch; x-unix-mode=0644Download+280-21
#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#12)
Re: GiST VACUUM

On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on
page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved
to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page
is stamped with current "vacuum cycle id". When the vacuum scan
encounters a page with the current cycle id, whose right-link points to
a lower-numbered page, it immediately follows the right link, and
re-scans it. I.e. in the above example, if it was a B-tree, in step 3
when vacuum scans page 15, it would see that it was concurrently split.
It would immediately vacuum page 5 again, before continuing the scan in
physical order.

We'll need to do something similar in GiST.

- Heikki

#14Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#13)
Re: GiST VACUUM

14 июля 2018 г., в 0:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):
Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page is stamped with current "vacuum cycle id". When the vacuum scan encounters a page with the current cycle id, whose right-link points to a lower-numbered page, it immediately follows the right link, and re-scans it. I.e. in the above example, if it was a B-tree, in step 3 when vacuum scans page 15, it would see that it was concurrently split. It would immediately vacuum page 5 again, before continuing the scan in physical order.

We'll need to do something similar in GiST.

OK, I will do this.

This is tradeoff between complex concurrency feature and possibility of few dead tuples left after VACUUM. I want to understand: is it something dangerous in this dead tuples?
There is one more serious race condition: result of first scan is just a hint where to look for downlinks to empty pages. If internal page splits between scan and cleanup, offsets of downlinks will be changed, cleanup will lock pages, see non-empty pages and will not delete them (though there are not dead tuples, just not deleted empty leafs).

Best regards, Andrey Borodin.

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#14)
Re: GiST VACUUM

On 14/07/18 10:26, Andrey Borodin wrote:

This is tradeoff between complex concurrency feature and possibility
of few dead tuples left after VACUUM. I want to understand: is it
something dangerous in this dead tuples?

Yeah, that's bad. When a new heap tuple is inserted in the same
location, the old index tuple will point to the new, unrelated, tuple,
and you will get incorrect query results.

There is one more serious race condition: result of first scan is
just a hint where to look for downlinks to empty pages. If internal
page splits between scan and cleanup, offsets of downlinks will be
changed, cleanup will lock pages, see non-empty pages and will not
delete them (though there are not dead tuples, just not deleted empty
leafs).

That's fine. Leaving behind a few empty pages is harmless, the next
vacuum will pick them up.

- Heikki

#16Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#15)
Re: GiST VACUUM

14 июля 2018 г., в 14:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 14/07/18 10:26, Andrey Borodin wrote:

This is tradeoff between complex concurrency feature and possibility
of few dead tuples left after VACUUM. I want to understand: is it
something dangerous in this dead tuples?

Yeah, that's bad. When a new heap tuple is inserted in the same location, the old index tuple will point to the new, unrelated, tuple, and you will get incorrect query results.

PFA v8: at the beginning of physical scan I grab GetInsertRecPtr() and if leaf page have rightlink lefter in file and NSN higher than LSN of start we get back to scan that page one more time. This is recursive.

I'm still not very comfortable with storing deletion xid in opaque data: we reuse rightlink field under very specific conditions instead of using totally unused pd_prune_xid. We have a remark in bufpage.h
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
Or may be we should use some other place on those empty 8Kb page.

Deleted page do not have rightlink now, but in B-trees rightlink on deleted pages is actively used.
We either cannot reuse NSN: rightlink is useless without NSN anyway. We cannot reuse flags, page is deleted in flags after all. uint16 gist_page_id is just not enough.

Best regards, Andrey Borodin.

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v8.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v8.patch; x-unix-mode=0644Download+277-22
0002-Physical-GiST-scan-during-VACUUM-v8.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v8.patch; x-unix-mode=0644Download+326-41
#17Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: GiST VACUUM

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

What's your reason for being scared? It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space. Other AMs like btree
or zheap may have different needs; one size does not fit all. That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Andrey Borodin
amborodin@acm.org
In reply to: Robert Haas (#17)
Re: GiST VACUUM

16 июля 2018 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а):

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

What's your reason for being scared? It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way we store it easily.

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v9.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v9.patch; x-unix-mode=0644Download+326-41
0001-Delete-pages-during-GiST-VACUUM-v9.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v9.patch; x-unix-mode=0644Download+272-21
#19Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#18)
Re: GiST VACUUM

Hi!

16 июля 2018 г., в 21:24, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I was checking WAL replay of new scheme to log page deletes and found a bug there (incorrect value of deleted downlink in WAL record). Here's fixed patch v10.

Also I've added support to WAL identification for new record, done some improvements to comments and naming in data structures.

Thanks!

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v10.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v10.patch; x-unix-mode=0644Download+326-41
0001-Delete-pages-during-GiST-VACUUM-v10.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v10.patch; x-unix-mode=0644Download+274-19
#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#19)
Re: GiST VACUUM

On 17/07/18 21:41, Andrey Borodin wrote:

I was checking WAL replay of new scheme to log page deletes and found
a bug there (incorrect value of deleted downlink in WAL record).
Here's fixed patch v10.

Also I've added support to WAL identification for new record, done
some improvements to comments and naming in data structures.

Thanks!

+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+			{
+				gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
+			}

In the corresponding B-tree code, we use don't do actual recursion, but
a hand-optimized "tail recursion", to avoid stack overflow if there are
a lot of splits. I think we need to do something like tha there, too. I
don't think it's safe to assume that we have enough stack space for the
recursion. You have a check_stack_depth() in the function, so you'll get
ERROR, but it sucks if VACUUM errors out because of that.

It's not cool to use up to 'maintenance_work_mem' of memory for holding
the in-memory graph. VACUUM already uses up to that much memory to hold
the list of dead TIDs, so we would use up to 2x maintenance_work_mem in
total.

The only reason we still need the logical scan is to support page
deletion, when there is not enough memory available. Can we get rid of
that altogether? I think I'd prefer some other scheme to deal with that
situation. For example, we could make note, in memory, of the empty
pages during the physical scan, and perform a second physical scan of
the index to find their downlinks. Two full-index scans is not nice, but
it's actually not that bad if it's done in physical order. And you could
have some heuristics, e.g. only do it if more than 10% of the pages were
deletable.

Sorry to ask you to rewrite this again, but I think it would be better
to split this into two patches as follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch,
which just switches to doing physical-order scan, first.

- Heikki

#21Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#21)
#23Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#22)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#23)
#25Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#25)
#27Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#26)
#28Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#27)
#29Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#28)
#30Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#29)
#31Andrey Borodin
amborodin@acm.org
In reply to: Thomas Munro (#30)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#31)
#33Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#32)
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#33)
#35Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#34)
#36Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#36)
#38Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#37)
#39Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andrey Borodin (#38)
#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#38)
#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#40)
#42Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#40)
#43Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#43)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#43)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#42)
#47Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#46)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#47)
#49Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#49)
#51Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#50)
#52Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#48)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#51)
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#51)
#55Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#54)
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#55)
#57Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#56)
#58Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#57)
#59Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#56)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#59)
#61Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#59)
#62Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#60)
#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#62)
#64Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#63)
#65Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#64)
#67Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#65)
#68Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#67)
#69Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#68)
#70Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#69)
#71Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#70)
#72Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#71)
#73Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#71)
#74Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#73)
#75Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#71)
In reply to: Heikki Linnakangas (#64)
#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#75)
#78Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#75)
#79Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#78)
In reply to: Heikki Linnakangas (#79)
#81Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#80)
In reply to: Heikki Linnakangas (#81)