Bug in ginRedoRecompress that causes opaque data on page to be overrun

Started by R, Sivaover 7 years ago13 messages
#1R, Siva
sivasubr@amazon.com
1 attachment(s)

Hi,

We recently encountered an issue where the opaque data flags on a gin data leaf page was corrupted while replaying a gin insert WAL record. Upon further examination of the redo code, we found a bug in ginRedoRecompress code, which extracts the WAL information and updates the page.

Specifically, when a new segment is inserted in the middle of a page, a memmove operation is performed [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/ginxlog.c;h=7515f8bc167c2eafceced5d6ad5d74f7ec09e0a5;hb=refs/heads/REL9_6_STABLE#l278 at the current point in the page to make room for the new segment. If this segment insertion is followed by delete segment actions that are yet to be processed and the total data size is very close to GinDataPageMaxDataSize, then we may move the data portion beyond the boundary causing the opaque data to be corrupted.

One way of solving this problem is to perform the replay work on a scratch space, perform sanity check on the total size of the data portion before copying it back to the actual page. While it involves additional memory allocation and memcpy operations, it is safer and similar to the 'do' code path where we ensure to make a copy of all segment past the first modified segment before placing them back on the page [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/gindatapage.c;h=cd3b9dfb784b084dd27a37146a4909fa1109ee81;hb=refs/heads/REL9_6_STABLE#l1726.

I have attached a patch for that approach here. Please let us know any comments or feedback.
Thanks!

Best
Siva

References:
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/ginxlog.c;h=7515f8bc167c2eafceced5d6ad5d74f7ec09e0a5;hb=refs/heads/REL9_6_STABLE#l278
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/gindatapage.c;h=cd3b9dfb784b084dd27a37146a4909fa1109ee81;hb=refs/heads/REL9_6_STABLE#l1726

Attachments:

rewrite-ginRedoRecompress-scratch-space-processing-replay_v1.patchapplication/octet-stream; name=rewrite-ginRedoRecompress-scratch-space-processing-replay_v1.patchDownload
From 9f971e7802d041573024eede7e0072372ef1d6e6 Mon Sep 17 00:00:00 2001
From: Siva R <sivasubr@amazon.com>
Date: Tue, 4 Sep 2018 19:19:16 +0000
Subject: [PATCH] Rewrite ginRedoRecompress to use a scratch space for
 processing replay actions instead of modifying contents on the actual page to
 avoid accidental overwrites of opaque data.

---
 src/backend/access/gin/ginxlog.c | 180 +++++++++++++++++++++------------------
 1 file changed, 98 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7515f8b..7ea798f 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -138,17 +138,21 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 {
 	int			actionno;
 	int			segno;
-	GinPostingList *oldseg;
+	GinPostingList *segiter;
+	Pointer		segmentcopybegin;
 	Pointer		segmentend;
 	char	   *walbuf;
 	int			totalsize;
+	Pointer		scratch;
+	Pointer		scratchbegin;
+	bool		modified;
 
 	/*
 	 * If the page is in pre-9.4 format, convert to new format first.
 	 */
 	if (!GinPageIsCompressed(page))
 	{
-		ItemPointer uncompressed = (ItemPointer) GinDataPageGetData(page);
+		ItemPointer uncompressed = (ItemPointer)GinDataPageGetData(page);
 		int			nuncompressed = GinPageGetOpaque(page)->maxoff;
 		int			npacked;
 
@@ -180,63 +184,77 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 		GinPageGetOpaque(page)->maxoff = InvalidOffsetNumber;
 	}
 
-	oldseg = GinDataLeafPageGetPostingList(page);
-	segmentend = (Pointer) oldseg + GinDataLeafPageGetPostingListSize(page);
-	segno = 0;
+	scratch = palloc(GinDataPageMaxDataSize);
+
+	segiter = GinDataLeafPageGetPostingList(page);
+	segmentend = (Pointer)segiter + GinDataLeafPageGetPostingListSize(page);
+	segmentcopybegin = segmentend;
 
-	walbuf = ((char *) data) + sizeof(ginxlogRecompressDataLeaf);
+	segno = 0;
+	modified = false;
+	scratchbegin = scratch;
+	walbuf = ((char *)data) + sizeof(ginxlogRecompressDataLeaf);
 	for (actionno = 0; actionno < data->nactions; actionno++)
 	{
-		uint8		a_segno = *((uint8 *) (walbuf++));
-		uint8		a_action = *((uint8 *) (walbuf++));
+		uint8		a_segno = *((uint8 *)(walbuf++));
+		uint8		a_action = *((uint8 *)(walbuf++));
 		GinPostingList *newseg = NULL;
 		int			newsegsize = 0;
 		ItemPointerData *items = NULL;
 		uint16		nitems = 0;
-		ItemPointerData *olditems;
-		int			nolditems;
-		ItemPointerData *newitems;
-		int			nnewitems;
-		int			segsize;
-		Pointer		segptr;
-		int			szleft;
-
-		/* Extract all the information we need from the WAL record */
-		if (a_action == GIN_SEGMENT_INSERT ||
-			a_action == GIN_SEGMENT_REPLACE)
-		{
-			newseg = (GinPostingList *) walbuf;
-			newsegsize = SizeOfGinPostingList(newseg);
-			walbuf += SHORTALIGN(newsegsize);
-		}
-
-		if (a_action == GIN_SEGMENT_ADDITEMS)
-		{
-			memcpy(&nitems, walbuf, sizeof(uint16));
-			walbuf += sizeof(uint16);
-			items = (ItemPointerData *) walbuf;
-			walbuf += nitems * sizeof(ItemPointerData);
-		}
 
 		/* Skip to the segment that this action concerns */
 		Assert(segno <= a_segno);
 		while (segno < a_segno)
 		{
-			oldseg = GinNextPostingListSegment(oldseg);
+			if (modified)
+			{
+				int			cursegsize;
+
+				/* Copy out the segment to scratch space */
+				cursegsize = SizeOfGinPostingList(segiter);
+				Assert(scratch + cursegsize - scratchbegin <= GinDataPageMaxDataSize);
+				memcpy(scratch, segiter, cursegsize);
+				scratch += cursegsize;
+			}
+
+			/* Move to next segment in the page */
+			segiter = GinNextPostingListSegment(segiter);
 			segno++;
 		}
 
 		/*
-		 * ADDITEMS action is handled like REPLACE, but the new segment to
-		 * replace the old one is reconstructed using the old segment from
-		 * disk and the new items from the WAL record.
+		 * Encountered first segment with an action. Every subsequent segment
+		 * needs to be in scratch space.
+		 */
+		if (!modified)
+		{
+			modified = true;
+			segmentcopybegin = (Pointer)segiter;
+		}
+
+		/*
+		 * Positioned after the last existing segment. Only INSERTs expected
+		 * here.
 		 */
+		if ((Pointer)segiter == segmentend)
+			Assert(a_action == GIN_SEGMENT_INSERT);
+
+		/* Process the action on the segment */
 		if (a_action == GIN_SEGMENT_ADDITEMS)
 		{
+			ItemPointerData *olditems;
+			int			nolditems;
+			ItemPointerData *newitems;
+			int			nnewitems;
 			int			npacked;
 
-			olditems = ginPostingListDecode(oldseg, &nolditems);
+			memcpy(&nitems, walbuf, sizeof(uint16));
+			walbuf += sizeof(uint16);
+			items = (ItemPointerData *)walbuf;
+			walbuf += nitems * sizeof(ItemPointerData);
 
+			olditems = ginPostingListDecode(segiter, &nolditems);
 			newitems = ginMergeItemPointers(items, nitems,
 											olditems, nolditems,
 											&nnewitems);
@@ -247,61 +265,59 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 			Assert(npacked == nnewitems);
 
 			newsegsize = SizeOfGinPostingList(newseg);
-			a_action = GIN_SEGMENT_REPLACE;
+			memcpy(scratch, newseg, newsegsize);
+			scratch += newsegsize;
 		}
+		else if (a_action == GIN_SEGMENT_INSERT ||
+				 a_action == GIN_SEGMENT_REPLACE)
+		{
+			newseg = (GinPostingList *)walbuf;
+			newsegsize = SizeOfGinPostingList(newseg);
+			walbuf += SHORTALIGN(SizeOfGinPostingList(newseg));
 
-		segptr = (Pointer) oldseg;
-		if (segptr != segmentend)
-			segsize = SizeOfGinPostingList(oldseg);
+			/* Copy the new/replaced segment into scratch space */
+			memcpy(scratch, newseg, newsegsize);
+			scratch += newsegsize;
+		}
+		else if (a_action == GIN_SEGMENT_DELETE)
+		{
+			/* No op */
+		}
 		else
 		{
-			/*
-			 * Positioned after the last existing segment. Only INSERTs
-			 * expected here.
-			 */
-			Assert(a_action == GIN_SEGMENT_INSERT);
-			segsize = 0;
+			elog(ERROR, "unexpected GIN leaf action: %u", a_action);
 		}
-		szleft = segmentend - segptr;
 
-		switch (a_action)
+		/*
+		 * Advance to next segment on original page if the current segment
+		 * action is not insert.
+		 */
+		if (a_action != GIN_SEGMENT_INSERT)
 		{
-			case GIN_SEGMENT_DELETE:
-				memmove(segptr, segptr + segsize, szleft - segsize);
-				segmentend -= segsize;
-
-				segno++;
-				break;
-
-			case GIN_SEGMENT_INSERT:
-				/* make room for the new segment */
-				memmove(segptr + newsegsize, segptr, szleft);
-				/* copy the new segment in place */
-				memcpy(segptr, newseg, newsegsize);
-				segmentend += newsegsize;
-				segptr += newsegsize;
-				break;
-
-			case GIN_SEGMENT_REPLACE:
-				/* shift the segments that follow */
-				memmove(segptr + newsegsize,
-						segptr + segsize,
-						szleft - segsize);
-				/* copy the replacement segment in place */
-				memcpy(segptr, newseg, newsegsize);
-				segmentend -= segsize;
-				segmentend += newsegsize;
-				segptr += newsegsize;
-				segno++;
-				break;
-
-			default:
-				elog(ERROR, "unexpected GIN leaf action: %u", a_action);
+			segiter = GinNextPostingListSegment(segiter);
+			segno++;
 		}
-		oldseg = (GinPostingList *) segptr;
 	}
 
-	totalsize = segmentend - (Pointer) GinDataLeafPageGetPostingList(page);
+	/* Assert that there was some modification performed */
+	Assert(modified);
+
+	/* Copy out remainder of the page into scratch */
+	if ((Pointer)segiter < segmentend)
+		memcpy(scratch, segiter, (segmentend - ((Pointer)segiter)));
+
+	/*
+	 * Check that the new total size is within the max data page size allowed.
+	 * New total size is sum of scratch space size consumed and portion of
+	 * page that was unmodified.
+	 */
+	totalsize = (scratch - scratchbegin) + (segmentcopybegin - (Pointer)GinDataLeafPageGetPostingList(page));
+	Assert(totalsize <= GinDataPageMaxDataSize);
+
+	/* Copy the items back onto the page */
+	memcpy(segmentcopybegin, scratchbegin, (scratch - scratchbegin));
+
+	/* Set size on the page */
 	GinDataPageSetDataSize(page, totalsize);
 }
 
-- 
2.7.3.AMZN

#2Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: R, Siva (#1)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

Hi, Siva!

On Tue, Sep 4, 2018 at 11:01 PM R, Siva <sivasubr@amazon.com> wrote:

We recently encountered an issue where the opaque data flags on a gin data leaf page was corrupted while replaying a gin insert WAL record. Upon further examination of the redo code, we found a bug in ginRedoRecompress code, which extracts the WAL information and updates the page.

Specifically, when a new segment is inserted in the middle of a page, a memmove operation is performed [1] at the current point in the page to make room for the new segment. If this segment insertion is followed by delete segment actions that are yet to be processed and the total data size is very close to GinDataPageMaxDataSize, then we may move the data portion beyond the boundary causing the opaque data to be corrupted.

One way of solving this problem is to perform the replay work on a scratch space, perform sanity check on the total size of the data portion before copying it back to the actual page. While it involves additional memory allocation and memcpy operations, it is safer and similar to the 'do' code path where we ensure to make a copy of all segment past the first modified segment before placing them back on the page [2].

I have attached a patch for that approach here. Please let us know any comments or feedback.

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In reply to: R, Siva (#1)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Tue, Sep 4, 2018 at 12:59 PM, R, Siva <sivasubr@amazon.com> wrote:

We recently encountered an issue where the opaque data flags on a gin data
leaf page was corrupted while replaying a gin insert WAL record. Upon
further examination of the redo code, we found a bug in ginRedoRecompress
code, which extracts the WAL information and updates the page.

I wonder how you managed to detect it in the first place. Were you
using something like wal_consistency_checking=all, with a custom
stress-test?

--
Peter Geoghegan

#4R, Siva
sivasubr@amazon.com
In reply to: Peter Geoghegan (#3)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

Hi Alexander!
On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

Unfortunately, I do not have a way of reproducing this issue.
So far I have tried a workload consisting of inserts (of the
same attribute value that is indexed), batch deletes of rows
and vacuum interleaved with engine crash/restarts.

Hi Peter!
On Tue, Sep 4, 2018 at 09:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I wonder how you managed to detect it in the first place. Were you
using something like wal_consistency_checking=all, with a custom
stress-test?

We observed this corruption during stress testing and were
able to isolate the corrupted page and WAL record changes
leading up to the corruption using some internal diagnostic
tools.

Best
Siva

#5Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: R, Siva (#4)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote:

On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

Unfortunately, I do not have a way of reproducing this issue.
So far I have tried a workload consisting of inserts (of the
same attribute value that is indexed), batch deletes of rows
and vacuum interleaved with engine crash/restarts.

Issue reproduction and testing is essential for bug fix. Remember
last time you reported GIN bug [1]/messages/by-id/1531867212836.63354@amazon.com, after issue reproduction it
appears that we have more things to fix. I's quite clear for me that
if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE,
then it might lead to wrong behavior in ginRedoRecompress(). But it's
not yet clear to understand what code patch could lead to such segment
list... I'll explore code more and probably will come with some idea.

Links
[1]: /messages/by-id/1531867212836.63354@amazon.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#6Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Alexander Korotkov (#5)
1 attachment(s)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote:

On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

Unfortunately, I do not have a way of reproducing this issue.
So far I have tried a workload consisting of inserts (of the
same attribute value that is indexed), batch deletes of rows
and vacuum interleaved with engine crash/restarts.

Issue reproduction and testing is essential for bug fix. Remember
last time you reported GIN bug [1], after issue reproduction it
appears that we have more things to fix. I's quite clear for me that
if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE,
then it might lead to wrong behavior in ginRedoRecompress(). But it's
not yet clear to understand what code patch could lead to such segment
list... I'll explore code more and probably will come with some idea.

Aha, I've managed to reproduce this.
1. Apply ginRedoRecompress-asserts.patch, which add assertions to
ginRedoRecompress() detecting past opaque writes.
2. Setup streaming replication.
3. Execute following on the master.
create or replace function test () returns void as $$
declare
i int;
begin
FOR i IN 1..1000 LOOP
insert into test values ('{1}');
end loop;
end
$$ language plpgsql;
create table test (a int[]);
insert into test (select '{}'::int[] from generate_series(1,10000));
insert into test (select '{1}'::int[] from generate_series(1,100000));
create index test_idx on test using gin(a) with (fastupdate = off);
delete from test where a = '{}'::int[];
vacuum test;
select test();

So, since we managed to reproduce this, I'm going to test and review your fix.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

ginRedoRecompress-asserts.patchapplication/x-patch; name=ginRedoRecompress-asserts.patchDownload
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7a1e94a1d56..301c527cab9 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -276,6 +276,7 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 				break;
 
 			case GIN_SEGMENT_INSERT:
+				Assert(segptr + newsegsize + szleft < PageGetSpecialPointer(page));
 				/* make room for the new segment */
 				memmove(segptr + newsegsize, segptr, szleft);
 				/* copy the new segment in place */
@@ -285,6 +286,8 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 				break;
 
 			case GIN_SEGMENT_REPLACE:
+				Assert(segptr + newsegsize + (szleft - segsize) < PageGetSpecialPointer(page));
+				Assert(segptr + szleft < PageGetSpecialPointer(page));
 				/* shift the segments that follow */
 				memmove(segptr + newsegsize,
 						segptr + segsize,
#7Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Alexander Korotkov (#6)
1 attachment(s)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Wed, Sep 5, 2018 at 2:39 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote:

On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

Unfortunately, I do not have a way of reproducing this issue.
So far I have tried a workload consisting of inserts (of the
same attribute value that is indexed), batch deletes of rows
and vacuum interleaved with engine crash/restarts.

Issue reproduction and testing is essential for bug fix. Remember
last time you reported GIN bug [1], after issue reproduction it
appears that we have more things to fix. I's quite clear for me that
if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE,
then it might lead to wrong behavior in ginRedoRecompress(). But it's
not yet clear to understand what code patch could lead to such segment
list... I'll explore code more and probably will come with some idea.

Aha, I've managed to reproduce this.
1. Apply ginRedoRecompress-asserts.patch, which add assertions to
ginRedoRecompress() detecting past opaque writes.

It was wrong, sorry. It appears that I put strict inequality into
asserts, while there should be loose inequality. Correct version of
patch is attached. And scenario I've posted isn't really reproducing
the bug...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

ginRedoRecompress-asserts-v2.patchapplication/octet-stream; name=ginRedoRecompress-asserts-v2.patchDownload
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7a1e94a1d56..52d84a2c207 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -276,6 +276,7 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 				break;
 
 			case GIN_SEGMENT_INSERT:
+				Assert(segptr + newsegsize + szleft <= PageGetSpecialPointer(page));
 				/* make room for the new segment */
 				memmove(segptr + newsegsize, segptr, szleft);
 				/* copy the new segment in place */
@@ -285,6 +286,8 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 				break;
 
 			case GIN_SEGMENT_REPLACE:
+				Assert(segptr + newsegsize + (szleft - segsize) <= PageGetSpecialPointer(page));
+				Assert(segptr + szleft <= PageGetSpecialPointer(page));
 				/* shift the segments that follow */
 				memmove(segptr + newsegsize,
 						segptr + segsize,
#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: R, Siva (#1)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Wed, Sep 5, 2018 at 4:59 AM, R, Siva <sivasubr@amazon.com> wrote:

Hi,

We recently encountered an issue where the opaque data flags on a gin data
leaf page was corrupted while replaying a gin insert WAL record. Upon
further examination of the redo code, we found a bug in ginRedoRecompress
code, which extracts the WAL information and updates the page.

Specifically, when a new segment is inserted in the middle of a page, a
memmove operation is performed [1] at the current point in the page to make
room for the new segment. If this segment insertion is followed by delete
segment actions that are yet to be processed and the total data size is very
close to GinDataPageMaxDataSize, then we may move the data portion beyond
the boundary causing the opaque data to be corrupted.

One way of solving this problem is to perform the replay work on a scratch
space, perform sanity check on the total size of the data portion before
copying it back to the actual page. While it involves additional memory
allocation and memcpy operations, it is safer and similar to the 'do' code
path where we ensure to make a copy of all segment past the first modified
segment before placing them back on the page [2].

Hmm, could you share the sequence of what kind of WAL has applied to
the broken page? I suspect the segment list contains
GIN_SEGMENT_REPLACE before GIN_SEGMENT_INSERT.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Alexander Korotkov (#7)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Wed, Sep 5, 2018 at 5:05 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 2:39 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote:

On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Do you have a test scenario for reproduction of this issue? We need
it to ensure that fix is correct.

Unfortunately, I do not have a way of reproducing this issue.
So far I have tried a workload consisting of inserts (of the
same attribute value that is indexed), batch deletes of rows
and vacuum interleaved with engine crash/restarts.

Issue reproduction and testing is essential for bug fix. Remember
last time you reported GIN bug [1], after issue reproduction it
appears that we have more things to fix. I's quite clear for me that
if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE,
then it might lead to wrong behavior in ginRedoRecompress(). But it's
not yet clear to understand what code patch could lead to such segment
list... I'll explore code more and probably will come with some idea.

Aha, I've managed to reproduce this.
1. Apply ginRedoRecompress-asserts.patch, which add assertions to
ginRedoRecompress() detecting past opaque writes.

It was wrong, sorry. It appears that I put strict inequality into
asserts, while there should be loose inequality. Correct version of
patch is attached. And scenario I've posted isn't really reproducing
the bug...

Finally I managed to reproduce the bug. The scenario is following.
Underlying idea is that when insertion of multiple tuples goes to the
beginning of the page and this insertion succeed only thanks to
collapse of some short segments together, then this insertion wouldn't
fit to the page if given alone.

create table test (i integer primary key, a int[]) with (fillfactor = 50);
insert into test (select id, array[id%2]::int[] from
generate_series(1,15376) id);
create index test_idx on test using gin(a) with (fastupdate = off);
update test set a = '{1}' where i % 4 = 0 or i % 16 = 2 or i % 64 in
(6, 46, 36) or i % 256 = 54;
vacuum test;
insert into test (select id + 16376, '{0}' from generate_series(1,5180) id);
update test set a = '{1}' where i <= 15376 and i % 256 in (182, 198);
vacuum test;
alter index test_idx set (fastupdate = on);
delete from test where i <= 134 and a = '{1}';
vacuum test;
insert into test (select id+30000, '{0}' from generate_series(1,110) id);
vacuum test;

With ginRedoRecompress-asserts-v2.patch following assertion is fired.
TRAP: FailedAssertion("!(segptr + newsegsize + (szleft - segsize) <= (
((void) ((_Bool) (! (!(PageValidateSpecialPointer(page))) ||
(ExceptionalCondition("!(PageValidateSpecialPointer(page))",
("FailedAssertion"), "ginxlog.c", 289), 0)))), (char *) ((char *)
(page) + ((PageHeader) (page))->pd_special) ))", File: "ginxlog.c",
Line: 289)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10R, Siva
sivasubr@amazon.com
In reply to: Alexander Korotkov (#9)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Tue, Sep 5, 2018 at 08:55 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Finally I managed to reproduce the bug. The scenario is following.
Underlying idea is that when insertion of multiple tuples goes to the
beginning of the page and this insertion succeed only thanks to
collapse of some short segments together, then this insertion wouldn't
fit to the page if given alone.

Sorry for the late reply.
Thank you so much for working on this and reproducing the issue!
Like you mentioned, the WAL record where we detected this problem
has future segments deleted due to compaction and written out
as an insert segment.

alter index test_idx set (fastupdate = on);

Just curious why does this help with the repro? This is related to only
using the Gin pending list vs the posting tree.

I will try to reproduce the issue with the above workload and
also test the fix with the same and report back.

On Wed, Sep 5, 2018 at 5:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hmm, could you share the sequence of what kind of WAL has applied to
the broken page? I suspect the segment list contains
GIN_SEGMENT_REPLACE before GIN_SEGMENT_INSERT.

These are the segment operations on the WAL in sequence:
- 1 Replace action on segment N
- 5 Insert actions after segment N - the 5th insert action is essentially
replacing the last 3 remaining segments with a new one.
- 3 delete actions on the remaining segments.

Best
Siva

#11Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: R, Siva (#10)
1 attachment(s)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Thu, Sep 6, 2018 at 12:53 AM R, Siva <sivasubr@amazon.com> wrote:

On Tue, Sep 5, 2018 at 08:55 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Finally I managed to reproduce the bug. The scenario is following.
Underlying idea is that when insertion of multiple tuples goes to the
beginning of the page and this insertion succeed only thanks to
collapse of some short segments together, then this insertion wouldn't
fit to the page if given alone.

Sorry for the late reply.
Thank you so much for working on this and reproducing the issue!
Like you mentioned, the WAL record where we detected this problem
has future segments deleted due to compaction and written out
as an insert segment.

alter index test_idx set (fastupdate = on);

Just curious why does this help with the repro? This is related to only
using the Gin pending list vs the posting tree.

With (fastupdate = on) GIN performs bulk update of posting lists,
inserting multiple tuples at once if possible. With (fastupdate =
off) GIN always inserts tuples one-by-one. It might be still possible
to reproduce the issue with (fastupdate = off), but it seems even
harder.

BTW, I've tried the patch you've posted. On my test case it fails
with following assertion.
TRAP: FailedAssertion("!(a_action == 2)", File: "ginxlog.c", Line: 243)

I thought about fixing this issue more, and I decided we can fix it in
less invasive way. Once modification is started we can copy tail of
the page into separately allocated chunk of memory, and the use it as
the source of original segments. See the patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

ginRedoRecompress-tail-copy-v1.patchapplication/octet-stream; name=ginRedoRecompress-tail-copy-v1.patchDownload
commit a1debdcdb11376544e7e46348ec13d585a123b7a
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Thu Sep 6 00:17:19 2018 +0300

    Fix past pd_upper write in ginRedoRecompress()
    
    ginRedoRecompress() replays actions over compressed segments of posting list
    in-place.  However, it might lead to write past pg_upper, because intermediate
    state during playing the changes can take more space than both original state
    and final state.  This commit fixes that by refuse from in-place modification.
    Instead page tail is copied once modification is started, and then it's used
    as the source of original segments.  Backpatch to 9.4 where posting list
    compression was introduced.
    
    Reported-by: Sivasubramanian Ramasubramanian
    Discussion: https://postgr.es/m/1536091151804.6588%40amazon.com
    Author: Alexander Korotkov based on patch from and ideas by Sivasubramanian Ramasubramanian
    Backpatch-through: 9.4

diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7a1e94a1d56..06a025afcb8 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -135,6 +135,14 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
 	}
 }
 
+/*
+ * Redo recompression of posting list.  Doing all the changes in-place is not
+ * always possible, because it might require more space than we've on the page.
+ * Instead, once modification is required we copy unprocessed tail of the page
+ * into separately allocated chunk of memory for further reading original
+ * versions of segments.  Thanks to that we don't bother about moving page date
+ * in-place.
+ */
 static void
 ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 {
@@ -144,6 +152,9 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 	Pointer		segmentend;
 	char	   *walbuf;
 	int			totalsize;
+	Pointer		tailCopy = NULL;
+	Pointer		writePtr;
+	Pointer		segptr;
 
 	/*
 	 * If the page is in pre-9.4 format, convert to new format first.
@@ -183,6 +194,7 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 	}
 
 	oldseg = GinDataLeafPageGetPostingList(page);
+	writePtr = (Pointer) oldseg;
 	segmentend = (Pointer) oldseg + GinDataLeafPageGetPostingListSize(page);
 	segno = 0;
 
@@ -200,8 +212,6 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 		ItemPointerData *newitems;
 		int			nnewitems;
 		int			segsize;
-		Pointer		segptr;
-		int			szleft;
 
 		/* Extract all the information we need from the WAL record */
 		if (a_action == GIN_SEGMENT_INSERT ||
@@ -224,6 +234,16 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 		Assert(segno <= a_segno);
 		while (segno < a_segno)
 		{
+			/*
+			 * Once modification is started and page tail is copied, we've
+			 * to copy unmodified segments.
+			 */
+			if (tailCopy)
+			{
+				Assert(writePtr + SizeOfGinPostingList(oldseg) < PageGetSpecialPointer(page));
+				memcpy(writePtr, (Pointer) oldseg, SizeOfGinPostingList(oldseg));
+			}
+			writePtr += SizeOfGinPostingList(oldseg);
 			oldseg = GinNextPostingListSegment(oldseg);
 			segno++;
 		}
@@ -264,36 +284,42 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 			Assert(a_action == GIN_SEGMENT_INSERT);
 			segsize = 0;
 		}
-		szleft = segmentend - segptr;
+
+		/*
+		 * We're about to start modification of the page.  So, copy tail of the
+		 * page if it's not done already.
+		 */
+		if (!tailCopy && segptr != segmentend)
+		{
+			int tailSize = segmentend - segptr;
+
+			tailCopy = (Pointer) palloc(tailSize);
+			memcpy(tailCopy, segptr, tailSize);
+			segptr = tailCopy;
+			oldseg = (GinPostingList *) segptr;
+			segmentend = segptr + tailSize;
+		}
 
 		switch (a_action)
 		{
 			case GIN_SEGMENT_DELETE:
-				memmove(segptr, segptr + segsize, szleft - segsize);
-				segmentend -= segsize;
-
+				segptr += segsize;
 				segno++;
 				break;
 
 			case GIN_SEGMENT_INSERT:
-				/* make room for the new segment */
-				memmove(segptr + newsegsize, segptr, szleft);
 				/* copy the new segment in place */
-				memcpy(segptr, newseg, newsegsize);
-				segmentend += newsegsize;
-				segptr += newsegsize;
+				Assert(writePtr + newsegsize <= PageGetSpecialPointer(page));
+				memcpy(writePtr, newseg, newsegsize);
+				writePtr += newsegsize;
 				break;
 
 			case GIN_SEGMENT_REPLACE:
-				/* shift the segments that follow */
-				memmove(segptr + newsegsize,
-						segptr + segsize,
-						szleft - segsize);
-				/* copy the replacement segment in place */
-				memcpy(segptr, newseg, newsegsize);
-				segmentend -= segsize;
-				segmentend += newsegsize;
-				segptr += newsegsize;
+				/* copy the new version of segment in place */
+				Assert(writePtr + newsegsize <= PageGetSpecialPointer(page));
+				memcpy(writePtr, newseg, newsegsize);
+				writePtr += newsegsize;
+				segptr += segsize;
 				segno++;
 				break;
 
@@ -303,7 +329,18 @@ ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 		oldseg = (GinPostingList *) segptr;
 	}
 
-	totalsize = segmentend - (Pointer) GinDataLeafPageGetPostingList(page);
+	/* Copy the rest of unmodified segments if any. */
+	segptr = (Pointer) oldseg;
+	if (segptr != segmentend && tailCopy)
+	{
+		int restSize = segmentend - segptr;
+
+		Assert(writePtr + restSize <= PageGetSpecialPointer(page));
+		memcpy(writePtr, segptr, restSize);
+		writePtr += restSize;
+	}
+
+	totalsize = writePtr - (Pointer) GinDataLeafPageGetPostingList(page);
 	GinDataPageSetDataSize(page, totalsize);
 }
 
#12R, Siva
sivasubr@amazon.com
In reply to: Alexander Korotkov (#11)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Tue, Sep 6, 2018 at 09:53 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

With (fastupdate = on) GIN performs bulk update of posting lists,
inserting multiple tuples at once if possible. With (fastupdate =
off) GIN always inserts tuples one-by-one. It might be still possible
to reproduce the issue with (fastupdate = off), but it seems even
harder.

Ah I see. This is cool, I will keep in mind for future testing. Thanks!

BTW, I've tried the patch you've posted. On my test case it fails
with following assertion.
TRAP: FailedAssertion("!(a_action == 2)", File: "ginxlog.c", Line: 243)

I thought about fixing this issue more, and I decided we can fix it in
less invasive way. Once modification is started we can copy tail of
the page into separately allocated chunk of memory, and the use it as
the source of original segments. See the patch attached.

I'm also running into this assert with the workload, I think my patch is
not handling the case where the action is add items on the last segment
of the page correctly. I'm still investigating the issue further to find the
source of the bug.

Meanwhile I reviewed your patch and it looks good to me. I agree that
copying out the entire tail out to the scratch space in one shot vs copying
out every segment reduces the number of memcpy calls and simplifies
the solution overall. Let us go ahead with this patch.

Best
Siva

#13Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: R, Siva (#12)
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun

On Thu, Sep 6, 2018 at 9:02 PM R, Siva <sivasubr@amazon.com> wrote:

I'm also running into this assert with the workload, I think my patch is
not handling the case where the action is add items on the last segment
of the page correctly. I'm still investigating the issue further to find
the
source of the bug.

Meanwhile I reviewed your patch and it looks good to me. I agree that
copying out the entire tail out to the scratch space in one shot vs copying
out every segment reduces the number of memcpy calls and simplifies
the solution overall. Let us go ahead with this patch.

Thank you for review! Pushed with minor beautification.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company