Failure in contrib test _int on loach

Started by Thomas Munroalmost 7 years ago16 messages
#1Thomas Munro
thomas.munro@gmail.com

Hi,

This is a strange failure:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00

test _int ... FAILED 649 ms

================= pgsql.build/contrib/intarray/regression.diffs
===================
diff -U3 /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
/usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
--- /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
2019-03-21 12:16:30.514677000 +0100
+++ /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
2019-04-05 07:23:10.005914000 +0200
@@ -453,13 +453,13 @@
 SELECT count(*) from test__int WHERE a && '{23,50}';
  count
 -------
-   403
+   402
 (1 row)
 SELECT count(*) from test__int WHERE a @@ '23|50';
  count
 -------
-   403
+   402
 (1 row)

Those two queries are run immediately after:

CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

--
Thomas Munro
https://enterprisedb.com

#2Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Thomas Munro (#1)
Re: Failure in contrib test _int on loach

On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi,

This is a strange failure:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&amp;dt=2019-04-05%2005%3A15%3A00

test _int ... FAILED 649 ms

================= pgsql.build/contrib/intarray/regression.diffs
===================
diff -U3 /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
/usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
--- /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
2019-03-21 12:16:30.514677000 +0100
+++ /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
2019-04-05 07:23:10.005914000 +0200
@@ -453,13 +453,13 @@
SELECT count(*) from test__int WHERE a && '{23,50}';
count
-------
-   403
+   402
(1 row)
SELECT count(*) from test__int WHERE a @@ '23|50';
count
-------
-   403
+   402
(1 row)

Those two queries are run immediately after:

CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

There are a couple of other recent instances of this failure, on
francolin and whelk.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#2)
Re: Failure in contrib test _int on loach

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is a strange failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&amp;dt=2019-04-05%2005%3A15%3A00
[ wrong answers from queries using a GIST index ]

There are a couple of other recent instances of this failure, on
francolin and whelk.

Yeah. Given three failures in a couple of days, we can reasonably
guess that the problem was introduced within a day or two prior to
the first one. Looking at what's touched GIST in that time frame,
suspicion has to fall heavily on 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.

If I had to bet, I'd bet that there's something wrong with the
machinations described in the commit message:

For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.

I'll go add this as an open issue.

regards, tom lane

#4Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Tom Lane (#3)
1 attachment(s)
Re: Failure in contrib test _int on loach

05.04.2019 18:01, Tom Lane writes:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is a strange failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&amp;dt=2019-04-05%2005%3A15%3A00
[ wrong answers from queries using a GIST index ]

There are a couple of other recent instances of this failure, on
francolin and whelk.

Yeah. Given three failures in a couple of days, we can reasonably
guess that the problem was introduced within a day or two prior to
the first one. Looking at what's touched GIST in that time frame,
suspicion has to fall heavily on 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.

If I had to bet, I'd bet that there's something wrong with the
machinations described in the commit message:

For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.

I'll go add this as an open issue.

regards, tom lane

Hi,
I've already noticed the same failure in our company buildfarm and
started the research.

You are right, it's the " Generate less WAL during GiST, GIN and SP-GiST
index build. " patch to blame.
Because of using the GistBuildLSN some pages are not linked correctly,
so index scan cannot find some entries, while seqscan finds them.

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

gist_optimized_wal_intarray_test.patchtext/x-patch; name=gist_optimized_wal_intarray_test.patchDownload
commit b233d40c1bd1b270ec0ea1d52887794320a8f595
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Fri Apr 5 19:25:48 2019 +0300

    Test to reproduce gist index problem caused by patch Generate less WAL during GiST, GIN and SP-GiST index build

diff --git a/contrib/intarray/Makefile b/contrib/intarray/Makefile
index 2505294..3535bb1 100644
--- a/contrib/intarray/Makefile
+++ b/contrib/intarray/Makefile
@@ -9,7 +9,7 @@ DATA = intarray--1.2.sql intarray--1.1--1.2.sql intarray--1.0--1.1.sql \
 	intarray--unpackaged--1.0.sql
 PGFILEDESC = "intarray - functions and operators for arrays of integers"
 
-REGRESS = _int
+REGRESS = _int _int2
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/intarray/expected/_int2.out b/contrib/intarray/expected/_int2.out
new file mode 100644
index 0000000..e88fd34
--- /dev/null
+++ b/contrib/intarray/expected/_int2.out
@@ -0,0 +1,48 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+ a | t 
+---+---
+(0 rows)
+
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
diff --git a/contrib/intarray/sql/_int2.sql b/contrib/intarray/sql/_int2.sql
new file mode 100644
index 0000000..d4adca7
--- /dev/null
+++ b/contrib/intarray/sql/_int2.sql
@@ -0,0 +1,20 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 94b6ad6..2183b3a 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -509,6 +509,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 0)
 			{
@@ -531,6 +532,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 1)
 				break;
#5Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Anastasia Lubennikova (#4)
2 attachment(s)
Re: Failure in contrib test _int on loach

05.04.2019 19:41, Anastasia Lubennikova writes:

05.04.2019 18:01, Tom Lane writes:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

This is a strange failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&amp;dt=2019-04-05%2005%3A15%3A00

[ wrong answers from queries using a GIST index ]

There are a couple of other recent instances of this failure, on
francolin and whelk.

Yeah.  Given three failures in a couple of days, we can reasonably
guess that the problem was introduced within a day or two prior to
the first one.  Looking at what's touched GIST in that time frame,
suspicion has to fall heavily on
9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.

If I had to bet, I'd bet that there's something wrong with the
machinations described in the commit message:
          For GiST, the LSN-NSN interlock makes this a little tricky.
All pages must
     be marked with a valid (i.e. non-zero) LSN, so that the
parent-child
     LSN-NSN interlock works correctly. We now use magic value 1 for
that during
     index build. Change the fake LSN counter to begin from 1000, so
that 1 is
     safely smaller than any real or fake LSN. 2 would've been enough
for our
     purposes, but let's reserve a bigger range, in case we need more
special
     values in the future.

I'll go add this as an open issue.

            regards, tom lane

Hi,
I've already noticed the same failure in our company buildfarm and
started the research.

You are right, it's the " Generate less WAL during GiST, GIN and
SP-GiST index build. " patch to blame.
Because of using the GistBuildLSN some pages are not linked correctly,
so index scan cannot find some entries, while seqscan finds them.

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

The patches with the test and fix are attached.

Many thanks to Teodor Sigaev, who helped to find the bug.

--

Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

gist_optimized_wal_intarray_fix_v1.patchtext/x-patch; name=gist_optimized_wal_intarray_fix_v1.patchDownload
commit 53938b423ec19a88e470b7b9552de13de5e2634a
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Tue Apr 9 18:58:00 2019 +0300

    fix for gist create index WAL optimization. Track splits correctly

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c..ff98d13 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -639,6 +639,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
 	firststack.lsn = 0;
+	firststack.retry_from_parent = false;
 	firststack.parent = NULL;
 	firststack.downlinkoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
@@ -693,8 +694,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
+		/*
+		 * During gistbuild we cannot rely on lsn-nsn comparison,
+		 * since all pages have the same GistBuildLSN value in these fields.
+		 * So use special flag stack->retry_from_parent to detect the fact that
+		 * we need to go back to parent and rechoose the child.
+		 */
 		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+			((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			 stack->retry_from_parent == true))
 		{
 			/*
 			 * Concurrent split detected. There's no guarantee that the
@@ -786,6 +794,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 						UnlockReleaseBuffer(stack->buffer);
 						xlocked = false;
 						state.stack = stack = stack->parent;
+						stack->retry_from_parent = true;
 					}
 					continue;
 				}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e2e3f..361a7a8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -215,6 +215,12 @@ typedef struct GISTInsertStack
 	 */
 	GistNSN		lsn;
 
+	/*
+	 * flag to use during gistbuild to recognize page split
+	 * (instead of lsn-nsn comparison)
+	 */
+	bool retry_from_parent;
+
 	/* offset of the downlink in the parent page, that points to this page */
 	OffsetNumber downlinkoffnum;
 
gist_optimized_wal_intarray_test_v1.patchtext/x-patch; name=gist_optimized_wal_intarray_test_v1.patchDownload
commit 53f4607b50e0516be05ef27d9369c89bdcb9d82b
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Tue Apr 9 18:56:32 2019 +0300

    test

diff --git a/contrib/intarray/Makefile b/contrib/intarray/Makefile
index 2505294..3535bb1 100644
--- a/contrib/intarray/Makefile
+++ b/contrib/intarray/Makefile
@@ -9,7 +9,7 @@ DATA = intarray--1.2.sql intarray--1.1--1.2.sql intarray--1.0--1.1.sql \
 	intarray--unpackaged--1.0.sql
 PGFILEDESC = "intarray - functions and operators for arrays of integers"
 
-REGRESS = _int
+REGRESS = _int _int2
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/intarray/expected/_int2.out b/contrib/intarray/expected/_int2.out
new file mode 100644
index 0000000..8346334
--- /dev/null
+++ b/contrib/intarray/expected/_int2.out
@@ -0,0 +1,47 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+ a | t 
+---+---
+(0 rows)
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
diff --git a/contrib/intarray/sql/_int2.sql b/contrib/intarray/sql/_int2.sql
new file mode 100644
index 0000000..d4adca7
--- /dev/null
+++ b/contrib/intarray/sql/_int2.sql
@@ -0,0 +1,20 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 94b6ad6..2183b3a 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -509,6 +509,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 0)
 			{
@@ -531,6 +532,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 1)
 				break;
#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Anastasia Lubennikova (#5)
Re: Failure in contrib test _int on loach

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?

- Heikki

#7Andrey Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Heikki Linnakangas (#6)
1 attachment(s)
Re: Failure in contrib test _int on loach

On 10/04/2019 20:25, Heikki Linnakangas wrote:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?

Based on Anastasia's idea i prepare alternative solution to fix the bug
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process
is used for change NSN value to 1 for each block of index relation.
I hope this can be a fairly clear and safe solution.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachments:

0001-Alt-fix-for-gist_optimized_wal_intarray_test-bug.patchtext/x-patch; name=0001-Alt-fix-for-gist_optimized_wal_intarray_test-bug.patchDownload
From 59e1519a0a48b879777820ff68116c68ed31e684 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru>
Date: Thu, 11 Apr 2019 10:52:39 +0500
Subject: [PATCH] Alt fix for gist_optimized_wal_intarray_test bug

---
 src/backend/access/gin/gininsert.c      |  2 +-
 src/backend/access/gist/gist.c          |  4 ++--
 src/backend/access/gist/gistbuild.c     | 17 +++++++++++++++--
 src/backend/access/spgist/spginsert.c   |  2 +-
 src/backend/access/transam/xloginsert.c |  4 +++-
 src/include/access/gist.h               |  6 ------
 src/include/access/gist_private.h       |  2 ++
 src/include/access/xloginsert.h         |  6 +++++-
 8 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab14617..a63f33b429 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -413,7 +413,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	{
 		log_newpage_range(index, MAIN_FORKNUM,
 						  0, RelationGetNumberOfBlocks(index),
-						  true);
+						  true, NULL);
 	}
 
 	/*
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c840..56f6ce04db 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -492,7 +492,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * we don't need to be able to detect concurrent splits yet.)
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+			recptr = gistBuildLSN++;
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -559,7 +559,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+			recptr = gistBuildLSN++;
 		else
 		{
 			if (RelationNeedsWAL(rel))
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 8e81eda517..31118b54cf 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -76,6 +76,13 @@ typedef struct
 	GistBufferingMode bufferingMode;
 } GISTBuildState;
 
+/*
+ * A bogus LSN / NSN value used during index build. Must be smaller than any
+ * real or fake unlogged LSN, so that after an index build finishes, all the
+ * splits are considered completed.
+ */
+XLogRecPtr gistBuildLSN = 0;
+
 /* prototypes for private functions */
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -107,6 +114,12 @@ static void gistMemorizeParent(GISTBuildState *buildstate, BlockNumber child,
 static void gistMemorizeAllDownlinks(GISTBuildState *buildstate, Buffer parent);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+static void
+gistbuild_log_mask(char *page)
+{
+	GistPageSetNSN((Page) page, (uint64) 1);
+}
+
 /*
  * Main entry point to GiST index build. Initially calls insert over and over,
  * but switches to more efficient buffering build algorithm after a certain
@@ -180,7 +193,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	GISTInitBuffer(buffer, F_LEAF);
 
 	MarkBufferDirty(buffer);
-	PageSetLSN(page, GistBuildLSN);
+	PageSetLSN(page, gistBuildLSN++);
 
 	UnlockReleaseBuffer(buffer);
 
@@ -222,7 +235,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	{
 		log_newpage_range(index, MAIN_FORKNUM,
 						  0, RelationGetNumberOfBlocks(index),
-						  true);
+						  true, gistbuild_log_mask);
 	}
 
 	/*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index b40bd440cf..50f302d045 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -139,7 +139,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	{
 		log_newpage_range(index, MAIN_FORKNUM,
 						  0, RelationGetNumberOfBlocks(index),
-						  true);
+						  true, NULL);
 	}
 
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1c76dcfa0d..ab8af41d99 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1041,7 +1041,7 @@ log_newpage_buffer(Buffer buffer, bool page_std)
 void
 log_newpage_range(Relation rel, ForkNumber forkNum,
 				  BlockNumber startblk, BlockNumber endblk,
-				  bool page_std)
+				  bool page_std, log_mask_callback mask)
 {
 	BlockNumber blkno;
 
@@ -1089,6 +1089,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
 		for (i = 0; i < nbufs; i++)
 		{
 			XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+			if (mask)
+				mask(BufferGetPage(bufpack[i]));
 			MarkBufferDirty(bufpack[i]);
 		}
 
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 6902f4115b..d09f992e1f 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -49,12 +49,6 @@
 
 typedef XLogRecPtr GistNSN;
 
-/*
- * A bogus LSN / NSN value used during index build. Must be smaller than any
- * real or fake unlogged LSN, so that after an index build finishes, all the
- * splits are considered completed.
- */
-#define GistBuildLSN	((XLogRecPtr) 1)
 
 /*
  * For on-disk compatibility with pre-9.3 servers, NSN is stored as two
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e2e3fb31..64469640e4 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -381,6 +381,8 @@ typedef struct GiSTOptions
 	int			bufferingModeOffset;	/* use buffering build? */
 } GiSTOptions;
 
+extern XLogRecPtr gistBuildLSN;
+
 /* gist.c */
 extern void gistbuildempty(Relation index);
 extern bool gistinsert(Relation r, Datum *values, bool *isnull,
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 30c4ff7bea..0d56ee2266 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -38,6 +38,8 @@
 #define REGBUF_KEEP_DATA	0x10	/* include data even if a full-page image
 									 * is taken */
 
+typedef void (*log_mask_callback)(char *page);
+
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
 extern void XLogSetRecordFlags(uint8 flags);
@@ -55,8 +57,10 @@ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
 			BlockNumber blk, char *page, bool page_std);
 extern XLogRecPtr log_newpage_buffer(Buffer buffer, bool page_std);
+
 extern void log_newpage_range(Relation rel, ForkNumber forkNum,
-				  BlockNumber startblk, BlockNumber endblk, bool page_std);
+				  BlockNumber startblk, BlockNumber endblk, bool page_std,
+				  log_mask_callback mask);
 extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
 
 extern void InitXLogInsert(void);
-- 
2.17.1

#8Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Lepikhov (#7)
Re: Failure in contrib test _int on loach

On 11/04/2019 09:10, Andrey Lepikhov wrote:

On 10/04/2019 20:25, Heikki Linnakangas wrote:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?

Based on Anastasia's idea i prepare alternative solution to fix the bug
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process
is used for change NSN value to 1 for each block of index relation.
I hope this can be a fairly clear and safe solution.

That's basically the same idea as always using the "fake LSN" during
index build, like the original version of this patch did. It's got the
problem that I mentioned at
/messages/by-id/090fb3cb-1ca4-e173-ecf7-47d41ebac620@iki.fi:

* Using "fake" unlogged LSNs for GiST index build seemed fishy. I could
not convince myself that it was safe in all corner cases. In a recently
initdb'd cluster, it's theoretically possible that the fake LSN counter
overtakes the real LSN value, and that could lead to strange behavior.
For example, how would the buffer manager behave, if there was a dirty
page in the buffer cache with an LSN value that's greater than the
current WAL flush pointer? I think you'd get "ERROR: xlog flush request
%X/%X is not satisfied --- flushed only to %X/%X".

Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE,
so with defaults WAL segment size, the index build would have to do
about 16 million page splits. The index would have to be at least 150 GB
for that. But it seems possible, and with non-default segment and page
size settings more so.

Perhaps we could start at 1, but instead of using a global counter,
whenever a page is split, we take the parent's LSN value and increment
it by one. So different branches of the tree could use the same values,
which would reduce the consumption of the counter values.

Yet another idea would be to start the counter at 1, but check that it
doesn't overtake the WAL insert pointer. If it's about to overtake it,
just generate some dummy WAL.

But it seems best to deal with this in gistdoinsert(). I think
Anastasia's approach of adding a flag to GISTInsertStack can be made to
work, if we set the flag somewhere in gistinserttuples() or
gistplacetopage(), whenever a page is split. That way, if it needs to
split multiple levels, the flag is set on all of the corresponding
GISTInsertStack entries.

Yet another trivial fix would be just always start the tree descend from
the root in gistdoinsert(), if a page is split. Not as efficient, but
probably negligible in practice.

- Heikki

#9Andrey Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Heikki Linnakangas (#8)
Re: Failure in contrib test _int on loach

On 11/04/2019 13:14, Heikki Linnakangas wrote:

On 11/04/2019 09:10, Andrey Lepikhov wrote:

On 10/04/2019 20:25, Heikki Linnakangas wrote:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?

Based on Anastasia's idea i prepare alternative solution to fix the bug
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process
is used for change NSN value to 1 for each block of index relation.
I hope this can be a fairly clear and safe solution.

That's basically the same idea as always using the "fake LSN" during
index build, like the original version of this patch did. It's got the
problem that I mentioned at
/messages/by-id/090fb3cb-1ca4-e173-ecf7-47d41ebac620@iki.fi:

* Using "fake" unlogged LSNs for GiST index build seemed fishy. I
could not convince myself that it was safe in all corner cases. In a
recently initdb'd cluster, it's theoretically possible that the fake
LSN counter overtakes the real LSN value, and that could lead to
strange behavior. For example, how would the buffer manager behave, if
there was a dirty page in the buffer cache with an LSN value that's
greater than the current WAL flush pointer? I think you'd get "ERROR:
xlog flush request %X/%X is not satisfied --- flushed only to %X/%X".

Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE,
so with defaults WAL segment size, the index build would have to do
about 16 million page splits. The index would have to be at least 150 GB
for that. But it seems possible, and with non-default segment and page
size settings more so.

As i see in bufmgr.c, XLogFlush() can't called during index build. In
the log_newpage_range() call we can use mask to set value of NSN (and
LSN) to 1.

Perhaps we could start at 1, but instead of using a global counter,
whenever a page is split, we take the parent's LSN value and increment
it by one. So different branches of the tree could use the same values,
which would reduce the consumption of the counter values.

Yet another idea would be to start the counter at 1, but check that it
doesn't overtake the WAL insert pointer. If it's about to overtake it,
just generate some dummy WAL.

But it seems best to deal with this in gistdoinsert(). I think
Anastasia's approach of adding a flag to GISTInsertStack can be made to
work, if we set the flag somewhere in gistinserttuples() or
gistplacetopage(), whenever a page is split. That way, if it needs to
split multiple levels, the flag is set on all of the corresponding
GISTInsertStack entries.

Yet another trivial fix would be just always start the tree descend from
the root in gistdoinsert(), if a page is split. Not as efficient, but
probably negligible in practice.

Agree

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#10Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Heikki Linnakangas (#6)
Re: Failure in contrib test _int on loach

10.04.2019 18:25, Heikki Linnakangas writes:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following
comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that
we'd need to climb further up?

From what I understand,
the only reason for grandparent's split during gistbuild is the
insertion of the newtup returned by gistgetadjusted().

After we stepped up the stack, we will do gistchoose() to choose new
correct child,
adjust the downlink key and insert it into grandparent. If this
insertion caused split, we will recursively follow the same codepath
and set stack->retry_from_parent again.

So it is possible, but it doesn't require any extra algorithm changes.
I didn't manage to generate dataset to reproduce grandparent split.
Though, I do agree that it's worth checking out. Do you have any ideas?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Anastasia Lubennikova (#10)
Re: Failure in contrib test _int on loach

Anastasia Lubennikova <a.lubennikova@postgrespro.ru> writes:

So it is possible, but it doesn't require any extra algorithm changes.
I didn't manage to generate dataset to reproduce grandparent split.
Though, I do agree that it's worth checking out. Do you have any ideas?

Ping? This thread has gone cold, but the bug is still there, and
IMV it's a beta blocker.

regards, tom lane

#12Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Tom Lane (#11)
2 attachment(s)
Re: Failure in contrib test _int on loach

27.04.2019 22:05, Tom Lane wrote:

Anastasia Lubennikova <a.lubennikova@postgrespro.ru> writes:

So it is possible, but it doesn't require any extra algorithm changes.
I didn't manage to generate dataset to reproduce grandparent split.
Though, I do agree that it's worth checking out. Do you have any ideas?

Ping? This thread has gone cold, but the bug is still there, and
IMV it's a beta blocker.

Hi,

Thank you for the reminder.
In a nutshell, this fix is ready for committer.

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

gist_optimized_wal_intarray_fix_v1.patchtext/x-patch; name=gist_optimized_wal_intarray_fix_v1.patchDownload
commit 53938b423ec19a88e470b7b9552de13de5e2634a
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Tue Apr 9 18:58:00 2019 +0300

    fix for gist create index WAL optimization. Track splits correctly

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c..ff98d13 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -639,6 +639,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
 	firststack.lsn = 0;
+	firststack.retry_from_parent = false;
 	firststack.parent = NULL;
 	firststack.downlinkoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
@@ -693,8 +694,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
+		/*
+		 * During gistbuild we cannot rely on lsn-nsn comparison,
+		 * since all pages have the same GistBuildLSN value in these fields.
+		 * So use special flag stack->retry_from_parent to detect the fact that
+		 * we need to go back to parent and rechoose the child.
+		 */
 		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+			((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			 stack->retry_from_parent == true))
 		{
 			/*
 			 * Concurrent split detected. There's no guarantee that the
@@ -786,6 +794,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 						UnlockReleaseBuffer(stack->buffer);
 						xlocked = false;
 						state.stack = stack = stack->parent;
+						stack->retry_from_parent = true;
 					}
 					continue;
 				}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e2e3f..361a7a8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -215,6 +215,12 @@ typedef struct GISTInsertStack
 	 */
 	GistNSN		lsn;
 
+	/*
+	 * flag to use during gistbuild to recognize page split
+	 * (instead of lsn-nsn comparison)
+	 */
+	bool retry_from_parent;
+
 	/* offset of the downlink in the parent page, that points to this page */
 	OffsetNumber downlinkoffnum;
 
gist_optimized_wal_intarray_test_v1.patchtext/x-patch; name=gist_optimized_wal_intarray_test_v1.patchDownload
commit 53f4607b50e0516be05ef27d9369c89bdcb9d82b
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Tue Apr 9 18:56:32 2019 +0300

    test

diff --git a/contrib/intarray/Makefile b/contrib/intarray/Makefile
index 2505294..3535bb1 100644
--- a/contrib/intarray/Makefile
+++ b/contrib/intarray/Makefile
@@ -9,7 +9,7 @@ DATA = intarray--1.2.sql intarray--1.1--1.2.sql intarray--1.0--1.1.sql \
 	intarray--unpackaged--1.0.sql
 PGFILEDESC = "intarray - functions and operators for arrays of integers"
 
-REGRESS = _int
+REGRESS = _int _int2
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/intarray/expected/_int2.out b/contrib/intarray/expected/_int2.out
new file mode 100644
index 0000000..8346334
--- /dev/null
+++ b/contrib/intarray/expected/_int2.out
@@ -0,0 +1,47 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+ count 
+-------
+   403
+(1 row)
+
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+ count 
+-------
+   403
+(1 row)
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+ a | t 
+---+---
+(0 rows)
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+             a             
+---------------------------
+ {193,213,230,266,285,299}
+(1 row)
+
diff --git a/contrib/intarray/sql/_int2.sql b/contrib/intarray/sql/_int2.sql
new file mode 100644
index 0000000..d4adca7
--- /dev/null
+++ b/contrib/intarray/sql/_int2.sql
@@ -0,0 +1,20 @@
+CREATE TABLE test__int2( a int[] );
+\copy test__int2 from 'data/test__int.data'
+ANALYZE test__int2;
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+CREATE INDEX text_idx2 on test__int2 using gist ( a gist__int_ops );
+
+SELECT count(*) from test__int2 WHERE a && '{23,50}';
+SELECT count(*) from test__int2 WHERE a @@ '23|50';
+
+SELECT a, t FROM (SELECT a::text, (SELECT COUNT (*) FROM test__int2 t2 WHERE t2.a = t1.a) t from test__int2 t1 WHERE a IS NOT NULL) s WHERE t = 0;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
+
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+
+select * from test__int2 where a = '{193,213,230,266,285,299}';
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 94b6ad6..2183b3a 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -509,6 +509,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 0)
 			{
@@ -531,6 +532,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 			{
 				/* we didn't make the random choice yet for this old best */
 				keep_current_best = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+				keep_current_best = 0; /* FIXME only for debugging */
 			}
 			if (keep_current_best == 1)
 				break;
#13Teodor Sigaev
teodor@sigaev.ru
In reply to: Anastasia Lubennikova (#12)
Re: Failure in contrib test _int on loach

Hi!

So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

Patch looks right, but I think that comment should be improved in follow piece:

         if (stack->blkno != GIST_ROOT_BLKNO &&
-           stack->parent->lsn < GistPageGetNSN(stack->page))
+           ((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+            stack->retry_from_parent == true))
         {
             /*
              * Concurrent split detected. There's no guarantee that the
		....
Not only concurrent split could be deteced here and it was missed long ago. But 
this patch seems a good chance to change this comment.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#14Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Anastasia Lubennikova (#12)
Re: Failure in contrib test _int on loach

(resending, previous attempt didn't make it to pgsql-hackers)

On 29/04/2019 16:16, Anastasia Lubennikova wrote:

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

I still believe there is a problem with grandparent splits with this.
I'll try to construct a test case later this week, unless you manage to
create one before that.

- Heikki

#15Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#14)
2 attachment(s)
Re: Failure in contrib test _int on loach

On 02/05/2019 10:37, Heikki Linnakangas wrote:

(resending, previous attempt didn't make it to pgsql-hackers)

On 29/04/2019 16:16, Anastasia Lubennikova wrote:

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

I still believe there is a problem with grandparent splits with this.
I'll try to construct a test case later this week, unless you manage to
create one before that.

Here you go. If you apply the two patches from
/messages/by-id/5d48ce28-34cf-9b03-5d42-dbd5457926bf@postgrespro.ru,
and run the attached script, it will print out something like this:

postgres=# \i grandparent.sql
DROP TABLE
CREATE TABLE
INSERT 0 150000
CREATE INDEX
psql:grandparent.sql:27: NOTICE: working on 10000
psql:grandparent.sql:27: NOTICE: working on 20000
psql:grandparent.sql:27: NOTICE: working on 30000
psql:grandparent.sql:27: NOTICE: working on 40000
psql:grandparent.sql:27: NOTICE: working on 50000
psql:grandparent.sql:27: NOTICE: working on 60000
psql:grandparent.sql:27: NOTICE: working on 70000
psql:grandparent.sql:27: NOTICE: working on 80000
psql:grandparent.sql:27: NOTICE: working on 90000
psql:grandparent.sql:27: NOTICE: working on 100000
psql:grandparent.sql:27: NOTICE: working on 110000
psql:grandparent.sql:27: NOTICE: failed for 114034
psql:grandparent.sql:27: NOTICE: working on 120000
DO

That "failed for 114034" should not happen.

Fortunately, that's not too hard to fix. We just need to arrange things
so that the "retry_from_parent" flag also gets set for the grandparent,
when the grandparent is split. Like in the attached patch.

- Heikki

Attachments:

0001-Detect-internal-GiST-page-splits-correctly-during-in.patchtext/x-patch; name=0001-Detect-internal-GiST-page-splits-correctly-during-in.patchDownload
From 7e0b7fa120b9c4c3e0f0f71dc7a5f35ba806ea7f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 8 May 2019 01:29:52 +0300
Subject: [PATCH 1/1] Detect internal GiST page splits correctly during index
 build.

As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.

Thomas Munro noticed that the contrib/intarray regression test started
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.

To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and we need to retry
the descend from the parent.

Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com
---
 src/backend/access/gist/gist.c    | 33 +++++++++++++++++++++++++++++++
 src/include/access/gist_private.h |  7 +++++++
 2 files changed, 40 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 769aca42e3..d70a138f54 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -639,6 +639,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
 	firststack.lsn = 0;
+	firststack.retry_from_parent = false;
 	firststack.parent = NULL;
 	firststack.downlinkoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
@@ -651,6 +652,21 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 */
 	for (;;)
 	{
+		/*
+		 * If we split an internal page while descending the tree, we have to
+		 * retry at the parent. (Normally, the LSN-NSN interlock below would
+		 * also catch this and cause us to retry. But LSNs are not updated
+		 * during index build.)
+		 */
+		while (stack->retry_from_parent)
+		{
+			if (xlocked)
+				LockBuffer(stack->buffer, GIST_UNLOCK);
+			xlocked = false;
+			ReleaseBuffer(stack->buffer);
+			state.stack = stack = stack->parent;
+		}
+
 		if (XLogRecPtrIsInvalid(stack->lsn))
 			stack->buffer = ReadBuffer(state.r, stack->blkno);
 
@@ -1376,6 +1392,23 @@ gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 					 unlockbuf	/* Unlock stack->buffer if caller wants that */
 		);
 	Assert(left->buf == stack->buffer);
+
+	/*
+	 * If we split the page because we had to adjust the downlink on an
+	 * internal page, while descending the tree for inserting a new tuple,
+	 * then this might no longer be the correct page for the new tuple. The
+	 * downlink to this page might not cover the new tuple anymore, it might
+	 * need to go to the newly-created right sibling instead. Tell the caller
+	 * to walk back up the stack, to re-check at the parent which page to
+	 * insert to.
+	 *
+	 * Normally, the LSN-NSN interlock during the tree descend would also
+	 * detect that a concurrent split happened (by ourselves), and cause us to
+	 * retry at the parent. But that mechanism doesn't work during index
+	 * build, because we don't do WAL-logging, and don't update LSNs, during
+	 * index build.
+	 */
+	stack->retry_from_parent = true;
 }
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index ccf050cd62..614d9996ec 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -215,6 +215,13 @@ typedef struct GISTInsertStack
 	 */
 	GistNSN		lsn;
 
+	/*
+	 * If set, we split the page while descending the tree to find an
+	 * insertion target. It means that we need to retry from the parent,
+	 * because the downlink of this page might no longer cover the new key.
+	 */
+	bool		retry_from_parent;
+
 	/* offset of the downlink in the parent page, that points to this page */
 	OffsetNumber downlinkoffnum;
 
-- 
2.20.1

grandparent.sqlapplication/sql; name=grandparent.sqlDownload
#16Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#15)
Re: Failure in contrib test _int on loach

On 08/05/2019 01:31, Heikki Linnakangas wrote:

On 02/05/2019 10:37, Heikki Linnakangas wrote:

On 29/04/2019 16:16, Anastasia Lubennikova wrote:

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

I still believe there is a problem with grandparent splits with this.
I'll try to construct a test case later this week, unless you manage to
create one before that.

Here you go. If you apply the two patches from
/messages/by-id/5d48ce28-34cf-9b03-5d42-dbd5457926bf@postgrespro.ru,
and run the attached script, it will print out something like this:

postgres=# \i grandparent.sql
DROP TABLE
CREATE TABLE
INSERT 0 150000
CREATE INDEX
psql:grandparent.sql:27: NOTICE: working on 10000
psql:grandparent.sql:27: NOTICE: working on 20000
psql:grandparent.sql:27: NOTICE: working on 30000
psql:grandparent.sql:27: NOTICE: working on 40000
psql:grandparent.sql:27: NOTICE: working on 50000
psql:grandparent.sql:27: NOTICE: working on 60000
psql:grandparent.sql:27: NOTICE: working on 70000
psql:grandparent.sql:27: NOTICE: working on 80000
psql:grandparent.sql:27: NOTICE: working on 90000
psql:grandparent.sql:27: NOTICE: working on 100000
psql:grandparent.sql:27: NOTICE: working on 110000
psql:grandparent.sql:27: NOTICE: failed for 114034
psql:grandparent.sql:27: NOTICE: working on 120000
DO

That "failed for 114034" should not happen.

Fortunately, that's not too hard to fix. We just need to arrange things
so that the "retry_from_parent" flag also gets set for the grandparent,
when the grandparent is split. Like in the attached patch.

I hear no objections, so pushed that. But if you have a chance to review
this later, just to double-check, I'd still appreciate that.

- Heikki