contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

Started by Tom Laneover 4 years ago5 messages
#1Tom Lane
tgl@sss.pgh.pa.us

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out	2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out	2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
      0 | f           | f
      1 | f           | f
      2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid
@@ -235,7 +236,8 @@
0 | t | t
1 | f | f
2 | t | t
-(3 rows)
+ 3 | t | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid

The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

regards, tom lane

#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#1)
Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&amp;dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
0 | f           | f
1 | f           | f
2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid
@@ -235,7 +236,8 @@
0 | t | t
1 | f | f
2 | t | t
-(3 rows)
+ 3 | t | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid

The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

It seems like the recent revert (8e03eb92e9a) is relevant.

After committing 7db0cd2145f we had the same regression test failure
in January[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&amp;dt=2021-01-19+20%3A27%3A46. Then we fixed that issue by 39b66a91b. But since we
recently reverted most of 39b66a91b, the same issue happened again.

Regards,

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&amp;dt=2021-01-19+20%3A27%3A46

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#2)
1 attachment(s)
Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&amp;dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
0 | f           | f
1 | f           | f
2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid
@@ -235,7 +236,8 @@
0 | t | t
1 | f | f
2 | t | t
-(3 rows)
+ 3 | t | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid

The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

It seems like the recent revert (8e03eb92e9a) is relevant.

After committing 7db0cd2145f we had the same regression test failure
in January[1]. Then we fixed that issue by 39b66a91b. But since we
recently reverted most of 39b66a91b, the same issue happened again.

So the cause of this failure seems the same as before. The failed test is,

begin;
truncate copyfreeze;
copy copyfreeze from stdin freeze;
1 '1'
2 '2'
3 '3'
4 '4'
5 '5'
\.
copy copyfreeze from stdin;
6 '6'
\.
copy copyfreeze from stdin freeze;
7 '7'
8 '8'
9 '9'
10 '10'
11 '11'
12 '12'
\.
commit;

If the target block cache is invalidated before the third COPY, we
will start to insert the frozen tuple into a new page, resulting in
adding two blocks in total during the third COPY. I think we still
need the following part of the reverted code so that we don't leave
the page partially empty after relcache invalidation:

--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
         * target.
         */
        targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
-   }
-   /*
-    * If the FSM knows nothing of the rel, try the last page before we give
-    * up and extend.  This avoids one-tuple-per-page syndrome during
-    * bootstrapping or in a recently-started system.
-    */
-   if (targetBlock == InvalidBlockNumber)
-   {
-       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+       /*
+        * If the FSM knows nothing of the rel, try the last page before we
+        * give up and extend.  This avoids one-tuple-per-page syndrome during
+        * bootstrapping or in a recently-started system.
+        */
+       if (targetBlock == InvalidBlockNumber)
+       {
+           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-       if (nblocks > 0)
-           targetBlock = nblocks - 1;
+           if (nblocks > 0)
+               targetBlock = nblocks - 1;
+       }
    }

Attached the patch that brings back the above change.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

use_last_block.patchapplication/octet-stream; name=use_last_block.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c47c7522ca..ffc89685bf 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		 * target.
 		 */
 		targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
+	}
 
-		/*
-		 * If the FSM knows nothing of the rel, try the last page before we
-		 * give up and extend.  This avoids one-tuple-per-page syndrome during
-		 * bootstrapping or in a recently-started system.
-		 */
-		if (targetBlock == InvalidBlockNumber)
-		{
-			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+	/*
+	 * If the FSM knows nothing of the rel, try the last page before we
+	 * give up and extend.  This avoids one-tuple-per-page syndrome during
+	 * bootstrapping or in a recently-started system.
+	 */
+	if (targetBlock == InvalidBlockNumber)
+	{
+		BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
 
-			if (nblocks > 0)
-				targetBlock = nblocks - 1;
-		}
+		if (nblocks > 0)
+			targetBlock = nblocks - 1;
 	}
 
 loop:
#4Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Masahiko Sawada (#3)
Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

On 6/7/21 2:11 PM, Masahiko Sawada wrote:

On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&amp;dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
0 | f           | f
1 | f           | f
2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid
@@ -235,7 +236,8 @@
0 | t | t
1 | f | f
2 | t | t
-(3 rows)
+ 3 | t | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid

The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

It seems like the recent revert (8e03eb92e9a) is relevant.

After committing 7db0cd2145f we had the same regression test failure
in January[1]. Then we fixed that issue by 39b66a91b. But since we
recently reverted most of 39b66a91b, the same issue happened again.

So the cause of this failure seems the same as before. The failed test is,

begin;
truncate copyfreeze;
copy copyfreeze from stdin freeze;
1 '1'
2 '2'
3 '3'
4 '4'
5 '5'
\.
copy copyfreeze from stdin;
6 '6'
\.
copy copyfreeze from stdin freeze;
7 '7'
8 '8'
9 '9'
10 '10'
11 '11'
12 '12'
\.
commit;

If the target block cache is invalidated before the third COPY, we
will start to insert the frozen tuple into a new page, resulting in
adding two blocks in total during the third COPY. I think we still
need the following part of the reverted code so that we don't leave
the page partially empty after relcache invalidation:

--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
* target.
*/
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
-   }
-   /*
-    * If the FSM knows nothing of the rel, try the last page before we give
-    * up and extend.  This avoids one-tuple-per-page syndrome during
-    * bootstrapping or in a recently-started system.
-    */
-   if (targetBlock == InvalidBlockNumber)
-   {
-       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+       /*
+        * If the FSM knows nothing of the rel, try the last page before we
+        * give up and extend.  This avoids one-tuple-per-page syndrome during
+        * bootstrapping or in a recently-started system.
+        */
+       if (targetBlock == InvalidBlockNumber)
+       {
+           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-       if (nblocks > 0)
-           targetBlock = nblocks - 1;
+           if (nblocks > 0)
+               targetBlock = nblocks - 1;
+       }
}

Attached the patch that brings back the above change.

Thanks for the analysis! I think you're right - this bit should have
been kept. Partial reverts are tricky :-(

I'll get this fixed / pushed later today, after a bit more testing. I'd
swear I ran tests with CCA, but it's possible I skipped contrib.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#4)
Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

On Mon, Jun 7, 2021 at 10:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/7/21 2:11 PM, Masahiko Sawada wrote:

On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&amp;dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
0 | f           | f
1 | f           | f
2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid
@@ -235,7 +236,8 @@
0 | t | t
1 | f | f
2 | t | t
-(3 rows)
+ 3 | t | t
+(4 rows)

select * from pg_check_frozen('copyfreeze');
t_ctid

The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

It seems like the recent revert (8e03eb92e9a) is relevant.

After committing 7db0cd2145f we had the same regression test failure
in January[1]. Then we fixed that issue by 39b66a91b. But since we
recently reverted most of 39b66a91b, the same issue happened again.

So the cause of this failure seems the same as before. The failed test is,

begin;
truncate copyfreeze;
copy copyfreeze from stdin freeze;
1 '1'
2 '2'
3 '3'
4 '4'
5 '5'
\.
copy copyfreeze from stdin;
6 '6'
\.
copy copyfreeze from stdin freeze;
7 '7'
8 '8'
9 '9'
10 '10'
11 '11'
12 '12'
\.
commit;

If the target block cache is invalidated before the third COPY, we
will start to insert the frozen tuple into a new page, resulting in
adding two blocks in total during the third COPY. I think we still
need the following part of the reverted code so that we don't leave
the page partially empty after relcache invalidation:

--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
* target.
*/
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
-   }
-   /*
-    * If the FSM knows nothing of the rel, try the last page before we give
-    * up and extend.  This avoids one-tuple-per-page syndrome during
-    * bootstrapping or in a recently-started system.
-    */
-   if (targetBlock == InvalidBlockNumber)
-   {
-       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+       /*
+        * If the FSM knows nothing of the rel, try the last page before we
+        * give up and extend.  This avoids one-tuple-per-page syndrome during
+        * bootstrapping or in a recently-started system.
+        */
+       if (targetBlock == InvalidBlockNumber)
+       {
+           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-       if (nblocks > 0)
-           targetBlock = nblocks - 1;
+           if (nblocks > 0)
+               targetBlock = nblocks - 1;
+       }
}

Attached the patch that brings back the above change.

Thanks for the analysis! I think you're right - this bit should have
been kept. Partial reverts are tricky :-(

I'll get this fixed / pushed later today, after a bit more testing. I'd
swear I ran tests with CCA, but it's possible I skipped contrib.

I had missed this mail. Thank you for pushing the fix!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/