pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum

Started by Daniel Shelepanovabout 4 years ago3 messagesbugs
Jump to latest
#1Daniel Shelepanov
deniel1495@mail.ru

Hello everyone!

I've been examining a floating bug in pg_visibility. To reproduce it,
you have to:

1) apply a patch to pg_visibility's tests (you can find it attached)

2) run test-icx.sh (attached) from the root of the postgresql's source tree

pg_vis_clean.patch adds one additional test to pg_visibility test suite
and normally it is expected to pass. However, when (I assume so) it
happens to run in parallel with autovacuum, the test is failed.
test-icx.sh runs several parallel installchecks on contrib/pg_visibility
tests and there is a chance that some of them will catch this error.

This is the set-up for a bug reproduction:

*create table vacuum_test as select 42 i;**
**vacuum (disable_page_skipping) vacuum_test;**
*

This is the actual test result:

*diff -U3
/tmp/icx/ic01-1/contrib/pg_visibility/expected/pg_visibility.out
/tmp/icx/ic01-1/contrib/pg_visibility/results/pg_visibility.out**
**--- /tmp/icx/ic01-1/contrib/pg_visibility/expected/pg_visibility.out
2022-02-17 23:46:26.005500630 +0300**
**+++ /tmp/icx/ic01-1/contrib/pg_visibility/results/pg_visibility.out
2022-02-17 23:47:36.066838872 +0300**
**@@ -247,7 +247,7 @@**
** select count(*) > 0 from pg_check_visible('vacuum_test');**
**  ?column?**
** ----------**
**- f**
**+ t**
** (1 row)**
**
** -- cleanup*

The thing is that the underlying file is one page long which should
contain the only all-visible tuple and pg_check_visible should return
nothing. Is this a real bug or do I miss something? Thank you in advance
for the response

P. S.: SELECT version(); result: *PostgreSQL 14.2 on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04)
9.3.0, 64-bit*

Attachments:

test-icx.shapplication/x-shellscript; name=test-icx.shDownload
pg_vis_clean.patchtext/x-patch; charset=UTF-8; name=pg_vis_clean.patchDownload+14-0
#2Andres Freund
andres@anarazel.de
In reply to: Daniel Shelepanov (#1)
Re: pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum

Hi,

On 2022-02-18 00:09:48 +0300, Daniel Shelepanov wrote:

I've been examining a floating bug in pg_visibility. To reproduce it, you
have to:

1) apply a patch to pg_visibility's tests (you can find it attached)

2) run test-icx.sh (attached) from the root of the postgresql's source tree

Could you try to minimize the script? A 300 line reproducer is quite long. And
it looks like it won't even work in non postgres-pro tree.

One thing to do would be to modify pg_visibility to elog(PANIC, "something")
when it encounters corruption. Then you would have a chance of inspecting the
state of the tuple/page in that moment.

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum

Hi,

On 2022-02-18 08:56:37 -0800, Andres Freund wrote:

Could you try to minimize the script? A 300 line reproducer is quite long. And
it looks like it won't even work in non postgres-pro tree.

One thing to do would be to modify pg_visibility to elog(PANIC, "something")
when it encounters corruption. Then you would have a chance of inspecting the
state of the tuple/page in that moment.

Oh, I have been able to reliably reproduce this on HEAD. I modified
record_corrupt_item() to PANIC and then:

psql regression:
BEGIN ;SELECT txid_current();
<leave open>

psql postgres
DROP TABLE IF EXISTS vacuum_test_0;
create table vacuum_test_0 as select 42 i;
vacuum (disable_page_skipping) vacuum_test_0;
select * from pg_check_visible('vacuum_test_0');

At which point there immediately is a crash.

This reproduces in earlier versions too, at least back to 10.

I *think* this is a false positive:

- PGPROC->xmin is computed without regard for the database in which the other
sessions are running. Due to the the txid_current() session this includes an
older xid.

- During the VACUUM in vis.sql the only connection to the database pgbench
connects to is VACUUM and thus ignored when determining horizons (due to
PROC_IN_VACUUM). Therefore the horizon is computed to latestCompletedXid +
1.

- But during pg_check_visible(), the current session is *not* marked as
PROC_IN_VACUUM. So the horizon is the xid from the txid_current().

Boom.

Greetings,

Andres Freund

Attachments:

repro.sqlapplication/sqlDownload