Undiagnosed bug in Bloom index
I am getting corrupted Bloom indexes in which a tuple in the table
heap is not in the index.
I see it as early as commit a9284849b48b, with commit e13ac5586c49c
cherry picked onto it. I don't see it before a9284849b48b because the
test-case seg faults before anything interesting can happen. I think
this is an ab initio bug, either in bloom contrib or in the core index
am. I see it as recently as 371b572, which is as new as I have
tested.
The problem is that an update which must properly update exactly one
row instead updates zero rows.
It takes 5 to 16 hours to reproduce when run as 8 clients on 8 cores.
I suspect it is some kind of race condition, and testing with more
clients on more cores would make it happen faster. If you inject
crash/recovery cycles into the system, it seems to happen sooner. But
crash/recovery cycles are not necessary.
If you use the attached do_nocrash.sh script, the error will generate
a message like:
child abnormal exit update did not update 1 row: key 6838 updated 0E0
at count.pl line 189.\n at count.pl line 197.
(And I've added code so that once this is detected, the script will
soon terminate)
If you want to run do_nocrash.sh, change the first few lines to
hardcode the correct path for the binaries and the temp data directory
(which will be ruthlessly deleted). It will run on an unpatched
server, since crash injection is turned off.
If you want to make it fork more clients, change the 8 in 'perl
count.pl 8 0|| on_error;'
I have preserved a large tarball (215M) of a corrupt data directory.
It was run with the a928484 compilation with e13ac5586 cherrypicked,
and is at https://drive.google.com/open?id=0Bzqrh1SO9FcEci1FQTkwZW9ZU1U.
Or, if you can tell me how to look for myself (pageinspect doesn't
offer much for Bloom).
With that tarball, the first query using the index returns nothing,
will the second forcing a seq scan returns a row:
select * from foo where bloom = md5('6838');
select * from foo where bloom||'' = md5('6838');
The machinery posted here is probably much more elaborate than
necessary to detect the problem. You could probably detect it with
pgbench -N, except that that doesn't check the results to make sure
the expected number of rows were actually selected/updated.
Cheers,
Jeff
Attachments:
do_nocrash.shapplication/x-sh; name=do_nocrash.shDownload
echo $HOSTNAME
TMPDATA=/tmp/data2
INST=/home/jjanes/pgsql/torn_bisect/
JJXID='--JJ_xid=40'
JJTORN='--JJ_torn_page=1000'
JJTORNOFF='--JJ_torn_page=0'
### turn off crashing and xid fast forward.
### use the empty string, not settting them to zero, so it can run on unpatched server.
JJXID=''
JJTORN=''
JJTORNOFF=''
on_exit() {
echo "Cleaning up"
$INST/bin/pg_ctl -D $TMPDATA stop -m immediate -w
## don't clean up on normal exit, either, as cleaning up by hand is easy
## and it is hard to know that all failure modes go through the other one.
# rm -r $TMPDATA
exit 0
}
on_error() {
echo "Exiting on error"
$INST/bin/pg_ctl -D $TMPDATA stop -m immediate -w
# don't clean up, to preserve for forensic analysis.
# rm -r $TMPDATA/
exit 1
};
trap 'on_exit' USR2;
trap 'on_exit' INT;
$INST/bin/pg_ctl -D $TMPDATA stop -m immediate -w
rm -r $TMPDATA
$INST/bin/initdb -k -D $TMPDATA || exit
cat <<END >> $TMPDATA/postgresql.conf
## Jeff's changes to config for use with recovery stress testing
## sometimes it doesn't want to use bloom, so force it
enable_seqscan=off
## can't do these two when testing 9.3
max_replication_slots=3
wal_level = logical
wal_keep_segments=20 ## preserve some evidence
## Crashes are driven by checkpoints, so we want to do them often
#checkpoint_segments = 1
max_wal_size = 48MB
min_wal_size = 32MB
checkpoint_timeout = 30s
checkpoint_warning = 0
#archive_mode = on
## There is a known race condition that sometimes causes auto restart to fail when archiving is on.
## that is annoying, so turn it off unless we specifically want to test the on condition.
archive_mode = off
archive_command = 'echo archive_command %p %f `date`' # Don't actually archive, just make pgsql think we are
archive_timeout = 30
track_io_timing=on
autovacuum_naptime = 10s
## if updates are not HOT, the table/index can easily bloat faster than default throttled autovac can possibly
## cope. So crank it up.
autovacuum_vacuum_cost_delay = 2ms
log_line_prefix = '%p %i %e %m:'
restart_after_crash = on
## Since we crash the PG software, not the OS, fsync does not matter as the surviving OS is obligated to provide a
## consistent view of the written-but-not-fsynced data even after PG restarts. Turning it off gives more
## testing per unit of time.
fsync=off
shared_preload_libraries = 'pg_stat_statements'
wal_compression=1
track_commit_timestamp=1
END
## The below settings create to much noise to leave them on all the time.
## But once a problem is found, these are some of the first things to try
## for diagnosing it
cat <<END >> $TMPDATA/postgresql.conf
#log_error_verbosity = verbose
#log_checkpoints = on
#log_autovacuum_min_duration=-1
#JJ_vac=1
END
$INST/bin/pg_ctl -D $TMPDATA start -w || exit
$INST/bin/createdb
$INST/bin/psql -c 'select version()'
$INST/bin/pg_config
$INST/bin/psql -c 'create extension pageinspect'
$INST/bin/psql -c 'create extension pgstattuple'
$INST/bin/psql -c 'create extension pg_stat_statements'
$INST/bin/psql -c 'create extension pg_buffercache'
$INST/bin/psql -c "create extension pg_freespacemap"
$INST/bin/psql -c "create extension btree_gist"
$INST/bin/psql -c "create extension btree_gin"
## run the initial load now, before JJ_torn_page is turned on,
## or else we crash before even getting the table initialized due to WAL of the GIN or GIST index build.
perl count.pl 8 0|| on_error;
### Occasionally useful monitoring queries
#while (true) ; do psql -c "explain (analyze,buffers) update foo set count=count+0 where text_array @> '{679}'; set enable_seqscan=off; explain (analyze,buffers) update foo set count=count+0 where text_array @> '{679}';"; sleep 1; done &
#while (true) ; do psql -c "explain (analyze,buffers) update foo set count=count+0 where text_array @> '{679}'; "; sleep 1; done &
# psql -c "explain (analyze,buffers) update foo set count=count+0 where text_array @@ to_tsquery('{'||md5(984::text)||'}');"
#while (true) ; do psql -c "\dit+ "; sleep 5; done &
for g in `seq 1 1000` ; do
$INST/bin/pg_ctl -D $TMPDATA restart -o "--ignore_checksum_failure=0 $JJTORN $JJXID" -w
echo JJ starting loop $g;
for f in `seq 1 100`; do
#$INST/bin/psql -c 'SELECT datname, datfrozenxid, age(datfrozenxid) FROM pg_database;';
## on_error is needed to preserve database for inspection. Otherwise autovac will destroy evidence.
perl count.pl 8 || on_error;
done;
echo JJ ending loop $g;
## give autovac a chance to run to completion
# need to disable crashing, as sometimes the vacuum itself triggers the crash
$INST/bin/pg_ctl -D $TMPDATA restart -o "--ignore_checksum_failure=0 $JJTORNOFF $JJXID" -w || (sleep 5; \
$INST/bin/pg_ctl -D $TMPDATA restart -o "--ignore_checksum_failure=0 $JJTORNOFF $JJXID" -w || on_error;)
## trying to get autovac to work in the face of consistent crashing
## is just too hard, so do manual vacs unless autovac is specifically
## what you are testing.
#$INST/bin/vacuumdb -a -F || on_error;
## or sleep a few times in the hope autovac can get it done, if you want to test that.
#$INST/bin/psql -c 'select pg_sleep(120)' || (sleep 5; $INST/bin/psql -c 'select pg_sleep(120)') || (sleep 5; $INST/bin/psql -c 'select pg_sleep(120)')## give autovac a chance to do its thing
echo JJ ending sleep after loop $g;
done;
on_exit
Jeff Janes <jeff.janes@gmail.com> writes:
I am getting corrupted Bloom indexes in which a tuple in the table
heap is not in the index.
Hmm. I can trivially reproduce a problem, but I'm not entirely sure
whether it matches yours. Same basic test case as the bloom regression
test:
regression=# CREATE TABLE tst (
i int4,
t text
);
CREATE TABLE
regression=# CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);
CREATE INDEX
regression=# INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i;
INSERT 0 2000
regression=# vacuum verbose tst;
...
INFO: index "bloomidx" now contains 2000 row versions in 5 pages
...
regression=# delete from tst;
DELETE 2000
regression=# vacuum verbose tst;
...
INFO: index "bloomidx" now contains 0 row versions in 5 pages
DETAIL: 2000 index row versions were removed.
...
regression=# INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i;
INSERT 0 2000
regression=# vacuum verbose tst;
...
INFO: index "bloomidx" now contains 1490 row versions in 5 pages
...
Ooops. (Note: this is done with some fixes already in place to make
blvacuum.c return correct tuple counts for VACUUM VERBOSE; right now
it tends to double-count during a VACUUM.)
The problem seems to be that (1) blvacuum marks all the index pages as
BLOOM_DELETED, but doesn't bother to clear the notFullPage array on the
metapage; (2) blinsert uses a page from notFullPage, failing to notice
that it's inserting data into a BLOOM_DELETED page; (3) once we ask the
FSM for a page, it returns a BLOOM_DELETED page that we've already put
tuples into, which we happily blow away by reinit'ing the page.
A race-condition variant of this could be that after an autovacuum
has marked a page BLOOM_DELETED, but before it's reached the point of
updating the metapage, blinsert could stick data into the deleted page.
That would make it possible to reach the problem without requiring the
extreme edge case that blvacuum finds no partly-full pages to put into the
metapage. If this does explain your problem, it's probably that variant.
Will push a fix in a bit.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 13, 2016 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
I am getting corrupted Bloom indexes in which a tuple in the table
heap is not in the index.
....
Will push a fix in a bit.
After 36 hours of successful running on two different machines (one
with crash injection turned on, one without), I am pretty confident
that your fix is working.
Thanks,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers