Postgres crashes at memcopy() after upgrade to PG 13.
Hi ,
In one of the environments, using pg_upgrade with hard links, PostgreSQL 12
has been upgraded to PostgreSQL 13.1. The OS was Ubuntu 16.04.7 LTS (Xenial
Xerus). pg_repack was used to rebuild all the tables across the database
right after the upgrade to PG 13.
A new server with Ubuntu 20.04.1 LTS was later provisioned. Streaming
replication was set up from the Old Server running on Ubuntu 16 to New
Server on Ubuntu 20 - same PG versions 13.1.
Replication was running fine, but, after the failover to the New Server, an
Update on a few random rows (not on the same page) was causing Segmentation
fault and causing a crash of the Postgres.
Selecting the records using the Index or directly from the table works
absolutely fine. But, when the same records are updated, it gets into the
following error.
2021-03-12 17:20:01.979 CET p#7 s#604b8fa9.7 t#0 LOG: terminating any
other active server processes
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 WARNING: terminating
connection because of crash of another server process
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 HINT: In a moment you
should be able to reconnect to the database and repeat your command.
gdb backtrace looks like following with the debug symbols.
(gdb) bt
#0 __memmove_avx_unaligned_erms () at
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:533
#1 0x000055b72761c370 in memmove (__len=<optimized out>,
__src=0x55b72930e9c7, __dest=<optimized out>)
at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:40
#2 _bt_swap_posting (newitem=newitem@entry=0x55b7292010c0,
oposting=oposting@entry=0x7f3b46f94778,
postingoff=postingoff@entry=2) at
./build/../src/backend/access/nbtree/nbtdedup.c:796
#3 0x000055b72761d40b in _bt_insertonpg (rel=0x7f3acd8a49c0,
itup_key=0x55b7292bc6a8, buf=507, cbuf=0, stack=0x55b7292d5f98,
itup=0x55b7292010c0, itemsz=32, newitemoff=48, postingoff=2,
split_only_page=false)
at ./build/../src/backend/access/nbtree/nbtinsert.c:1167
#4 0x000055b72761eae9 in _bt_doinsert (rel=rel@entry=0x7f3acd8a49c0,
itup=itup@entry=0x55b7292bc848,
checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, heapRel=heapRel@entry
=0x7f3acd894f70)
at ./build/../src/backend/access/nbtree/nbtinsert.c:1009
#5 0x000055b727621e2e in btinsert (rel=0x7f3acd8a49c0, values=<optimized
out>, isnull=<optimized out>, ht_ctid=0x55b7292d4578,
heapRel=0x7f3acd894f70, checkUnique=UNIQUE_CHECK_NO,
indexInfo=0x55b7292bc238)
at ./build/../src/backend/access/nbtree/nbtree.c:210
#6 0x000055b727757487 in ExecInsertIndexTuples
(slot=slot@entry=0x55b7292d4548,
estate=estate@entry=0x55b7291ff1f8,
noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0,
arbiterIndexes=arbiterIndexes@entry=0x0)
at ./build/../src/backend/executor/execIndexing.c:393
#7 0x000055b7277807a8 in ExecUpdate (mtstate=0x55b7292bb2c8,
tupleid=0x7fff45ea318a, oldtuple=0x0, slot=0x55b7292d4548,
planSlot=0x55b7292c04e8, epqstate=0x55b7292bb3c0,
estate=0x55b7291ff1f8, canSetTag=true)
at ./build/../src/backend/executor/nodeModifyTable.c:1479
#8 0x000055b727781655 in ExecModifyTable (pstate=0x55b7292bb2c8) at
./build/../src/backend/executor/nodeModifyTable.c:2253
#9 0x000055b727758424 in ExecProcNode (node=0x55b7292bb2c8) at
./build/../src/include/executor/executor.h:248
#10 ExecutePlan (execute_once=<optimized out>, dest=0x55b7292c1728,
direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_UPDATE,
use_parallel_mode=<optimized out>, planstate=0x55b7292bb2c8,
estate=0x55b7291ff1f8) at
./build/../src/backend/executor/execMain.c:1632
#11 standard_ExecutorRun (queryDesc=0x55b7292ba578, direction=<optimized
out>, count=0, execute_once=<optimized out>)
at ./build/../src/backend/executor/execMain.c:350
#12 0x000055b7278bebf7 in ProcessQuery (plan=<optimized out>,
sourceText=0x55b72919efa8 "\031)\267U", params=0x0, queryEnv=0x0,
dest=0x55b7292c1728, qc=0x7fff45ea34c0) at
./build/../src/backend/tcop/pquery.c:160
#13 0x000055b7278bedf9 in PortalRunMulti (portal=portal@entry=0x55b729254128,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55b7292c1728,
altdest=altdest@entry=0x55b7292c1728,
qc=qc@entry=0x7fff45ea34c0) at ./build/../src/backend/tcop/pquery.c:1265
#14 0x000055b7278bf847 in PortalRun (portal=portal@entry=0x55b729254128,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55b7292c1728,
--Type <RET> for more, q to quit, c to continue without paging--
Is this expected when replication is happening between PostgreSQL databases
hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ? Or, do we
think this is some sort of corruption ?
--
Regards,
Avi.
On Mon, Mar 15, 2021 at 1:29 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
Is this expected when replication is happening between PostgreSQL databases hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ? Or, do we think this is some sort of corruption ?
Is this index on a text datatype, and using a collation other than "C"?
https://wiki.postgresql.org/wiki/Locale_data_changes
Not that I expect it to crash if that's the cause, I thought it'd just
get confused.
Hi Thomas,
On Sun, Mar 14, 2021 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Mar 15, 2021 at 1:29 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:Is this expected when replication is happening between PostgreSQL
databases hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ?
Or, do we think this is some sort of corruption ?Is this index on a text datatype, and using a collation other than "C"?
Its en_US.UTF-8
https://wiki.postgresql.org/wiki/Locale_data_changes
Not that I expect it to crash if that's the cause, I thought it'd just
get confused.
On Ubuntu 16 server,
*$* ldd --version
ldd (Ubuntu GLIBC 2.23-0ubuntu11.2) 2.23
On New Server Ubuntu 20,
*$* ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
--
Regards,
Avi.
Hi Thomas,
On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar <avinash.vallarapu@gmail.com>
wrote:
Hi Thomas,
On Sun, Mar 14, 2021 at 9:40 PM Thomas Munro <thomas.munro@gmail.com>
wrote:On Mon, Mar 15, 2021 at 1:29 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:Is this expected when replication is happening between PostgreSQL
databases hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ?
Or, do we think this is some sort of corruption ?Is this index on a text datatype, and using a collation other than "C"?
Its en_US.UTF-8
Also the datatype is bigint
https://wiki.postgresql.org/wiki/Locale_data_changes
Not that I expect it to crash if that's the cause, I thought it'd just
get confused.On Ubuntu 16 server,
*$* ldd --version
ldd (Ubuntu GLIBC 2.23-0ubuntu11.2) 2.23
On New Server Ubuntu 20,
*$* ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
--
Regards,
Avi.
--
Regards,
Avinash Vallarapu
+1-902-221-5976
[Dropping pgsql-general@ from the CC, because cross-posting triggers
moderation; sorry I didn't notice that on my first reply]
On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar <avinash.vallarapu@gmail.com> wrote:
Also the datatype is bigint
Ok. Collation changes are the most common cause of index problems
when upgrading OSes, but here we can rule that out if your index is on
bigint. So it seems like this is some other kind of corruption in
your database, or a bug in the deduplication code.
Hi,
On Sun, Mar 14, 2021 at 10:17 PM Thomas Munro <thomas.munro@gmail.com>
wrote:
[Dropping pgsql-general@ from the CC, because cross-posting triggers
moderation; sorry I didn't notice that on my first reply]On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar <
avinash.vallarapu@gmail.com> wrote:
Also the datatype is bigint
Ok. Collation changes are the most common cause of index problems
when upgrading OSes, but here we can rule that out if your index is on
bigint. So it seems like this is some other kind of corruption in
your database, or a bug in the deduplication code.
I suspect the same.
When i tried to perform a pg_filedump to see the entry of the ID in the
index, it was strange that the entry did not exist in the Index. But, the
SELECT using an Index only scan was still working okay. I have chosen the
start and end page perfectly and there should not be any mistake there.
Following may be helpful to understand what I meant.
I have renamed the table and index names before adding it here.
=# select pg_size_pretty(pg_relation_size('idx_id_mtime')) as size,
relpages from pg_class where relname = 'idx_id_mtime';
size | relpages
-------+----------
71 MB | 8439
=# select pg_relation_filepath('idx_id_mtime');
pg_relation_filepath
----------------------
base/16404/346644309
=# \d+ idx_id_mtime
Index "public.idx_id_mtime"
Column | Type | Key? | Definition | Storage | Stats
target
-----------+--------------------------+------+------------+---------+--------------
sometable_id | bigint | yes | sometable_id | plain |
mtime | timestamp with time zone | yes | mtime | plain |
btree, for table "public.sometable"
$ pg_filedump -R 1 8439 -D bigint,timestamp
/flash/berta13/base/16404/346644309 > 12345.txt
$ cat 12345.txt | grep -w 70334
--> No Output.
We don't see the entry for the ID : 70334 in the output of pg_filedump.
*But, the SELECT statement is still using the same Index. *
=*# EXPLAIN select * from sometable where sometable_id = 70334;
QUERY PLAN
--------------------------------------------------------------------------------
Index Scan using idx_id_mtime on sometable (cost=0.43..2.45 rows=1
width=869)
Index Cond: (sometable_id = 70334)
(2 rows)
=*# EXPLAIN ANALYZE select * from sometable where sometable_id = 70334;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id_mtime on sometable (cost=0.43..2.45 rows=1
width=869) (actual time=0.166..0.168 rows=1 loops=1)
Index Cond: (sometable_id = 70334)
Planning Time: 0.154 ms
Execution Time: 0.195 ms
(4 rows)
=*# update sometable set sometable_id = 70334 where sometable_id = 70334;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
Now, let us see the next ID. Here, the entry is visible in the output of
pg_filedump.
$ cat 12345.txt | grep -w 10819
COPY: 10819 2018-03-21 15:16:41.202277
The update still fails with the same error.
=*# update sometable set sometable_id = 10819 where sometable_id = 10819;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
On Sun, Mar 14, 2021 at 6:54 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
Following may be helpful to understand what I meant.
I have renamed the table and index names before adding it here.
It should be possible to run amcheck on your database, which will
detect corrupt posting list tuples on Postgres 13. It's a contrib
extension, so you must first run "CREATE EXTENSION amcheck;". From
there, you can run a query like the following (you may want to
customize this):
SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;
If this query takes too long to complete you may find it useful to add
something to limit the indexes check, such as: AND n.nspname =
'public' -- that change to the SQL will make the query just test
indexes from the public schema.
Do "SET client_min_messages=DEBUG1 " to get a kind of rudimentary
progress indicator, if that seems useful to you.
The docs have further information on what this bt_index_parent_check
function does, should you need it:
https://www.postgresql.org/docs/13/amcheck.html
--
Peter Geoghegan
Hi,
On Sun, Mar 14, 2021 at 11:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Mar 14, 2021 at 6:54 PM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:Following may be helpful to understand what I meant.
I have renamed the table and index names before adding it here.
It should be possible to run amcheck on your database, which will
detect corrupt posting list tuples on Postgres 13. It's a contrib
extension, so you must first run "CREATE EXTENSION amcheck;". From
there, you can run a query like the following (you may want to
customize this):SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;If this query takes too long to complete you may find it useful to add
something to limit the indexes check, such as: AND n.nspname =
'public' -- that change to the SQL will make the query just test
indexes from the public schema.Do "SET client_min_messages=DEBUG1 " to get a kind of rudimentary
progress indicator, if that seems useful to you.
I see that there are 26 Indexes for which there are 100 to thousands of
entries similar to the following. All are of course btree indexes.
psql:amchecksql.sql:17: DEBUG: leaf block 1043751 of index "idx_id_mtime"
has no first data item
And one error as follows.
psql:amchecksql.sql:17: ERROR: down-link lower bound invariant violated
for index "some_other_index"
The docs have further information on what this bt_index_parent_check
function does, should you need it:
https://www.postgresql.org/docs/13/amcheck.html--
Peter Geoghegan
--
Regards,
Avi.
On Mon, Mar 15, 2021 at 6:56 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
psql:amchecksql.sql:17: DEBUG: leaf block 1043751 of index "idx_id_mtime" has no first data item
That one is harmless.
And one error as follows.
psql:amchecksql.sql:17: ERROR: down-link lower bound invariant violated for index "some_other_index"
That indicates corruption. Can you tie this back to the crash? Is it
the same index?
--
Peter Geoghegan
Hi,
On Mon, Mar 15, 2021 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 15, 2021 at 6:56 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:psql:amchecksql.sql:17: DEBUG: leaf block 1043751 of index
"idx_id_mtime" has no first data item
That one is harmless.
And one error as follows.
psql:amchecksql.sql:17: ERROR: down-link lower bound invariant violated
for index "some_other_index"
That indicates corruption. Can you tie this back to the crash? Is it
the same index?
No, that's not the same index. The Index discussed in the previous
messages shows the following output.
DEBUG: verifying consistency of tree structure for index "idx_id_mtime"
with cross-level checks
DEBUG: verifying level 2 (true root level)
DEBUG: verifying level 1
DEBUG: verifying level 0 (leaf level)
DEBUG: verifying that tuples from index "idx_id_mtime" are present in
"player"
DEBUG: finished verifying presence of 1966412 tuples from table "player"
with bitset 29.89% set
LOG: duration: 3341.755 ms statement: SELECT bt_index_parent_check(index
=> c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod
= am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON
c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence !=
't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid AND indexrelid =
80774 AND n.nspname = 'public' ORDER BY c.relpages DESC;
bt_index_parent_check | relname | relpages
-----------------------+-----------------+----------
| idx_id_mtime | 8439
(1 row)
--
Peter Geoghegan
--
Regards,
Avi.
On Mon, Mar 15, 2021 at 3:21 PM Avinash Kumar <avinash.vallarapu@gmail.com>
wrote:
Hi,
On Mon, Mar 15, 2021 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 15, 2021 at 6:56 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:psql:amchecksql.sql:17: DEBUG: leaf block 1043751 of index
"idx_id_mtime" has no first data item
That one is harmless.
And one error as follows.
psql:amchecksql.sql:17: ERROR: down-link lower bound invariant
violated for index "some_other_index"
That indicates corruption. Can you tie this back to the crash? Is it
the same index?No, that's not the same index. The Index discussed in the previous
messages shows the following output.DEBUG: verifying consistency of tree structure for index "idx_id_mtime"
with cross-level checks
DEBUG: verifying level 2 (true root level)
DEBUG: verifying level 1
DEBUG: verifying level 0 (leaf level)
DEBUG: verifying that tuples from index "idx_id_mtime" are present in
"player"
DEBUG: finished verifying presence of 1966412 tuples from table "player"
with bitset 29.89% set
LOG: duration: 3341.755 ms statement: SELECT bt_index_parent_check(index
=> c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod
= am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON
c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence !=
't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid AND indexrelid =
80774 AND n.nspname = 'public' ORDER BY c.relpages DESC;
bt_index_parent_check | relname | relpages
-----------------------+-----------------+----------
| idx_id_mtime | 8439
(1 row)--
Peter Geoghegan--
Regards,
Avi.
I am afraid that it looks to me like a deduplication bug but not sure how
this can be pin-pointed. If there is something I could do to determine
that, I would be more than happy.
--
Regards,
Avi
On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
I am afraid that it looks to me like a deduplication bug but not sure how this can be pin-pointed. If there is something I could do to determine that, I would be more than happy.
That cannot be ruled out, but I don't consider it to be the most
likely explanation. The index in question passes amcheck verification,
which includes verification of the posting list tuple structure, and
even includes making sure the index has an entry for each row from the
table. It's highly unlikely that it is corrupt, and it's hard to see
how you get from a non-corrupt index to the segfault. At the same time
we see that some other index is corrupt -- it fails amcheck due to a
cross-level inconsistency, which is very unlikely to be related to
deduplication in any way. It's hard to believe that the problem is
squarely with _bt_swap_posting().
Did you actually run amcheck on the failed-over server, not the original server?
Note that you can disable deduplication selectively -- perhaps doing
so will make it possible to isolate the issue. Something like this
should do it (you need to reindex here to actually change the on-disk
representation to not have any posting list tuples from
deduplication):
alter index idx_id_mtime set (deduplicate_items = off);
reindex index idx_id_mtime;
--
Peter Geoghegan
Hi,
On Tue, Mar 16, 2021 at 1:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:I am afraid that it looks to me like a deduplication bug but not sure
how this can be pin-pointed. If there is something I could do to determine
that, I would be more than happy.That cannot be ruled out, but I don't consider it to be the most
likely explanation. The index in question passes amcheck verification,
which includes verification of the posting list tuple structure, and
even includes making sure the index has an entry for each row from the
table. It's highly unlikely that it is corrupt, and it's hard to see
how you get from a non-corrupt index to the segfault. At the same time
we see that some other index is corrupt -- it fails amcheck due to a
cross-level inconsistency, which is very unlikely to be related to
deduplication in any way. It's hard to believe that the problem is
squarely with _bt_swap_posting().Did you actually run amcheck on the failed-over server, not the original
server?
Yes, it was on the failover-over server where the issue is currently seen.
Took a snapshot of the data directory so that the issue can be analyzed.
Note that you can disable deduplication selectively -- perhaps doing
so will make it possible to isolate the issue. Something like this
should do it (you need to reindex here to actually change the on-disk
representation to not have any posting list tuples from
deduplication):alter index idx_id_mtime set (deduplicate_items = off);
reindex index idx_id_mtime;
I can do this. But, to add here, when we do a pg_repack or rebuild of
Indexes, automatically this is resolved. But, not sure if we get the same
issue again.
--
Peter Geoghegan
--
Regards,
Avi.
On Tue, Mar 16, 2021 at 9:50 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
Yes, it was on the failover-over server where the issue is currently seen. Took a snapshot of the data directory so that the issue can be analyzed.
I would be very cautious when using LVM snapshots with a Postgres data
directory, or VM-based snapshotting tools. There are many things that
can go wrong with these tools, which are usually not sensitive to the
very specific requirements of a database system like Postgres (e.g.
inconsistencies between WAL and data files can emerge in many
scenarios).
My general recommendation is to avoid these tools completely --
consistently use a backup solution like pgBackrest instead.
BTW, running pg_repack is something that creates additional risk of
database corruption, at least to some degree. That seems less likely
to have been the problem here (I think that it's probably something
with snapshots). Something to consider.
I can do this. But, to add here, when we do a pg_repack or rebuild of Indexes, automatically this is resolved.
Your bug report was useful to me, because it made me realize that the
posting list split code in _bt_swap_posting() is unnecessarily
trusting of the on-disk data -- especially compared to _bt_split(),
the page split code. While I consider it unlikely that the problem
that you see is truly a bug in Postgres, it is still true that the
crash that you saw should probably have just been an error.
We don't promise that the database cannot crash even with corrupt
data, but we do try to avoid it whenever possible. I may be able to
harden _bt_swap_posting(), to make failures like this a little more
friendly. It's an infrequently hit code path, so we can easily afford
to make the code more careful/less trusting.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
... It's hard to believe that the problem is
squarely with _bt_swap_posting().
IIUC, the problem is seen on a replica server but not the primary?
In that case, my thoughts would run towards a bug in WAL log creation
or replay, causing the index contents to be different/wrong on the
replica.
regards, tom lane
On Tue, Mar 16, 2021 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
... It's hard to believe that the problem is
squarely with _bt_swap_posting().IIUC, the problem is seen on a replica server but not the primary?
In that case, my thoughts would run towards a bug in WAL log creation
or replay, causing the index contents to be different/wrong on the
replica.
Right, observed after the replica Server after it got promoted.
The replica is of the same Postgres minor version - 13.1 but, the OS is
Ubuntu 16 on Primary and Ubuntu 20 on Replica (that got promoted).
Replica was setup using a backup taken using pg_basebackup.
I can share any detail that would help here.
regards, tom lane
--
Regards,
Avi
On Tue, Mar 16, 2021 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
... It's hard to believe that the problem is
squarely with _bt_swap_posting().IIUC, the problem is seen on a replica server but not the primary?
In that case, my thoughts would run towards a bug in WAL log creation
or replay, causing the index contents to be different/wrong on the
replica.
My remarks were intended to include problems during recovery
(_bt_swap_posting() is run inside REDO routines). Though I did
consider recovery specifically when thinking through the problem.
My assessment is that the index is highly unlikely to be corrupt
(whether it happened during recovery or at some other time), because
it passes validation by bt_index_parent_check(), with the optional
heapallindexed index-matches-table verification option enabled. This
includes exhaustive verification of posting list tuple invariants.
Anything is possible, but I find it easier to believe that the issue
is somewhere else -- we see the problem in _bt_swap_posting() because
it happens to go further than other code in trusting that the tuple
isn't corrupt (which it shouldn't). Another unrelated index *was*
reported corrupt by amcheck, though the error in question does not
suggest an issue with deduplication.
--
Peter Geoghegan
On Tue, Mar 16, 2021 at 11:20 AM Avinash Kumar
<avinash.vallarapu@gmail.com> wrote:
I can share any detail that would help here.
I would like to know what you see when you run a slightly modified
version of the same amcheck query. The same query as before, but with
the call to bt_index_parent_check() replaced with a call to
bt_index_check(). Can you do that, please?
This is what I mean:
SELECT bt_index_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;
The error that you reported was a cross-level invariant violation,
from one of the tests that bt_index_parent_check() performs but
bt_index_check() does not perform (the former performs checks that are
a superset of the latter). It's possible that we'll get a more
interesting error message from bt_index_check() here, because it might
go on for a bit longer -- it might conceivably reach a corrupt posting
list tuple on the leaf level, and report it as such.
Of course we don't see any corruption in the index that you had the
crash with at all, but it can't hurt to do this as well -- just in
case the issue is transient or something.
Thanks
--
Peter Geoghegan