[CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

Started by Neha Sharmaabout 5 years ago34 messageshackers
Jump to latest
#1Neha Sharma
neha.sharma@enterprisedb.com

Hello,

While executing the below test case server crashed with Segfault 11 on
master branch.
I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h

Issue is only reproducing on master branch.

*Test Case:*
CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);

CLUSTER sm_5_323_table USING sm_5_323_idx;

\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb
-h localhost -p 5432 -d postgres

*Test case output:*
edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.

postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t
sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database
"postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

*Stack Trace:*
Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER
'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM,
behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM,
behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at
md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0,
forknum=forknum@entry=MAIN_FORKNUM)
at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>,
forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70)
at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster
(OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>,
use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64,
num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70,
tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster
(tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70,
num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>,
use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058,
OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>,
pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>,
verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>,
OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>,
params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0,
stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at
cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430,
queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430,
queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL,
params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710,
qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70,
pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710,
qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70,
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,

dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710,
qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710,
qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850
"CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffcba6ec670,
dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>,
argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209

Thanks.
--
Regards,
Neha Sharma

#2Amul Sul
sulamul@gmail.com
In reply to: Neha Sharma (#1)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.

Regards,
Amul

On Mon, Mar 22, 2021 at 11:53 AM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Show quoted text

Hello,

While executing the below test case server crashed with Segfault 11 on master branch.
I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h

Issue is only reproducing on master branch.

Test Case:
CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);

CLUSTER sm_5_323_table USING sm_5_323_idx;

\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres

Test case output:
edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.

postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Stack Trace:
Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>, use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>, verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209

Thanks.
--
Regards,
Neha Sharma

Attachments:

fix_failure_for_cca.patchapplication/x-patch; name=fix_failure_for_cca.patchDownload+3-0
#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amul Sul (#2)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:

In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.

That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.

I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.

--
Amit Langote
EDB: http://www.enterprisedb.com

#4Amul Sul
sulamul@gmail.com
In reply to: Amit Langote (#3)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:

In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.

That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.

Thanks for the confirmation.

I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Regards,
Amul

#5Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Amul Sul (#4)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Tue, Mar 23, 2021 at 10:08 AM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com>
wrote:

On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:

In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line

tuplesort_begin_cluster()

get called which cause the system cache invalidation and due to CCA

setting,

wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the

subsequent

operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached

patch.

That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.

Thanks for the confirmation.

I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)
before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Sure, will give a regression run with CCA enabled.

Regards,
Amul

Regards,
Neha Sharma

#6Michael Paquier
michael@paquier.xyz
In reply to: Neha Sharma (#5)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Tue, Mar 23, 2021 at 10:52:09AM +0530, Neha Sharma wrote:

Sure, will give a regression run with CCA enabled.

I can confirm the regression between 13 and HEAD, so I have added an
open item. It would be good to figure out the root issue here, and I
am ready to bet that the problem is deeper than it looks and that more
code paths could be involved.

It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.
--
Michael

#7Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:

It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I am too tired to poke at that today, so I'll try tomorrow. Tom may
beat me at that though.
--
Michael

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#7)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:

It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I am too tired to poke at that today, so I'll try tomorrow. Tom may
beat me at that though.

I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.

The interesting question here seems to be "why didn't the existing
CLOBBER_CACHE_ALWAYS buildfarm testing catch this?". It looks to me like
the answer is that it only happens for an empty table (or at least one
where the data pattern is such that we skip the RelationOpenSmgr call
earlier in end_heap_rewrite) and we don't happen to be exercising that
exact scenario in the regression tests.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.

On closer inspection, I believe the true culprit is c6b92041d,
which did this:

 	 */
 	if (RelationNeedsWAL(state->rs_new_rel))
-		heap_sync(state->rs_new_rel);
+		smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);

logical_end_heap_rewrite(state);

heap_sync was careful about opening rd_smgr, the new code not so much.

I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.

regards, tom lane

#10Amul Sul
sulamul@gmail.com
In reply to: Tom Lane (#9)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.

On closer inspection, I believe the true culprit is c6b92041d,
which did this:

*/
if (RelationNeedsWAL(state->rs_new_rel))
-               heap_sync(state->rs_new_rel);
+               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);

logical_end_heap_rewrite(state);

heap_sync was careful about opening rd_smgr, the new code not so much.

I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.

I too didn't find any other place replacing heap_sync() or equivalent place from
this commit where smgr* operation reaches without necessary precautions call.
heap_sync() was calling RelationOpenSmgr() through FlushRelationBuffers() before
it reached smgrimmedsync(). So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

Regards,
Amul

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amul Sul (#10)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

Amul Sul <sulamul@gmail.com> writes:

On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

On closer inspection, I believe the true culprit is c6b92041d,
which did this:
-               heap_sync(state->rs_new_rel);
+               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
heap_sync was careful about opening rd_smgr, the new code not so much.

So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr? That is, write the above and all
similar code like

smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);

where we provide something like

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

and then we could get rid of most or all other RelationOpenSmgr calls.

This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.

regards, tom lane

#12Amul Sul
sulamul@gmail.com
In reply to: Tom Lane (#11)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amul Sul <sulamul@gmail.com> writes:

On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

On closer inspection, I believe the true culprit is c6b92041d,
which did this:
-               heap_sync(state->rs_new_rel);
+               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
heap_sync was careful about opening rd_smgr, the new code not so much.

So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr? That is, write the above and all
similar code like

smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);

where we provide something like

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

and then we could get rid of most or all other RelationOpenSmgr calls.

+1

This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.

I think that will be fine, one-time pain. If you want I will do those changes.

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

Regards,
Amul

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amul Sul (#12)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

Amul Sul <sulamul@gmail.com> writes:

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

regards, tom lane

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#13)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

Sorry for the bug.

At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Amul Sul <sulamul@gmail.com> writes:

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

FWIW, +1 for the function as is.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#14)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Thu, Mar 25, 2021 at 12:10 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Sorry for the bug.

At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Amul Sul <sulamul@gmail.com> writes:

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

FWIW, +1 for the function as is.

Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.

Regards,
Amul

Attachments:

Add-RelationGetSmgr-inline-function.patchapplication/x-patch; name=Add-RelationGetSmgr-inline-function.patchDownload+118-141
#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#15)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On 2021-Mar-25, Amul Sul wrote:

Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.

We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

--
�lvaro Herrera Valdivia, Chile

#17Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#16)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:

We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.
--
Michael

#18Amul Sul
sulamul@gmail.com
In reply to: Michael Paquier (#17)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:

We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.

Agree, did the same in the attached version, thanks.

Regards,
Amul

P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Attachments:

v2_Add-RelationGetSmgr-inline-function.patchapplication/octet-stream; name=v2_Add-RelationGetSmgr-inline-function.patchDownload+117-152
#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amul Sul (#18)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in

On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:

We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.

Agree, did the same in the attached version, thanks.

+	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,

At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone.. I don't mind other sccessive calls if any
since what I don't like is the notation there.

P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Isn't this a kind of open item?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#19)
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in

On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:

We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.

Agree, did the same in the attached version, thanks.

+       smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
(char *) metapage, true);
-       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,

At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone.. I don't mind other sccessive calls if any
since what I don't like is the notation there.

Perhaps, isn't that bad. It is good to follow the practice of using
RelationGetSmgr() for rd_smgr access, IMHO.

P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Isn't this a kind of open item?

Sorry, I didn't get you. Do I need to move this to some other bucket?

Regards,
Amul

#21Justin Pryzby
pryzby@telsasoft.com
In reply to: Amul Sul (#20)
#22Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#19)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#22)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#23)
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amul Sul (#20)
#26Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#25)
#27Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Amul Sul (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amul Sul (#26)
#29Amul Sul
sulamul@gmail.com
In reply to: Tom Lane (#28)
#30Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#29)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amul Sul (#30)
#32Amul Sul
sulamul@gmail.com
In reply to: Alvaro Herrera (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amul Sul (#32)
#34Amul Sul
sulamul@gmail.com
In reply to: Tom Lane (#33)