[CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
Hello,
While executing the below test case server crashed with Segfault 11 on
master branch.
I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h
Issue is only reproducing on master branch.
*Test Case:*
CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CLUSTER sm_5_323_table USING sm_5_323_idx;
\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb
-h localhost -p 5432 -d postgres
*Test case output:*
edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.
postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t
sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database
"postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
*Stack Trace:*
Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER
'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM,
behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM,
behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at
md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0,
forknum=forknum@entry=MAIN_FORKNUM)
at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>,
forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70)
at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster
(OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>,
use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64,
num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70,
tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster
(tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70,
num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>,
use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058,
OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>,
pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>,
verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>,
OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>,
params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0,
stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at
cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430,
queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430,
queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL,
params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710,
qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70,
pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710,
qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70,
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710,
qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710,
qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850
"CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffcba6ec670,
dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>,
argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209
Thanks.
--
Regards,
Neha Sharma
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.
By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.
Regards,
Amul
On Mon, Mar 22, 2021 at 11:53 AM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
Show quoted text
Hello,
While executing the below test case server crashed with Segfault 11 on master branch.
I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.hIssue is only reproducing on master branch.
Test Case:
CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);CLUSTER sm_5_323_table USING sm_5_323_idx;
\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
Test case output:
edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.Stack Trace:
Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>, use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>, verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209Thanks.
--
Regards,
Neha Sharma
Attachments:
fix_failure_for_cca.patchapplication/x-patch; name=fix_failure_for_cca.patchDownload+3-0
On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.
That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.
I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.
Thanks for the confirmation.
I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.
I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before
smgrnblocks() with CCA setting and didn't see any problem there.
I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.
Regards,
Amul
On Tue, Mar 23, 2021 at 10:08 AM Amul Sul <sulamul@gmail.com> wrote:
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com>
wrote:On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next linetuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA
setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the
subsequent
operations and causes segmentation fault.
By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attachedpatch.
That makes sense. I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.Thanks for the confirmation.
I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with. For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)
before
smgrnblocks() with CCA setting and didn't see any problem there.I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.
Sure, will give a regression run with CCA enabled.
Regards,
Amul
Regards,
Neha Sharma
On Tue, Mar 23, 2021 at 10:52:09AM +0530, Neha Sharma wrote:
Sure, will give a regression run with CCA enabled.
I can confirm the regression between 13 and HEAD, so I have added an
open item. It would be good to figure out the root issue here, and I
am ready to bet that the problem is deeper than it looks and that more
code paths could be involved.
It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.
--
Michael
On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.
One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
I am too tired to poke at that today, so I'll try tomorrow. Tom may
beat me at that though.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce. It would be good to bisect
the origin point here as a first step.
One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
I am too tired to poke at that today, so I'll try tomorrow. Tom may
beat me at that though.
I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.
The interesting question here seems to be "why didn't the existing
CLOBBER_CACHE_ALWAYS buildfarm testing catch this?". It looks to me like
the answer is that it only happens for an empty table (or at least one
where the data pattern is such that we skip the RelationOpenSmgr call
earlier in end_heap_rewrite) and we don't happen to be exercising that
exact scenario in the regression tests.
regards, tom lane
I wrote:
Michael Paquier <michael@paquier.xyz> writes:
One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.
On closer inspection, I believe the true culprit is c6b92041d,
which did this:
*/
if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
logical_end_heap_rewrite(state);
heap_sync was careful about opening rd_smgr, the new code not so much.
I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.
regards, tom lane
On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
Michael Paquier <michael@paquier.xyz> writes:
One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.I think that's an artifact. That commit didn't touch anything related to
relation opening or closing. What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.On closer inspection, I believe the true culprit is c6b92041d,
which did this:*/ if (RelationNeedsWAL(state->rs_new_rel)) - heap_sync(state->rs_new_rel); + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);logical_end_heap_rewrite(state);
heap_sync was careful about opening rd_smgr, the new code not so much.
I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.
I too didn't find any other place replacing heap_sync() or equivalent place from
this commit where smgr* operation reaches without necessary precautions call.
heap_sync() was calling RelationOpenSmgr() through FlushRelationBuffers() before
it reached smgrimmedsync(). So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.
Regards,
Amul
Amul Sul <sulamul@gmail.com> writes:
On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
On closer inspection, I believe the true culprit is c6b92041d, which did this: - heap_sync(state->rs_new_rel); + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); heap_sync was careful about opening rd_smgr, the new code not so much.
So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.
I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr? That is, write the above and all
similar code like
smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
where we provide something like
static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}
and then we could get rid of most or all other RelationOpenSmgr calls.
This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.
regards, tom lane
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amul Sul <sulamul@gmail.com> writes:
On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
On closer inspection, I believe the true culprit is c6b92041d, which did this: - heap_sync(state->rs_new_rel); + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); heap_sync was careful about opening rd_smgr, the new code not so much.So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr? That is, write the above and all
similar code likesmgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
where we provide something like
static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}and then we could get rid of most or all other RelationOpenSmgr calls.
+1
This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.
I think that will be fine, one-time pain. If you want I will do those changes.
A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?
Regards,
Amul
Amul Sul <sulamul@gmail.com> writes:
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}
A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?
The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.
Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.
regards, tom lane
Sorry for the bug.
At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Amul Sul <sulamul@gmail.com> writes:
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.
FWIW, +1 for the function as is.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Mar 25, 2021 at 12:10 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Sorry for the bug.
At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Amul Sul <sulamul@gmail.com> writes:
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?The multiple-evaluation hazard seems like an issue. We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.FWIW, +1 for the function as is.
Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.
Regards,
Amul
Attachments:
Add-RelationGetSmgr-inline-function.patchapplication/x-patch; name=Add-RelationGetSmgr-inline-function.patchDownload+118-141
On 2021-Mar-25, Amul Sul wrote:
Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.
We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.
--
�lvaro Herrera Valdivia, Chile
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.
No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.
--
Michael
On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.
Agree, did the same in the attached version, thanks.
Regards,
Amul
P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
Attachments:
v2_Add-RelationGetSmgr-inline-function.patchapplication/octet-stream; name=v2_Add-RelationGetSmgr-inline-function.patchDownload+117-152
At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in
On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.Agree, did the same in the attached version, thanks.
+ smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
(char *) metapage, true);
- log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+ log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone.. I don't mind other sccessive calls if any
since what I don't like is the notation there.
P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
Isn't this a kind of open item?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in
On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
We forgot this patch earlier in the commitfest. Do people think we
should still get it in on this cycle? I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.No objections from here to do that now even after feature freeze. I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c. This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.Agree, did the same in the attached version, thanks.
+ smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone.. I don't mind other sccessive calls if any
since what I don't like is the notation there.
Perhaps, isn't that bad. It is good to follow the practice of using
RelationGetSmgr() for rd_smgr access, IMHO.
P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
Isn't this a kind of open item?
Sorry, I didn't get you. Do I need to move this to some other bucket?
Regards,
Amul