FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Started by Justin Pryzbyover 5 years ago31 messageshackers
Jump to latest
#1Justin Pryzby
pryzby@telsasoft.com

Core was generated by `postgres: telsasoft ts [local] BIND '.

(gdb) bt
#0 0x00007f0951303387 in raise () from /lib64/libc.so.6
#1 0x00007f0951304a78 in abort () from /lib64/libc.so.6
#2 0x0000000000921005 in ExceptionalCondition (conditionName=conditionName@entry=0xa5db3d "pd_idx == pinfo->nparts", errorType=errorType@entry=0x977389 "FailedAssertion",
fileName=fileName@entry=0xa5da88 "execPartition.c", lineNumber=lineNumber@entry=1689) at assert.c:67
#3 0x0000000000672806 in ExecCreatePartitionPruneState (planstate=planstate@entry=0x908f6d8, partitionpruneinfo=<optimized out>) at execPartition.c:1689
#4 0x000000000068444a in ExecInitAppend (node=node@entry=0x7036b90, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAppend.c:132
#5 0x00000000006731fd in ExecInitNode (node=0x7036b90, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:179
#6 0x000000000069d03a in ExecInitResult (node=node@entry=0x70363d8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeResult.c:210
#7 0x000000000067323c in ExecInitNode (node=0x70363d8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:164
#8 0x000000000069e834 in ExecInitSort (node=node@entry=0x7035ca8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeSort.c:210
#9 0x0000000000672ff0 in ExecInitNode (node=0x7035ca8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:313
#10 0x00000000006812e8 in ExecInitAgg (node=node@entry=0x68311d0, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAgg.c:3292
#11 0x0000000000672fb1 in ExecInitNode (node=0x68311d0, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:328
#12 0x000000000068925a in ExecInitGatherMerge (node=node@entry=0x6830998, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeGatherMerge.c:110
#13 0x0000000000672f33 in ExecInitNode (node=0x6830998, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:348
#14 0x00000000006812e8 in ExecInitAgg (node=node@entry=0x682eda8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAgg.c:3292
#15 0x0000000000672fb1 in ExecInitNode (node=node@entry=0x682eda8, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at execProcnode.c:328
#16 0x000000000066c8e6 in InitPlan (eflags=16, queryDesc=<optimized out>) at execMain.c:1020
#17 standard_ExecutorStart (queryDesc=<optimized out>, eflags=16) at execMain.c:266
#18 0x00007f0944ca83b5 in pgss_ExecutorStart (queryDesc=0x1239b08, eflags=<optimized out>) at pg_stat_statements.c:1007
#19 0x00007f09117e4891 in explain_ExecutorStart (queryDesc=0x1239b08, eflags=<optimized out>) at auto_explain.c:301
#20 0x00000000007f9983 in PortalStart (portal=0xeff810, params=0xfacc98, eflags=0, snapshot=0x0) at pquery.c:505
#21 0x00000000007f7370 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xeb8500, dbname=0xeb84e0 "ts", username=<optimized out>) at postgres.c:1987
#22 0x000000000048916e in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4523
#23 BackendStartup (port=0xeb1000) at postmaster.c:4215
#24 ServerLoop () at postmaster.c:1727
#25 0x000000000076ec85 in PostmasterMain (argc=argc@entry=13, argv=argv@entry=0xe859b0) at postmaster.c:1400
#26 0x000000000048a82d in main (argc=13, argv=0xe859b0) at main.c:210

#3 0x0000000000672806 in ExecCreatePartitionPruneState (planstate=planstate@entry=0x908f6d8, partitionpruneinfo=<optimized out>) at execPartition.c:1689
pd_idx = <optimized out>
pp_idx = <optimized out>
pprune = 0x908f910
partdesc = 0x91937f8
pinfo = 0x7d6ee78
partrel = <optimized out>
partkey = 0xfbba28
lc2__state = {l = 0x7d6ee20, i = 0}
partrelpruneinfos = 0x7d6ee20
lc2 = <optimized out>
npartrelpruneinfos = <optimized out>
prunedata = 0x908f908
j = 0
lc__state = {l = 0x7d6edc8, i = 0}
estate = 0x11563f0
prunestate = 0x908f8b0
n_part_hierarchies = <optimized out>
lc = <optimized out>
i = 0

(gdb) p *pinfo
$2 = {type = T_PartitionedRelPruneInfo, rtindex = 7, present_parts = 0x7d6ef10, nparts = 414, subplan_map = 0x7d6ef68, subpart_map = 0x7d6f780, relid_map = 0x7d6ff98, initial_pruning_steps = 0x7d707b0,
exec_pruning_steps = 0x0, execparamids = 0x0}

(gdb) p pd_idx
$3 = <optimized out>

< 2020-08-02 02:04:17.358 SST >LOG: server process (PID 20954) was terminated by signal 6: Aborted
< 2020-08-02 02:04:17.358 SST >DETAIL: Failed process was running:
INSERT INTO child.cdrs_data_users_per_cell_20200801 (...list of columns elided...)
(
SELECT ..., $3::timestamp, $2,
MODE() WITHIN GROUP (ORDER BY ...) AS ..., STRING_AGG(DISTINCT ..., ',') AS ..., ...

This crashed at 2am, which at first I thought was maybe due to simultaneously
creating today's partition.

Aug 2 02:04:08 telsasoftsky abrt-hook-ccpp: Process 19264 (postgres) of user 26 killed by SIGABRT - dumping core
Aug 2 02:04:17 telsasoftsky abrt-hook-ccpp: Process 20954 (postgres) of user 26 killed by SIGABRT - ignoring (repeated crash)

Running:
postgresql13-server-13-beta2_1PGDG.rhel7.x86_64

Maybe this is a problem tickled by something new in v13. However, this is a
new VM, and at the time of the crash I was running a shell loop around
pg_restore, in reverse-chronological order. I have full logs, and I found that
just CREATEd was a table which the crashing process would've tried to SELECT FROM:

| 2020-08-02 02:04:01.48-11 | duration: 106.275 ms statement: CREATE TABLE child.cdrs_huawei_sgwrecord_2019_06_14 (

That table *currently* has:
|Number of partitions: 416 (Use \d+ to list them.)
And the oldest table is still child.cdrs_huawei_sgwrecord_2019_06_14 (since the
shell loop probably quickly spun through hundreds of pg_restores, failing to
connect to the database "in recovery"). And today's partition was already
created, at: 2020-08-02 01:30:35. So I think

Based on commit logs, I suspect this may be an "older bug", specifically maybe
with:

|commit 898e5e3290a72d288923260143930fb32036c00c
|Author: Robert Haas <rhaas@postgresql.org>
|Date: Thu Mar 7 11:13:12 2019 -0500
|
| Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.

I don't think it matters, but the process surrounding the table being INSERTed
INTO is more than a little special, involving renames, detaches, creation,
re-attaching within a transaction. I think that doesn't matter though, and the
issue is surrounding the table being SELECTed *from*, which is actually behind
a view.

#2Robert Haas
robertmhaas@gmail.com
In reply to: Justin Pryzby (#1)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Based on commit logs, I suspect this may be an "older bug", specifically maybe
with:

|commit 898e5e3290a72d288923260143930fb32036c00c
|Author: Robert Haas <rhaas@postgresql.org>
|Date: Thu Mar 7 11:13:12 2019 -0500
|
| Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.

I don't think it matters, but the process surrounding the table being INSERTed
INTO is more than a little special, involving renames, detaches, creation,
re-attaching within a transaction. I think that doesn't matter though, and the
issue is surrounding the table being SELECTed *from*, which is actually behind
a view.

That's an entirely reasonable guess, but it doesn't seem easy to
understand exactly what happened here based on the provided
information. The assertion failure probably indicates that
pinfo->relid_map[] and partdesc->oids[] differ in some way other than
additional elements having been inserted into the latter. For example,
some elements might have disappeared, or the order might have changed.
This isn't supposed to happen, because DETACH PARTITION requires
heavier locking, and the order changing without anything getting
detached should be impossible. But evidently it did. If we could dump
out the two arrays in question it might shed more light on exactly how
things went wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#2)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote:

On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Based on commit logs, I suspect this may be an "older bug", specifically maybe
with:

|commit 898e5e3290a72d288923260143930fb32036c00c
|Author: Robert Haas <rhaas@postgresql.org>
|Date: Thu Mar 7 11:13:12 2019 -0500
|
| Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.

I don't think it matters, but the process surrounding the table being INSERTed
INTO is more than a little special, involving renames, detaches, creation,
re-attaching within a transaction. I think that doesn't matter though, and the
issue is surrounding the table being SELECTed *from*, which is actually behind
a view.

That's an entirely reasonable guess, but it doesn't seem easy to
understand exactly what happened here based on the provided
information. The assertion failure probably indicates that
pinfo->relid_map[] and partdesc->oids[] differ in some way other than
additional elements having been inserted into the latter. For example,
some elements might have disappeared, or the order might have changed.
This isn't supposed to happen, because DETACH PARTITION requires
heavier locking, and the order changing without anything getting
detached should be impossible. But evidently it did. If we could dump
out the two arrays in question it might shed more light on exactly how
things went wrong.

(gdb) p *pinfo->relid_map@414
$8 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 22606404, 22600237, 22600131, 22596610, 22595013,
22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 22564524, 22564113, 22558519, 22557080, 22556981, 22556793,
22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 22513674, 22513575, 22513405, 22513288, 22507520, 22504728,
22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 22434352, 22430792, 22426903, 22426804, 22420732, 22420025,
22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 22381934, 22375210, 22370297, 22367878, 22367779, 22367586,
22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 22287466, 22283194, 22278659, 22278560, 22272041, 22269121,
22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 22228744, 22228345, 22228033, 22222528, 22222429, 22222330,
22222144, 22222045, 22218408, 22215986, 22215887, 22209311, 22209212, 22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 22202772, 22201869, 22200438, 22197706, 22195027, 22194932,
22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 22158131, 22158021, 22157824, 22153348, 22153236, 22147308,
22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 21885889, 21862973, 21859854, 21859671, 21858869, 21853440,
21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 21798026, 21791895, 21791124, 21783854, 21783744, 21783540,
21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 21735769, 21727603, 21725956, 21716432, 21678682, 21670968,
21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 21621459, 21621320, 21621148, 21612902, 21612790, 21606170,
21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 21560017, 21553910, 21553800, 21553613, 21553495, 21549102,
21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 21514269, 21514159, 21508389, 21508138, 21508028, 21507830,
21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 21472576, 21468851, 21468741, 21468546, 21467832, 21460086,
21425406, 21420632, 21420506, 21419974, 21417830, 21417365, 21408677, 21401314, 21400808, 21399725, 21399113, 21393312, 21393202, 21387393, 21384625, 21384361, 21384172, 21384054, 21379960, 21374013, 21365760,
21361813, 21361703, 21361504, 21358333, 21358220, 21352848, 21348896, 21348484, 21343591, 21337675, 21337472, 21331017, 21330907, 21325895, 21325785, 21325675, 21325565, 21325370, 21319929, 21316068, 21315958,
21312609, 21284187, 21262186, 21258549, 21258439, 21258279, 21258131, 21254759, 21251782, 21251094, 21250984, 21250874, 21250764, 21244302, 21239067, 21238951, 21238831, 21236783, 21235605, 21230205, 21166173,
21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 0}

(gdb) p *partdesc->oids@415
$12 = {22653702, 22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 22606404, 22600237, 22600131, 22596610,
22595013, 22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 22564524, 22564113, 22558519, 22557080, 22556981,
22556793, 22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 22513674, 22513575, 22513405, 22513288, 22507520,
22504728, 22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 22434352, 22430792, 22426903, 22426804, 22420732,
22420025, 22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 22381934, 22375210, 22370297, 22367878, 22367779,
22367586, 22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 22287466, 22283194, 22278659, 22278560, 22272041,
22269121, 22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 22228744, 22228345, 22228033, 22222528, 22222429,
22222330, 22222144, 22222045, 22218408, 22215986, 22215887, 22209311, 22209212, 22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 22202772, 22201869, 22200438, 22197706, 22195027,
22194932, 22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 22158131, 22158021, 22157824, 22153348, 22153236,
22147308, 22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 21885889, 21862973, 21859854, 21859671, 21858869,
21853440, 21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 21798026, 21791895, 21791124, 21783854, 21783744,
21783540, 21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 21735769, 21727603, 21725956, 21716432, 21678682,
21670968, 21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 21621459, 21621320, 21621148, 21612902, 21612790,
21606170, 21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 21560017, 21553910, 21553800, 21553613, 21553495,
21549102, 21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 21514269, 21514159, 21508389, 21508138, 21508028,
21507830, 21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 21472576, 21468851, 21468741, 21468546, 21467832,
21460086, 21425406, 21420632, 21420506, 21419974, 21417830, 21417365, 21408677, 21401314, 21400808, 21399725, 21399113, 21393312, 21393202, 21387393, 21384625, 21384361, 21384172, 21384054, 21379960, 21374013,
21365760, 21361813, 21361703, 21361504, 21358333, 21358220, 21352848, 21348896, 21348484, 21343591, 21337675, 21337472, 21331017, 21330907, 21325895, 21325785, 21325675, 21325565, 21325370, 21319929, 21316068,
21315958, 21312609, 21284187, 21262186, 21258549, 21258439, 21258279, 21258131, 21254759, 21251782, 21251094, 21250984, 21250874, 21250764, 21244302, 21239067, 21238951, 21238831, 21236783, 21235605, 21230205,
21166173, 21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 22628862}

ts=# SELECT 22628862 ::regclass;
regclass | child.cdrs_huawei_msc_voice_2020_08_02

=> This one was *probably* created around 00:30, but I didn't save logs earlier
than 0200. That table was probably involved in a query around 2020-08-02
02:02:01.

ts=# SELECT 22653702 ::regclass;
regclass | child.cdrs_huawei_msc_voice_2019_06_15

=> This one was created by pg_restore at: 2020-08-02 02:03:24

Maybe it's significant that the crash happened during BIND. This is a prepared
query.

--
Justin

#4Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#3)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Tue, Aug 4, 2020 at 1:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote:

On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Based on commit logs, I suspect this may be an "older bug", specifically maybe
with:

|commit 898e5e3290a72d288923260143930fb32036c00c
|Author: Robert Haas <rhaas@postgresql.org>
|Date: Thu Mar 7 11:13:12 2019 -0500
|
| Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.

I don't think it matters, but the process surrounding the table being INSERTed
INTO is more than a little special, involving renames, detaches, creation,
re-attaching within a transaction. I think that doesn't matter though, and the
issue is surrounding the table being SELECTed *from*, which is actually behind
a view.

That's an entirely reasonable guess, but it doesn't seem easy to
understand exactly what happened here based on the provided
information. The assertion failure probably indicates that
pinfo->relid_map[] and partdesc->oids[] differ in some way other than
additional elements having been inserted into the latter. For example,
some elements might have disappeared, or the order might have changed.
This isn't supposed to happen, because DETACH PARTITION requires
heavier locking, and the order changing without anything getting
detached should be impossible. But evidently it did. If we could dump
out the two arrays in question it might shed more light on exactly how
things went wrong.

It may be this commit that went into PG 12 that is causing the problem:

commit 428b260f87e8861ba8e58807b69d433db491c4f4
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Mar 30 18:58:55 2019 -0400

Speed up planning when partitions can be pruned at plan time.

which had this:

-               /* Double-check that list of relations has not changed. */
-               Assert(memcmp(partdesc->oids, pinfo->relid_map,
-                      pinfo->nparts * sizeof(Oid)) == 0);
+               /*
+                * Double-check that the list of unpruned relations has not
+                * changed.  (Pruned partitions are not in relid_map[].)
+                */
+#ifdef USE_ASSERT_CHECKING
+               for (int k = 0; k < pinfo->nparts; k++)
+               {
+                   Assert(partdesc->oids[k] == pinfo->relid_map[k] ||
+                          pinfo->subplan_map[k] == -1);
+               }
+#endif

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

I was able to reproduce such a crash as follows:

Start with these tables in session 1.

create table foo (a int, b int) partition by list (a);
create table foo1 partition of foo for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);

Attach gdb with a breakpoint set in PartitionDirectoryLookup() and run this:

explain analyze select * from foo where a <> 1 and a = (select 2);
<After hitting the breakpoint in PartitionDirectoryLookup() called by
the planner, step to the end of it and leave it there>

In another session:

create table foo4 (like foo)
alter table foo attach partition foo4 for values in (4);

That should finish without waiting for any lock and send an
invalidation message to session 1. Go back to gdb attached to session
1 and hit continue, resulting in the plan containing runtime pruning
info being executed. ExecCreatePartitionPruneState() opens foo which
will now have 4 partitions instead of the 3 that the planner would
have seen, of which foo1 is pruned (a <> 1), so the following block is
executed:

if (partdesc->nparts == pinfo->nparts)
...
else
{
int pd_idx = 0;
int pp_idx;

/*
* Some new partitions have appeared since plan time, and
* those are reflected in our PartitionDesc but were not
* present in the one used to construct subplan_map and
* subpart_map. So we must construct new and longer arrays
* where the partitions that were originally present map to
* the same place, and any added indexes map to -1, as if the
* new partitions had been pruned.
*/
pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
{
if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
{
pprune->subplan_map[pp_idx] = -1;
pprune->subpart_map[pp_idx] = -1;
}
else
{
pprune->subplan_map[pp_idx] =
pinfo->subplan_map[pd_idx];
pprune->subpart_map[pp_idx] =
pinfo->subpart_map[pd_idx++];
}
}
Assert(pd_idx == pinfo->nparts);
}

where it crashes due to having relid_map[] and partdesc->oids[] that
look like this:

(gdb) p *pinfo->relid_map@pinfo->nparts
$3 = {0, 74106, 74109}

(gdb) p *partdesc->oids@partdesc->nparts
$6 = {74103, 74106, 74109, 74112}

The 0 in relid_map matches with nothing in partdesc->oids with the
loop ending without moving forward in the relid_map array, causing the
Assert to fail.

The attached patch should fix that.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachments:

PartitionedRelPruneInfo-relid_map-pruned-parts-zero-fix.patchapplication/octet-stream; name=PartitionedRelPruneInfo-relid_map-pruned-parts-zero-fix.patchDownload+6-1
#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#4)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:

It may be this commit that went into PG 12 that is causing the problem:

Thanks for digging into this.

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

Is there anything else needed to check that my crash matches your analysis ?

(gdb) up
#4 0x000000000068444a in ExecInitAppend (node=node@entry=0x7036b90, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAppend.c:132
132 nodeAppend.c: No such file or directory.
(gdb) p *node->appendplans
$17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, initial_elements = 0x7037400}

(gdb) down
#3 0x0000000000672806 in ExecCreatePartitionPruneState (planstate=planstate@entry=0x908f6d8, partitionpruneinfo=<optimized out>) at execPartition.c:1689
1689 execPartition.c: No such file or directory.

$27 = {ps = {type = T_AppendState, plan = 0x7036b90, state = 0x11563f0, ExecProcNode = 0x6842c0 <ExecAppend>, ExecProcNodeReal = 0x0, instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0,
qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleDesc = 0x0, ps_ResultTupleSlot = 0x0, ps_ExprContext = 0x908f7f0, ps_ProjInfo = 0x0, scandesc = 0x0,
scanops = 0x0, outerops = 0x0, innerops = 0x0, resultops = 0x0, scanopsfixed = false, outeropsfixed = false, inneropsfixed = false, resultopsfixed = false, scanopsset = false, outeropsset = false,
inneropsset = false, resultopsset = false}, appendplans = 0x0, as_nplans = 0, as_whichplan = -1, as_first_partial_plan = 0, as_pstate = 0x0, pstate_len = 0, as_prune_state = 0x0, as_valid_subplans = 0x0,
choose_next_subplan = 0x0}

--
Justin

#6Robert Haas
robertmhaas@gmail.com
In reply to: Justin Pryzby (#3)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Mon, Aug 3, 2020 at 12:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

(gdb) p *pinfo->relid_map@414
(gdb) p *partdesc->oids@415

Whoa, I didn't know about using @ in gdb to print multiple elements. Wild!

Anyway, these two arrays differ in that the latter array has 22653702
inserted at the beginning and 22628862 at the end, and also in that a
0 has been removed. This code can't cope with things getting removed,
so kaboom. I think Amit probably has the right idea about what's going
on here and how to fix it, but I haven't yet had time to study it in
detail.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#1)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 5, 2020 at 9:52 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:

On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:

It may be this commit that went into PG 12 that is causing the problem:

Thanks for digging into this.

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

Is there anything else needed to check that my crash matches your analysis ?

If you can spot a 0 in the output of the following, then yes.

(gdb) p *pinfo->relid_map@pinfo->nparts

I guess you knew that an earlier message has just that. Thanks.
/messages/by-id/20200803161133.GA21372@telsasoft.com

Yeah, you showed:

(gdb) p *pinfo->relid_map@414

And there is indeed a 0 in there, but I wasn't sure if it was actually
in the array or a stray zero due to forcing gdb to show beyond the
array bound. Does pinfo->nparts match 414?

(sorry, I forgot to hit reply all in last two emails.)

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#8Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#7)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote:

On Wed, Aug 5, 2020 at 9:52 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:

On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:

It may be this commit that went into PG 12 that is causing the problem:

Thanks for digging into this.

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

Is there anything else needed to check that my crash matches your analysis ?

If you can spot a 0 in the output of the following, then yes.

(gdb) p *pinfo->relid_map@pinfo->nparts

I guess you knew that an earlier message has just that. Thanks.
/messages/by-id/20200803161133.GA21372@telsasoft.com

Yeah, you showed:

(gdb) p *pinfo->relid_map@414

And there is indeed a 0 in there, but I wasn't sure if it was actually
in the array or a stray zero due to forcing gdb to show beyond the
array bound. Does pinfo->nparts match 414?

Yes. I typed 414 manually since the the array lengths were suspect.

(gdb) p pinfo->nparts
$1 = 414
(gdb) set print elements 0
(gdb) p *pinfo->relid_map@pinfo->nparts
$3 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 22606404, 22600237, 22600131, 22596610, 22595013,
22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 22564524, 22564113, 22558519, 22557080, 22556981, 22556793,
22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 22513674, 22513575, 22513405, 22513288, 22507520, 22504728,
22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 22434352, 22430792, 22426903, 22426804, 22420732, 22420025,
22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 22381934, 22375210, 22370297, 22367878, 22367779, 22367586,
22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 22287466, 22283194, 22278659, 22278560, 22272041, 22269121,
22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 22228744, 22228345, 22228033, 22222528, 22222429, 22222330,
22222144, 22222045, 22218408, 22215986, 22215887, 22209311, 22209212, 22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 22202772, 22201869, 22200438, 22197706, 22195027, 22194932,
22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 22158131, 22158021, 22157824, 22153348, 22153236, 22147308,
22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 21885889, 21862973, 21859854, 21859671, 21858869, 21853440,
21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 21798026, 21791895, 21791124, 21783854, 21783744, 21783540,
21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 21735769, 21727603, 21725956, 21716432, 21678682, 21670968,
21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 21621459, 21621320, 21621148, 21612902, 21612790, 21606170,
21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 21560017, 21553910, 21553800, 21553613, 21553495, 21549102,
21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 21514269, 21514159, 21508389, 21508138, 21508028, 21507830,
21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 21472576, 21468851, 21468741, 21468546, 21467832, 21460086,
21425406, 21420632, 21420506, 21419974, 21417830, 21417365, 21408677, 21401314, 21400808, 21399725, 21399113, 21393312, 21393202, 21387393, 21384625, 21384361, 21384172, 21384054, 21379960, 21374013, 21365760,
21361813, 21361703, 21361504, 21358333, 21358220, 21352848, 21348896, 21348484, 21343591, 21337675, 21337472, 21331017, 21330907, 21325895, 21325785, 21325675, 21325565, 21325370, 21319929, 21316068, 21315958,
21312609, 21284187, 21262186, 21258549, 21258439, 21258279, 21258131, 21254759, 21251782, 21251094, 21250984, 21250874, 21250764, 21244302, 21239067, 21238951, 21238831, 21236783, 21235605, 21230205, 21166173,
21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 0}

--
Justin

#9Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#8)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 5, 2020 at 10:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote:

On Wed, Aug 5, 2020 at 9:52 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:

On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:

It may be this commit that went into PG 12 that is causing the problem:

Thanks for digging into this.

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

Is there anything else needed to check that my crash matches your analysis ?

If you can spot a 0 in the output of the following, then yes.

(gdb) p *pinfo->relid_map@pinfo->nparts

I guess you knew that an earlier message has just that. Thanks.
/messages/by-id/20200803161133.GA21372@telsasoft.com

Yeah, you showed:

(gdb) p *pinfo->relid_map@414

And there is indeed a 0 in there, but I wasn't sure if it was actually
in the array or a stray zero due to forcing gdb to show beyond the
array bound. Does pinfo->nparts match 414?

Yes. I typed 414 manually since the the array lengths were suspect.

(gdb) p pinfo->nparts
$1 = 414
(gdb) set print elements 0
(gdb) p *pinfo->relid_map@pinfo->nparts
$3 = {....
21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 0}

Thanks. There is a 0 in there, which can only be there if planner was
able to prune that last partition. So, the planner saw a table with
414 partitions, was able to prune the last one and constructed an
Append plan with 413 subplans for unpruned partitions as you showed
upthread:

(gdb) p *node->appendplans
$17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, initial_elements = 0x7037400}

This suggests that the crash I was able produce is similar to what you saw.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#4)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Amit Langote <amitlangote09@gmail.com> writes:

The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.
I was able to reproduce such a crash as follows:

Yeah, I can repeat the case per these directions. I concur that the
issue is that ExecCreatePartitionPruneState is failing to cope with
zeroes in the relid_map.

The attached patch should fix that.

I don't like this patch at all though; I do not think it is being nearly
careful enough to ensure that it's matched the surviving relation OIDs
correctly. In particular it blithely assumes that a zero in relid_map
*must* match the immediately next entry in partdesc->oids, which is easy
to break if the new partition is adjacent to the one the planner managed
to prune. So I think we should do it more like the attached.

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

regards, tom lane

Attachments:

fix-sloppy-partition-mapping-2.patchtext/x-diff; charset=us-ascii; name=fix-sloppy-partition-mapping-2.patchDownload+21-1
#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 5, 2020 at 1:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't like this patch at all though; I do not think it is being nearly
careful enough to ensure that it's matched the surviving relation OIDs
correctly. In particular it blithely assumes that a zero in relid_map
*must* match the immediately next entry in partdesc->oids, which is easy
to break if the new partition is adjacent to the one the planner managed
to prune. So I think we should do it more like the attached.

Ooh, nice catch.

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

I was thinking about that, too. +1 for taking that step.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 5, 2020 at 1:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

I was thinking about that, too. +1 for taking that step.

Will do.

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here". If we're going
to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that. I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely. So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 5, 2020 at 2:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here". If we're going
to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that. I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely. So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

That's an interesting idea. I don't know exactly how it would work,
but I agree that it would allow useful testing that we can't do today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 5, 2020 at 2:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here". If we're going
to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that. I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely. So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

That's an interesting idea. I don't know exactly how it would work,
but I agree that it would allow useful testing that we can't do today.

After thinking about it for a little bit, I'm envisioning a test module
that can be loaded into a session, and then it gets into the planner_hook,
and what it does is after each planner execution, take and release an
advisory lock with some selectable ID. Then we can construct
isolationtester specs that do something like

session 1 session 2

LOAD test-module;
SET custom_guc_for_lock_id = n;
prepare test tables;

SELECT pg_advisory_lock(n);

SELECT victim-query-here;
... after planning, query blocks on lock

perform DDL changes;
SELECT pg_advisory_unlock(n);

... query executes with now-stale plan

regards, tom lane

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Wed, Aug 5, 2020 at 4:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After thinking about it for a little bit, I'm envisioning a test module
that can be loaded into a session, and then it gets into the planner_hook,
and what it does is after each planner execution, take and release an
advisory lock with some selectable ID. Then we can construct
isolationtester specs that do something like

session 1 session 2

LOAD test-module;
SET custom_guc_for_lock_id = n;
prepare test tables;

SELECT pg_advisory_lock(n);

SELECT victim-query-here;
... after planning, query blocks on lock

perform DDL changes;
SELECT pg_advisory_unlock(n);

... query executes with now-stale plan

Very sneaky!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#10)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Thu, Aug 6, 2020 at 2:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

The attached patch should fix that.

I don't like this patch at all though; I do not think it is being nearly
careful enough to ensure that it's matched the surviving relation OIDs
correctly. In particular it blithely assumes that a zero in relid_map
*must* match the immediately next entry in partdesc->oids, which is easy
to break if the new partition is adjacent to the one the planner managed
to prune.

Indeed, you're right.

So I think we should do it more like the attached.

Thanks for pushing that.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#17Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#12)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Thu, Aug 6, 2020 at 2:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 5, 2020 at 1:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

I was thinking about that, too. +1 for taking that step.

Will do.

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here".

One of the methods I see is we can just add some GUC variable for some
action injection. basically it adds some code based on the GUC like this;

if (shall_delay_planning)
{
sleep(10)
};

AFAIK, MongoDB uses much such technology in their test framework. First
it
defines the fail point [1]https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&amp;unscoped_q=MONGO_FAIL_POINT_DEFINE, and then does code injection if the fail point
is set [2]https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475.
At last, during the test it can set a fail point like a GUC, but with more
attributes [3]https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427.
If that is useful in PG as well and it is not an urgent task, I would like
to help
in this direction.

[1]: https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&amp;unscoped_q=MONGO_FAIL_POINT_DEFINE
https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&amp;unscoped_q=MONGO_FAIL_POINT_DEFINE

[2]: https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475
https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475
[3]: https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427
https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427

If we're going

to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that. I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely. So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

regards, tom lane

--
Best Regards
Andy Fan

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Fan (#17)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Andy Fan <zhihui.fan1213@gmail.com> writes:

On Thu, Aug 6, 2020 at 2:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here".

One of the methods I see is we can just add some GUC variable for some
action injection. basically it adds some code based on the GUC like this;

See my straw-man proposal downthread. I'm not very excited about putting
things like this into the standard build, because it's really hard to be
sure that there are no security-hazard-ish downsides of putting in ways to
get at testing behaviors from standard SQL. And then there's the question
of whether you're adding noticeable overhead to production builds. So a
loadable module that can use some existing hook to provide the needed
behavior seems like a better plan to me, whenever we can do it that way.

In general, though, it seems like we've seldom regretted investments in
test tooling.

regards, tom lane

#19Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#18)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Thu, Aug 6, 2020 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andy Fan <zhihui.fan1213@gmail.com> writes:

On Thu, Aug 6, 2020 at 2:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here".

One of the methods I see is we can just add some GUC variable for some
action injection. basically it adds some code based on the GUC like

this;

See my straw-man proposal downthread. I'm not very excited about putting
things like this into the standard build, because it's really hard to be
sure that there are no security-hazard-ish downsides of putting in ways to
get at testing behaviors from standard SQL. And then there's the question
of whether you're adding noticeable overhead to production builds. So a
loadable module that can use some existing hook to provide the needed
behavior seems like a better plan to me, whenever we can do it that way.

In general, though, it seems like we've seldom regretted investments in
test tooling.

regards, tom lane

Thanks for your explanation, I checked it again and it looks a very clean
method. The attached is a draft patch based on my understanding. Hope
I didn't misunderstand you..

--
Best Regards
Andy Fan

Attachments:

v1-0001-test_module-used-for-concurrency-case-simulation.patchapplication/octet-stream; name=v1-0001-test_module-used-for-concurrency-case-simulation.patchDownload+111-19
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Fan (#19)
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

Andy Fan <zhihui.fan1213@gmail.com> writes:

On Thu, Aug 6, 2020 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

See my straw-man proposal downthread.

Thanks for your explanation, I checked it again and it looks a very clean
method. The attached is a draft patch based on my understanding. Hope
I didn't misunderstand you..

Ah, I was going to play arond with that today, but you beat me to it ;-)

A few thoughts after a quick look at the patch:

* I had envisioned that there's a custom GUC controlling the lock ID
used; this would allow blocking different sessions at different points,
if we ever need that. Also, I'd make the GUC start out as zero which
means "do nothing", so that merely loading the module has no immediate
effect.

* Don't really see the point of the before-planning lock.

* Rather than exposing internal declarations from lockfuncs.c, you
could just write calls to pg_advisory_lock_int8 etc. using
DirectFunctionCall1.

* We need some better name than "test_module". I had vaguely thought
about "delay_execution", but am surely open to better ideas.

regards, tom lane

#21Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Fan (#21)
#23Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Fan (#23)
#25Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#22)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#25)
#27Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#25)
#28Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#25)
#29Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#28)
#30Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#29)
#31Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#30)