Server crashed with dense_rank on partition table.

Started by Rajkumar Raghuwanshialmost 8 years ago8 messageshackers
Jump to latest
#1Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com

Hi,

I am getting server crash with below query.

CREATE TABLE pagg_tab (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000',
'0001', '0002', '0003');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0004',
'0005', '0006', '0007');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0008',
'0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000') FROM
generate_series(0, 36) i;
ANALYZE pagg_tab;
SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab GROUP BY b
ORDER BY 1;

postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
GROUP BY b ORDER BY 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

logfile have this

2018-06-12 21:29:54.930 IST [69580] STATEMENT: drop table pagg_tab;
TRAP: BadArgument("!(((context) != ((void *)0) && (((((const
Node*)((context)))->type) == T_AllocSetContext) || ((((const
Node*)((context)))->type) == T_SlabContext) || ((((const
Node*)((context)))->type) == T_GenerationContext))))", File: "mcxt.c",
Line: 775)
2018-06-12 21:29:55.552 IST [69571] LOG: server process (PID 69580) was
terminated by signal 6: Aborted
2018-06-12 21:29:55.552 IST [69571] DETAIL: Failed process was running:
SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab GROUP BY b
ORDER BY 1;
2018-06-12 21:29:55.552 IST [69571] LOG: terminating any other active
server processes

and here is core file content

Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: edb postgres [local]
SELECT '.
Program terminated with signal 6, Aborted.
#0 0x0000003dd2632495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000003dd2632495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003dd2633c75 in abort () at abort.c:92
#2 0x0000000000a32622 in ExceptionalCondition (
conditionName=0xc99320 "!(((context) != ((void *)0) && (((((const
Node*)((context)))->type) == T_AllocSetContext) || ((((const
Node*)((context)))->type) == T_SlabContext) || ((((const
Node*)((context)))->type) == T_Generatio"..., errorType=0xc99312
"BadArgument", fileName=0xc993f5 "mcxt.c", lineNumber=775) at assert.c:54
#3 0x0000000000a6be3a in MemoryContextAlloc (context=0x134a708, size=8) at
mcxt.c:775
#4 0x0000000000a6d0a6 in MemoryContextStrdup (context=0x134a708,
string=0xc60413 "integer") at mcxt.c:1153
#5 0x0000000000a6d0e9 in pstrdup (in=0xc60413 "integer") at mcxt.c:1163
#6 0x0000000000927328 in format_type_extended (type_oid=23, typemod=-1,
flags=0) at format_type.c:224
#7 0x00000000009275c4 in format_type_be (type_oid=23) at format_type.c:330
#8 0x00000000006d41ed in CheckVarSlotCompatibility (slot=0x134a6a8,
attnum=1, vartype=23) at execExprInterp.c:1883
#9 0x00000000006d4062 in CheckExprStillValid (state=0x1388370,
econtext=0x134a6a8) at execExprInterp.c:1823
#10 0x00000000006d3f5e in ExecInterpExprStillValid (state=0x1388370,
econtext=0x134a6a8, isNull=0x7ffe988f3907) at execExprInterp.c:1780
#11 0x000000000098a116 in ExecEvalExprSwitchContext (state=0x1388370,
econtext=0x134a6a8, isNull=0x7ffe988f3907) at
../../../../src/include/executor/executor.h:303
#12 0x000000000098a191 in ExecQual (state=0x1388370, econtext=0x134a6a8) at
../../../../src/include/executor/executor.h:372
#13 0x000000000098a1e3 in ExecQualAndReset (state=0x1388370,
econtext=0x134a6a8) at ../../../../src/include/executor/executor.h:389
#14 0x000000000098cb96 in hypothetical_dense_rank_final
(fcinfo=0x7ffe988f3a40) at orderedsetaggs.c:1389
#15 0x00000000006f3a5d in finalize_aggregate (aggstate=0x1368f98,
peragg=0x1382a38, pergroupstate=0x1382be8, resultVal=0x13829f8,
resultIsNull=0x1382a18) at nodeAgg.c:965
#16 0x00000000006f3ff0 in finalize_aggregates (aggstate=0x1368f98,
peraggs=0x1382a38, pergroup=0x1382be8) at nodeAgg.c:1172
#17 0x00000000006f516a in agg_retrieve_direct (aggstate=0x1368f98) at
nodeAgg.c:1887
#18 0x00000000006f4a6d in ExecAgg (pstate=0x1368f98) at nodeAgg.c:1551
#19 0x0000000000718972 in ExecProcNode (node=0x1368f98) at
../../../src/include/executor/executor.h:237
#20 0x0000000000718abe in ExecSort (pstate=0x1368e80) at nodeSort.c:107
#21 0x00000000006e6b26 in ExecProcNodeFirst (node=0x1368e80) at
execProcnode.c:445
#22 0x00000000006dbd61 in ExecProcNode (node=0x1368e80) at
../../../src/include/executor/executor.h:237
#23 0x00000000006de71b in ExecutePlan (estate=0x1368c68,
planstate=0x1368e80, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x137a928, execute_once=true) at
execMain.c:1726
#24 0x00000000006dc34b in standard_ExecutorRun (queryDesc=0x1354318,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:363
#25 0x00000000006dc167 in ExecutorRun (queryDesc=0x1354318,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:306
#26 0x00000000008cadd2 in PortalRunSelect (portal=0x12f0c28, forward=true,
count=0, dest=0x137a928) at pquery.c:932
#27 0x00000000008caa60 in PortalRun (portal=0x12f0c28,
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x137a928,
altdest=0x137a928,
completionTag=0x7ffe988f43a0 "") at pquery.c:773
#28 0x00000000008c4a37 in exec_simple_query (query_string=0x128b798 "SELECT
dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab GROUP BY b ORDER BY
1;") at postgres.c:1122
#29 0x00000000008c8d07 in PostgresMain (argc=1, argv=0x12b52a0,
dbname=0x12b5100 "postgres", username=0x1288298 "edb") at postgres.c:4153
#30 0x00000000008264f7 in BackendRun (port=0x12ad060) at postmaster.c:4361
#31 0x0000000000825c65 in BackendStartup (port=0x12ad060) at
postmaster.c:4033
#32 0x0000000000822047 in ServerLoop () at postmaster.c:1706
#33 0x0000000000821979 in PostmasterMain (argc=3, argv=0x12861f0) at
postmaster.c:1379
#34 0x0000000000748bc4 in main (argc=3, argv=0x12861f0) at main.c:228

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#2Michael Paquier
michael@paquier.xyz
In reply to: Rajkumar Raghuwanshi (#1)
Re: Server crashed with dense_rank on partition table.

On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:

postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
GROUP BY b ORDER BY 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Indeed, thanks for the test case. This used to work in v10 but this is
failing with v11 so I am adding an open item. The plans of the pre-10
query and the query on HEAD are rather similar, and the memory context
at execution time looks messed up.
--
Michael

#3David Rowley
dgrowleyml@gmail.com
In reply to: Michael Paquier (#2)
Re: Server crashed with dense_rank on partition table.

On 13 June 2018 at 17:55, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:

postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
GROUP BY b ORDER BY 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Indeed, thanks for the test case. This used to work in v10 but this is
failing with v11 so I am adding an open item. The plans of the pre-10
query and the query on HEAD are rather similar, and the memory context
at execution time looks messed up.

Looks like some memory is being stomped on somewhere.

4b9094eb6 (Adapt to LLVM 7+ Orc API changes.) appears to be the first
bad commit.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Michael Paquier (#2)
Re: Server crashed with dense_rank on partition table.

Hi.

On 2018/06/13 14:55, Michael Paquier wrote:

On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:

postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
GROUP BY b ORDER BY 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Indeed, thanks for the test case. This used to work in v10 but this is
failing with v11 so I am adding an open item. The plans of the pre-10
query and the query on HEAD are rather similar, and the memory context
at execution time looks messed up.

Fwiw, I see that the crash can also occur even when using a
non-partitioned table in the query, as shown in the following example
which reuses Rajkumar's test data and query:

create table foo (a int, b int, c text);
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM0000') from
generate_series(0, 36) i;

select dense_rank(b) within group (order by a) from foo group by b order by 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Following query in the regression test suite can also be made to crash by
adding a group by clause:

select dense_rank(3) within group (order by x) from (values
(1),(1),(2),(2),(3),(3),(4)) v(x) group by (x);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Looking at the core dump of this, it seems the following commit may be
relevant:

commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb
Author: Andres Freund <andres@anarazel.de>
Date: Thu Feb 15 21:55:31 2018 -0800

Do execGrouping.c via expression eval machinery, take two.

Thanks,
Amit

#5Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#4)
Re: Server crashed with dense_rank on partition table.

On 2018/06/13 16:35, Amit Langote wrote:

Fwiw, I see that the crash can also occur even when using a
non-partitioned table in the query, as shown in the following example
which reuses Rajkumar's test data and query:

create table foo (a int, b int, c text);
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM0000') from
generate_series(0, 36) i;

select dense_rank(b) within group (order by a) from foo group by b order by 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Following query in the regression test suite can also be made to crash by
adding a group by clause:

select dense_rank(3) within group (order by x) from (values
(1),(1),(2),(2),(3),(3),(4)) v(x) group by (x);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Looking at the core dump of this, it seems the following commit may be
relevant:

commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb
Author: Andres Freund <andres@anarazel.de>
Date: Thu Feb 15 21:55:31 2018 -0800

Do execGrouping.c via expression eval machinery, take two.

I studied this a bit and found a bug that's causing the crash.

The above mentioned commit has this hunk:

@@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
PG_RETURN_INT64(rank);

     osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
+    econtext = osastate->qstate->econtext;
+    if (!econtext)
+        osastate->qstate->econtext = econtext =
CreateStandaloneExprContext();

In CreateStandloneExprContext(), we have this:

econtext->ecxt_per_query_memory = CurrentMemoryContext;

/*
* Create working memory for expression evaluation in this context.
*/
econtext->ecxt_per_tuple_memory =
AllocSetContextCreate(CurrentMemoryContext,
"ExprContext",
ALLOCSET_DEFAULT_SIZES);

I noticed when debugging the crashing query that CurrentMemoryContext is
actually per-tuple memory context of some expression context of the
calling code, which would get reset before getting here again. So, it's
wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
without first switching to an actual per-query context.

Attached patch seems to fix the crash.

Thanks,
Amit

Attachments:

v1-0001-Set-correct-memory-context-in-hypothetical_dense_.patchtext/plain; charset=UTF-8; name=v1-0001-Set-correct-memory-context-in-hypothetical_dense_.patchDownload+20-1
#6Andres Freund
andres@anarazel.de
In reply to: Amit Langote (#4)
Re: Server crashed with dense_rank on partition table.

On 2018-06-13 16:35:58 +0900, Amit Langote wrote:

Hi.

On 2018/06/13 14:55, Michael Paquier wrote:

On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:

postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
GROUP BY b ORDER BY 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Indeed, thanks for the test case. This used to work in v10 but this is
failing with v11 so I am adding an open item. The plans of the pre-10
query and the query on HEAD are rather similar, and the memory context
at execution time looks messed up.

Fwiw, I see that the crash can also occur even when using a
non-partitioned table in the query, as shown in the following example
which reuses Rajkumar's test data and query:

create table foo (a int, b int, c text);
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM0000') from
generate_series(0, 36) i;

select dense_rank(b) within group (order by a) from foo group by b order by 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Following query in the regression test suite can also be made to crash by
adding a group by clause:

select dense_rank(3) within group (order by x) from (values
(1),(1),(2),(2),(3),(3),(4)) v(x) group by (x);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Looking at the core dump of this, it seems the following commit may be
relevant:

commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb
Author: Andres Freund <andres@anarazel.de>
Date: Thu Feb 15 21:55:31 2018 -0800

Do execGrouping.c via expression eval machinery, take two.

Andres, with RMT hat on: Andres, this needs looking at ASAP.
Andres, without RMT hat on: Oh, I had first missed it, and then was
distracted reviewing pluggable storage.
Andres, with RMT hat on: that's not really an excuse
Andres, without RMT hat on: sorry, will start looking now.

Greetings,

Andres Freund

#7Andres Freund
andres@anarazel.de
In reply to: Amit Langote (#5)
Re: Server crashed with dense_rank on partition table.

On 2018-07-02 17:14:14 +0900, Amit Langote wrote:

I studied this a bit and found a bug that's causing the crash.

The above mentioned commit has this hunk:

@@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
PG_RETURN_INT64(rank);

osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
+    econtext = osastate->qstate->econtext;
+    if (!econtext)
+        osastate->qstate->econtext = econtext =
CreateStandaloneExprContext();

In CreateStandloneExprContext(), we have this:

econtext->ecxt_per_query_memory = CurrentMemoryContext;

/*
* Create working memory for expression evaluation in this context.
*/
econtext->ecxt_per_tuple_memory =
AllocSetContextCreate(CurrentMemoryContext,
"ExprContext",
ALLOCSET_DEFAULT_SIZES);

I noticed when debugging the crashing query that CurrentMemoryContext is
actually per-tuple memory context of some expression context of the
calling code, which would get reset before getting here again. So, it's
wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
without first switching to an actual per-query context.

Attached patch seems to fix the crash.

Thanks, that looks correct. Pushed!

- Andres

#8Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Andres Freund (#7)
Re: Server crashed with dense_rank on partition table.

On 2018/07/05 9:40, Andres Freund wrote:

On 2018-07-02 17:14:14 +0900, Amit Langote wrote:

I studied this a bit and found a bug that's causing the crash.

The above mentioned commit has this hunk:

@@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
PG_RETURN_INT64(rank);

osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
+    econtext = osastate->qstate->econtext;
+    if (!econtext)
+        osastate->qstate->econtext = econtext =
CreateStandaloneExprContext();

In CreateStandloneExprContext(), we have this:

econtext->ecxt_per_query_memory = CurrentMemoryContext;

/*
* Create working memory for expression evaluation in this context.
*/
econtext->ecxt_per_tuple_memory =
AllocSetContextCreate(CurrentMemoryContext,
"ExprContext",
ALLOCSET_DEFAULT_SIZES);

I noticed when debugging the crashing query that CurrentMemoryContext is
actually per-tuple memory context of some expression context of the
calling code, which would get reset before getting here again. So, it's
wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
without first switching to an actual per-query context.

Attached patch seems to fix the crash.

Thanks, that looks correct. Pushed!

Thank you.

Regards,
Amit