[sqlsmith] crashes in RestoreSnapshot on hot standby

Started by Andreas Seltenreichover 9 years ago8 messages
#1Andreas Seltenreich
seltenreich@gmx.de

Running sqlsmith on a streaming slave (master as of f8c5855) is
inconspicuous as long as the master is idle. As soon as I start it on
the master as well, the standby frequently crashes in RestoreSnapshot.
It doesn't seem to be specific to the queries, as they don't trigger a
crash when re-run.

Backtraces always look like the ones below.

regards,
Andreas

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.9826@.core ON doombat ****
Core was generated by `postgres: smith regression [local] SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020
#2 0x00000000004a934a in heap_beginscan_parallel (relation=0x2060a90, parallel_scan=parallel_scan@entry=0x7f2701d5a0f8) at heapam.c:1657
#3 0x00000000005fbedf in ExecSeqScanInitializeDSM (node=0x1f5b470, pcxt=0x221af88) at nodeSeqscan.c:327
#4 0x00000000005dd0ad in ExecParallelInitializeDSM (planstate=planstate@entry=0x1f5b470, d=d@entry=0x7ffd4ba200d0) at execParallel.c:245
#5 0x00000000005dd425 in ExecInitParallelPlan (planstate=0x1f5b470, estate=estate@entry=0x1f5ab28, nworkers=2) at execParallel.c:477
#6 0x00000000005ef4a4 in ExecGather (node=node@entry=0x1f5b048) at nodeGather.c:159
#7 0x00000000005dda48 in ExecProcNode (node=node@entry=0x1f5b048) at execProcnode.c:515
#8 0x00000000005f4b30 in ExecLimit (node=node@entry=0x1f5acd0) at nodeLimit.c:91
#9 0x00000000005dd9d8 in ExecProcNode (node=node@entry=0x1f5acd0) at execProcnode.c:531
#10 0x00000000005fef7c in ExecSetParamPlan (node=<optimized out>, econtext=0x1f5c138) at nodeSubplan.c:999
#11 0x00000000005e28b5 in ExecEvalParamExec (exprstate=<optimized out>, econtext=<optimized out>, isNull=0x22045b0 "", isDone=<optimized out>) at execQual.c:1135
#12 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x2204200, econtext=0x1f5c138, isNull=0x2203d98 "", isDone=<optimized out>) at execQual.c:2015
#13 0x00000000005de29a in ExecEvalCoalesce (coalesceExpr=<optimized out>, econtext=0x1f5c138, isNull=0x2203d98 "", isDone=<optimized out>) at execQual.c:3446
#14 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x22039e8, econtext=0x1f5c138, isNull=0x7ffd4ba203df "", isDone=<optimized out>) at execQual.c:2015
#15 0x00000000005e4939 in ExecQual (qual=<optimized out>, econtext=econtext@entry=0x1f5c138, resultForNull=resultForNull@entry=0 '\000') at execQual.c:5269
#16 0x00000000005faef1 in ExecResult (node=node@entry=0x1f5c020) at nodeResult.c:82
#17 0x00000000005ddbf8 in ExecProcNode (node=node@entry=0x1f5c020) at execProcnode.c:392
#18 0x00000000005d9c1f in ExecutePlan (dest=0x1ebb7d0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1f5c020, estate=0x1f5ab28) at execMain.c:1567
#19 standard_ExecutorRun (queryDesc=0x1f5a718, direction=<optimized out>, count=0) at execMain.c:338
#20 0x00000000006f7238 in PortalRunSelect (portal=portal@entry=0x1d13be8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x1ebb7d0) at pquery.c:946
#21 0x00000000006f875e in PortalRun (portal=0x1d13be8, count=9223372036854775807, isTopLevel=<optimized out>, dest=0x1ebb7d0, altdest=0x1ebb7d0, completionTag=0x7ffd4ba20840 "") at pquery.c:787
#22 0x00000000006f6003 in exec_simple_query (query_string=<optimized out>) at postgres.c:1094
#23 PostgresMain (argc=30489576, argv=0x1ecfb08, dbname=0x1cf5a00 "regression", username=0x1ecfc20 "\b\373\354\001") at postgres.c:4074
#24 0x000000000046ca67 in BackendRun (port=0x1d17b50) at postmaster.c:4262
#25 BackendStartup (port=0x1d17b50) at postmaster.c:3936
#26 ServerLoop () at postmaster.c:1693
#27 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cf45e0) at postmaster.c:1301
#28 0x000000000046d9cd in main (argc=3, argv=0x1cf45e0) at main.c:228
(gdb) p debug_query_string
$1 = 0x1d68a78 "select \n sample_0.j as c0\nfrom \n public.testjsonb as sample_0 tablesample system (8) \nwhere cast(coalesce(pg_catalog.char_length(\n cast((select comment from public.room limit 1 offset 20)\n as text)),\n pg_catalog.pg_trigger_depth()) as integer) <> 3"

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.8104@.core ON marbit ****
Core was generated by `postgres: bgworker: parallel worker for PID 2610 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1 0x0000000000822032 in RestoreSnapshot (start_address=0x7f7b1ee4fa58 <error: Cannot access memory at address 0x7f7b1ee4fa58>) at snapmgr.c:2020
#2 0x00000000004e6142 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1010
#3 0x0000000000683b62 in StartBackgroundWorker () at bgworker.c:726
#4 0x000000000068ec32 in do_start_bgworker (rw=0x22a8900) at postmaster.c:5535
#5 maybe_start_bgworker () at postmaster.c:5709
#6 0x000000000068f686 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:4971
#7 <signal handler called>
#8 0x00007f7b1e3f3ac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#9 0x000000000046c381 in ServerLoop () at postmaster.c:1657
#10 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x22825e0) at postmaster.c:1301
#11 0x000000000046d9cd in main (argc=3, argv=0x22825e0) at main.c:228

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Michael Paquier
michael.paquier@gmail.com
In reply to: Andreas Seltenreich (#1)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

#1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020

memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
serialized_snapshot->subxcnt * sizeof(TransactionId));
So this is choking here? Is one of those pointers NULL?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

#1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020

memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
serialized_snapshot->subxcnt * sizeof(TransactionId));
So this is choking here? Is one of those pointers NULL?

Theory 1:
If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value. Then if serialized_snapshot->subxcnt

0, we set snapshot->subxip = snapshot->xip +

serialized_snapshot->xcnt (so that's NULL too). Then in line the line
you show we call memcpy(snapshot->subxip, ...). The fix might be
something like the attached.

Theory 2:
The DSM segment was deleted underneath us. We can see that it was not
mapped by the time GDB dumped that (start_address is not accessible).

Theory 3:
Somehow the xcnt or xsubcnt was wrong or the serialized snapshot was
truncated, and we read past the end of the DSM, who knows...

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-subxip.patchapplication/octet-stream; name=fix-subxip.patchDownload
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index dfe2c70..f50fb5c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2016,7 +2016,10 @@ RestoreSnapshot(char *start_address)
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot->subxcnt > 0)
 	{
-		snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+		if (snapshot->xip == NULL)
+			snapshot->subxip = (TransactionId *) (snapshot + 1);
+		else
+			snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
 		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
 			   serialized_snapshot->subxcnt * sizeof(TransactionId));
 	}
#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#3)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de>

wrote:

#1 0x0000000000822032 in RestoreSnapshot

(start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memory at address 0x7f2701d5a110>) at snapmgr.c:2020

memcpy(snapshot->subxip, serialized_xids +

serialized_snapshot->xcnt,

serialized_snapshot->subxcnt * sizeof(TransactionId));
So this is choking here? Is one of those pointers NULL?

Theory 1:
If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value. Then if serialized_snapshot->subxcnt

0, we set snapshot->subxip = snapshot->xip +

serialized_snapshot->xcnt (so that's NULL too). Then in line the line
you show we call memcpy(snapshot->subxip, ...). The fix might be
something like the attached.

I was just typing the mail, when I see this mail. I also reached to the
conclusion that this is the reason of crash. You can see how CopySnapshot
calculates the subxipoff, may be writing code that way will be more
consistent. In case of recovery, I think serialized_snapshot->xcnt will
always be zero as we fill everything in subxip array (refer below code in
GetSnapshotData).

GetSnapshotData()
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.
..
..
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#5Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Amit Kapila (#4)
1 attachment(s)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value. Then if serialized_snapshot->subxcnt

0, we set snapshot->subxip = snapshot->xip +

serialized_snapshot->xcnt (so that's NULL too). Then in line the line
you show we call memcpy(snapshot->subxip, ...). The fix might be
something like the attached.

I was just typing the mail, when I see this mail. I also reached to the
conclusion that this is the reason of crash. You can see how CopySnapshot
calculates the subxipoff, may be writing code that way will be more
consistent.

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-subxip-v2.patchapplication/octet-stream; name=fix-subxip-v2.patchDownload
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index dfe2c70..e1caf01 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2016,7 +2016,8 @@ RestoreSnapshot(char *start_address)
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot->subxcnt > 0)
 	{
-		snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+		snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+			serialized_snapshot->xcnt;
 		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
 			   serialized_snapshot->subxcnt * sizeof(TransactionId));
 	}
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#5)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value. Then if serialized_snapshot->subxcnt

0, we set snapshot->subxip = snapshot->xip +

serialized_snapshot->xcnt (so that's NULL too). Then in line the line
you show we call memcpy(snapshot->subxip, ...). The fix might be
something like the attached.

I was just typing the mail, when I see this mail. I also reached to the
conclusion that this is the reason of crash. You can see how CopySnapshot
calculates the subxipoff, may be writing code that way will be more
consistent.

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

This way it looks better to me. Thanks for the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andreas Seltenreich
seltenreich@gmx.de
In reply to: Amit Kapila (#6)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

Amit Kapila writes:

On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

This way it looks better to me. Thanks for the patch.

I no longer see these crashes when testing with the patch applied.

thanks,
Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Andreas Seltenreich (#7)
Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

On Fri, Jul 1, 2016 at 1:41 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Amit Kapila writes:

On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

This way it looks better to me. Thanks for the patch.

I no longer see these crashes when testing with the patch applied.

Committed and back-patched to 9.5 where this code was added. Thanks
to all for the report, patch, review, and testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers