PG 12 beta 1 segfault during analyze

Started by Steve Singerover 6 years ago9 messages
#1Steve Singer
steve@ssinger.info

I encountered the following segfault when running against a PG 12 beta1

during a analyze against a table.

#0 0x000056008ad0c826 in update_attstats (vacattrstats=0x0,
natts=2139062143, inh=false,
relid=<error reading variable: Cannot access memory at address
0x40>) at analyze.c:572
#1 do_analyze_rel (onerel=onerel@entry=0x7f0bc59a7a38,
params=params@entry=0x7ffe06aeabb0, va_cols=va_cols@entry=0x0,
acquirefunc=<optimized out>, relpages=8, inh=inh@entry=false,
in_outer_xact=false, elevel=13) at analyze.c:572
#2 0x000056008ad0d2e0 in analyze_rel (relid=<optimized out>,
relation=<optimized out>,
params=params@entry=0x7ffe06aeabb0, va_cols=0x0,
in_outer_xact=<optimized out>, bstrategy=<optimized out>)
at analyze.c:260
#3 0x000056008ad81300 in vacuum (relations=0x56008c4d1110,
params=params@entry=0x7ffe06aeabb0,
bstrategy=<optimized out>, bstrategy@entry=0x0,
isTopLevel=isTopLevel@entry=true) at vacuum.c:413
#4 0x000056008ad8197f in ExecVacuum
(pstate=pstate@entry=0x56008c5c2688, vacstmt=vacstmt@entry=0x56008c3e0428,
isTopLevel=isTopLevel@entry=true) at vacuum.c:199
#5 0x000056008af0133b in standard_ProcessUtility (pstmt=0x56008c982e50,
queryString=0x56008c3df368 "select
\"_disorder_replica\".finishTableAfterCopy(3); analyze
\"disorder\".\"do_inventory\"; ", context=<optimized out>, params=0x0,
queryEnv=0x0, dest=0x56008c9831d8, completionTag=0x7ffe06aeaef0 "")
at utility.c:670
#6 0x000056008aefe112 in PortalRunUtility (portal=0x56008c4515f8,
pstmt=0x56008c982e50, isTopLevel=<optimized out>,
setHoldSnapshot=<optimized out>, dest=<optimized out>,
completionTag=0x7ffe06aeaef0 "") at pquery.c:1175
#7 0x000056008aefec91 in PortalRunMulti
(portal=portal@entry=0x56008c4515f8, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x56008c9831d8, altdest=altdest@entry=0x56008c9831d8,
completionTag=completionTag@entry=0x7ffe06aeaef0 "") at pquery.c:1328
#8 0x000056008aeff9e9 in PortalRun (portal=portal@entry=0x56008c4515f8,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x56008c9831d8,
altdest=altdest@entry=0x56008c9831d8, completionTag=0x7ffe06aeaef0
"") at pquery.c:796
#9 0x000056008aefb6bb in exec_simple_query (
query_string=0x56008c3df368 "select
\"_disorder_replica\".finishTableAfterCopy(3); analyze
\"disorder\".\"do_inventory\"; ") at postgres.c:1215

With master from today(aa087ec64f703a52f3c48c) I still get segfaults
under do_analyze_rel

compute_index_stats (onerel=0x7f84bf1436a8, col_context=0x55a5d3d56640,
numrows=<optimized out>, rows=0x55a5d4039520,
nindexes=<optimized out>, indexdata=0x3ff0000000000000,
totalrows=500) at analyze.c:711
#1 do_analyze_rel (onerel=onerel@entry=0x7f84bf1436a8,
params=0x7ffdde2b5c40, params@entry=0x3ff0000000000000,
va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>,
relpages=11, inh=inh@entry=false, in_outer_xact=true,
elevel=13) at analyze.c:552

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Steve Singer (#1)
Re: PG 12 beta 1 segfault during analyze

Steve Singer <steve@ssinger.info> writes:

I encountered the following segfault when running against a PG 12 beta1
during a analyze against a table.

Nobody else has reported this, so you're going to have to work on
producing a self-contained test case, or else debugging it yourself.

regards, tom lane

#3Steve Singer
steve@ssinger.info
In reply to: Tom Lane (#2)
1 attachment(s)
Re: PG 12 beta 1 segfault during analyze

On 6/15/19 10:18 PM, Tom Lane wrote:

Steve Singer <steve@ssinger.info> writes:

I encountered the following segfault when running against a PG 12 beta1
during a analyze against a table.

Nobody else has reported this, so you're going to have to work on
producing a self-contained test case, or else debugging it yourself.

regards, tom lane

The attached patch fixes the issue.

Steve

Attachments:

heapam_handler.c.difftext/x-patch; name=heapam_handler.c.diffDownload
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b7d2ddbbdcf..fc19f40a2e3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1113,11 +1113,11 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 				 * concurrent transaction never commits.
 				 */
 				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
-					deadrows += 1;
+					*deadrows += 1;
 				else
 				{
 					sample_it = true;
-					liverows += 1;
+					*liverows += 1;
 				}
 				break;
 
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Steve Singer (#3)
Re: PG 12 beta 1 segfault during analyze

Steve Singer <steve@ssinger.info> writes:

The attached patch fixes the issue.

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it. I wonder what test case
you were using, on what platform?

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: PG 12 beta 1 segfault during analyze

Hi,

On 2019-06-18 00:32:17 -0400, Tom Lane wrote:

Steve Singer <steve@ssinger.info> writes:

The attached patch fixes the issue.

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it. I wonder what test case
you were using, on what platform?

I suspect that's because the bug is "only" in the
HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
crashing goes in the !TransactionIdIsCurrentTransactionId() case,
because as the tuple is sampled, we just return. And then there still
needs to be an actually dead row afterwards, to actually trigger
dereferencing the modified deadrows. And then acquire_sample_rows()'s
deadrows actually needs to point to something that causes crashes when
modified.

I can definitely get it to do a "wild" pointer write:

Breakpoint 2, heapam_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850,
deadrows=0x7fff56159f50, slot=0x55f8fcb92b40) at /home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:1061
1061 *deadrows += 1;
(gdb) p deadrows
$9 = (double *) 0x7fff56159f50
(gdb) up
#1 0x000055f8fad922c5 in table_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850,
deadrows=0x7fff56159848, slot=0x55f8fcb92b40) at /home/andres/src/postgresql/src/include/access/tableam.h:1467
1467 return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
(gdb) p deadrows
$10 = (double *) 0x7fff56159848

making a question of a crash just a question of the exact stack layout
and the number of deleted tuples.

Will fix tomorrow morning.

Greetings,

Andres Freund

#6Steve Singer
steve@ssinger.info
In reply to: Tom Lane (#4)
Re: PG 12 beta 1 segfault during analyze

On 6/18/19 12:32 AM, Tom Lane wrote:

Steve Singer <steve@ssinger.info> writes:

The attached patch fixes the issue.

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it. I wonder what test case
you were using, on what platform?

regards, tom lane

I was running the slony regression tests.  The crash happened when it
tries to replicate a particular table that already has data in it on the
replica.  It doesn't happen with every table and  I haven't been able to
replicate the crash in as self contained test by manually doing similar
steps to just that table with psql.

This is on x64.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: PG 12 beta 1 segfault during analyze

Andres Freund <andres@anarazel.de> writes:

On 2019-06-18 00:32:17 -0400, Tom Lane wrote:

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it. I wonder what test case
you were using, on what platform?

I suspect that's because the bug is "only" in the
HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
crashing goes in the !TransactionIdIsCurrentTransactionId() case,
because as the tuple is sampled, we just return. And then there still
needs to be an actually dead row afterwards, to actually trigger
dereferencing the modified deadrows. And then acquire_sample_rows()'s
deadrows actually needs to point to something that causes crashes when
modified.

Right, I'd come to the same conclusions last night, but failed to make
a crasher example. Not sure why though, because my first try today
blew up real good:

---
\set N 10

create table bug as select generate_series(1,:N) f1;
delete from bug where f1 = :N;

begin;
delete from bug;
analyze verbose bug;
rollback;

drop table bug;
---

On my machine, N smaller than 10 doesn't do it, but of course that
will be very platform-specific.

Will fix tomorrow morning.

OK. To save you the trouble of "git blame", it looks like
737a292b5de296615a715ddce2b2d83d1ee245c5 introduced this.

regards, tom lane

#8Andres Freund
andres@anarazel.de
In reply to: Steve Singer (#3)
Re: PG 12 beta 1 segfault during analyze

Hi,

On 2019-06-17 21:46:02 -0400, Steve Singer wrote:

On 6/15/19 10:18 PM, Tom Lane wrote:

Steve Singer <steve@ssinger.info> writes:

I encountered the following segfault when running against a PG 12 beta1
during a analyze against a table.

Nobody else has reported this, so you're going to have to work on
producing a self-contained test case, or else debugging it yourself.

regards, tom lane

The attached patch fixes the issue.

Thanks for the bug report, diagnosis and patch. Pushed.

I included a small testcase for ANALYZE running in a transaction that
also modified a few rows, after going back and forth on it for a
while. Seems unlikely that we'll reintroduce this specific bug, but it
seems good to have test coverage a least some of the
HEAPTUPLE_DELETE_IN_PROGRESS path. We currently have none...

I think the testcase would catch the issue at hand on most machines, by
mixing live/dead/deleted-by-current-transaction rows.

Greetings,

Andres Freund

#9Noname
ilmari@ilmari.org
In reply to: Andres Freund (#8)
Re: PG 12 beta 1 segfault during analyze

Andres Freund <andres@anarazel.de> writes:

Hi,

On 2019-06-17 21:46:02 -0400, Steve Singer wrote:

On 6/15/19 10:18 PM, Tom Lane wrote:

Steve Singer <steve@ssinger.info> writes:

I encountered the following segfault when running against a PG 12 beta1
during a analyze against a table.

Nobody else has reported this, so you're going to have to work on
producing a self-contained test case, or else debugging it yourself.

regards, tom lane

The attached patch fixes the issue.

Thanks for the bug report, diagnosis and patch. Pushed.

I was going to suggest trying to prevent similar bugs by declaring these
and other output parameters as `double *const foo` in tableam.h, but
doing that without adding the corresponding `const` in heapam_handler.c
doesn't even raise a warning.

Still, declaring them as *const in both places might serve as an
example/reminder for people writing their own table AMs.

- ilmari
--
"I use RMS as a guide in the same way that a boat captain would use
a lighthouse. It's good to know where it is, but you generally
don't want to find yourself in the same spot." - Tollef Fog Heen