stress test for parallel workers

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Justin Pryzby (#1)

Re: stress test for parallel workers

Justin Pryzby <pryzby@telsasoft.com> writes:

Does anyone have a stress test for parallel workers ?
On a customer's new VM, I got this several times while (trying to) migrate their DB:

< 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction

We've been seeing this irregularly in the buildfarm, too. I've been
suspicious that it's from an OOM kill on the postmaster in the
buildfarm cases, but ...

There's nothing in dmesg nor in postgres logs.

... you'd think an OOM kill would show up in the kernel log.
(Not necessarily in dmesg, though. Did you try syslog?)

Ideally a minimal test, since I'm apparently going to
have to run under gdb to see how it's dying, or even what process is failing.

Like it told you, it's the postmaster that's going away.
That's Not Supposed To Happen, of course, but unfortunately Linux'
OOM kill heuristic preferentially targets the postmaster when
its children are consuming too many resources.

If that is the problem, there's some info on working around it at

regards, tom lane

pryzby@telsasoft.com

over 6 years ago

In reply to: Tom Lane (#2)

Re: stress test for parallel workers

On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:

... you'd think an OOM kill would show up in the kernel log.
(Not necessarily in dmesg, though. Did you try syslog?)

Nothing in /var/log/messages (nor dmesg ring).

I enabled abrtd while trying to reproduce it last week. Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:

(gdb) bt
#0 0x000000395be32495 in raise () from /lib64/libc.so.6
#1 0x000000395be33c75 in abort () from /lib64/libc.so.6
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451
#8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337
#9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = <value optimized out>
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = <value optimized out>
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
No locals.
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
shutdown = false
checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0,
oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 2141308}
recptr = <value optimized out>
_logSegNo = <value optimized out>
Insert = <value optimized out>
freespace = <value optimized out>
PriorRedoPtr = <value optimized out>
curInsert = <value optimized out>
last_important_lsn = <value optimized out>
vxids = 0x280afb8
nvxids = 0
__func__ = "CreateCheckPoint"
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
ckpt_performed = false
do_restartpoint = <value optimized out>
flags = 128
do_checkpoint = <value optimized out>
now = 1563781930
elapsed_secs = <value optimized out>
cur_timeout = <value optimized out>
rc = <value optimized out>
local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224, -1669940128812603322, 1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0, 246346239061, 140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0,
7864320}}}}
checkpointer_context = 0x27e15d0
__func__ = "CheckpointerMain"

Supposedly it's trying to do this:

| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

Justin

thomas.munro@gmail.com

over 6 years ago

In reply to: Justin Pryzby (#1)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction
< 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX unused0_huawei_umts_nodeb_locell_201907_unique_idx ON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...

... I've set
max_parallel_workers_per_gather=0, ...

Just by the way, parallelism in CREATE INDEX is controlled by
max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

--
Thomas Munro
https://enterprisedb.com

thomas.munro@gmail.com

over 6 years ago

In reply to: Justin Pryzby (#3)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = <value optimized out>
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = <value optimized out>
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"

Supposedly it's trying to do this:

| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno. If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space. Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though. If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked). Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence? But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once. I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that. Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.

--
Thomas Munro
https://enterprisedb.com

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#5)

Re: stress test for parallel workers

Thomas Munro <thomas.munro@gmail.com> writes:

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.

I agree that there's little evidence implicating parallelism directly.
The reason I'm suspicious about a possible OOM kill is that parallel
queries would appear to the OOM killer to be eating more resources
than the same workload non-parallel, so that we might be at more
hazard of getting OOM'd just because of that.

A different theory is that there's some hard-to-hit bug in the
postmaster's processing of parallel workers that doesn't apply to
regular backends. I've looked for one in a desultory way but not
really focused on it.

In any case, the evidence from the buildfarm is pretty clear that
there is *some* connection. We've seen a lot of recent failures
involving "postmaster exited during a parallel transaction", while
the number of postmaster failures not involving that is epsilon.

regards, tom lane

thomas.munro@gmail.com

over 6 years ago

In reply to: Thomas Munro (#5)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:

edata = <value optimized out>

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno. ...

Rats. We already saw that it's optimised out so unless we can find
that somewhere else in a variable that's present in the core, we
probably can't find out what the operating system said. So my other
idea for getting this information next time is to try putting the
PostgreSQL logs somewhere that's more likely to be still working when
that thing fails.

--
Thomas Munro
https://enterprisedb.com

pryzby@telsasoft.com

over 6 years ago

In reply to: Thomas Munro (#5)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

Could it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

It's not impossible, especially while loading data, and data_dir is only 64GB;
it may have happened that way sometimes; but it's hard to believe I that's been
the case 5-10 times now. If I don't forget to drop the old database previously
loaded, when loading old/historic data, it should have ~40GB free on data_dir,
and no clients connected other than pg_restore.

$ df -h /var/lib/pgsql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data-postgres
64G 26G 38G 41% /var/lib/pgsql

| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

I realized this probably is *not* an issue with zfs, since it's failing to log
(for one reason or another) to /var/lib/pgsql (ext4).

Perhaps it would be clearer what's going on if you could put the PostgreSQL
log onto a different filesystem, so we get a better chance of collecting
evidence?

I didn't mention it but last weekend I'd left a loop around the restore process
running overnight, and had convinced myself the issue didn't recur since their
faulty blade was taken out of service... My plan was to leave the server
running in the foreground with logging_collector=no, which I hope is enough,
unless logging is itself somehow implicated. I'm trying to stress test that
way now.

But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once. I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that. Perhaps we're looking at evidence from two
unrelated failures.

The messages from the parallel leader (building indices) were visible to the
client, not via the server log. I was loading their data and the errors were
visible when pg_restore failed.

On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote:

Just by the way, parallelism in CREATE INDEX is controlled by
max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

Thank you.

Justin

thomas.munro@gmail.com

over 6 years ago

In reply to: Justin Pryzby (#8)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

How about something like this:

print errorData[errordata_stack_depth]

If you can't find errordata_stack_depth, maybe look at the whole array
and try to find the interesting bit?

--
Thomas Munro
https://enterprisedb.com

#10

pryzby@telsasoft.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

How about something like this:

print errorData[errordata_stack_depth]

Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x27b3790 "< %m %u >", lineno = 41745456,
funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>,
detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context = 0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0,
table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030 "origin.c", lineno = 591,
funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293,
message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0,
message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin

#11

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Justin Pryzby (#10)

Re: stress test for parallel workers

Justin Pryzby <pryzby@telsasoft.com> writes:

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

The fact that you're not finding log output matching what was reported
to the client seems to me to be a mighty strong indication that there
*was* an ENOSPC problem. Can you reconfigure to put the postmaster
log on a different volume?

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Parallel workers aren't ever allowed to write, in the current
implementation, so it's not real obvious why they'd have any
WAL log files open at all.

regards, tom lane

#12

thomas.munro@gmail.com

over 6 years ago

In reply to: Justin Pryzby (#10)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different. Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log? You mentioned that you couldn't see any signs of the OOM
killer. Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system? You might try typing
this into Python:

x = [42]
for i in range(1000):
x = x + x

On my non-Linux system, it ran for a while and then was killed, and
dmesg showed:

pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space
pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible... Anyway on
Linux you should see stuff about killed processes and/or OOM in one of
dmesg, syslog, messages.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Parallel workers don't do anything with WAL files, but they can create
temporary files. If you're building humongous indexes with parallel
workers, you'll get some of those, but I don't think it'd be more than
you'd get without parallelism. If you were using up all of your disk
space with temporary files, wouldn't this be reproducible? I think
you said you were testing this repeatedly, so if that were the problem
I'd expect to see some non-panicky out-of-space errors when the temp
files blow out your disk space, and only rarely a panic if a
checkpoint happens to run exactly at a moment where the create index
hasn't yet written the byte that breaks the camel's back, but the
checkpoint pushes it over edge in one of these places where it panics
on failure.

--
Thomas Munro
https://enterprisedb.com

#13

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#6)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.

I agree that there's little evidence implicating parallelism directly.
The reason I'm suspicious about a possible OOM kill is that parallel
queries would appear to the OOM killer to be eating more resources
than the same workload non-parallel, so that we might be at more
hazard of getting OOM'd just because of that.

A different theory is that there's some hard-to-hit bug in the
postmaster's processing of parallel workers that doesn't apply to
regular backends. I've looked for one in a desultory way but not
really focused on it.

In any case, the evidence from the buildfarm is pretty clear that
there is *some* connection. We've seen a lot of recent failures
involving "postmaster exited during a parallel transaction", while
the number of postmaster failures not involving that is epsilon.

I don't have access to the build farm history in searchable format
(I'll go and ask for that). Do you have an example to hand? Is this
failure always happening on Linux?

--
Thomas Munro
https://enterprisedb.com

#14

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Justin Pryzby (#10)

Re: stress test for parallel workers

On 2019-Jul-23, Justin Pryzby wrote:

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

I'm not sure that this proves much, since I expect temporary files to be
deleted on failure; by the time you run 'df' the condition might have
already been cleared. You'd need to be capturing diskspace telemetry
with sufficient granularity ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

pryzby@telsasoft.com

over 6 years ago

In reply to: Thomas Munro (#12)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance. It occured to me to check inodes, which can also cause
ENOSPC. This is mkfs -T largefile, so running out of inodes is not an
impossibility. But seems an unlikely culprit, unless something made tens of
thousands of (small) files.

[pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
65536 5605 59931 9% /var/lib/pgsql

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different. Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log? You mentioned that you couldn't see any signs of the OOM
killer. Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system?

$ command time -v python -c "'x'*4999999999" |wc
Traceback (most recent call last):
File "<string>", line 1, in <module>
MemoryError
Command exited with non-zero status 1
...
Maximum resident set size (kbytes): 4276

$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB

I wouldn't burn too much more time on it until I can reproduce it. The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible...

Browsers making lots of progress here but still clearly 2nd place.

Justin

#16

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#13)

Re: stress test for parallel workers

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In any case, the evidence from the buildfarm is pretty clear that
there is *some* connection. We've seen a lot of recent failures
involving "postmaster exited during a parallel transaction", while
the number of postmaster failures not involving that is epsilon.

I don't have access to the build farm history in searchable format
(I'll go and ask for that).

Yeah, it's definitely handy to be able to do SQL searches in the
history. I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.

Do you have an example to hand? Is this
failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those. To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

regards, tom lane

#17

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#16)

Re: stress test for parallel workers

On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do you have an example to hand? Is this
failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those. To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

sysname | branch | snapshot
-----------+---------------+---------------------
lorikeet | HEAD | 2019-06-16 20:28:25
lorikeet | HEAD | 2019-07-07 14:58:38
lorikeet | HEAD | 2019-07-02 10:38:08
lorikeet | HEAD | 2019-06-14 14:58:24
lorikeet | HEAD | 2019-07-04 20:28:44
lorikeet | HEAD | 2019-04-30 11:00:49
lorikeet | HEAD | 2019-06-19 20:29:27
lorikeet | HEAD | 2019-05-21 08:28:26
lorikeet | REL_11_STABLE | 2019-07-11 08:29:08
lorikeet | REL_11_STABLE | 2019-07-09 08:28:41
lorikeet | REL_12_STABLE | 2019-07-16 08:28:37
lorikeet | REL_12_STABLE | 2019-07-02 21:46:47
lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14
vulpes | HEAD | 2019-06-14 09:18:18
vulpes | HEAD | 2019-06-27 09:17:19
vulpes | HEAD | 2019-07-21 09:01:45
vulpes | HEAD | 2019-06-12 09:11:02
vulpes | HEAD | 2019-07-05 08:43:29
vulpes | HEAD | 2019-07-15 08:43:28
vulpes | HEAD | 2019-07-19 09:28:12
wobbegong | HEAD | 2019-06-09 20:43:22
wobbegong | HEAD | 2019-07-02 21:17:41
wobbegong | HEAD | 2019-06-04 21:06:07
wobbegong | HEAD | 2019-07-14 20:43:54
wobbegong | HEAD | 2019-06-19 21:05:04
wobbegong | HEAD | 2019-07-08 20:55:18
wobbegong | HEAD | 2019-06-28 21:18:46
wobbegong | HEAD | 2019-06-02 20:43:20
wobbegong | HEAD | 2019-07-04 21:01:37
wobbegong | HEAD | 2019-06-14 21:20:59
wobbegong | HEAD | 2019-06-23 21:36:51
wobbegong | HEAD | 2019-07-18 21:31:36
(32 rows)

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

chipmunk also:

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

On some systems (depending on sysctl kernel.dmesg_restrict on Linux,
security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run
dmesg as a non-root user, and there the OOM killer's footprints or
signaled exit statuses for processes under init might normally be found,
but that seems a bit invasive for the host system (I guess you'd
filter it carefully). Unfortunately it isn't enabled on many common
systems anyway.

Maybe there is a systemd-specific way to get the info we need without
being root?

Another idea: start the postmaster under a subreaper (Linux 3.4+
prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+
procctl(PROC_REAP_ACQUIRE)) that exists just to report on its
children's exit status, so the build farm could see "pid XXX was
killed by signal 9" message if it is nuked by the OOM killer. Perhaps
there is a common subreaper wrapper out there that would wait, print
messages like that, rince and repeat until it has no children and then
exit, or perhaps pg_ctl or even a perl script could do somethign like
that if requested. Another thought, not explored, is the brand new
Linux pidfd stuff that can be used to wait and get exit status for a
non-child process (or the older BSD equivalent), but the paint isn't
even dry on that stuff anwyay.

--
Thomas Munro
https://enterprisedb.com

#18

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#17)

Re: stress test for parallel workers

Thomas Munro <thomas.munro@gmail.com> writes:

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status. Where to report it *to* seems like the hard part,
but maybe an answer that worked for the buildfarm would be enough for now.

regards, tom lane

#19

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#18)

Re: stress test for parallel workers

On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status. Where to report it *to* seems like the hard part,
but maybe an answer that worked for the buildfarm would be enough for now.

Oh, right, you don't even need subreaper tricks (I was imagining we
had a double fork somewhere we don't).

Another question is whether the build farm should be setting the Linux
oom score adjust thing.

--
Thomas Munro
https://enterprisedb.com

#20

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#19)

Re: stress test for parallel workers

Thomas Munro <thomas.munro@gmail.com> writes:

Another question is whether the build farm should be setting the Linux
oom score adjust thing.

AFAIK you can't do that without being root.

regards, tom lane

#21

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#20)

#22

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Thomas Munro (#17)

#23

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Heikki Linnakangas (#22)

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Tom Lane (#23)

#25

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Heikki Linnakangas (#24)

#26

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 6 years ago

In reply to: Tom Lane (#25)

#27

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#19)

#28

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Tom Lane (#11)

#29

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#27)

#30

Mark Wong

markw@osdl.org

over 6 years ago

In reply to: Tom Lane (#29)

#31

Andrew Dunstan

andrew@dunslane.net

over 6 years ago

In reply to: Tom Lane (#29)

#32

Andrew Dunstan

andrew@dunslane.net

over 6 years ago

In reply to: Andrew Dunstan (#31)

#33

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andrew Dunstan (#32)

#34

Andrew Dunstan

andrew@dunslane.net

over 6 years ago

In reply to: Tom Lane (#33)

#35

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#33)

#36

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andrew Dunstan (#34)

#37

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#35)

#38

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#37)

#39

Mark Wong

markw@osdl.org

over 6 years ago

In reply to: Thomas Munro (#37)

#40

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#35)

#41

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#40)

#42

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#41)

#43

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#38)

#44

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#43)

#45

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#44)

#46

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#45)

#47

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Thomas Munro (#46)

#48

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#47)

#49

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#48)

#50

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#49)

#51

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#50)

#52

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#51)

#53

pryzby@telsasoft.com

over 6 years ago

In reply to: Justin Pryzby (#1)

#54

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#52)

#55

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#54)

#56

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Thomas Munro (#55)

#57

thomas.munro@gmail.com

over 5 years ago

In reply to: Tom Lane (#56)

#58

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Thomas Munro (#57)

#59

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#58)

#60