Parallel pg_dump's error reporting doesn't work worth squat

Started by Tom Laneover 10 years ago33 messageshackers

tgl@sss.pgh.pa.us

over 10 years ago

I was in process of testing the proposed patch for bug #13727,
and I found that at least on my Linux box, this is the behavior
in the failure case without the patch:

$ pg_dump "postgres://postgres:phonypassword@localhost/regression" --jobs=9 -Fd -f testdump
$ echo $?
141
$ ls testdump
toc.dat

That is, the pg_dump process has crashed with a SIGPIPE without printing
any message whatsoever, and without coming anywhere near finishing the
dump.

A bit of investigation says that this is because somebody had the bright
idea that worker processes could report fatal errors back to the master
process instead of just printing them to stderr. So when the workers
fail to establish connections (because of the password problem cited in
#13727), they don't tell me about that. Oh no, they send those errors
back up to the pipe to the parent, and then die silently. Meanwhile,
the parent is trying to send them commands, and since it doesn't protect
itself against SIGPIPE on the command pipes, it crashes without ever
reporting anything. If you aren't paying close attention, you wouldn't
even realize you didn't get a completed dump.

Depending on timing, this scheme might accidentally fail to fail, but it
seems fragile as can be. I would bet that it's prone to deadlocks, quite
aside from the SIGPIPE problem. Considering how amazingly ugly the
underlying code is (exit_horribly is in parallel.c now? Really?), I want
to rip it out entirely, not try to band-aid it by suppressing SIGPIPE ---
though likely we need to do that too.

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Tom Lane (#1)

Re: Parallel pg_dump's error reporting doesn't work worth squat

On 12/23/2015 10:16 AM, Tom Lane wrote:

Depending on timing, this scheme might accidentally fail to fail, but it
seems fragile as can be. I would bet that it's prone to deadlocks, quite
aside from the SIGPIPE problem. Considering how amazingly ugly the
underlying code is (exit_horribly is in parallel.c now? Really?), I want
to rip it out entirely, not try to band-aid it by suppressing SIGPIPE ---
though likely we need to do that too.

Thoughts?

This is something that should work, period. It should be a showcase for
the code we ship because it shows how serious we take data integrity
(backups/restore etc...).

+1 for turning it into something irrefutably as close to perfect as
humans can produce.

Sincerely,

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#1)

Re: Parallel pg_dump's error reporting doesn't work worth squat

[ getting back to a complaint I made in December ]

I wrote:

... the pg_dump process has crashed with a SIGPIPE without printing
any message whatsoever, and without coming anywhere near finishing the
dump.

A bit of investigation says that this is because somebody had the bright
idea that worker processes could report fatal errors back to the master
process instead of just printing them to stderr. So when the workers
fail to establish connections (because of the password problem cited in
#13727), they don't tell me about that. Oh no, they send those errors
back up to the pipe to the parent, and then die silently. Meanwhile,
the parent is trying to send them commands, and since it doesn't protect
itself against SIGPIPE on the command pipes, it crashes without ever
reporting anything. If you aren't paying close attention, you wouldn't
even realize you didn't get a completed dump.

Attached is a proposed patch for this. It reverts exit_horribly() to
what it used to be pre-9.3, ie just print the message on stderr and die.
The master now notices child failure by observing EOF on the status pipe.
While that works automatically on Unix, we have to explicitly close the
child side of the pipe on Windows (could someone check that that works?).
I also fixed the parent side to ignore SIGPIPE earlier, so that it won't
just die if it was in the midst of sending to the child.

BTW, after having spent the afternoon staring at it, I will assert with
confidence that pg_dump/parallel.c is an embarrassment to the project.
It is chock-full of misleading, out-of-date, and/or outright wrong
comments, useless or even wrong Asserts, ill-designed APIs, code that
works quite differently in Unix and Windows cases without a shred of
commentary about the fact, etc etc. I have mostly resisted the temptation
to do cosmetic cleanup in the attached, but this code really needs some
editorial attention.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#3)

Re: Parallel pg_dump's error reporting doesn't work worth squat

I wrote:

... the pg_dump process has crashed with a SIGPIPE without printing
any message whatsoever, and without coming anywhere near finishing the
dump.

Attached is a proposed patch for this. It reverts exit_horribly() to
what it used to be pre-9.3, ie just print the message on stderr and die.
The master now notices child failure by observing EOF on the status pipe.
While that works automatically on Unix, we have to explicitly close the
child side of the pipe on Windows (could someone check that that works?).
I also fixed the parent side to ignore SIGPIPE earlier, so that it won't
just die if it was in the midst of sending to the child.

Now that we're all back from PGCon ... does anyone wish to review this?
Or have an objection to treating it as a bug fix and patching all branches
back to 9.3?

BTW, after having spent the afternoon staring at it, I will assert with
confidence that pg_dump/parallel.c is an embarrassment to the project.

I've done a bit of work on a cosmetic cleanup patch, but don't have
anything to show yet.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#4)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Hello,

# Note for convenience for others: The commit fixing the original
# issue is 1aa41e3eae3746e05d0e23286ac740a9a6cee7df.

At Mon, 23 May 2016 13:40:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <1336.1464025230@sss.pgh.pa.us>

I wrote:

... the pg_dump process has crashed with a SIGPIPE without printing
any message whatsoever, and without coming anywhere near finishing the
dump.

Attached is a proposed patch for this. It reverts exit_horribly() to
what it used to be pre-9.3, ie just print the message on stderr and die.
The master now notices child failure by observing EOF on the status pipe.
While that works automatically on Unix, we have to explicitly close the
child side of the pipe on Windows (could someone check that that works?).
I also fixed the parent side to ignore SIGPIPE earlier, so that it won't
just die if it was in the midst of sending to the child.

Now that we're all back from PGCon ... does anyone wish to review this?
Or have an objection to treating it as a bug fix and patching all branches
back to 9.3?

FWIW, it seems to me a bug which should be fixed to back
branches.

I tried this only on Linux (I'll try it on Windows afterward) and
got something like this.

pg_dump: [archiver (db)] pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth: no password supplied
connection to database "postgres" failed: fe_sendauth: no password supplied
pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth: no password supplied
<some repeats>
pg_dump: [parallel archiver] a worker process died unexpectedly
pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth: no password supplied
<some repeats>

Although the error messages from multiple children are cluttered,
it would be inevitable and better than vanishing.

It seems hard to distinguish the meaning among the module names
enclosed by '[]' but it is another issue.

In archive_close_connection, the following change means that
si->AHX could be NULL there, as the existing code for
non-parallel does. But it seems to be set once for
si(=shutdown_info)'s lifetime, before reaching there, to a valid
value.

-			DisconnectDatabase(si->AHX);
+			if (si->AHX)
+				DisconnectDatabase(si->AHX);

Shouldn't archive_close_connection have an assertion (si->AHX !=
NULL) instead of checking "if (si->AHX)" in each path?

BTW, after having spent the afternoon staring at it, I will assert with
confidence that pg_dump/parallel.c is an embarrassment to the project.

I've done a bit of work on a cosmetic cleanup patch, but don't have
anything to show yet.

regards, tom lane

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Shouldn't archive_close_connection have an assertion (si->AHX !=
NULL) instead of checking "if (si->AHX)" in each path?

I thought about that but didn't see any really good argument for it.
It'd be making that function dependent on the current behavior of
unrelated code, when it doesn't need to be.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#6)

Re: Parallel pg_dump's error reporting doesn't work worth squat

At Wed, 25 May 2016 00:15:57 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3149.1464149757@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Shouldn't archive_close_connection have an assertion (si->AHX !=
NULL) instead of checking "if (si->AHX)" in each path?

I thought about that but didn't see any really good argument for it.
It'd be making that function dependent on the current behavior of
unrelated code, when it doesn't need to be.

It's also fine with me.

I tried it on Windows 7/64 but first of all, I'm surprised to see
that the following command line gets an error but it would be
fine because it is consistent with what is written in the manual.

| >pg_dump postgres://horiguti:hoge@localhost/postgres --jobs=9 -Fd -f testdump
| pg_dump: too many command-line arguments (first is "--jobs=9")
| Try "pg_dump --help" for more information.

Next, I got the following behavior for the following command,
then freeze. Maybe stopping at the same point with the next
paragraph but I'm not sure. The same thing occurs this patch on
top of the current master but doesn't on Linux.

| >pg_dump -d postgres --jobs=9 -Fd -f testdump
| Password: <correct password>
| pg_dump: [archiver] WARNING: requested compression not available in this installation -- archive will be uncompressed
| pg_dump: [compress_io] not built with zlib support
| pg_dump: [parallel archiver] a worker process died unexpectedly
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'
| pg_dump: [archiver (db)] query failed: ERROR: invalid snapshot identifier: "00002299-1"
| pg_dump: [archiver (db)] query was: SET TRANSACTION SNAPSHOT '00002299-1'

Third, I'm not sure on this detail, but pg_dump shows the
following message then freeze until Ctrl-C. I think that I forgot
to set password to the user for the time. It doesn't seem to
occur for this patch on top of the current master.

| >pg_dump --jobs=9 -Fd -f testdump "postgres://horiguti:hoge@localhost/postgres"
| pg_dump: [archiver] WARNING: requested compression not available in this installation -- archive will be uncompressed
| pg_dump: [compress_io] not built with zlib support
| pg_dump: [parallel archiver] a worker process died unexpectedly
| ^C
| >

The main thread was stopping at WaitForMultiplObjects() around
parallel.c:361(@master + this patch) but I haven't investigated
it any more.

Finally, I got the following expected result, which seems sane.

| >pg_dump --jobs=9 -Fd -f testdump "postgres://horiguti:hoge@localhost/postgres"
| pg_dump: [archiver] WARNING: requested compression not available in this installation -- archive will be uncompressed
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [parallel archiver] a worker process died unexpectedly
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied
| pg_dump: [archiver (db)] connection to database "postgres" failed: fe_sendauth:no password supplied

I might be wrong with something, but pg_dump seems to have some
issues in thread handling.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

I tried it on Windows 7/64 but first of all, I'm surprised to see
that the following command line gets an error but it would be
fine because it is consistent with what is written in the manual.

| >pg_dump postgres://horiguti:hoge@localhost/postgres --jobs=9 -Fd -f testdump
| pg_dump: too many command-line arguments (first is "--jobs=9")
| Try "pg_dump --help" for more information.

Where do you see an example like that? We should fix it. The only case
that is certain to work is switches before non-switch arguments, and so
we should not give any documentation examples in the other order. On a
platform using GNU getopt(), getopt() will rearrange the argv[] array to
make such cases work, but non-GNU getopt() doesn't do that (and I don't
really trust the GNU version not to get confused, either).

I might be wrong with something, but pg_dump seems to have some
issues in thread handling.

Wouldn't surprise me ... there's a lot of code in there depending on
static variables, and we've only tried to thread-proof a few. Still,
I think that's a separate issue from what this patch is addressing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#8)

Re: Parallel pg_dump's error reporting doesn't work worth squat

At Wed, 25 May 2016 10:11:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <24577.1464185488@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

I tried it on Windows 7/64 but first of all, I'm surprised to see
that the following command line gets an error but it would be
fine because it is consistent with what is written in the manual.

| >pg_dump postgres://horiguti:hoge@localhost/postgres --jobs=9 -Fd -f testdump
| pg_dump: too many command-line arguments (first is "--jobs=9")
| Try "pg_dump --help" for more information.

Where do you see an example like that? We should fix it.

Sorry for the confusing sentence. The command line came from your
first mail starting this thread:p And "consistent with manual"
means that the command line results in an error (even only on
Windows) since it is difrerent from the document. No such example
has been seen in the documentation AFAICS.

/messages/by-id/2458.1450894615@sss.pgh.pa.us
https://www.postgresql.org/docs/9.6/static/app-pgdump.html

The only case
that is certain to work is switches before non-switch arguments, and so
we should not give any documentation examples in the other order. On a
platform using GNU getopt(), getopt() will rearrange the argv[] array to
make such cases work, but non-GNU getopt() doesn't do that (and I don't
really trust the GNU version not to get confused, either).

Yeah, I knew it after reading port/getopt_long.c. But the error
message seems saying nothing about that (at least to me). And
those accumstomed to the GNU getopt's behavior will fail to get
the point of the message. The documentation also doesn't
explicitly saying about the restriction; it is only implicitly
shown in synopsis. How about something like the following
message? (The patch attached looks too dependent on the specific
behavior of our getopt_long.c, but doing it in getopt_long.c also
seems to be wrong..)

| >pg_dump hoge -f
| pg_dump: non-option arguments should not precede options.

I might be wrong with something, but pg_dump seems to have some
issues in thread handling.

Wouldn't surprise me ... there's a lot of code in there depending on
static variables, and we've only tried to thread-proof a few. Still,
I think that's a separate issue from what this patch is addressing.

Sounds reasonable. I look into this further.
I see no other functional problem in this patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Sounds reasonable. I look into this further.

I looked into that and found one problem in the patch.

Next, I got the following behavior for the following command,
then freeze. Maybe stopping at the same point with the next
paragraph but I'm not sure. The same thing occurs this patch on
top of the current master but doesn't on Linux.

This occurs in the following steps.

1. One of the workers dies from some reason.
(InitCompressorZlib immediately goes into exit_horribly in this case)

2. The main thread detects in ListenToWorkers that the worker is dead.

3. ListenToWorkers calls exit_horribly then exit_nicely

4. exit_nicely calls archive_close_connection as a callback then
the callback calls ShutdownWorkersHard

5. ShutdownWorkersHard should close the write side of the pipe
but the patch skips it for WIN32.

So, the attached patch on top the patch fixes that, that is,
pg_dump returns to command prompt even for the case.

By the way, the reason of the "invalid snapshot identifier" is
that some worker threads try to use it after the connection on
the first worker closed. Some of the workers succesfully did
before the connection closing and remained listening to their
master to inhibit termination of the entire process.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#10)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Next, I got the following behavior for the following command,
then freeze. Maybe stopping at the same point with the next
paragraph but I'm not sure. The same thing occurs this patch on
top of the current master but doesn't on Linux.

[ need to close command sockets in ShutdownWorkersHard ]

Hah. I had made that same change in the cosmetic-cleanups patch I was
working on ... but I assumed it wasn't a live bug or we'd have heard
about it.

Pushed along with the comment improvements I'd made to that function.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

At Wed, 25 May 2016 10:11:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <24577.1464185488@sss.pgh.pa.us>

The only case
that is certain to work is switches before non-switch arguments, and so
we should not give any documentation examples in the other order. On a
platform using GNU getopt(), getopt() will rearrange the argv[] array to
make such cases work, but non-GNU getopt() doesn't do that (and I don't
really trust the GNU version not to get confused, either).

Yeah, I knew it after reading port/getopt_long.c. But the error
message seems saying nothing about that (at least to me). And
those accumstomed to the GNU getopt's behavior will fail to get
the point of the message. The documentation also doesn't
explicitly saying about the restriction; it is only implicitly
shown in synopsis. How about something like the following
message? (The patch attached looks too dependent on the specific
behavior of our getopt_long.c, but doing it in getopt_long.c also
seems to be wrong..)

It's not a bad idea to try to improve our response to this situation,
but I think this patch needs more work. I wonder in particular why
you're not basing the variant error message on the first unprocessed
argument starting with '-', rather than looking at the word before it.
Another thought is that the message is fairly unhelpful because it does
not show where in the arguments things went wrong. Maybe something
more like

if (argv[optind][0] == '-')
fprintf(stderr, _("%s: misplaced option \"%s\": options must appear before non-option arguments\n"),
progname, argv[optind]);
else
// existing message

In any case, if we wanted to do something about this scenario, we should
do it consistently across all our programs, not just pg_dump. I count
25 appearances of that "too many command-line arguments" message...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#12)

Re: Parallel pg_dump's error reporting doesn't work worth squat

At Thu, 26 May 2016 12:15:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <8273.1464279329@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

At Wed, 25 May 2016 10:11:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <24577.1464185488@sss.pgh.pa.us>

The only case
that is certain to work is switches before non-switch arguments, and so
we should not give any documentation examples in the other order. On a
platform using GNU getopt(), getopt() will rearrange the argv[] array to
make such cases work, but non-GNU getopt() doesn't do that (and I don't
really trust the GNU version not to get confused, either).

Yeah, I knew it after reading port/getopt_long.c. But the error
message seems saying nothing about that (at least to me). And
those accumstomed to the GNU getopt's behavior will fail to get
the point of the message. The documentation also doesn't
explicitly saying about the restriction; it is only implicitly
shown in synopsis. How about something like the following
message? (The patch attached looks too dependent on the specific
behavior of our getopt_long.c, but doing it in getopt_long.c also
seems to be wrong..)

It's not a bad idea to try to improve our response to this situation,
but I think this patch needs more work. I wonder in particular why
you're not basing the variant error message on the first unprocessed
argument starting with '-', rather than looking at the word before it.

Sorry, it just comes from my carelessness. It is correct to do
what you wrote as above. And it is also the cause of my obscure
error message.

Another thought is that the message is fairly unhelpful because it does
not show where in the arguments things went wrong. Maybe something
more like

if (argv[optind][0] == '-')
fprintf(stderr, _("%s: misplaced option \"%s\": options must appear before non-option arguments\n"),
progname, argv[optind]);
else
// existing message

In any case, if we wanted to do something about this scenario, we should
do it consistently across all our programs, not just pg_dump. I count
25 appearances of that "too many command-line arguments" message...

Although I suppose no one has ever complained before about that,
and the same thing will occur on the other new programs even if
the programs are fixed now..

I'll consider more on it for some time..

Thank you for the suggestion.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#11)

Re: Parallel pg_dump's error reporting doesn't work worth squat

At Thu, 26 May 2016 10:53:47 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <5237.1464274427@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Next, I got the following behavior for the following command,
then freeze. Maybe stopping at the same point with the next
paragraph but I'm not sure. The same thing occurs this patch on
top of the current master but doesn't on Linux.

[ need to close command sockets in ShutdownWorkersHard ]

Hah. I had made that same change in the cosmetic-cleanups patch I was
working on ... but I assumed it wasn't a live bug or we'd have heard
about it.

Pushed along with the comment improvements I'd made to that function.

Thank you!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#10)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

By the way, the reason of the "invalid snapshot identifier" is
that some worker threads try to use it after the connection on
the first worker closed.

... BTW, I don't quite see what the issue is there. The snapshot is
exported from the master session, so errors in worker sessions should not
cause such failures in other workers. And I don't see any such failure
when setting up a scenario that will cause a worker to fail on Linux.
The "invalid snapshot identifier" bleats would make sense if you had
gotten a server-side error (and transaction abort) in the master session,
but I don't see any evidence that that happened in that example. Might be
worth seeing if that's reproducible.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#4)

Re: Parallel pg_dump's error reporting doesn't work worth squat

I wrote:

BTW, after having spent the afternoon staring at it, I will assert with
confidence that pg_dump/parallel.c is an embarrassment to the project.

I've done a bit of work on a cosmetic cleanup patch, but don't have
anything to show yet.

Attached is the threatened cosmetic cleanup of parallel.c. As I went
through it, I found quite a few things not to like, but in this patch
I've mostly resisted the temptation to fix them immediately, and have
just tried to document what's there more accurately.

Aside from a major amount of comment-rewriting and a very small amount of
cosmetic code adjustment (mostly moving code for more clarity), this
patch changes these things:

* Rename SetupWorker() to RunWorker() to reflect what it actually does,
and remove its unused "worker" argument.

* Rename lockTableNoWait() to lockTableForWorker() for clarity, and move
the test for BLOBS into it instead of having an Assert that the caller
checked that.

* Don't bother with getThreadLocalPQExpBuffer() at all in non-Windows
builds; it was identical to getLocalPQExpBuffer() anyway, except for
being misleadingly named.

* Remove some completely-redundant or otherwise ill-considered Asserts.

* Fix incorrect "Assert(msgsize <= bufsize)" --- should be < bufsize.

* Fix missing socket close in one error exit from pgpipe(). This isn't
too exciting at present since we'll just curl up and die if it fails,
but might as well get it right.

I have some other, less-cosmetic, things I want to do here, but first
does anyone care to review this?

regards, tom lane

#17

Alvaro Herrera

alvherre@2ndquadrant.com

about 10 years ago

In reply to: Tom Lane (#16)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Regarding this:

*************** select_loop(int maxFd, fd_set *workerset
*** 1092,1104 ****
--- 1150,1160 ----
fd_set		saveSet = *workerset;
#ifdef WIN32
for (;;)
{
/*
* sleep a quarter of a second before checking if we should terminate.
+ * XXX this certainly looks useless, why not just wait indefinitely?
*/
struct timeval tv = {0, 250000};

There's another select_loop() in vacuumdb.c suggesting that the timeout
is used to check for cancel requests; as I understood while working on
the vacuumdb changes, select() is not interrupted in that case. I
suppose that means it's necessary here also. But on the other hand it's
quite possible that the original developer just copied what was in
pg_dump and that it's not actually needed; if that's the case, perhaps
it's better to rip it out from both places.

/messages/by-id/20150122174601.GB1663@alvh.no-ip.org

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Alvaro Herrera (#17)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Regarding this:

+ * XXX this certainly looks useless, why not just wait indefinitely?

There's another select_loop() in vacuumdb.c suggesting that the timeout
is used to check for cancel requests; as I understood while working on
the vacuumdb changes, select() is not interrupted in that case. I
suppose that means it's necessary here also. But on the other hand it's
quite possible that the original developer just copied what was in
pg_dump and that it's not actually needed; if that's the case, perhaps
it's better to rip it out from both places.

Ah, interesting. That ties into something else I was wondering about,
which is how we could get useful control-C cancellation on Windows.
It looks like the vacuumdb.c version of this code actually is tied
into an interrupt handler, but whoever copied it for parallel.c just
ripped out the CancelRequested checks, making the looping behavior
pretty useless.

For pg_restore (parallel or not) it would be useful if the program
didn't just fall over at control-C but actually sent cancel requests
to the backend(s). It's not such a problem if we're transferring
data, but if we're waiting for some slow operation like CREATE INDEX,
the current behavior isn't very good. On the Unix side we have some
SIGINT infrastructure there already, but I don't see any for Windows.

So now I'm thinking we should leave that alone, with the expectation
that we'll be putting CancelRequested checks back in at some point.

/messages/by-id/20150122174601.GB1663@alvh.no-ip.org

Hmm, did the patch you're discussing there get committed?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Tom Lane (#18)

Re: Parallel pg_dump's error reporting doesn't work worth squat

On Sat, May 28, 2016 at 5:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Regarding this:

+ * XXX this certainly looks useless, why not just wait

indefinitely?

There's another select_loop() in vacuumdb.c suggesting that the timeout
is used to check for cancel requests; as I understood while working on
the vacuumdb changes, select() is not interrupted in that case. I
suppose that means it's necessary here also. But on the other hand it's
quite possible that the original developer just copied what was in
pg_dump and that it's not actually needed; if that's the case, perhaps
it's better to rip it out from both places.

Ah, interesting. That ties into something else I was wondering about,
which is how we could get useful control-C cancellation on Windows.
It looks like the vacuumdb.c version of this code actually is tied
into an interrupt handler, but whoever copied it for parallel.c just
ripped out the CancelRequested checks, making the looping behavior
pretty useless.

It seems to me that CancelRequested checks were introduced in vacuumdb.c as
part of commit a1792320 and select_loop for parallel.c version exists from
commit 9e257a18 which got committed earlier. I think control-C handling
for Windows in parallel.c is missing or if there is some way to deal with
it, clearly it is not same as what we do in vacuumdb.c.

For pg_restore (parallel or not) it would be useful if the program
didn't just fall over at control-C but actually sent cancel requests
to the backend(s). It's not such a problem if we're transferring
data, but if we're waiting for some slow operation like CREATE INDEX,
the current behavior isn't very good. On the Unix side we have some
SIGINT infrastructure there already, but I don't see any for Windows.

So now I'm thinking we should leave that alone, with the expectation
that we'll be putting CancelRequested checks back in at some point.

/messages/by-id/20150122174601.GB1663@alvh.no-ip.org

Hmm, did the patch you're discussing there get committed?

Yes, it was committed - a1792320

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Amit Kapila (#19)

Re: Parallel pg_dump's error reporting doesn't work worth squat

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, May 28, 2016 at 5:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It looks like the vacuumdb.c version of this code actually is tied
into an interrupt handler, but whoever copied it for parallel.c just
ripped out the CancelRequested checks, making the looping behavior
pretty useless.

It seems to me that CancelRequested checks were introduced in vacuumdb.c as
part of commit a1792320 and select_loop for parallel.c version exists from
commit 9e257a18 which got committed earlier.

Huh, interesting. I wonder how parallel.c's select_loop got to be like
that then? The git history offers no clue: it is essentially the same as
HEAD as far back as the initial commit of parallel.c. It certainly looks
like someone intended to introduce a cancel check and never did, or had
one and took it out without simplifying the rest of the logic.

Anyway, AFAICS the time limit on select() wait is completely useless in
the code as it stands; but we'll likely want to add a cancel check there,
so ripping it out wouldn't be a good plan. I'll change the added comment.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 10 years ago

In reply to: Tom Lane (#15)

#22

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kyotaro Horiguchi (#21)

#23

Kyotaro Horiguchi