pg_basebackup behavior on non-existent slot

Started by Jeff Janesover 8 years ago6 messages
#1Jeff Janes
jeff.janes@gmail.com

If I tell pg_basebackup to use a non-existent slot, it immediately reports
an error. And then it exits with an error, but only after streaming the
entire database contents.

If you are doing this interactively and are on the ball, of course, you can
hit ctrl-C when you see the error message.

I don't know if this is exactly a bug, but it seems rather unfortunate.

Should the parent process of pg_basebackup be made to respond to SIGCHLD?
Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

$ /usr/local/pgsql9_6/bin/pg_basebackup -D data_replica -P --slot foobar -Xs

pg_basebackup: could not send replication command "START_REPLICATION":
ERROR: replication slot "foobar" does not exist
22384213/22384213 kB (100%), 1/1 tablespace
pg_basebackup: child process exited with error 1
pg_basebackup: removing data directory "data_replica"

Cheers,

Jeff

#2Magnus Hagander
magnus@hagander.net
In reply to: Jeff Janes (#1)
Re: pg_basebackup behavior on non-existent slot

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

If I tell pg_basebackup to use a non-existent slot, it immediately reports
an error. And then it exits with an error, but only after streaming the
entire database contents.

If you are doing this interactively and are on the ball, of course, you
can hit ctrl-C when you see the error message.

I don't know if this is exactly a bug, but it seems rather unfortunate.

I think that should qualify as a bug.

In 10 it will automatically create a transient slot in this case, but there
might still be a case where you can provoke this.

Should the parent process of pg_basebackup be made to respond to SIGCHLD?
Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super
quickly, but we should react. And we should then exit the main process with
an error before actually streaming everything.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#3Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Magnus Hagander (#2)
Re: pg_basebackup behavior on non-existent slot

Magnus Hagander wrote:

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Should the parent process of pg_basebackup be made to respond to SIGCHLD?
Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super
quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

And we should then exit the main process with an error before actually
streaming everything.

Right.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Magnus Hagander
magnus@hagander.net
In reply to: Alvaro Herrera (#3)
Re: pg_basebackup behavior on non-existent slot

On Wed, Sep 6, 2017 at 11:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Magnus Hagander wrote:

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Should the parent process of pg_basebackup be made to respond to

SIGCHLD?

Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super
quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

Good point.

So the question is what to do for Windows. I'd rather not have to bring in
the whole extra thread and socket emulation stuff into pg_basebackup if it
can be avoided. But I guess we could code up something Windows-specific in
just that one (since it's threaded and not processed on Windows, it's
easier than the backend). I think that means we'd have to rewrite it to use
the async libpq apis, don't you?

The other option would be to just kill the process from the child thread.
Since the're threads we can do that. However, that will leave us in a
position where we can't clean up from the error (as in remove files/dirs),
not sure that's good?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Alvaro Herrera (#3)
Re: pg_basebackup behavior on non-existent slot

On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Magnus Hagander wrote:

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Should the parent process of pg_basebackup be made to respond to

SIGCHLD?

Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super
quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

If we don't want polling by waitpid, then my next thought would be to move
the data copy into another process, then have the main process do nothing
but wait for the first child to exit. If the first to exit is the WAL
receiver, then we must have an error and the data receiver can be killed.
I don't know how to translate that to Windows, however.

Cheers,

Jeff

#6Magnus Hagander
magnus@hagander.net
In reply to: Jeff Janes (#5)
Re: pg_basebackup behavior on non-existent slot

On Tue, Sep 12, 2017 at 7:35 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Magnus Hagander wrote:

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com>

wrote:

Should the parent process of pg_basebackup be made to respond to

SIGCHLD?

Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super
quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

If we don't want polling by waitpid, then my next thought would be to move
the data copy into another process, then have the main process do nothing
but wait for the first child to exit. If the first to exit is the WAL
receiver, then we must have an error and the data receiver can be killed.
I don't know how to translate that to Windows, however.

Well, we could do something similar -- run the main process and the
streamer in separate threads on windows and have a main thread wait on
both. The main thread would have to be in charge of cleanup as well of
course. But I think that's likely going to be more complicated than using
non blocking libpq APIs.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;