pg_basebackup error: replication slot "pg_basebackup_2194" already exists

Started by Ludovic Vaugeois-Pepinover 8 years ago10 messages

ludovicvp@gmail.com

over 8 years ago

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install
https://download.postgresql.org/pub/repos/yum/testing/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U repl1 -Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory "/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

--
Ludovic

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Kenneth Marshall

ktm@rice.edu

over 8 years ago

In reply to: Ludovic Vaugeois-Pepin (#1)

Re: pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Tue, May 30, 2017 at 09:14:41PM +0200, Ludovic Vaugeois-Pepin wrote:

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install
https://download.postgresql.org/pub/repos/yum/testing/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U repl1 -Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory "/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

--
Ludovic

Hi,

Version 10 will create a temporary slot for you if one is not specified or
the --no-slot option is not used:

https://www.postgresql.org/docs/10/static/app-pgbasebackup.html

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Magnus Hagander

magnus@hagander.net

over 8 years ago

In reply to: Ludovic Vaugeois-Pepin (#1)

Re: pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Tue, May 30, 2017 at 9:14 PM, Ludovic Vaugeois-Pepin <ludovicvp@gmail.com

wrote:

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install
https://download.postgresql.org/pub/repos/yum/testing/10/
redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U repl1
-Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory
"/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

Hah, that's an interesting failure. In the name of the slot, the 2194 comes
from the pid -- but it's the pid of pg_basebackup.

I assume you're not running the two pg_basebackup processes on the same
machine? Is it predictable when this happens (meaning that the pid value is
actually predictable), or do you have to run it a large numbe rof times
before it happens?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Ludovic Vaugeois-Pepin

ludovicvp@gmail.com

over 8 years ago

In reply to: Ludovic Vaugeois-Pepin (#1)

Fwd: pg_basebackup error: replication slot "pg_basebackup_2194" already exists

Le 30 mai 2017 9:32 PM, "Magnus Hagander" <magnus@hagander.net> a écrit :

On Tue, May 30, 2017 at 9:14 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install
https://download.postgresql.org/pub/repos/yum/testing/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U repl1 -Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory "/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

Hah, that's an interesting failure. In the name of the slot, the 2194
comes from the pid -- but it's the pid of pg_basebackup.

I assume you're not running the two pg_basebackup processes on the same machine?

Indeed, I run it from two VMs that were created from the same .ova
(packaged VM).

Is it predictable when this happens (meaning that the pid value is
actually predictable), or do you have to run it a large numbe rof
times before it happens?

I ran into this once, however I have been running tests on 10.0 for a
couple of days or so.

My guess is that the two hosts ended up using the same pid when
running the backup.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Reply to msg id not found: CAAJDx8PJDRDOszwfJ8zQpsKhBGFdvfe70a0og+k9nEzBJsKB9g@mail.gmail.com

Ludovic Vaugeois-Pepin

ludovicvp@gmail.com

over 8 years ago

In reply to: Magnus Hagander (#3)

Re: pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Tue, May 30, 2017 at 9:32 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, May 30, 2017 at 9:14 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install

https://download.postgresql.org/pub/repos/yum/testing/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U repl1
-Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory
"/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

Hah, that's an interesting failure. In the name of the slot, the 2194 comes
from the pid -- but it's the pid of pg_basebackup.

I assume you're not running the two pg_basebackup processes on the same
machine? Is it predictable when this happens (meaning that the pid value is
actually predictable), or do you have to run it a large numbe rof times
before it happens?

Indeed, I run it from two VMs that were created from the same .ova
(packaged VM).
I ran into this once, however I have been running tests on 10.0 for a
couple of days or so.

My guess is that the two hosts ended up using the same pid when
running the backup.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Magnus Hagander

magnus@hagander.net

over 8 years ago

In reply to: Ludovic Vaugeois-Pepin (#5)

Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Wed, May 31, 2017 at 12:20 AM, Ludovic Vaugeois-Pepin <
ludovicvp@gmail.com> wrote:

On Tue, May 30, 2017 at 9:32 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, May 30, 2017 at 9:14 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:

I ran into the issue described below with 10.0 beta. The error I got is:

pg_basebackup: could not create temporary replication slot
"pg_basebackup_2194": ERROR: replication slot "pg_basebackup_2194"
already exists

A race condition? Or maybe I am doing something wrong.

Release:
Name : postgresql10-server
Version : 10.0
Release : beta1PGDG.rhel7

Test Type:
Functional testing of a pacemaker resource agent
(https://github.com/ulodciv/pgha)

Test Detail:
During context/environement setup, pg_basebackup is invoked (in
parallel) from multiple virtual machines. The backups are then started
as asynchronously replicated hot standbies.

Platform:
Centos 7.3

Installation Method:
yum -y install

https://download.postgresql.org/pub/repos/yum/testing/10/

redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm

yum -y install postgresql10-server postgresql10-contrib

Platform Detail:

Test Procedure:

Have pg_basebackup run simultaneously on multiple hosts against
the same instance eg:

pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U

repl1

-Xs

Failure?

E deploylib.deployer_error.DeployerError:
postgres@test5: got exit status 1 for:
E pg_basebackup -h test4 -p 5432 -D
/var/lib/pgsql/10/data -U repl1 -Xs
E stderr: pg_basebackup: could not create temporary
replication slot "pg_basebackup_2194": ERROR: replication slot
"pg_basebackup_2194" already exists
E pg_basebackup: child process exited with error 1
E pg_basebackup: removing data directory
"/var/lib/pgsql/10/data"

Test Results:

Comments:
This seems to be new with 10. I recently began testing the
pacemaker resource agent against PG 10. I never had (or noticed) this
failure with 9.6.1 and 9.6.2.

Hah, that's an interesting failure. In the name of the slot, the 2194

comes

from the pid -- but it's the pid of pg_basebackup.

I assume you're not running the two pg_basebackup processes on the same
machine? Is it predictable when this happens (meaning that the pid value

is

actually predictable), or do you have to run it a large numbe rof times
before it happens?

Indeed, I run it from two VMs that were created from the same .ova
(packaged VM).
I ran into this once, however I have been running tests on 10.0 for a
couple of days or so.

My guess is that the two hosts ended up using the same pid when
running the backup.

Moving this one over to -hackers to discuss the fix, as this is clearly an
issue.

Right now, pg_basebackup will use the pid of the *client* process to
generate it's ephemeral slot name. Per this report that seems like it can
definitely be a problem.

One of my first thoughts would be to instead use the pid of the *server* to
do that, as this will be guaranteed to be unique. However, the client can't
access the pid of the server as it is now, and its the client that has to
create the name.

One way to do that would be to include the pid of the walsender backend in
the reply to IDENTIFY_SYSTEM, and then use that. What do people think of
that idea?

Other suggestions?

I will add this to the 10.0 open item lists.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Magnus Hagander (#6)

Re: Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Wed, May 31, 2017 at 9:22 AM, Magnus Hagander <magnus@hagander.net> wrote:

Moving this one over to -hackers to discuss the fix, as this is clearly an
issue.

Right now, pg_basebackup will use the pid of the *client* process to
generate it's ephemeral slot name. Per this report that seems like it can
definitely be a problem.

One of my first thoughts would be to instead use the pid of the *server* to
do that, as this will be guaranteed to be unique. However, the client can't
access the pid of the server as it is now, and its the client that has to
create the name.

Yes, something like that sounds like a sensible idea. The system
identifier won't help either.

One way to do that would be to include the pid of the walsender backend in
the reply to IDENTIFY_SYSTEM, and then use that. What do people think of
that idea?

Other suggestions?

Here is a funky idea: add a read-only GUC parameter that reports the
PID of the process, and use the SHOW command with the replication
protocol to get the PID on backend-side.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Magnus Hagander (#6)

Re: Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On 2017-05-31 18:22:18 +0200, Magnus Hagander wrote:

However, the client can't access the pid of the server as it is now,
and its the client that has to create the name.

I don't think that's actually true? Doesn't the wire protocol always
include the PID, which is then exposed with PQbackendPID()?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Andres Freund (#8)

Re: Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Wed, May 31, 2017 at 11:18 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-05-31 18:22:18 +0200, Magnus Hagander wrote:

However, the client can't access the pid of the server as it is now,
and its the client that has to create the name.

I don't think that's actually true? Doesn't the wire protocol always
include the PID, which is then exposed with PQbackendPID()?

Ah, you are right here. I forgot that this is exposed to the client at
backend startup.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Magnus Hagander

magnus@hagander.net

over 8 years ago

In reply to: Andres Freund (#8)

Re: Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists

On Wed, May 31, 2017 at 8:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-05-31 18:22:18 +0200, Magnus Hagander wrote:

However, the client can't access the pid of the server as it is now,
and its the client that has to create the name.

I don't think that's actually true? Doesn't the wire protocol always
include the PID, which is then exposed with PQbackendPID()?

Oh, you're right. Well, that makes the fix a lot simpler. Will fix in a
minute.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>