BDR to ignore table exists error

Started by Nikhilalmost 10 years ago8 messagesgeneral

nikhilsmenon@gmail.com

almost 10 years ago

Hello,

I have a BDR setup with two nodes. If I bring one node down i am seeing that
the replication slot is becoming inactive with below error.

<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
streaming transactions committing after 0/111A91
48, reading WAL from 0/110F03F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
logical decoding found consistent point at 0/110F03
F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
Logical decoding will begin using saved snapshot
.
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
unexpected EOF on standby connection
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.437 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.462 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.096 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.101 ms
<3462016-05-25 23:58:20 GMT%LOG: starting background worker process "bdr
(6288505144157102317,1,16384,)->bdr (628851211361
7339435,2,"
<798462016-05-25 23:58:20 GMT%ERROR: relation "af_npx_device_l3_16_149_10"
already exists

Thanks & Regards,
Nikhil

Martín Marqués

martin@2ndquadrant.com

almost 10 years ago

In reply to: Nikhil (#1)

Re: BDR to ignore table exists error

El 27/05/16 a las 06:33, Nikhil escribió:

Hello,

I have a BDR setup with two nodes. If I bring one node down i am seeing that
the replication slot is becoming inactive with below error.

If you take down one of the nodes of a BDR mesh, the replication slots
from each of the upstream nodes it connects to will switch to inactive.
That's how replication slots work.

<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
streaming transactions committing after 0/111A91
48, reading WAL from 0/110F03F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
logical decoding found consistent point at 0/110F03
F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
Logical decoding will begin using saved snapshot
.
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
unexpected EOF on standby connection

Downstream node got disconnected, which is sensible given that you took
that node down.

<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.437 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.462 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.096 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.101 ms
<3462016-05-25 23:58:20 GMT%LOG: starting background worker process "bdr
(6288505144157102317,1,16384,)->bdr (628851211361
7339435,2,"

It seems you brought up postgres on the downstream node again and it
connected to the replication slot.

<798462016-05-25 23:58:20 GMT%ERROR: relation "af_npx_device_l3_16_149_10"
already exists

I'm not sure what happened here. Does that relation exist?

Run \d+ af_npx_device_l3_16_149_10 with psql on both nodes.

Also, did replication resume? Check with the lag query from the BDR
documentation.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Nikhil

nikhilsmenon@gmail.com

almost 10 years ago

In reply to: Martín Marqués (#2)

Re: BDR to ignore table exists error

Once the node which was down is brought back the replication slot is not
turned active. The reason being replication slot is trying to create a
partition table which already exists. Because of this error replication
slot is stuck in inactive mode. Is there any way to ignore this error?
On 28-May-2016 4:56 PM, "Martín Marqués" <martin@2ndquadrant.com> wrote:

Show quoted text

El 27/05/16 a las 06:33, Nikhil escribió:

Hello,

I have a BDR setup with two nodes. If I bring one node down i am seeing

that

the replication slot is becoming inactive with below error.

If you take down one of the nodes of a BDR mesh, the replication slots
from each of the upstream nodes it connects to will switch to inactive.
That's how replication slots work.

<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
streaming transactions committing after 0/111A91
48, reading WAL from 0/110F03F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
logical decoding found consistent point at 0/110F03
F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
Logical decoding will begin using saved snapshot
.
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
unexpected EOF on standby connection

Downstream node got disconnected, which is sensible given that you took
that node down.

<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.437 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.462 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.096 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG: duration:
0.101 ms
<3462016-05-25 23:58:20 GMT%LOG: starting background worker process "bdr
(6288505144157102317,1,16384,)->bdr (628851211361
7339435,2,"

It seems you brought up postgres on the downstream node again and it
connected to the replication slot.

<798462016-05-25 23:58:20 GMT%ERROR: relation

"af_npx_device_l3_16_149_10"

already exists

I'm not sure what happened here. Does that relation exist?

Run \d+ af_npx_device_l3_16_149_10 with psql on both nodes.

Also, did replication resume? Check with the lag query from the BDR
documentation.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Martín Marqués

martin@2ndquadrant.com

almost 10 years ago

In reply to: Nikhil (#3)

Re: BDR to ignore table exists error

El 28/05/16 a las 08:57, Nikhil escribió:

Once the node which was down is brought back the replication slot is not
turned active. The reason being replication slot is trying to create a
partition table which already exists. Because of this error replication
slot is stuck in inactive mode. Is there any way to ignore this error?

Could you provide the DDL that's run for the patitioning?

Could you provide logs from the other node? (the one where the partition
was created)

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Martín Marqués

martin@2ndquadrant.com

almost 10 years ago

In reply to: Nikhil (#3)

Re: BDR to ignore table exists error

El 28/05/16 a las 08:57, Nikhil escribió:

Once the node which was down is brought back the replication slot is not
turned active. The reason being replication slot is trying to create a
partition table which already exists. Because of this error replication
slot is stuck in inactive mode. Is there any way to ignore this error?

BTW, how did you end up in such a state? Did you create the partition
table skipping ddl locking?

At this point the easiest way out is to drop the table on the node where
it's trying to get applied with bdr_replication off or
skip_ddl_replication on, so the table is dropped locally but not
replicated, and the create table from the slot can be consumed.

The other option is to consume the create table statement from the slot
directly.

Be aware of the dangers of changing the default values for such
parameters, (bdr_replication, skip_ddl_replication, skip_ddl_locking)
and when needed they should be used with special care.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Nikhil

nikhilsmenon@gmail.com

almost 10 years ago

In reply to: Martín Marqués (#5)

Re: BDR to ignore table exists error

Please see my replies inline.

On Sat, May 28, 2016 at 8:08 PM, Martín Marqués <martin@2ndquadrant.com>
wrote:

El 28/05/16 a las 08:57, Nikhil escribió:

Once the node which was down is brought back the replication slot is not
turned active. The reason being replication slot is trying to create a
partition table which already exists. Because of this error replication
slot is stuck in inactive mode. Is there any way to ignore this error?

BTW, how did you end up in such a state? Did you create the partition
table skipping ddl locking?
*Nik>> skip_ddl_locking is set to True in my configuration. As this was
preventing single*

*node from doing DDL operation (if one is down majority is not there for
doing DDL on available node)*

At this point the easiest way out is to drop the table on the node where
it's trying to get applied with bdr_replication off or
skip_ddl_replication on, so the table is dropped locally but not
replicated, and the create table from the slot can be consumed.

The other option is to consume the create table statement from the slot
directly.
Nik>> DDL used is

ERROR: relation "af_npx_l3_16_146_10" already exists
<596802016-05-29 08:53:07 GMT%CONTEXT: during DDL replay of ddl statement:
CREATE TABLE public.af_npx_license_l3_16_146_
10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK
(((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS
(public.af_npx_l3) WITH (oids=OFF)
<554132016-05-29 08:53:07 GMT%LOG: worker process: bdr
(6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
680) exited with exit code 1

Show quoted text

Be aware of the dangers of changing the default values for such
parameters, (bdr_replication, skip_ddl_replication, skip_ddl_locking)
and when needed they should be used with special care.
*Nik>>. The DDL replay is started once the node join back to bdr group. I
think its started from an old check point causing partition already exists
error. Is there any way to ignore replay error ? or ignore DDL errors while
replay ?*

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Martín Marqués

martin@2ndquadrant.com

almost 10 years ago

In reply to: Nikhil (#6)

Re: BDR to ignore table exists error

Hi,

El 29/05/16 a las 06:01, Nikhil escribió:

*Nik>> skip_ddl_locking is set to True in my configuration. As this
was preventing single*

*node from doing DDL operation (if one is down majority is not there
for doing DDL on available node)*

Well, you have to be prepared to deal with burn wounds if you play with
fire. ;)

If you decide to have skip_ddl_locking on you have to be sure all DDLs
happen on one node, else you end up with conflicts like this.

I suggest you find out why the table was already created on the
downstream node (as a forensics task so you can avoid bumping into the
same issue).

Nik>> DDL used is

ERROR: relation "af_npx_l3_16_146_10" already exists
<596802016-05-29 08:53:07 GMT%CONTEXT: during DDL replay of ddl
statement: CREATE TABLE public.af_npx_license_l3_16_146_
10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK
(((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS
(public.af_npx_l3) WITH (oids=OFF)
<554132016-05-29 08:53:07 GMT%LOG: worker process: bdr
(6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
680) exited with exit code 1

On the node where the CREATE TABLE is trying to get applied run this:

BEGIN;
SET LOCAL bdr.skip_ddl_replication TO 'on';
SET LOCAL bdr.skip_ddl_locking TO 'on';
DROP TABLE af_npx_l3_16_146_10;
END;

After that, the DDL that's stuck will get applied and the stream of
changes will continue.

By the looks of what you're dealing with, I wouldn't be surprised if the
replication gets stuck again on another DDL conflict.

I suggest rethinking the locking strategy, because this shows that
there's something fishy there.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Nikhil

nikhilsmenon@gmail.com

almost 10 years ago

In reply to: Martín Marqués (#7)

Re: BDR to ignore table exists error

Thanks a lot Martin for your replies.

On Sun, May 29, 2016 at 11:50 PM, Martín Marqués <martin@2ndquadrant.com>
wrote:

Show quoted text

Hi,

El 29/05/16 a las 06:01, Nikhil escribió:

*Nik>> skip_ddl_locking is set to True in my configuration. As this
was preventing single*

*node from doing DDL operation (if one is down majority is not there
for doing DDL on available node)*

Well, you have to be prepared to deal with burn wounds if you play with
fire. ;)

If you decide to have skip_ddl_locking on you have to be sure all DDLs
happen on one node, else you end up with conflicts like this.

I suggest you find out why the table was already created on the
downstream node (as a forensics task so you can avoid bumping into the
same issue).

Nik>> DDL used is

ERROR: relation "af_npx_l3_16_146_10" already exists
<596802016-05-29 08:53:07 GMT%CONTEXT: during DDL replay of ddl
statement: CREATE TABLE public.af_npx_license_l3_16_146_
10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK
(((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS
(public.af_npx_l3) WITH (oids=OFF)
<554132016-05-29 08:53:07 GMT%LOG: worker process: bdr
(6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
680) exited with exit code 1

On the node where the CREATE TABLE is trying to get applied run this:

BEGIN;
SET LOCAL bdr.skip_ddl_replication TO 'on';
SET LOCAL bdr.skip_ddl_locking TO 'on';
DROP TABLE af_npx_l3_16_146_10;
END;

After that, the DDL that's stuck will get applied and the stream of
changes will continue.

By the looks of what you're dealing with, I wouldn't be surprised if the
replication gets stuck again on another DDL conflict.

I suggest rethinking the locking strategy, because this shows that
there's something fishy there.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services