Removing and readding bdr nodes

Started by Mathew Moonalmost 11 years ago6 messagesgeneral

mathew.moon@vipaar.com

almost 11 years ago

Hi,

I have a group of 5 bdr nodes and before we can move to production with
them we must demonstrate that we can remove a node from the group and add
that node back later. When I remove a node it stays in the bdr.bdr_nodes
view with status 'k'. If I try to add that node back the node itself errors
saying that it is already part of a bdr group. If I totally remove the
entire database from the node, deleting all of the data directory, and run
initdb on the data directory again and try to add the node to the group I
get errors saying that the other nodes expect this one to use its old sysid
and connect to its old replication slot. I don't understand how the other
nodes are identifying this one by its old information since I removed the
entire data directory and started over.

I saw in another thread that support for removing nodes is not complete but
surely there must be some way to do this even manually. How would one go
about removing ALL traces of an existing node from all of the others so it
was like it never existed before? Any help would be greatly appreciated.
BDR is the perfect solution for our infrastructure's needs for backup and
availability but we lack this one key ability to be able to roll it out for
production use.

Regards,
Mathew

Craig Ringer

craig@2ndquadrant.com

almost 11 years ago

In reply to: Mathew Moon (#1)

Re: Removing and readding bdr nodes

On 1 May 2015 at 12:40, Mathew Moon <mathew.moon@vipaar.com> wrote:

Hi,

I have a group of 5 bdr nodes and before we can move to production with
them we must demonstrate that we can remove a node from the group and add
that node back later. When I remove a node it stays in the bdr.bdr_nodes
view with status 'k'. If I try to add that node back the node itself errors
saying that it is already part of a bdr group.

That's intended, though the error message needs to be improved.

You can't remove a node then add it back later. When you remove a node,
the remaining nodes are still generating change streams, but they aren't
saving them up for the removed node anymore. So if you remove a node, make
some changes, and add the node back then the node will have a "gap" in its
history, putting it out of sync with all its peers. Changes on the re-added
node could replicate old data to new nodes, changes from new nodes might
not apply on the old re-added node, etc. Worse, if any table structures
have changed then the node can't possibly apply changes or send changes
that can be applied by other nodes.

Once a node is removed you must add a new node to replace it, you can't
re-add a removed node.

There's room for improvement here, but the fundamental limitations mean
we're never going to support simply removing and re-adding nodes. We may be
able to provide a way to clean and resync a node later, but it'll be much
the same thing as dropdb; createdb; and rejoin.

Note that short of removing a node, you can (a) just shut it down for a
while or (b) pause replay on that node using bdr.bdr_apply_pause() and
bdr.bdr_apply_resume(). While a node is down, other nodes will function
mostly normally, but will be unable to purge WAL required for replaying to
the down/paused node so they'll eventually run out of space in pg_xlog.
They will also be unable to perform DDL, because that requires consensus.

If I totally remove the entire database from the node, deleting all of the

data directory, and run initdb on the data directory again and try to add
the node to the group I get errors saying that the other nodes expect this
one to use its old sysid and connect to its old replication slot.

That doesn't make sense. Odd. Can you please show the step-by-step process
you used to get that effect, with exact commands run,
exact text of error messages, etc?

I don't understand how the other nodes are identifying this one by its old
information since I removed the entire data directory and started over.

Nor do I. When you remove the datadir you remove the only place the sysid
for that node is stored. Are you certain you ran the join query on the
newly created not-yet-joined node?

I saw in another thread that support for removing nodes is not complete
but surely there must be some way to do this even manually.

Node remove by SQL function calls is supported in 0.9.0. The remaining work
centers mainly around making it more robust under load and handling
unexpected node loss better.

How would one go about removing ALL traces of an existing node from all of
the others so it was like it never existed before?

Once it's confirmed removed, delete the bdr.bdr_nodes entry with status =
'k'. All replication slots (pg_catalog.pg_replication_slots) should already
be gone.

There should never be any reason to do this though. If you need to do it,
then something is already wrong. A database oid shouldn't get reused, so if
you dropdb and createdb you get a new node identity. The same is true if
you re-initdb. Since re-adding a removed node won't work, there's no reason
to ever remove the record of the node's existence and removal.

Any help would be greatly appreciated. BDR is the perfect solution for our
infrastructure's needs for backup and availability

You might want to consider BDR's single-master UDR mode too, or tools like
Londiste. Don't add multi-master unless you really need it. Significant
limitations are introduced around how and when you can do DDL, etc, when
doing multi-master BDR, per the manual.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Mathew Moon

mathew.moon@vipaar.com

almost 11 years ago

In reply to: Craig Ringer (#2)

Re: Removing and readding bdr nodes

Hi Craig,

Thank you for your reply. Below you will see the entire process that I have
used along with relevant log output. Something else that I have noticed is
that the status of nodes in bdr.bdr_nodes stays 'i' on the first server
that I initialize (lets call it primary and the others backups for sake of
clarity). On the backups themselves bdr.bdr_nodes shows themselves as
status 'r', but other nodes show their status as still initializing even
though their logs show that the node in question has 'caught up with
primary'. Anyway here is what I have:

*Here are all three nodes joined to the group and functioning as shown from
the "primary" server, which is the server that the group is created from
and the server that the other nodes are joined to the group using:*

ballyhoo=# select bdr.bdr_get_local_nodeid();

bdr_get_local_nodeid
-------------------------------
(6145084718544474516,1,16391)
(1 row)
ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)
ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | i |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)
ballyhoo=# select * from pg_catalog.pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2362 | 0/21DA7B50
bdr_16391_6145083860389238141_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2362 | 0/21DA7B50
(2 rows)

*Now here is the same info from the other two nodes in the group. Notice
how the "primary" shows status of 'i' on these while they recognize
themselves as 'r':*

ballyhoo=# select bdr.bdr_get_local_nodeid();
bdr_get_local_nodeid
-------------------------------
(6145085940943605149,1,16391)
(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test2
(1 row)

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | r |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

*And the third node:*

ballyhoo=# select bdr.bdr_get_local_nodeid();
bdr_get_local_nodeid
-------------------------------
(6145083860389238141,1,16391)
(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | i |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=#

*Now I remove a node by stopping postgres on that node and executing
"select bdr.bdr_part_by_node_names('{node_name}') on the "primary" node:*

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test2
(1 row)

ballyhoo=# \q
[root@pg94-test2 pgsql]# service postgres stop
Stopping postgres service: [ OK ]
[root@pg94-test2 pgsql]#

*Now here is the primary:*

ballyhoo=# select bdr.bdr_part_by_node_names('{pg94-test2}');
bdr_part_by_node_names
------------------------

(1 row)

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145085940943605149 | 1 | 16391 | k |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)

*And the removal is confirmed on the third node:*

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145085940943605149 | 1 | 16391 | k |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

ballyhoo=#

*So this is done to simulate an instance where one of the nodes is down for
so long that the accumulated WAL on the other nodes is causing them to run
out of disk space and/or we need for the other nodes to be able to get a
concensus for DDL, a quorum for global sequence allocation, etc. so we
remove the node all together. Now we want to simulate adding that node back
to the group. Since this cannot actually be done we remove all of the
postgres data all together and initialize the node as if it were a totally
new server an expect the other nodes to view it as such.*

*So here we remove all traces of the old node from the group on the
"primary" first:*

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(2 rows)

ballyhoo=# select
pg_drop_replication_slot('bdr_16391_6145085940943605149_1_16391__');
pg_drop_replication_slot
--------------------------

(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)

*Now we check up on the other still active node:*

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r | pg94-test
| host=pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(2 rows)

ballyhoo=# select
pg_drop_replication_slot('bdr_16391_6145085940943605149_1_16391__');
pg_drop_replication_slot
--------------------------

(1 row)

*As you can see it did require a little bit of clean up. The replication
slot did have to be manually removed, but I expected this. Now we totally
remove the data directory on the dropped node and initialize it again from
the "primary". I use the following script for this:*

#!/bin/bash

SOURCE=$1
DATADIR=/var/lib/pgsql/9.4-bdr/data
DATABASE=$2
NAME=$(hostname -s)

if [ -z "$2" ] || [ -z "$1" ]; then
echo "You must provide the source server, and database name."
exit
fi

#stop postgres
service postgres stop

#Back up configs
su - postgres -c "cp ${DATADIR}/{postgresql.conf,pg_hba.conf} ~/"

#clear data dir
rm -rf ${DATADIR}/*

#init a new database dir
su - postgres -c "initdb -D $DATADIR"

#get new configs
su - postgres -c "scp -P22299
${SOURCE}:${DATADIR}/{postgresql.conf,pg_hba.conf,ca.crt,vipaar.wc.cert,vipaar.wc.key}
${DATADIR}/"

#start postgres
service postgres start

#Get dump of roles from source
su - postgres -c "ssh -p22299 $SOURCE 'pg_dumpall --globals-only'| psql -U
postgres"

#Create extensions and group for each database we want to replicate
psql -U postgres -h 127.0.0.1<<EOF
create database $DATABASE;
\c $DATABASE;
create extension btree_gist;
create extension bdr;
select bdr.bdr_group_join(
local_node_name := '$NAME',
node_external_dsn := 'host=$(hostname) port=5432 user=postgres
dbname=$DATABASE',
join_using_dsn := 'host=$SOURCE port=5432 user=postgres
dbname=$DATABASE'
);
SELECT bdr.bdr_node_join_wait_for_ready();
EOF

*Everything goes fine as far as importing the database. Once the import is
finished and the newly initialized node tries to connect for replication
the errors start. This is from the log on the server that I was recreating:*

d= p=20358 a=DEBUG: 00000: syncing bdr_nodes and bdr_connections
d= p=20358 a=LOCATION: bdr_init_replica, bdr_init_replica.c:964
d= p=20358 a=DEBUG: 00000: dump and apply finished, preparing for catchup
replay
d= p=20358 a=LOCATION: bdr_init_replica, bdr_init_replica.c:970
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: IDENTIFY_SYSTEM
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d= p=20358 a=DEBUG: 00000: Creating new slot
bdr_16391_6145097170174308100_1_16391__
d= p=20358 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:583
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: CREATE_REPLICATION_SLOT
"bdr_16391_6145097170174308100_1_16391__" LOGICAL bdr
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: searching for logical decoding starting point, starting at
0/21BE1798
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG: 00000:
logical decoding found consistent point at 0/21BE1798
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DETAIL:
There are no running transactions.
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1271
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG: 00000:
exported logical decoding snapshot: "00000802-1" with 0 transaction IDs
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d= p=20358 a=DEBUG: 00000: created replication identifier 2
d= p=20358 a=LOCATION: bdr_create_slot, bdr.c:389
d= p=20358 a=ERROR: 55000: System identification mismatch between
connection and slot
d= p=20358 a=DETAIL: Connection for bdr (6145085940943605149,1,16391,)
resulted in slot on node bdr (6145097170174308100,1,16391,) instead of
expected node
d= p=20358 a=LOCATION: bdr_init_make_other_slots, bdr_init_replica.c:579
d= p=20314 a=LOG: 00000: worker process: bdr db: ballyhoo (PID 20358)
exited with exit code 1
d= p=20314 a=LOCATION: LogChildExit, postmaster.c:3325
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG: 08006:
could not receive data from client: Connection reset by peer
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
pq_recvbuf, pqcomm.c:871
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG: 08006:
unexpected EOF on client connection with an open transaction
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SocketBackend, postgres.c:344
d= p=20314 a=LOG: 00000: starting background worker process "bdr db:
ballyhoo"
d= p=20314 a=LOCATION: do_start_bgworker, postmaster.c:5403
d= p=20372 a=NOTICE: 00000: version "1.0" of extension "btree_gist" is
already installed
d= p=20372 a=LOCATION: ExecAlterExtensionStmt, extension.c:2700
d= p=20372 a=NOTICE: 00000: version "0.9.0.3" of extension "bdr" is
already installed
d= p=20372 a=LOCATION: ExecAlterExtensionStmt, extension.c:2700
d= p=20372 a=DEBUG: 00000: per-db worker for node bdr
(6145097170174308100,1,16391,) starting
d= p=20372 a=LOCATION: bdr_perdb_worker_main, bdr_perdb.c:653
d= p=20372 a=DEBUG: 00000: init_replica init from remote host=
pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo
d= p=20372 a=LOCATION: bdr_init_replica, bdr_init_replica.c:808
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: IDENTIFY_SYSTEM
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d= p=20372 a=DEBUG: 00000: found valid replication identifier 2
d= p=20372 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:568
d= p=20372 a=ERROR: 55000: System identification mismatch between
connection and slot
d= p=20372 a=DETAIL: Connection for bdr (6145085940943605149,1,16391,)
resulted in slot on node bdr (6145097170174308100,1,16391,) instead of
expected node
d= p=20372 a=LOCATION: bdr_init_make_other_slots, bdr_init_replica.c:579
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOG: 08006:
could not receive data from client: Connection reset by peer
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
pq_recvbuf, pqcomm.c:871
d= p=20314 a=LOG: 00000: worker process: bdr db: ballyhoo (PID 20372)
exited with exit code 1
d= p=20314 a=LOCATION: LogChildExit, postmaster.c:3325
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
08003: unexpected EOF on client connection
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SocketBackend, postgres.c:355
^C
[root@pg94-test2 data]#

I am at a loss. What do you make of this Craig?

On Sun, May 3, 2015 at 8:39 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

Show quoted text

On 1 May 2015 at 12:40, Mathew Moon <mathew.moon@vipaar.com> wrote:

Hi,

I have a group of 5 bdr nodes and before we can move to production with
them we must demonstrate that we can remove a node from the group and add
that node back later. When I remove a node it stays in the bdr.bdr_nodes
view with status 'k'. If I try to add that node back the node itself errors
saying that it is already part of a bdr group.

That's intended, though the error message needs to be improved.

You can't remove a node then add it back later. When you remove a node,
the remaining nodes are still generating change streams, but they aren't
saving them up for the removed node anymore. So if you remove a node, make
some changes, and add the node back then the node will have a "gap" in its
history, putting it out of sync with all its peers. Changes on the re-added
node could replicate old data to new nodes, changes from new nodes might
not apply on the old re-added node, etc. Worse, if any table structures
have changed then the node can't possibly apply changes or send changes
that can be applied by other nodes.

Once a node is removed you must add a new node to replace it, you can't
re-add a removed node.

There's room for improvement here, but the fundamental limitations mean
we're never going to support simply removing and re-adding nodes. We may be
able to provide a way to clean and resync a node later, but it'll be much
the same thing as dropdb; createdb; and rejoin.

Note that short of removing a node, you can (a) just shut it down for a
while or (b) pause replay on that node using bdr.bdr_apply_pause() and
bdr.bdr_apply_resume(). While a node is down, other nodes will function
mostly normally, but will be unable to purge WAL required for replaying to
the down/paused node so they'll eventually run out of space in pg_xlog.
They will also be unable to perform DDL, because that requires consensus.

If I totally remove the entire database from the node, deleting all of the

data directory, and run initdb on the data directory again and try to add
the node to the group I get errors saying that the other nodes expect this
one to use its old sysid and connect to its old replication slot.

That doesn't make sense. Odd. Can you please show the step-by-step process
you used to get that effect, with exact commands run,
exact text of error messages, etc?

I don't understand how the other nodes are identifying this one by its
old information since I removed the entire data directory and started over.

Nor do I. When you remove the datadir you remove the only place the sysid
for that node is stored. Are you certain you ran the join query on the
newly created not-yet-joined node?

I saw in another thread that support for removing nodes is not complete
but surely there must be some way to do this even manually.

Node remove by SQL function calls is supported in 0.9.0. The remaining
work centers mainly around making it more robust under load and handling
unexpected node loss better.

How would one go about removing ALL traces of an existing node from all
of the others so it was like it never existed before?

Once it's confirmed removed, delete the bdr.bdr_nodes entry with status =
'k'. All replication slots (pg_catalog.pg_replication_slots) should already
be gone.

There should never be any reason to do this though. If you need to do it,
then something is already wrong. A database oid shouldn't get reused, so if
you dropdb and createdb you get a new node identity. The same is true if
you re-initdb. Since re-adding a removed node won't work, there's no reason
to ever remove the record of the node's existence and removal.

Any help would be greatly appreciated. BDR is the perfect solution for
our infrastructure's needs for backup and availability

You might want to consider BDR's single-master UDR mode too, or tools like
Londiste. Don't add multi-master unless you really need it. Significant
limitations are introduced around how and when you can do DDL, etc, when
doing multi-master BDR, per the manual.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Mathew Moon

mathew.moon@vipaar.com

almost 11 years ago

In reply to: Mathew Moon (#3)

Re: Removing and readding bdr nodes

I just realized this also. The node that was removed and recreated shows
only one replication slot once it is recreated. That slot has its own id in
it too. That appears to be the problem. I have no idea why when creating it
the first time the slots for the other two nodes are created correctly, but
when creating it a second time it does this:

======================================================================================================
sql (9.4.1)
Type "help" for help.

ballyhoo=#
======================================================================================================

On Mon, May 4, 2015 at 2:26 PM, Mathew Moon <mathew.moon@vipaar.com> wrote:

Show quoted text

Hi Craig,

Thank you for your reply. Below you will see the entire process that I
have used along with relevant log output. Something else that I have
noticed is that the status of nodes in bdr.bdr_nodes stays 'i' on the first
server that I initialize (lets call it primary and the others backups for
sake of clarity). On the backups themselves bdr.bdr_nodes shows themselves
as status 'r', but other nodes show their status as still initializing even
though their logs show that the node in question has 'caught up with
primary'. Anyway here is what I have:

*Here are all three nodes joined to the group and functioning as shown
from the "primary" server, which is the server that the group is created
from and the server that the other nodes are joined to the group using:*

ballyhoo=# select bdr.bdr_get_local_nodeid();

bdr_get_local_nodeid
-------------------------------
(6145084718544474516,1,16391)
(1 row)
ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)
ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | i |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)
ballyhoo=# select * from pg_catalog.pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2362 | 0/21DA7B50
bdr_16391_6145083860389238141_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2362 | 0/21DA7B50
(2 rows)

*Now here is the same info from the other two nodes in the group. Notice
how the "primary" shows status of 'i' on these while they recognize
themselves as 'r':*

ballyhoo=# select bdr.bdr_get_local_nodeid();
bdr_get_local_nodeid
-------------------------------
(6145085940943605149,1,16391)
(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test2
(1 row)

ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | r |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

*And the third node:*

ballyhoo=# select bdr.bdr_get_local_nodeid();
bdr_get_local_nodeid
-------------------------------
(6145083860389238141,1,16391)
(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145085940943605149 | 1 | 16391 | i |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=# select * from pg_catalog.pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2071 | 0/21C30AC8
bdr_16391_6145084718544474516_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2071 | 0/21C30AC8
(2 rows)

ballyhoo=#

*Now I remove a node by stopping postgres on that node and executing
"select bdr.bdr_part_by_node_names('{node_name}') on the "primary" node:*

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test2
(1 row)

ballyhoo=# \q
[root@pg94-test2 pgsql]# service postgres stop
Stopping postgres service: [ OK ]
[root@pg94-test2 pgsql]#

*Now here is the primary:*

ballyhoo=# select bdr.bdr_part_by_node_names('{pg94-test2}');
bdr_part_by_node_names
------------------------

(1 row)

ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145085940943605149 | 1 | 16391 | k |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=# select * from pg_catalog.pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | f | | 2362 | 0/21DA7C28
bdr_16391_6145083860389238141_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2363 | 0/21DA8238
(2 rows)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)

*And the removal is confirmed on the third node:*

ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
6145085940943605149 | 1 | 16391 | k |
pg94-test2 | host=pg94-test2.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(3 rows)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

ballyhoo=#

*So this is done to simulate an instance where one of the nodes is down
for so long that the accumulated WAL on the other nodes is causing them to
run out of disk space and/or we need for the other nodes to be able to get
a concensus for DDL, a quorum for global sequence allocation, etc. so we
remove the node all together. Now we want to simulate adding that node back
to the group. Since this cannot actually be done we remove all of the
postgres data all together and initialize the node as if it were a totally
new server an expect the other nodes to view it as such.*

*So here we remove all traces of the old node from the group on the
"primary" first:*

ballyhoo=# delete from bdr.bdr_nodes where node_status='k';
DELETE 1
ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | i |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(2 rows)

ballyhoo=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | f | | 2362 | 0/21DA7C28
bdr_16391_6145083860389238141_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2364 | 0/21DA8868
(2 rows)

ballyhoo=# select
pg_drop_replication_slot('bdr_16391_6145085940943605149_1_16391__');
pg_drop_replication_slot
--------------------------

(1 row)

ballyhoo=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145083860389238141_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2364 | 0/21DA8868
(1 row)

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test
(1 row)

*Now we check up on the other still active node:*

ballyhoo=# select bdr.bdr_get_local_node_name();
bdr_get_local_node_name
-------------------------
pg94-test3
(1 row)

ballyhoo=# select * from bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn
| node_init_from_dsn

---------------------+---------------+------------+-------------+------------+--------------------------------------------------------------------+-----------------------------------------------
--------------------
6145084718544474516 | 1 | 16391 | r |
pg94-test | host=pg94-test.vipaar.com port=5432 user=postgres
dbname=ballyhoo |
6145083860389238141 | 1 | 16391 | r |
pg94-test3 | host=pg94-test3.vipaar.com port=5432 user=postgres
dbname=ballyhoo | host=pg94-test.vipaar.com port=5432 user=postg
res dbname=ballyhoo
(2 rows)

ballyhoo=# select * from pg_catalog.pg_replication_slots;
slot_name | plugin | slot_type | datoid |
database | active | xmin | catalog_xmin | restart_lsn

-----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
bdr_16391_6145085940943605149_1_16391__ | bdr | logical | 16391 |
ballyhoo | f | | 2071 | 0/21C31E30
bdr_16391_6145084718544474516_1_16391__ | bdr | logical | 16391 |
ballyhoo | t | | 2073 | 0/21C333A8
(2 rows)

ballyhoo=# select
pg_drop_replication_slot('bdr_16391_6145085940943605149_1_16391__');
pg_drop_replication_slot
--------------------------

(1 row)

*As you can see it did require a little bit of clean up. The replication
slot did have to be manually removed, but I expected this. Now we totally
remove the data directory on the dropped node and initialize it again from
the "primary". I use the following script for this:*

#!/bin/bash

SOURCE=$1
DATADIR=/var/lib/pgsql/9.4-bdr/data
DATABASE=$2
NAME=$(hostname -s)

if [ -z "$2" ] || [ -z "$1" ]; then
echo "You must provide the source server, and database name."
exit
fi

#stop postgres
service postgres stop

#Back up configs
su - postgres -c "cp ${DATADIR}/{postgresql.conf,pg_hba.conf} ~/"

#clear data dir
rm -rf ${DATADIR}/*

#init a new database dir
su - postgres -c "initdb -D $DATADIR"

#get new configs
su - postgres -c "scp -P22299
${SOURCE}:${DATADIR}/{postgresql.conf,pg_hba.conf,ca.crt,vipaar.wc.cert,vipaar.wc.key}
${DATADIR}/"

#start postgres
service postgres start

#Get dump of roles from source
su - postgres -c "ssh -p22299 $SOURCE 'pg_dumpall --globals-only'| psql -U
postgres"

#Create extensions and group for each database we want to replicate
psql -U postgres -h 127.0.0.1<<EOF
create database $DATABASE;
\c $DATABASE;
create extension btree_gist;
create extension bdr;
select bdr.bdr_group_join(
local_node_name := '$NAME',
node_external_dsn := 'host=$(hostname) port=5432 user=postgres
dbname=$DATABASE',
join_using_dsn := 'host=$SOURCE port=5432 user=postgres
dbname=$DATABASE'
);
SELECT bdr.bdr_node_join_wait_for_ready();
EOF

*Everything goes fine as far as importing the database. Once the import is
finished and the newly initialized node tries to connect for replication
the errors start. This is from the log on the server that I was recreating:*

d= p=20358 a=DEBUG: 00000: syncing bdr_nodes and bdr_connections
d= p=20358 a=LOCATION: bdr_init_replica, bdr_init_replica.c:964
d= p=20358 a=DEBUG: 00000: dump and apply finished, preparing for catchup
replay
d= p=20358 a=LOCATION: bdr_init_replica, bdr_init_replica.c:970
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: IDENTIFY_SYSTEM
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d= p=20358 a=DEBUG: 00000: Creating new slot
bdr_16391_6145097170174308100_1_16391__
d= p=20358 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:583
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: CREATE_REPLICATION_SLOT
"bdr_16391_6145097170174308100_1_16391__" LOGICAL bdr
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: searching for logical decoding starting point, starting at
0/21BE1798
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG:
00000: logical decoding found consistent point at 0/21BE1798
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot DETAIL:
There are no running transactions.
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1271
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG:
00000: exported logical decoding snapshot: "00000802-1" with 0 transaction
IDs
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d= p=20358 a=DEBUG: 00000: created replication identifier 2
d= p=20358 a=LOCATION: bdr_create_slot, bdr.c:389
d= p=20358 a=ERROR: 55000: System identification mismatch between
connection and slot
d= p=20358 a=DETAIL: Connection for bdr (6145085940943605149,1,16391,)
resulted in slot on node bdr (6145097170174308100,1,16391,) instead of
expected node
d= p=20358 a=LOCATION: bdr_init_make_other_slots, bdr_init_replica.c:579
d= p=20314 a=LOG: 00000: worker process: bdr db: ballyhoo (PID 20358)
exited with exit code 1
d= p=20314 a=LOCATION: LogChildExit, postmaster.c:3325
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG:
08006: could not receive data from client: Connection reset by peer
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
pq_recvbuf, pqcomm.c:871
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOG:
08006: unexpected EOF on client connection with an open transaction
d=ballyhoo p=20371 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SocketBackend, postgres.c:344
d= p=20314 a=LOG: 00000: starting background worker process "bdr db:
ballyhoo"
d= p=20314 a=LOCATION: do_start_bgworker, postmaster.c:5403
d= p=20372 a=NOTICE: 00000: version "1.0" of extension "btree_gist" is
already installed
d= p=20372 a=LOCATION: ExecAlterExtensionStmt, extension.c:2700
d= p=20372 a=NOTICE: 00000: version "0.9.0.3" of extension "bdr" is
already installed
d= p=20372 a=LOCATION: ExecAlterExtensionStmt, extension.c:2700
d= p=20372 a=DEBUG: 00000: per-db worker for node bdr
(6145097170174308100,1,16391,) starting
d= p=20372 a=LOCATION: bdr_perdb_worker_main, bdr_perdb.c:653
d= p=20372 a=DEBUG: 00000: init_replica init from remote host=
pg94-test.vipaar.com port=5432 user=postgres dbname=ballyhoo
d= p=20372 a=LOCATION: bdr_init_replica, bdr_init_replica.c:808
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
00000: received replication command: IDENTIFY_SYSTEM
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
exec_replication_command, walsender.c:1291
d= p=20372 a=DEBUG: 00000: found valid replication identifier 2
d= p=20372 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:568
d= p=20372 a=ERROR: 55000: System identification mismatch between
connection and slot
d= p=20372 a=DETAIL: Connection for bdr (6145085940943605149,1,16391,)
resulted in slot on node bdr (6145097170174308100,1,16391,) instead of
expected node
d= p=20372 a=LOCATION: bdr_init_make_other_slots, bdr_init_replica.c:579
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOG:
08006: could not receive data from client: Connection reset by peer
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
pq_recvbuf, pqcomm.c:871
d= p=20314 a=LOG: 00000: worker process: bdr db: ballyhoo (PID 20372)
exited with exit code 1
d= p=20314 a=LOCATION: LogChildExit, postmaster.c:3325
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot DEBUG:
08003: unexpected EOF on client connection
d=ballyhoo p=20373 a=bdr (6145097170174308100,1,16391,):mkslot LOCATION:
SocketBackend, postgres.c:355
^C
[root@pg94-test2 data]#

I am at a loss. What do you make of this Craig?

On Sun, May 3, 2015 at 8:39 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:

On 1 May 2015 at 12:40, Mathew Moon <mathew.moon@vipaar.com> wrote:

Hi,

I have a group of 5 bdr nodes and before we can move to production with
them we must demonstrate that we can remove a node from the group and add
that node back later. When I remove a node it stays in the bdr.bdr_nodes
view with status 'k'. If I try to add that node back the node itself errors
saying that it is already part of a bdr group.

That's intended, though the error message needs to be improved.

You can't remove a node then add it back later. When you remove a node,
the remaining nodes are still generating change streams, but they aren't
saving them up for the removed node anymore. So if you remove a node, make
some changes, and add the node back then the node will have a "gap" in its
history, putting it out of sync with all its peers. Changes on the re-added
node could replicate old data to new nodes, changes from new nodes might
not apply on the old re-added node, etc. Worse, if any table structures
have changed then the node can't possibly apply changes or send changes
that can be applied by other nodes.

Once a node is removed you must add a new node to replace it, you can't
re-add a removed node.

There's room for improvement here, but the fundamental limitations mean
we're never going to support simply removing and re-adding nodes. We may be
able to provide a way to clean and resync a node later, but it'll be much
the same thing as dropdb; createdb; and rejoin.

Note that short of removing a node, you can (a) just shut it down for a
while or (b) pause replay on that node using bdr.bdr_apply_pause() and
bdr.bdr_apply_resume(). While a node is down, other nodes will function
mostly normally, but will be unable to purge WAL required for replaying to
the down/paused node so they'll eventually run out of space in pg_xlog.
They will also be unable to perform DDL, because that requires consensus.

If I totally remove the entire database from the node, deleting all of

the data directory, and run initdb on the data directory again and try to
add the node to the group I get errors saying that the other nodes expect
this one to use its old sysid and connect to its old replication slot.

That doesn't make sense. Odd. Can you please show the step-by-step
process you used to get that effect, with exact commands run,
exact text of error messages, etc?

I don't understand how the other nodes are identifying this one by its
old information since I removed the entire data directory and started over.

Nor do I. When you remove the datadir you remove the only place the sysid
for that node is stored. Are you certain you ran the join query on the
newly created not-yet-joined node?

I saw in another thread that support for removing nodes is not complete
but surely there must be some way to do this even manually.

Node remove by SQL function calls is supported in 0.9.0. The remaining
work centers mainly around making it more robust under load and handling
unexpected node loss better.

How would one go about removing ALL traces of an existing node from all
of the others so it was like it never existed before?

Once it's confirmed removed, delete the bdr.bdr_nodes entry with status =
'k'. All replication slots (pg_catalog.pg_replication_slots) should already
be gone.

There should never be any reason to do this though. If you need to do it,
then something is already wrong. A database oid shouldn't get reused, so if
you dropdb and createdb you get a new node identity. The same is true if
you re-initdb. Since re-adding a removed node won't work, there's no reason
to ever remove the record of the node's existence and removal.

Any help would be greatly appreciated. BDR is the perfect solution for
our infrastructure's needs for backup and availability

You might want to consider BDR's single-master UDR mode too, or tools
like Londiste. Don't add multi-master unless you really need it.
Significant limitations are introduced around how and when you can do DDL,
etc, when doing multi-master BDR, per the manual.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Mathew Moon

mathew.moon@vipaar.com

almost 11 years ago

In reply to: Craig Ringer (#2)

Re: Removing and readding bdr nodes

Hi,

Wanted to "bump" this (Is that ok in this list?). Like I mentioned, I see now that the error I am seeing in the log possibly indicates that the node I am "reinitializing" is creating a slot that points to its own old sysid. Checking manually I see that this slot does in fact exist. I still have no idea why it is creating this incorrect slot and trying to connect to it. If anyone has any insight into this it would be greatly appreciated.

Also, does anyone have experience using repmgr for streaming replication (Log shipping standby)? If so what do you think about it? Is it a good tool or is there something better for the same use case that repmgr fulfills.

Regards,
Mathew

Sent from my iPhone

Show quoted text

On May 3, 2015, at 8:39 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 1 May 2015 at 12:40, Mathew Moon <mathew.moon@vipaar.com> wrote:
Hi,

I have a group of 5 bdr nodes and before we can move to production with them we must demonstrate that we can remove a node from the group and add that node back later. When I remove a node it stays in the bdr.bdr_nodes view with status 'k'. If I try to add that node back the node itself errors saying that it is already part of a bdr group.

That's intended, though the error message needs to be improved.

You can't remove a node then add it back later. When you remove a node, the remaining nodes are still generating change streams, but they aren't saving them up for the removed node anymore. So if you remove a node, make some changes, and add the node back then the node will have a "gap" in its history, putting it out of sync with all its peers. Changes on the re-added node could replicate old data to new nodes, changes from new nodes might not apply on the old re-added node, etc. Worse, if any table structures have changed then the node can't possibly apply changes or send changes that can be applied by other nodes.

Once a node is removed you must add a new node to replace it, you can't re-add a removed node.

There's room for improvement here, but the fundamental limitations mean we're never going to support simply removing and re-adding nodes. We may be able to provide a way to clean and resync a node later, but it'll be much the same thing as dropdb; createdb; and rejoin.

Note that short of removing a node, you can (a) just shut it down for a while or (b) pause replay on that node using bdr.bdr_apply_pause() and bdr.bdr_apply_resume(). While a node is down, other nodes will function mostly normally, but will be unable to purge WAL required for replaying to the down/paused node so they'll eventually run out of space in pg_xlog. They will also be unable to perform DDL, because that requires consensus.

If I totally remove the entire database from the node, deleting all of the data directory, and run initdb on the data directory again and try to add the node to the group I get errors saying that the other nodes expect this one to use its old sysid and connect to its old replication slot.

That doesn't make sense. Odd. Can you please show the step-by-step process you used to get that effect, with exact commands run,
exact text of error messages, etc?

I don't understand how the other nodes are identifying this one by its old information since I removed the entire data directory and started over.

Nor do I. When you remove the datadir you remove the only place the sysid for that node is stored. Are you certain you ran the join query on the newly created not-yet-joined node?

I saw in another thread that support for removing nodes is not complete but surely there must be some way to do this even manually.

Node remove by SQL function calls is supported in 0.9.0. The remaining work centers mainly around making it more robust under load and handling unexpected node loss better.

How would one go about removing ALL traces of an existing node from all of the others so it was like it never existed before?

Once it's confirmed removed, delete the bdr.bdr_nodes entry with status = 'k'. All replication slots (pg_catalog.pg_replication_slots) should already be gone.

There should never be any reason to do this though. If you need to do it, then something is already wrong. A database oid shouldn't get reused, so if you dropdb and createdb you get a new node identity. The same is true if you re-initdb. Since re-adding a removed node won't work, there's no reason to ever remove the record of the node's existence and removal.

Any help would be greatly appreciated. BDR is the perfect solution for our infrastructure's needs for backup and availability

You might want to consider BDR's single-master UDR mode too, or tools like Londiste. Don't add multi-master unless you really need it. Significant limitations are introduced around how and when you can do DDL, etc, when doing multi-master BDR, per the manual.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Mathew Moon (#5)

Re: Removing and readding bdr nodes

On 17 May 2015 at 04:14, Mathew Moon <mathew.moon@vipaar.com> wrote:

Hi,

Wanted to "bump" this (Is that ok in this list?). Like I mentioned, I see
now that the error I am seeing in the log possibly indicates that the node I
am "reinitializing" is creating a slot that points to its own old sysid.
Checking manually I see that this slot does in fact exist. I still have no
idea why it is creating this incorrect slot and trying to connect to it. If
anyone has any insight into this it would be greatly appreciated.

I fixed a bug just after the removal of 0.9.2 that is the most likely
explanation for this. BDR wasn't correctly ignoring
bdr.bdr_connections entries for 'k'illed nodes when joining a new
node.

Also, does anyone have experience using repmgr for streaming replication
(Log shipping standby)? If so what do you think about it?

2ndQuadrant maintains repmgr. I think it's a decent tool for the job.

Is it a good tool
or is there something better for the same use case that repmgr fulfills.

It's moderately easy to just do it by hand if you don't want to use
repmgr. I haven't looked for alternative tools for this use case.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general