[BDR] Node Join Question
To the wonderful BDR folks,
I have a 5 node pg+bdr cluster that I've spun up. I have verified that each
node can connect via psql to each other node. I am having an issue creating
the bdr group and having the nodes join each other.
The steps I am taking are as follows (I used pg compiled using getting
started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for my
purposes)
5. create extension bdr
6. on node 0 I am doing:
SELECT bdr.bdr_group_create(
local_node_name := 'pgbdr0',
node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
SELECT bdr.bdr_group_join(
local_node_name := 'pgbdr1',
node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
dbname=pgbdr',
join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
);
Sometimes this works and all nodes are joined, but it isn't consistent and
I'm not sure at the moment how to debug why.
From the most recent attempt on node 0:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status | node_name
| node_local_dsn | node_init_from_dsn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
6147857338297172162 | 1 | 16908 | r | rdpg0
| host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)
From the most recent attempt on node 1:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status | node_name
| node_local_dsn |
node_init_from_d
sn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
6147857345042854082 | 1 | 16908 | i | rdpg1
| host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=po
stgres dbname=pgbdr
(1 row)
I have pasted some relevant logs from node 0 further below.
Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually
just hang.
How can I debug why the nodes are not connecting to each other find out
where they getting stuck trying to join?
~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
[171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857345042854082_1_16908__"
LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There are
no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857344392790209_1_16908__"
LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340000055490_1_16908__"
LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340923527359_1_16908__"
LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG: 00000: started 0 elections
d= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG: 00000: started 0 votes
d= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG: 00000: tallied 0 elections
d= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG: 00000: checked 0 sequences for filling
d= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2
transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:
Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
Also,
Is there a way to remove these things from the init target node easier?
d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr
(6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots
corresponding to this node from the init target node then drop and recreate
this database and try again
On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@gmail.com>
wrote:
To the wonderful BDR folks,
I have a 5 node pg+bdr cluster that I've spun up. I have verified that
each node can connect via psql to each other node. I am having an issue
creating the bdr group and having the nodes join each other.The steps I am taking are as follows (I used pg compiled using getting
started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for
my purposes)
5. create extension bdr
6. on node 0 I am doing:
SELECT bdr.bdr_group_create(
local_node_name := 'pgbdr0',
node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
SELECT bdr.bdr_group_join(
local_node_name := 'pgbdr1',
node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
dbname=pgbdr',
join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);Sometimes this works and all nodes are joined, but it isn't consistent and
I'm not sure at the moment how to debug why.From the most recent attempt on node 0:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn |
node_init_from_dsn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
6147857338297172162 | 1 | 16908 | r | rdpg0
| host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)From the most recent attempt on node 1:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn |
node_init_from_d
sn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
6147857345042854082 | 1 | 16908 | i | rdpg1
| host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=po
stgres dbname=pgbdr
(1 row)I have pasted some relevant logs from node 0 further below.
Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will
usually just hang.How can I debug why the nodes are not connecting to each other find out
where they getting stuck trying to join?~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
[171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857345042854082_1_16908__"
LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There are
no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857344392790209_1_16908__"
LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340000055490_1_16908__"
LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340923527359_1_16908__"
LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG: 00000: started 0 elections
d= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG: 00000: started 0 votes
d= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG: 00000: tallied 0 elections
d= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG: 00000: checked 0 sequences for filling
d= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2
transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:
Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
--
~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
Also, what is the proper way to locate and remove these replication slots
as mentioned in this log?
On Mon, May 11, 2015 at 11:36 PM, Wayne E. Seguin <wayneeseguin@gmail.com>
wrote:
Also,
Is there a way to remove these things from the init target node easier?
d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr
(6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots
corresponding to this node from the init target node then drop and recreate
this database and try againOn Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@gmail.com>
wrote:To the wonderful BDR folks,
I have a 5 node pg+bdr cluster that I've spun up. I have verified that
each node can connect via psql to each other node. I am having an issue
creating the bdr group and having the nodes join each other.The steps I am taking are as follows (I used pg compiled using getting
started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for
my purposes)
5. create extension bdr
6. on node 0 I am doing:
SELECT bdr.bdr_group_create(
local_node_name := 'pgbdr0',
node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
SELECT bdr.bdr_group_join(
local_node_name := 'pgbdr1',
node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
dbname=pgbdr',
join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);Sometimes this works and all nodes are joined, but it isn't consistent
and I'm not sure at the moment how to debug why.From the most recent attempt on node 0:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn |
node_init_from_dsn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
6147857338297172162 | 1 | 16908 | r | rdpg0
| host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)From the most recent attempt on node 1:
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status |
node_name | node_local_dsn |
node_init_from_d
sn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
6147857345042854082 | 1 | 16908 | i | rdpg1
| host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=po
stgres dbname=pgbdr
(1 row)I have pasted some relevant logs from node 0 further below.
Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will
usually just hang.How can I debug why the nodes are not connecting to each other find out
where they getting stuck trying to join?~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
[171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857345042854082_1_16908__"
LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There
are no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857344392790209_1_16908__"
LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340000055490_1_16908__"
LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1
transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:
Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
received replication command: CREATE_REPLICATION_SLOT
"bdr_16908_6147857340923527359_1_16908__"
LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG: 00000: started 0 elections
d= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG: 00000: started 0 votes
d= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG: 00000: tallied 0 elections
d= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG: 00000: checked 0 sequences for filling
d= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2
transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:
Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006:
unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
SocketBackend, postgres.c:344--
~WayneWayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
--
~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
On 12 May 2015 at 14:33, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
SELECT bdr.bdr_group_join(
local_node_name := 'pgbdr1',
node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
dbname=pgbdr',
join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);
At a guess you're probably not waiting between joins to ensure that each
new node has finished joining before starting another node join.
BDR really needs to be enhanced to either support parallel join of multiple
nodes or identify and reject it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12 May 2015 at 14:36, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Also,
Is there a way to remove these things from the init target node easier?
d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr
(6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots
corresponding to this node from the init target node then drop and recreate
this database and try again
Now that we have SQL-level join it'd probably make sense to provide a
cleanup function for failed node joins. At this point there's no such
function.
Take note of the node identity given in the error as it corresponds to the
replication identifier name and slot name.
You need to, on the join target node:
SELECT pg_drop_replication_slot(slot_name)
FROM pg_replication_slots
WHERE slot_name =
bdr.bdr_format_slot_name('6147869128174526660',1,16908)
where the sysid, timeline ID and database OID are those given in the error.
You must run this from the target node's database, as it'll only consider
slots for the current database.
Then
SELECT pg_replication_identifier_drop(...)
the replication identifier used, after looking up the replication
identifier from pg_catalog.pg_replication_identifier. There isn't an
equivalent of bdr.bdr_format_slot_name for replication identifiers; I'll
look at adding one. Look it up visually or write a simple function to
format the string in the mean time.
Then delete the bdr.bdr_nodes entry for the failed-to-join node and any
bdr.bdr_connections entries for it.
You *must* drop and re-create the database on the failed-to-join node,
making a new blank db (preferably from template0).
Craig, thank you so much for the quick response!
Adding these cleanup functions sounds wonderful, thank you for looking into
that.
One question, why template0 vs template1 ? (My guess is because you want it
to be devoid of pretty much everything?)
On Tue, May 12, 2015 at 1:31 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 12 May 2015 at 14:36, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Also,
Is there a way to remove these things from the init target node easier?
d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr
(6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots
corresponding to this node from the init target node then drop and recreate
this database and try againNow that we have SQL-level join it'd probably make sense to provide a
cleanup function for failed node joins. At this point there's no such
function.Take note of the node identity given in the error as it corresponds to the
replication identifier name and slot name.You need to, on the join target node:
SELECT pg_drop_replication_slot(slot_name)
FROM pg_replication_slots
WHERE slot_name =
bdr.bdr_format_slot_name('6147869128174526660',1,16908)where the sysid, timeline ID and database OID are those given in the
error. You must run this from the target node's database, as it'll only
consider slots for the current database.Then
SELECT pg_replication_identifier_drop(...)
the replication identifier used, after looking up the replication
identifier from pg_catalog.pg_replication_identifier. There isn't an
equivalent of bdr.bdr_format_slot_name for replication identifiers; I'll
look at adding one. Look it up visually or write a simple function to
format the string in the mean time.Then delete the bdr.bdr_nodes entry for the failed-to-join node and any
bdr.bdr_connections entries for it.You *must* drop and re-create the database on the failed-to-join node,
making a new blank db (preferably from template0).
--
~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
Craig,
I was starting the first node then letting all of the other nodes join as
quick as they could which clearly won't work. It also explains why it
worked when I did it manually, I can only do it sequentially myself ;) I
had suspected a race condition and it seems I was in the right area :)
Thank you for this, I will alter what I am doing to start sequentially.
On Tue, May 12, 2015 at 1:19 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 12 May 2015 at 14:33, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
SELECT bdr.bdr_group_join(
local_node_name := 'pgbdr1',
node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
dbname=pgbdr',
join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres
dbname=pgbdr'
);At a guess you're probably not waiting between joins to ensure that each
new node has finished joining before starting another node join.BDR really needs to be enhanced to either support parallel join of
multiple nodes or identify and reject it.--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
~Wayne
Wayne E. Seguin
wayneeseguin@gmail.com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/
Craig,
It's alive!!!
One more question on this thread, where can I find the meanings of
node_status in the documentation?
pgbdr=# SELECT * FROM bdr.bdr_nodes;
node_sysid | node_timeline | node_dboid | node_status | node_name
| node_local_dsn |
node_init_from_
dsn
---------------------+---------------+------------+-------------+-----------+-------------------------------------------------------+---------------------------------
---------------------
6147988955199598785 | 1 | 16386 | r | rdpg0
| host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
6147988955568627909 | 1 | 16386 | i | rdpg1
| host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=p
ostgres dbname=pgbdr
6147988955597271233 | 1 | 16386 | r | rdpg3
| host=10.244.2.14 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=p
ostgres dbname=pgbdr
6147988963356274882 | 1 | 16386 | r | rdpg4
| host=10.244.2.18 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=p
ostgres dbname=pgbdr
6147988963438956739 | 1 | 16386 | r | rdpg2
| host=10.244.2.10 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
port=5432 user=p
ostgres dbname=pgbdr
(5 rows)
On 12 May 2015 at 22:21, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Craig,
It's alive!!!
One more question on this thread, where can I find the meanings of
node_status in the documentation?
node_status is really internal, but it's covered briefly in the docs:
*awesome*, the question was for my own curiosity so thanks for this!
On Tue, May 12, 2015 at 18:27 Craig Ringer <craig@2ndquadrant.com> wrote:
Show quoted text
On 12 May 2015 at 22:21, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Craig,
It's alive!!!
One more question on this thread, where can I find the meanings of
node_status in the documentation?node_status is really internal, but it's covered briefly in the docs:
On 13 May 2015 at 09:29, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
*awesome*, the question was for my own curiosity so thanks for this!
No worries.
I know it's trite, but for internal-ish detail like that the best reference
remains the source code. I'd like to think the sources are fairly sane.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services