pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

Started by Jānis Pūrisalmost 7 years ago7 messagesgeneral

janis@puris.lv

almost 7 years ago

Hello,

I'm working with a 9.4-bdr cluster and want to move away from BDR tech all
together. So my idea was to follow instructions on
http://bdr-project.org/docs/stable/ to first strip the node from BDR making
it into "regular" node and then moving the data from this node to official
9.4 instance.

Environment:

BDR_Node_1
BDR_Node_2

Regular_Node_1

Steps done on BDR_Node_2
1. Part node from BDR cluster
http://bdr-project.org/docs/stable/node-management-removing.html
2. Remove BDR from node
http://bdr-project.org/docs/stable/node-management-disabling.html
3. Create SR slot select
pg_create_physical_replication_slot('new_cluster_node');

After this. Steps done on Regular_Node_1, which is 9.4.22
1. Init the new node with pg_basebackup with source being BDR_Node_1
2. Update recovery.conf and postgres.conf
3. Start it with pg_ctl

All seems to be working well, but for this:

select * from information_schema.sequences;

produces "[XX000] ERROR: cache lookup failed for index 899807" for all of
the DBs that were configured with BDR with various OID.

I can not reproduce this error, when running query on any of the DBs on
9.4-bdr node.

Does anyone have any idea what may be causing this ?
Due to this I'm also unable to run pg_dump on new_cluster_node - it also
results in the same error.

That said. For more context - If it worked, next steps would have been to
1. Deactivate virtual IP on BDR_Node_2
2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"
3. Promote Regular_Node_1
4. Activate virtual IP on BDR_Node_2

Build the new 9.4 cluster from there on.

Any ideas / comments will be much appreciated!

Thank you in advance.
Best regards, Janis Puris

Ron

ronljohnsonjr@gmail.com

almost 7 years ago

In reply to: Jānis Pūris (#1)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

On 5/25/19 11:49 AM, Jānis Pūris wrote:

Hello,

I'm working with a 9.4-bdr cluster and want to move away from BDR tech all
together. So my idea was to follow instructions on
http://bdr-project.org/docs/stable/ to first strip the node from BDR
making it into "regular" node and then moving the data from this node to
official 9.4 instance.

Environment:

BDR_Node_1
BDR_Node_2

Regular_Node_1

Steps done on BDR_Node_2
1. Part node from BDR cluster
http://bdr-project.org/docs/stable/node-management-removing.html
2. Remove BDR from node
http://bdr-project.org/docs/stable/node-management-disabling.html
3. Create SR slot select
pg_create_physical_replication_slot('new_cluster_node');

After this. Steps done on Regular_Node_1, which is 9.4.22
1. Init the new node with pg_basebackup with source being BDR_Node_1
2. Update recovery.conf and postgres.conf
3. Start it with pg_ctl

All seems to be working well, but for this:

select * from information_schema.sequences;

produces "[XX000] ERROR: cache lookup failed for index 899807" for all of
the DBs that were configured with BDR with various OID.

I can not reproduce this error, when running query on any of the DBs on
9.4-bdr node.

Does anyone have any idea what may be causing this ?
Due to this I'm also unable to run pg_dump on new_cluster_node - it also
results in the same error.

That said. For more context - If it worked, next steps would have been to
1. Deactivate virtual IP on BDR_Node_2
2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"
3. Promote Regular_Node_1
4. Activate virtual IP on BDR_Node_2

Build the new 9.4 cluster from there on.

Any ideas / comments will be much appreciated!

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

--
Angular momentum makes the world go 'round.

Adrian Klaver

adrian.klaver@aklaver.com

almost 7 years ago

In reply to: Jānis Pūris (#1)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

On 5/25/19 9:49 AM, Jānis Pūris wrote:

Hello,

I'm working with a 9.4-bdr cluster and want to move away from BDR tech
all together. So my idea was to follow instructions on
http://bdr-project.org/docs/stable/ to first strip the node from BDR
making it into "regular" node and then moving the data from this node to
official 9.4 instance.

Environment:

BDR_Node_1
BDR_Node_2

Regular_Node_1

Steps done on BDR_Node_2
1. Part node from BDR cluster
http://bdr-project.org/docs/stable/node-management-removing.html
2. Remove BDR from node
http://bdr-project.org/docs/stable/node-management-disabling.html
3. Create SR slot select
pg_create_physical_replication_slot('new_cluster_node');

After this. Steps done on Regular_Node_1, which is 9.4.22
1. Init the new node with pg_basebackup with source being BDR_Node_1

I am not clear about above:

1) You removed BDR_Node_2 from cluster

2) You took pg_basebackup from BDR_Node_1 to create a new regular cluster.

3) 1) & 2) seem to be at odds with each other.

2. Update recovery.conf and postgres.conf
3. Start it with pg_ctl

All seems to be working well, but for this:

select * from information_schema.sequences;

produces "[XX000] ERROR: cache lookup failed for index 899807" for all
of the DBs that were configured with BDR with various OID.

I can not reproduce this error, when running query on any of the DBs on
9.4-bdr node.

Does anyone have any idea what may be causing this ?
Due to this I'm also unable to run pg_dump on new_cluster_node - it also
results in the same error.

That said. For more context - If it worked, next steps would have been to
1. Deactivate virtual IP on BDR_Node_2
2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"
3. Promote Regular_Node_1
4. Activate virtual IP on BDR_Node_2

Build the new 9.4 cluster from there on.

Any ideas / comments will be much appreciated!

Thank you in advance.
Best regards, Janis Puris

--
Adrian Klaver
adrian.klaver@aklaver.com

Jānis Pūris

janis@puris.lv

almost 7 years ago

In reply to: Adrian Klaver (#3)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

Hi, Adrian

Apologies, it is a typo.

pg_basebackup was taken from BDR_Node_2, not BDR_Node_1

Thank you in advance.
Best regards, Janis Puris

On 25 May 2019 at 19:27:42, Adrian Klaver (adrian.klaver@aklaver.com) wrote:

I am not clear about above:

1) You removed BDR_Node_2 from cluster

2) You took pg_basebackup from BDR_Node_1 to create a new regular cluster.

3) 1) & 2) seem to be at odds with each other.

Jānis Pūris

janis@puris.lv

almost 7 years ago

In reply to: Ron (#2)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

Hi Ron,

I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as
stated before. Typo)

I've been successful in transferring the data with pg_dump on BDR_Node_2
and then restoring it on Regular_Node_1. Then running "select * from
information_schema.sequences;" all is OK.

The problem with this approach is that I'm required to have minimal
downtime in this transition and we have a lot of data to transfer, which
would be lengthy process.

Thank you in advance.
Best regards, Janis Puris

On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com) wrote:

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

Adrian Klaver

adrian.klaver@aklaver.com

almost 7 years ago

In reply to: Jānis Pūris (#5)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

On 5/25/19 11:46 AM, Jānis Pūris wrote:

Hi Ron,

I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as
stated before. Typo)

I've been successful in transferring the data with pg_dump on BDR_Node_2
and then restoring it on Regular_Node_1. Then running "select * from
information_schema.sequences;" all is OK.

So the issue is that a binary backup/restore via pg_basebackup fails but
a logical backup/restore via pg_dump/pg_restore works, correct?

You might want to take a look here:

https://github.com/2ndQuadrant/bdr/issues/140

It might make sense to you, it does not to me. Looks to me something is
being done on the binary level that makes this difficult. I guessing you
are going to have to talk to the BDR folks.

The problem with this approach is that I'm required to have minimal
downtime in this transition and we have a lot of data to transfer, which
would be lengthy process.

Thank you in advance.
Best regards, Janis Puris

On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com
<mailto:ronljohnsonjr@gmail.com>) wrote:

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

--
Adrian Klaver
adrian.klaver@aklaver.com

Jānis Pūris

janis@puris.lv

almost 7 years ago

In reply to: Adrian Klaver (#6)

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

Managed to find following in the announcement of BDR on 9.6:
"BDR has always been an extension, but *on 9.4 it required a heavily
patched PostgreSQL, one that isn’t fully on-disk-format compatible with
stock community PostgreSQL 9.4.* The goal all along has been to allow it to
run as an extension on an unmodified PostgreSQL … and now we’re there."

Source: https://www.2ndquadrant.com/en/blog/bdr-is-coming-to-postgresql-9-6/
I suppose this answers why logical dump / restore works, but pg_basebackup
fails - the stock 9.4 and BDR 9.4 are not compatible on disk level.

On 25 May 2019 at 23:26:26, Adrian Klaver (adrian.klaver@aklaver.com) wrote:

On 5/25/19 11:46 AM, Jānis Pūris wrote:

Hi Ron,

I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as
stated before. Typo)

I've been successful in transferring the data with pg_dump on BDR_Node_2
and then restoring it on Regular_Node_1. Then running "select * from
information_schema.sequences;" all is OK.

So the issue is that a binary backup/restore via pg_basebackup fails but
a logical backup/restore via pg_dump/pg_restore works, correct?

You might want to take a look here:

https://github.com/2ndQuadrant/bdr/issues/140

It might make sense to you, it does not to me. Looks to me something is
being done on the binary level that makes this difficult. I guessing you
are going to have to talk to the BDR folks.

The problem with this approach is that I'm required to have minimal
downtime in this transition and we have a lot of data to transfer, which
would be lengthy process.

Thank you in advance.
Best regards, Janis Puris

On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com
<mailto:ronljohnsonjr@gmail.com>) wrote:

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

--
Adrian Klaver
adrian.klaver@aklaver.com