logical replication - possible remaining problem

Started by Erik Rijkersover 8 years ago4 messages
#1Erik Rijkers
er@xs4all.nl

I am not sure whether what I found here amounts to a bug, I might be
doing something dumb.

During the last few months I did tests by running pgbench over logical
replication. Earlier emails have details.

The basic form of that now works well (and the fix has been comitted)
but as I looked over my testing program I noticed one change I made to
it, already many weeks ago:

In the cleanup during startup (pre-flight check you might say) and also
before the end, instead of

echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)

I changed that (as I say, many weeks ago) to:

echo "delete from pg_subscription;
delete from pg_subscription_rel;
delete from pg_replication_origin; " | psql -qXp $port2 -- (2)

This occurs (2x) inside the bash function clean_pubsub(), in main test
script pgbench_detail2.sh

This change was an effort to ensure to arrive at a 'clean' start (and
end-) state which would always be the same.

All my more recent testing (and that of Mark, I have to assume) was thus
done with (2).

Now, looking at the script again I am thinking that it would be
reasonable to expect that after issuing
delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence.
(Is this reasonable? this is really the main question of this email).

So I removed the latter two delete statements again, and ran the tests
again with the form in (1)

I have established that (after a number of successful cycles) the test
stops succeeding with in the replica log repetitions of:

2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT: Increase
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical
replication worker for subscription 29235 (PID 2421) exited with exit
code 1

when I manually 'clean up' by doing:
delete from pg_replication_origin;

then, and only then, does the session finish and succeed ('replica ok').

So to me it looks as if there is an omission of
pg_replication_origin-cleanup when pg_description is deleted.

Does that make sense? All this is probably vague and I am only posting
in the hope that Petr (or someone else) perhaps immediately understands
what goes wrong, with even his limited amount of info.

In the meantime I will try to dig up more detailed info...

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Erik Rijkers (#1)
Re: logical replication - possible remaining problem

Erik Rijkers wrote:

Now, looking at the script again I am thinking that it would be reasonable
to expect that after issuing
delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is
this reasonable? this is really the main question of this email).

I don't think it's reasonable to expect that the system recovers
automatically from what amounts to catalog corruption. You should be
using the DDL that removes subscriptions instead.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Erik Rijkers
er@xs4all.nl
In reply to: Alvaro Herrera (#2)
Re: logical replication - possible remaining problem

On 2017-06-07 23:18, Alvaro Herrera wrote:

Erik Rijkers wrote:

Now, looking at the script again I am thinking that it would be
reasonable
to expect that after issuing
delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a
consequence. (Is
this reasonable? this is really the main question of this email).

I don't think it's reasonable to expect that the system recovers
automatically from what amounts to catalog corruption. You should be
using the DDL that removes subscriptions instead.

You're right, that makes sense.
Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#1)
Re: logical replication - possible remaining problem

Hi,

On 07/06/17 22:49, Erik Rijkers wrote:

I am not sure whether what I found here amounts to a bug, I might be
doing something dumb.

During the last few months I did tests by running pgbench over logical
replication. Earlier emails have details.

The basic form of that now works well (and the fix has been comitted)
but as I looked over my testing program I noticed one change I made to
it, already many weeks ago:

In the cleanup during startup (pre-flight check you might say) and also
before the end, instead of

echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)

I changed that (as I say, many weeks ago) to:

echo "delete from pg_subscription;
delete from pg_subscription_rel;
delete from pg_replication_origin; " | psql -qXp $port2 -- (2)

This occurs (2x) inside the bash function clean_pubsub(), in main test
script pgbench_detail2.sh

This change was an effort to ensure to arrive at a 'clean' start (and
end-) state which would always be the same.

All my more recent testing (and that of Mark, I have to assume) was thus
done with (2).

Now, looking at the script again I am thinking that it would be
reasonable to expect that after issuing
delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence.
(Is this reasonable? this is really the main question of this email).

Hmm, they are not cleaned automatically, deleting from system catalogs
manually like this never propagates to related tables, we don't use FKs
there.

So I removed the latter two delete statements again, and ran the tests
again with the form in (1)

I have established that (after a number of successful cycles) the test
stops succeeding with in the replica log repetitions of:

2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT: Increase
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical
replication worker for subscription 29235 (PID 2421) exited with exit
code 1

when I manually 'clean up' by doing:
delete from pg_replication_origin;

Yeah because you consumed all the origins (I am still not huge fan of
how that limit works, but that's separate discussion).

then, and only then, does the session finish and succeed ('replica ok').

So to me it looks as if there is an omission of
pg_replication_origin-cleanup when pg_description is deleted.

There is no omission, origin is not supposed to be deleted automatically
unless you use DROP SUBSCRIPTION.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers