Tables getting stuck at 's' state during logical replication
Hello,
I am trying out logical replication for upgrading postgres instance from
version 11 to 15.x. In the process, I noticed that some tables get stuck in
the 's' state during logical replication and they do not move to the 'r'
state. I tried to drop the subscription and create a new subscriber, but
the same thing repeated. Also, each time different tables get stuck, it is
not like the same tables get stuck every time. This also happens to tables
that do not have frequent writes to them. Then I tried to drop the tables
which got stuck from the original publication and created a new publication
and new subscribers just with the tables which got stuck in the previous
subscription. Now, in this new subscription, some tables go to 'r' state
(even though they did not in the older subscription) and some tables remain
in 's' state. If I continue doing this process for 5-6 times I am able to
cover all of the tables. Since these tables get stuck at 's' state, the
replication origins are not deleted automatically when I delete the table
from the publication and I had to manually delete the replication origin.
The logs contain "table synchronization worker for <table_name> has
finished even for table in 's' state. Can somebody please help with this
issue as I do not want to do manual intervention of creating a new
publication and subscription for tables that get stuck at 's'.
Configuration
max_replication_slots : 75 in publisher and subscriber
max_worker_processes: 60 (in subscriber)
max_logical_replication_workers: 55 (in subscriber)
Thanks,
Padmavathi
On Fri, May 5, 2023 at 3:04 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
Hello,
I am trying out logical replication for upgrading postgres instance from version 11 to 15.x. In the process, I noticed that some tables get stuck in the 's' state during logical replication and they do not move to the 'r' state. I tried to drop the subscription and create a new subscriber, but the same thing repeated. Also, each time different tables get stuck, it is not like the same tables get stuck every time.
This is strange. BTW, we don't save slots after the upgrade, so the
subscriptions in the upgraded node won't be valid. We have some
discussion on this topic in threads [1]/messages/by-id/20230217075433.u5mjly4d5cr4hcfe@jrouhaud[2]/messages/by-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com. So, I think after the
upgrade one needs to anyway re-create the subscriptions. Can you share
your exact steps for the upgrade and what is the state before the
upgrade? Is it possible to share some minimal test case to show the
exact problem you are facing?
[1]: /messages/by-id/20230217075433.u5mjly4d5cr4hcfe@jrouhaud
[2]: /messages/by-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
--
With Regards,
Amit Kapila.
Some background on the setup on which I am trying to carry out the upgrade:
We have a pod in a kubernetes cluster which contains the postgres 11 image.
We are following the logical replication process for upgrade
Steps followed for logical replication:
1. Created a new pod in the same kubernetes cluster with the latest
postgres 15 image
2. Created a publication (say publication 1) in the old pod including all
tables in a database
3. Created a subscription (say subscription 1) in the new pod for the above
mentioned publication
4. When monitoring the subscription via pg_subscription_rel in the
subscriber, I noticed that out of 45 tables 20 were in the 'r' state and 25
were in 's' state and they remained in the same state for almost 2 days,
there was no improvement in the state. But the logs showed that the tables
which had 's' state also had "synchronization workers for <table_name>
finished".
5. Then I removed the tables which got stuck in the 's' state from
publication 1 and created a new publication (publication 2) with only these
tables which got stuck and created a new subscription (subscription 2) for
this publication in the subscriber.
6. Now on monitoring subscription 2 via pg_subscription_rel I noticed that
out of 25, now 12 were in 'r' state and 13 again got stuck in 's' state.
Repeated this process of dropping tables which got stuck from publication
and created a new publisher and subscriber and finally I was able to bring
all tables to sync in this way. But still the tables were present in
replication origin.
7. On executing pg_replication_origins command, I saw that every
subscription had one origin and every table which got stuck in each
publication had one origin with roname pg_<subid>_<relid>. Eventhough they
were stuck, these replication origins were not removed.
On Fri, May 5, 2023 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Fri, May 5, 2023 at 3:04 PM Padmavathi G <padma9.9.1999@gmail.com>
wrote:Hello,
I am trying out logical replication for upgrading postgres instance fromversion 11 to 15.x. In the process, I noticed that some tables get stuck in
the 's' state during logical replication and they do not move to the 'r'
state. I tried to drop the subscription and create a new subscriber, but
the same thing repeated. Also, each time different tables get stuck, it is
not like the same tables get stuck every time.This is strange. BTW, we don't save slots after the upgrade, so the
subscriptions in the upgraded node won't be valid. We have some
discussion on this topic in threads [1][2]. So, I think after the
upgrade one needs to anyway re-create the subscriptions. Can you share
your exact steps for the upgrade and what is the state before the
upgrade? Is it possible to share some minimal test case to show the
exact problem you are facing?[1] -
/messages/by-id/20230217075433.u5mjly4d5cr4hcfe@jrouhaud
[2] -
/messages/by-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com--
With Regards,
Amit Kapila.
On Fri, May 5, 2023 at 7:27 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
Some background on the setup on which I am trying to carry out the upgrade:
We have a pod in a kubernetes cluster which contains the postgres 11 image. We are following the logical replication process for upgrade
Steps followed for logical replication:
1. Created a new pod in the same kubernetes cluster with the latest postgres 15 image
2. Created a publication (say publication 1) in the old pod including all tables in a database
3. Created a subscription (say subscription 1) in the new pod for the above mentioned publication
4. When monitoring the subscription via pg_subscription_rel in the subscriber, I noticed that out of 45 tables 20 were in the 'r' state and 25 were in 's' state and they remained in the same state for almost 2 days, there was no improvement in the state. But the logs showed that the tables which had 's' state also had "synchronization workers for <table_name> finished".
I think the problem happened in this step where some of the tables
remained in 's' state. It is not clear why this could happen because
apply worker should eventually update these relations state to 'r'. If
this is reproducible, we can try two things to further investigate the
issue: (a) Disable and Enable the subscription once and see if that
helps; (b) try increasing the LOG level to DEBUG2 and see if we get
any useful LOG message.
--
With Regards,
Amit Kapila.
On Tue, May 9, 2023 at 4:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 5, 2023 at 7:27 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
Some background on the setup on which I am trying to carry out the upgrade:
We have a pod in a kubernetes cluster which contains the postgres 11 image. We are following the logical replication process for upgrade
Steps followed for logical replication:
1. Created a new pod in the same kubernetes cluster with the latest postgres 15 image
2. Created a publication (say publication 1) in the old pod including all tables in a database
3. Created a subscription (say subscription 1) in the new pod for the above mentioned publication
4. When monitoring the subscription via pg_subscription_rel in the subscriber, I noticed that out of 45 tables 20 were in the 'r' state and 25 were in 's' state and they remained in the same state for almost 2 days, there was no improvement in the state. But the logs showed that the tables which had 's' state also had "synchronization workers for <table_name> finished".I think the problem happened in this step where some of the tables
remained in 's' state. It is not clear why this could happen because
apply worker should eventually update these relations state to 'r'. If
this is reproducible, we can try two things to further investigate the
issue: (a) Disable and Enable the subscription once and see if that
helps; (b) try increasing the LOG level to DEBUG2 and see if we get
any useful LOG message.
It looks like the respective apply workers for the tables whose state
was 's' are not able to get to UpdateSubscriptionRelState with
SUBREL_STATE_READY. In addition to what Amit suggested above, is it
possible for you to reproduce this problem on upstream (reproducing on
HEAD cool otherwise PG 15 enough) code? If yes, you can add custom
debug messages to see what's happening.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com