001_rep_changes.pl succeeds after a long time

Started by Álvaro Herrera10 months ago2 messages

alvherre@alvh.no-ip.org

10 months ago

Hello,

I noticed that the test file 001_repo_changes.pl finished successfully
after having taken 180s to run. This seems pretty suspicious --
normally that step takes around one second.

The problem is seen in this step:

[19:44:49.572](0.262s) ok 24 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check/t_001_rep_changes_publisher_data/pgdata -m fast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "publisher"
### Starting node "publisher"
# Running: pg_ctl -w -D /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check/t_001_rep_changes_publisher_data/pgdata -l /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check/log/001_rep_changes_publisher.log -o --cluster-name=publisher start
waiting for server to start.... done
server started
# Postmaster PID for node "publisher" is 3502228
Waiting for replication conn tap_sub's replay_lsn to pass 0/17DEBD8 on publisher
done
[19:47:50.689](181.116s) ok 25 - check replicated inserts after subscription publication change

Note that test 25 succeeded but took 3 minutes. If I look at the
publisher logs, I see this:

2025-03-11 19:44:49.885 CET [3501892] 001_rep_changes.pl LOG: statement: DELETE FROM tab_rep
2025-03-11 19:44:50.052 CET [3494694] LOG: received fast shutdown request
2025-03-11 19:44:50.052 CET [3494694] LOG: aborting any active transactions
2025-03-11 19:44:50.053 CET [3494694] LOG: background worker "logical replication launcher" (PID 3494772) exited with exit code 1
2025-03-11 19:44:50.053 CET [3494755] LOG: shutting down
2025-03-11 19:44:50.480 CET [3501776] tap_sub LOG: released logical replication slot "tap_sub"
2025-03-11 19:44:50.484 CET [3494755] LOG: checkpoint starting: shutdown immediate
2025-03-11 19:44:50.485 CET [3494755] LOG: checkpoint complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=579 kB, estimate=579 kB; lsn=0/17DEB60, redo lsn=0/17DEB60
2025-03-11 19:44:50.488 CET [3494694] LOG: database system is shut down
2025-03-11 19:44:50.884 CET [3502228] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2025-03-11 19:44:50.884 CET [3502228] LOG: listening on Unix socket "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529"
2025-03-11 19:44:50.886 CET [3502349] LOG: database system was shut down at 2025-03-11 19:44:50 CET
2025-03-11 19:44:50.889 CET [3502228] LOG: database system is ready to accept connections
2025-03-11 19:44:50.930 CET [3502395] 001_rep_changes.pl LOG: statement: SELECT pg_is_in_recovery()
2025-03-11 19:44:50.945 CET [3502428] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn()
2025-03-11 19:44:50.960 CET [3502440] 001_rep_changes.pl LOG: statement: SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('tap_sub', 'walreceiver')

The last query is repeated a number of times for 3 minutes, and finally we get this

2025-03-11 19:47:50.569 CET [3540762] 001_rep_changes.pl LOG: statement: SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('tap_sub', 'walreceiver')
2025-03-11 19:47:50.596 CET [3540764] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2025-03-11 19:47:50.597 CET [3540764] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2025-03-11 19:47:50.597 CET [3540764] tap_sub STATEMENT: IDENTIFY_SYSTEM

Meanwhile, the subscriber has this:

2025-03-11 19:44:49.777 CET [3501647] LOG: logical replication worker for subscription "tap_sub" will restart because of a parameter change
2025-03-11 19:44:49.785 CET [3501757] LOG: logical replication apply worker for subscription "tap_sub" has started
2025-03-11 19:44:50.481 CET [3501757] LOG: data stream from publisher has ended
2025-03-11 19:44:50.481 CET [3501757] ERROR: could not send end-of-streaming message to primary: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2025-03-11 19:44:50.484 CET [3495251] LOG: background worker "logical replication apply worker" (PID 3501757) exited with exit code 1
2025-03-11 19:44:50.490 CET [3502088] LOG: logical replication apply worker for subscription "tap_sub" has started
2025-03-11 19:44:50.490 CET [3502088] ERROR: apply worker for subscription "tap_sub" could not connect to the publisher: connection to server on socket "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
2025-03-11 19:44:50.493 CET [3495251] LOG: background worker "logical replication apply worker" (PID 3502088) exited with exit code 1
2025-03-11 19:47:50.593 CET [3540763] LOG: logical replication apply worker for subscription "tap_sub" has started

Is the problem that the subscriber never tries to establish the
connection again?

I have tried to reproduce it, but I haven't seen it again.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

10 months ago

In reply to: Álvaro Herrera (#1)

RE: 001_rep_changes.pl succeeds after a long time

On Thursday, March 13, 2025 12:29 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hi,

I noticed that the test file 001_repo_changes.pl finished successfully after
having taken 180s to run. This seems pretty suspicious -- normally that step
takes around one second.

The problem is seen in this step:

[19:44:49.572](0.262s) ok 24 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast # Running: pg_ctl -D
/home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
/t_001_rep_changes_publisher_data/pgdata -m fast stop waiting for server to
shut down.... done server stopped # No postmaster PID for node "publisher"
### Starting node "publisher"
# Running: pg_ctl -w -D
/home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
/t_001_rep_changes_publisher_data/pgdata -l
/home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
/log/001_rep_changes_publisher.log -o --cluster-name=publisher start
waiting for server to start.... done server started # Postmaster PID for node
"publisher" is 3502228 Waiting for replication conn tap_sub's replay_lsn to
pass 0/17DEBD8 on publisher done
[19:47:50.689](181.116s) ok 25 - check replicated inserts after subscription
publication change

Note that test 25 succeeded but took 3 minutes. If I look at the publisher logs,
I see this:

2025-03-11 19:44:49.885 CET [3501892] 001_rep_changes.pl LOG: statement:
DELETE FROM tab_rep
2025-03-11 19:44:50.052 CET [3494694] LOG: received fast shutdown request
2025-03-11 19:44:50.052 CET [3494694] LOG: aborting any active
transactions
2025-03-11 19:44:50.053 CET [3494694] LOG: background worker "logical
replication launcher" (PID 3494772) exited with exit code 1
2025-03-11 19:44:50.053 CET [3494755] LOG: shutting down
2025-03-11 19:44:50.480 CET [3501776] tap_sub LOG: released logical
replication slot "tap_sub"
2025-03-11 19:44:50.484 CET [3494755] LOG: checkpoint starting: shutdown
immediate
2025-03-11 19:44:50.485 CET [3494755] LOG: checkpoint complete: wrote 9
buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0
recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=579 kB, estimate=579 kB;
lsn=0/17DEB60, redo lsn=0/17DEB60
2025-03-11 19:44:50.488 CET [3494694] LOG: database system is shut down
2025-03-11 19:44:50.884 CET [3502228] LOG: starting PostgreSQL 18devel on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2025-03-11 19:44:50.884 CET [3502228] LOG: listening on Unix socket
"/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529"
2025-03-11 19:44:50.886 CET [3502349] LOG: database system was shut
down at 2025-03-11 19:44:50 CET
2025-03-11 19:44:50.889 CET [3502228] LOG: database system is ready to
accept connections
2025-03-11 19:44:50.930 CET [3502395] 001_rep_changes.pl LOG: statement:
SELECT pg_is_in_recovery()
2025-03-11 19:44:50.945 CET [3502428] 001_rep_changes.pl LOG: statement:
SELECT pg_current_wal_lsn()
2025-03-11 19:44:50.960 CET [3502440] 001_rep_changes.pl LOG: statement:
SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('tap_sub', 'walreceiver')

The last query is repeated a number of times for 3 minutes, and finally we get
this

2025-03-11 19:47:50.569 CET [3540762] 001_rep_changes.pl LOG: statement:
SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('tap_sub', 'walreceiver')
2025-03-11 19:47:50.596 CET [3540764] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2025-03-11 19:47:50.597 CET [3540764] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2025-03-11 19:47:50.597 CET [3540764] tap_sub STATEMENT:
IDENTIFY_SYSTEM

Meanwhile, the subscriber has this:

2025-03-11 19:44:49.777 CET [3501647] LOG: logical replication worker for
subscription "tap_sub" will restart because of a parameter change
2025-03-11 19:44:49.785 CET [3501757] LOG: logical replication apply worker
for subscription "tap_sub" has started
2025-03-11 19:44:50.481 CET [3501757] LOG: data stream from publisher has
ended
2025-03-11 19:44:50.481 CET [3501757] ERROR: could not send
end-of-streaming message to primary: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2025-03-11 19:44:50.484 CET [3495251] LOG: background worker "logical
replication apply worker" (PID 3501757) exited with exit code 1
2025-03-11 19:44:50.490 CET [3502088] LOG: logical replication apply worker
for subscription "tap_sub" has started
2025-03-11 19:44:50.490 CET [3502088] ERROR: apply worker for
subscription "tap_sub" could not connect to the publisher: connection to
server on socket "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529"
failed: No such file or directory
Is the server running locally and accepting connections on that socket?
2025-03-11 19:44:50.493 CET [3495251] LOG: background worker "logical
replication apply worker" (PID 3502088) exited with exit code 1
2025-03-11 19:47:50.593 CET [3540763] LOG: logical replication apply worker
for subscription "tap_sub" has started

Is the problem that the subscriber never tries to establish the connection
again?

I have tried to reproduce it, but I haven't seen it again.

I suspect that the launcher may have missed the latch ping from the exiting
apply worker due to the bug reported in [1]/messages/by-id/CALDaNm3fvCg00GPWtjagDNP1cSBAW-gMKTUZWFTg+8x6tUa-=w@mail.gmail.com.

[1]: /messages/by-id/CALDaNm3fvCg00GPWtjagDNP1cSBAW-gMKTUZWFTg+8x6tUa-=w@mail.gmail.com

Best Regards,
Hou zj