Logical replication failed with SSL SYSCALL error
Hi Team,
Postgres Version:- 13.8
Issue:- Logical replication failing with SSL SYSCALL error
Priority:-High
We are migrating our database through logical replications, and all of
sudden below error pops up in the source and target logs which leads us to
nowhere.
*Logs from Source:-*
LOG: could not send data to client: Connection reset by peer
STATEMENT: COPY public.test TO STDOUT
FATAL: connection to client lost
STATEMENT: COPY public.test TO STDOUT
*Logs from Target:-*
2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 1250) exited with exit code 1
2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table
synchronization worker for subscription " sub_tables_2_180", table "test"
has started
2023-04-15 19:12:05
UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING:
there is no transaction in progress
2023-04-15 19:14:08
UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG:
could not receive data from client: Connection reset by peer
2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2556) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2112) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 1089) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply worker for
subscription "sub_tables_2_180" has started
2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply worker for
subscription "sub_tables_3_192" has started
2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply worker for
subscription "sub_tables_1_180" has started
Just after this error, all other replication slots get disabled for some
time and come back online along with COPY command with the new PID in
pg_stat_activity.
I have a few queries regarding this:-
1. The exact reason for disconnection (Few articles claim memory and few
network)
2. Will it lead to data inconsistency?
3. Does this new PID COPY command again migrate the whole data of the
test table once again?
Please help we got stuck here.
--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya@gmail.com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023
Hi Team,
Could you please help me with this, It's urgent for the production
environment.
On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shaurya@gmail.com> wrote:
Hi Team,
Could you please help, It's urgent for the production env?
On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shaurya@gmail.com>
wrote:Hi Team,
Postgres Version:- 13.8
Issue:- Logical replication failing with SSL SYSCALL error
Priority:-HighWe are migrating our database through logical replications, and all of
sudden below error pops up in the source and target logs which leads us to
nowhere.*Logs from Source:-*
LOG: could not send data to client: Connection reset by peer
STATEMENT: COPY public.test TO STDOUT
FATAL: connection to client lost
STATEMENT: COPY public.test TO STDOUT*Logs from Target:-*
2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 1250) exited with exit code 1
2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table
synchronization worker for subscription " sub_tables_2_180", table "test"
has started
2023-04-15 19:12:05 UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING:
there is no transaction in progress
2023-04-15 19:14:08 UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG:
could not receive data from client: Connection reset by peer
2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2556) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2112) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 1089) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply worker
for subscription "sub_tables_2_180" has started
2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply worker
for subscription "sub_tables_3_192" has started
2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply worker
for subscription "sub_tables_1_180" has startedJust after this error, all other replication slots get disabled for some
time and come back online along with COPY command with the new PID in
pg_stat_activity.I have a few queries regarding this:-
1. The exact reason for disconnection (Few articles claim memory and
few network)
2. Will it lead to data inconsistency?
3. Does this new PID COPY command again migrate the whole data of the
test table once again?Please help we got stuck here.
--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya@gmail.com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya@gmail.com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023
--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya@gmail.com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023
Import Notes
Reply to msg id not found: CAHHJ3NQFGg9_1fNJ9fba+-AAv38tbZLr1Fx4cpiwkaNPpSDCA@mail.gmail.com
On Wed, 19 Apr 2023 at 17:26, shaurya jain <12345shaurya@gmail.com> wrote:
Hi Team,
Could you please help me with this, It's urgent for the production environment.
On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shaurya@gmail.com> wrote:
Hi Team,
Could you please help, It's urgent for the production env?
On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shaurya@gmail.com> wrote:
Hi Team,
Postgres Version:- 13.8
Issue:- Logical replication failing with SSL SYSCALL error
Priority:-HighWe are migrating our database through logical replications, and all of sudden below error pops up in the source and target logs which leads us to nowhere.
Logs from Source:-
LOG: could not send data to client: Connection reset by peer
STATEMENT: COPY public.test TO STDOUT
FATAL: connection to client lost
STATEMENT: COPY public.test TO STDOUTLogs from Target:-
2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 1250) exited with exit code 1
2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table synchronization worker for subscription " sub_tables_2_180", table "test" has started
2023-04-15 19:12:05 UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING: there is no transaction in progress
2023-04-15 19:14:08 UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG: could not receive data from client: Connection reset by peer
2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 2556) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 2112) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 1089) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply worker for subscription "sub_tables_2_180" has started
2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply worker for subscription "sub_tables_3_192" has started
2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply worker for subscription "sub_tables_1_180" has startedJust after this error, all other replication slots get disabled for some time and come back online along with COPY command with the new PID in pg_stat_activity.
I have a few queries regarding this:-
The exact reason for disconnection (Few articles claim memory and few network)
This might be because of network failure, did you notice any network
instability, could you check the TCP settings.
You could check the following configurations tcp_keepalives_idle,
tcp_keepalives_interval and tcp_keepalives_count.
This means it will connect the server based on tcp_keepalives_idle
seconds specified , if the server does not respond in
tcp_keepalives_interval seconds it'll try again, and will consider the
connection gone after tcp_keepalives_count failures.
Will it lead to data inconsistency?
It will not lead to inconsistency. In case of failure the failed
transaction will be rolled back.
Does this new PID COPY command again migrate the whole data of the test table once again?
Yes, it will migrate the whole table data again in case of failures.
Regards,
Vignesh
Hi Vignesh,
That's really prompt and solves our problem. Thank you buddy.
Please go through my inline comments:-
On Thu, Apr 20, 2023 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, 19 Apr 2023 at 17:26, shaurya jain <12345shaurya@gmail.com> wrote:
Hi Team,
Could you please help me with this, It's urgent for the production
environment.
On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shaurya@gmail.com>
wrote:
Hi Team,
Could you please help, It's urgent for the production env?
On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shaurya@gmail.com>
wrote:
Hi Team,
Postgres Version:- 13.8
Issue:- Logical replication failing with SSL SYSCALL error
Priority:-HighWe are migrating our database through logical replications, and all of
sudden below error pops up in the source and target logs which leads us to
nowhere.Logs from Source:-
LOG: could not send data to client: Connection reset by peer
STATEMENT: COPY public.test TO STDOUT
FATAL: connection to client lost
STATEMENT: COPY public.test TO STDOUTLogs from Target:-
2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data fromWAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logicalreplication worker" (PID 1250) exited with exit code 1
2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table
synchronization worker for subscription " sub_tables_2_180", table "test"
has started2023-04-15 19:12:05 UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING:
there is no transaction in progress
2023-04-15 19:14:08 UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG:
could not receive data from client: Connection reset by peer
2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from
WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from
WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from
WAL stream: SSL SYSCALL error: Connection timed out
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2556) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 2112) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
replication worker" (PID 1089) exited with exit code 1
2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply
worker for subscription "sub_tables_2_180" has started
2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply
worker for subscription "sub_tables_3_192" has started
2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply
worker for subscription "sub_tables_1_180" has started
Just after this error, all other replication slots get disabled for
some time and come back online along with COPY command with the new PID in
pg_stat_activity.I have a few queries regarding this:-
The exact reason for disconnection (Few articles claim memory and few
network)
This might be because of network failure, did you notice any network
instability, could you check the TCP settings.
You could check the following configurations tcp_keepalives_idle,
tcp_keepalives_interval and tcp_keepalives_count.
This means it will connect the server based on tcp_keepalives_idle
seconds specified , if the server does not respond in
tcp_keepalives_interval seconds it'll try again, and will consider the
connection gone after tcp_keepalives_count failures. ---Yes you were
correct, that ssue was related to network where VPN tunnel got restarted
because of some miss configuration at tunnel side. By fixing that it
stands resolved so far. These params were set to below values:-
1. keepalives_idle 60
2. keepalives_interval 100
3. keepalives_count 60
Will it lead to data inconsistency?
It will not lead to inconsistency. In case of failure the failed
transaction will be rolled back. Yes, Migration was up to the mark after
fixing network.Does this new PID COPY command again migrate the whole data of the
test table once again?
Yes, it will migrate the whole table data again in case of failures. Yes,
I follow you on that. Is there any way to rsync instead of simple copy?Regards,
Vignesh
--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya@gmail.com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023