Streaming replication - 11.5

Started by Nicola Contuabout 6 years ago16 messagesgeneral
Jump to latest
#1Nicola Contu
nicola.contu@gmail.com

Hello,
I have two servers connected to the same switch running postgres 11.5

I am trying to replicate one of those servers after a planned work on the
master, so the replica has been lost. It has always worked but now I get
this :

pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I don't really understand what the issue is.
I had this issue last week as well in another DC and I had to reboot the
slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#1)
Re: Streaming replication - 11.5

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same switch running postgres 11.5

I am trying to replicate one of those servers after a planned work on
the master, so the replica has been lost. It has always worked but now I
get this :

pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

I don't really understand what the issue is.

I would start with the logs from the Postgres server you are taking the
backup from.

I had this issue last week as well in another DC and I had to reboot the
slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#1)
Re: Streaming replication - 11.5

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

  2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG:  temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG:  could not send data to client: Broken pipe
2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL:  connection to client lost
2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
 restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s;
distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:  recovery
restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same switch running postgres 11.5

I am trying to replicate one of those servers after a planned

work on

the master, so the replica has been lost. It has always worked

but now I

get this :

pg_basebackup: could not receive data from WAL stream: server

closed the

connection unexpectedly
          This probably means the server terminated abnormally
          before or while processing the request.

I don't really understand what the issue is.

I would start with the logs from the Postgres server you are taking the
backup from.

I had this issue last week as well in another DC and I had to

reboot the

slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#3)
Re: Streaming replication - 11.5

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR: recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT: WAL control functions cannot be executed
during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT: select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and tested
again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_51199"

*pg_basebackup: could not receive data from WAL stream: SSL SYSCALL error:
EOF detected*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: restartpoint
complete: wrote 19565 buffers (0.2%); 0 WAL file(s) added, 0 removed, 7
recycled; write=270.014 s, sync=0.009 s, total=270.036 s; sync files=804,
longest=0.001 s, average=0.000 s; distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG: recovery
restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:22:44.050084+00.
2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG: restartpoint
starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG: restartpoint
complete: wrote 17069 buffers (0.2%); 0 WAL file(s) added, 0 removed, 17
recycled; write=269.879 s, sync=0.006 s, total=269.902 s; sync files=811,
longest=0.001 s, average=0.000 s; distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG: recovery
restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:27:43.945485+00.
2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG: restartpoint
starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: SELECT id, name,
parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG: duration: 9493.259 ms statement: SELECT slid,
gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG: restartpoint
complete: wrote 71260 buffers (0.8%); 0 WAL file(s) added, 0 removed, 13
recycled; write=269.953 s, sync=0.012 s, total=269.979 s; sync files=760,
longest=0.002 s, average=0.000 s; distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG: recovery
restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:32:13.916101+00.
2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG: restartpoint
starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG: restartpoint
complete: wrote 27130 buffers (0.3%); 0 WAL file(s) added, 0 removed, 12
recycled; write=270.026 s, sync=0.007 s, total=270.052 s; sync files=814,
longest=0.001 s, average=0.000 s; distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG: recovery
restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:37:14.067443+00.
2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG: restartpoint
starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG: restartpoint
complete: wrote 26040 buffers (0.3%); 0 WAL file(s) added, 0 removed, 9
recycled; write=269.850 s, sync=0.019 s, total=269.886 s; sync files=834,
longest=0.002 s, average=0.000 s; distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG: recovery
restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:42:13.900088+00.
2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG: restartpoint
starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG: restartpoint
complete: wrote 20966 buffers (0.2%); 0 WAL file(s) added, 0 removed, 9
recycled; write=270.048 s, sync=0.014 s, total=270.085 s; sync files=852,
longest=0.001 s, average=0.000 s; distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG: recovery
restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG: restartpoint
starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver <
adrian.klaver@aklaver.com> ha scritto:

Show quoted text

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG: could not send data to client: Broken

pipe

2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL: connection to client lost
2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s;
distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG: recovery
restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> ha

scritto:

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same switch running postgres

11.5

I am trying to replicate one of those servers after a planned

work on

the master, so the replica has been lost. It has always worked

but now I

get this :

pg_basebackup: could not receive data from WAL stream: server

closed the

connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I don't really understand what the issue is.

I would start with the logs from the Postgres server you are taking

the

backup from.

I had this issue last week as well in another DC and I had to

reboot the

slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#4)
Re: Streaming replication - 11.5

On 3/11/20 2:54 AM, Nicola Contu wrote:

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT:  WAL control functions cannot be
executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT:  select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and tested
again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_51199"
*pg_basebackup: could not receive data from WAL stream: SSL SYSCALL
error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout

Where are the master and standby in relation to each other network wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
 restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:  recovery
restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:22:44.050084+00.
2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
 restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
 restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:  recovery
restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:27:43.945485+00.
2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
 restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement: SELECT id,
name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG:  duration: 9493.259 ms  statement: SELECT slid,
gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
 restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s)
added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s;
distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:  recovery
restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:32:13.916101+00.
2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
 restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
 restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s;
distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:  recovery
restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:37:14.067443+00.
2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
 restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
 restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s,
total=269.886 s; sync files=834, longest=0.002 s, average=0.000 s;
distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:  recovery
restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:42:13.900088+00.
2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
 restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process due to
replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
 restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s,
total=270.085 s; sync files=852, longest=0.001 s, average=0.000 s;
distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:  recovery
restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
 restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

    2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG:  temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG:  could not send data to client:

Broken pipe

2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL:  connection to client lost
2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
   restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s;
distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:

 recovery

restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>> ha scritto:

     On 3/10/20 2:26 AM, Nicola Contu wrote:
      > Hello,
      > I have two servers connected to the same switch running

postgres 11.5

      >
      > I am trying to replicate one of those servers after a planned
     work on
      > the master, so the replica has been lost. It has always worked
     but now I
      > get this :
      >
      > pg_basebackup: could not receive data from WAL stream: server
     closed the
      > connection unexpectedly
      >          This probably means the server terminated abnormally
      >          before or while processing the request.
      >
      > I don't really understand what the issue is.

     I would start with the logs from the Postgres server you are

taking the

     backup from.

      > I had this issue last week as well in another DC and I had to
     reboot the
      > slave to make it working (not sure why it helped)
      >
      > Do you know what can cause this?
      >
      > Thank you,
      > Nicola

     --
     Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#5)
Re: Streaming replication - 11.5

I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one having
problems is in Dallas as well on the same switch of the one replicating
from the master.

It always worked not sure what is wrong now. We just encrypted disks on all
servers

Il mer 11 mar 2020, 18:57 Adrian Klaver <adrian.klaver@aklaver.com> ha
scritto:

Show quoted text

On 3/11/20 2:54 AM, Nicola Contu wrote:

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR: recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT: WAL control functions cannot be
executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT: select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and tested
again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_51199"
*pg_basebackup: could not receive data from WAL stream: SSL SYSCALL
error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout

Where are the master and standby in relation to each other network wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG: recovery
restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:22:44.050084+00.
2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG: recovery
restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:27:43.945485+00.
2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: SELECT id,
name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG: duration: 9493.259 ms statement: SELECT slid,
gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s)
added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s;
distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG: recovery
restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:32:13.916101+00.
2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s;
distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG: recovery
restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:37:14.067443+00.
2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s,
total=269.886 s; sync files=834, longest=0.002 s, average=0.000 s;
distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG: recovery
restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:42:13.900088+00.
2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s,
total=270.085 s; sync files=852, longest=0.001 s, average=0.000 s;
distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG: recovery
restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL: Last
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> ha

scritto:

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG: could not send data to client:

Broken pipe

2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL: connection to client lost
2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
total=269.999 s; sync files=1010, longest=0.001 s, average=0.000

s;

distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:

recovery

restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>> ha scritto:

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same switch running

postgres 11.5

I am trying to replicate one of those servers after a

planned

work on

the master, so the replica has been lost. It has always

worked

but now I

get this :

pg_basebackup: could not receive data from WAL stream:

server

closed the

connection unexpectedly
This probably means the server terminated

abnormally

before or while processing the request.

I don't really understand what the issue is.

I would start with the logs from the Postgres server you are

taking the

backup from.

I had this issue last week as well in another DC and I had

to

reboot the

slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#6)
Re: Streaming replication - 11.5

On 3/11/20 11:59 AM, Nicola Contu wrote:

I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one having
problems is in Dallas as well on the same switch of the one replicating
from the master.

It always worked not sure what is wrong now. We just encrypted disks on
all servers

Do you have before and after on CPU load, I/O throughput?

Do system logs show anything relevant during replication drop out?

Il mer 11 mar 2020, 18:57 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/11/20 2:54 AM, Nicola Contu wrote:

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT:  WAL control functions cannot be
executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT:  select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and

tested

again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot

"pg_basebackup_51199"

*pg_basebackup: could not receive data from WAL stream: SSL SYSCALL
error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process due to
replication timeout

Where are the master and standby in relation to each other network wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
   restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:

 recovery

restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:22:44.050084+00.
2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
   restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
   restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:

 recovery

restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:27:43.945485+00.
2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
   restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement: SELECT id,
name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG:  duration: 9493.259 ms  statement: SELECT

slid,

gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
   restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s)
added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s;
distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:

 recovery

restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:32:13.916101+00.
2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
   restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
   restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s;
distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:

 recovery

restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:37:14.067443+00.
2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
   restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
   restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s,
total=269.886 s; sync files=834, longest=0.002 s, average=0.000 s;
distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:

 recovery

restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:42:13.900088+00.
2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
   restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process

due to

replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
   restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s,
total=270.085 s; sync files=852, longest=0.001 s, average=0.000 s;
distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:

 recovery

restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
   restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>> ha scritto:

     On 3/10/20 8:17 AM, Nicola Contu wrote:
     Please post to list also.
     Ccing list.

     What came immediately before the temporary file error?

      >    2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
      > db=postgres,user=postgres LOG:  temporary file: path
      > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
      > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
      > db=postgres,user=postgres LOG:  could not send data to client:
     Broken pipe
      > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
      > db=postgres,user=postgres FATAL:  connection to client lost
      > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
      >   restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL

file(s)

      > added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
      > total=269.999 s; sync files=1010, longest=0.001 s,

average=0.000 s;

      > distance=175940 kB, estimate=416149 kB
      > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:
       recovery
      > restart point at 6424/1D7DEDE8
      >
      > It is a cascade replication
      >
      > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
      > <adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>> ha scritto:

      >
      >     On 3/10/20 2:26 AM, Nicola Contu wrote:
      >      > Hello,
      >      > I have two servers connected to the same switch running
     postgres 11.5
      >      >
      >      > I am trying to replicate one of those servers after

a planned

      >     work on
      >      > the master, so the replica has been lost. It has

always worked

      >     but now I
      >      > get this :
      >      >
      >      > pg_basebackup: could not receive data from WAL

stream: server

      >     closed the
      >      > connection unexpectedly
      >      >          This probably means the server terminated

abnormally

      >      >          before or while processing the request.
      >      >
      >      > I don't really understand what the issue is.
      >
      >     I would start with the logs from the Postgres server

you are

     taking the
      >     backup from.
      >
      >      > I had this issue last week as well in another DC

and I had to

      >     reboot the
      >      > slave to make it working (not sure why it helped)
      >      >
      >      > Do you know what can cause this?
      >      >
      >      > Thank you,
      >      > Nicola
      >
      >
      >     --
      >     Adrian Klaver
      > adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>>

      >

     --
     Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#7)
Re: Streaming replication - 11.5

CPU load on the server to be built? No.
System logs don't show anything relevant unfortunately

Il mer 11 mar 2020, 21:34 Adrian Klaver <adrian.klaver@aklaver.com> ha
scritto:

Show quoted text

On 3/11/20 11:59 AM, Nicola Contu wrote:

I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one having
problems is in Dallas as well on the same switch of the one replicating
from the master.

It always worked not sure what is wrong now. We just encrypted disks on
all servers

Do you have before and after on CPU load, I/O throughput?

Do system logs show anything relevant during replication drop out?

Il mer 11 mar 2020, 18:57 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/11/20 2:54 AM, Nicola Contu wrote:

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR: recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT: WAL control functions cannot be
executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT: select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and

tested

again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot

"pg_basebackup_51199"

*pg_basebackup: could not receive data from WAL stream: SSL

SYSCALL

error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due

to

replication timeout

Where are the master and standby in relation to each other network

wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:

recovery

restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:22:44.050084+00.

2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:

recovery

restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:27:43.945485+00.

2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: SELECT

id,

name, parent_id, parent, short_name, sales_rep_id FROM

mmx_clients;

2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG: duration: 9493.259 ms statement: SELECT

slid,

gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s)
added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s;
distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:

recovery

restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:32:13.916101+00.

2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s;
distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:

recovery

restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:37:14.067443+00.

2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s,
total=269.886 s; sync files=834, longest=0.002 s, average=0.000 s;
distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:

recovery

restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:42:13.900088+00.

2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process

due to

replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL file(s)
added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s,
total=270.085 s; sync files=852, longest=0.001 s, average=0.000 s;
distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:

recovery

restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL:

Last

completed transaction was at log time 2020-03-11

09:47:14.069731+00.

2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>> ha scritto:

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG: could not send data to

client:

Broken pipe

2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL: connection to client lost
2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL

file(s)

added, 0 removed, 16 recycled; write=269.943 s, sync=0.039

s,

total=269.999 s; sync files=1010, longest=0.001 s,

average=0.000 s;

distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:

recovery

restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
<adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>> ha scritto:

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same switch

running

postgres 11.5

I am trying to replicate one of those servers after

a planned

work on

the master, so the replica has been lost. It has

always worked

but now I

get this :

pg_basebackup: could not receive data from WAL

stream: server

closed the

connection unexpectedly
This probably means the server terminated

abnormally

before or while processing the request.

I don't really understand what the issue is.

I would start with the logs from the Postgres server

you are

taking the

backup from.

I had this issue last week as well in another DC

and I had to

reboot the

slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#8)
Re: Streaming replication - 11.5

On 3/11/20 2:12 PM, Nicola Contu wrote:

CPU load on the server to be built? No.

CPU load, I/O load on the servers in the replication chain.

Basically you just recently, it seems, imposed extra overhead to the
process by encrypting/decrypting. From what I gather from earlier post
then your replication started failing. Trying to establish whether the
former is causing the latter.

Not sure what you are referring to in regards to server to be built?

System logs don't show anything relevant unfortunately

Il mer 11 mar 2020, 21:34 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/11/20 11:59 AM, Nicola Contu wrote:

I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one

having

problems is in Dallas as well on the same switch of the one

replicating

from the master.

It always worked not sure what is wrong now. We just encrypted

disks on

all servers

Do you have before and after on CPU load, I/O throughput?

Do system logs show anything relevant during replication drop out?

Il mer 11 mar 2020, 18:57 Adrian Klaver

<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>> ha scritto:

     On 3/11/20 2:54 AM, Nicola Contu wrote:
      > These are the lines before
      >
      > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
      > db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
      > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
      > db=cmdv3,user=zabbix_check HINT:  WAL control functions

cannot be

      > executed during recovery.
      > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
      > db=cmdv3,user=zabbix_check STATEMENT:  select
      > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(),

replay_lsn)) from

      > pg_stat_replication where client_addr ='10.150.20.22'
      >
      > That query is made by Zabbix. So I stopped the zabbix

agent and

     tested
      > again. But still failing, because of this now :
      >
      > pg_basebackup: starting background WAL receiver
      > pg_basebackup: created temporary replication slot
     "pg_basebackup_51199"
      > *pg_basebackup: could not receive data from WAL stream:

SSL SYSCALL

      > error: EOF detected
      > *^C4699810/504983062 kB (70%), 0/1 tablespace
      > (...ql11/data/base/16401/231363544.2)

     So you started over with a pg_basebackup?

     Also from below:

     2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
     db=[unknown],user=replicator LOG:  terminating walsender

process due to

     replication timeout

     Where are the master and standby in relation to each other

network wise?

     Intervening firewalls, network latency issues?

      >
      >
      > here the full log starting right before the last try :
      >
      > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
      >   restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL

file(s)

      > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
      > total=270.036 s; sync files=804, longest=0.001 s,

average=0.000 s;

      > distance=131239 kB, estimate=725998 kB
      > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:
       recovery
      > restart point at 643A/D8C05F70
      > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:22:44.050084+00.

      > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
      >   restartpoint starting: time
      > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
      >   restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL

file(s)

      > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
      > total=269.902 s; sync files=811, longest=0.001 s,

average=0.000 s;

      > distance=120469 kB, estimate=665445 kB
      > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:
       recovery
      > restart point at 643A/E01AB438
      > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:27:43.945485+00.

      > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
      >   restartpoint starting: force wait
      > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
      > db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement:

SELECT id,

      > name, parent_id, parent, short_name, sales_rep_id FROM

mmx_clients;

      > 2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
      > db=cmdv3,user=nis LOG:  duration: 9493.259 ms  statement:

SELECT

     slid,
      > gnid, sof_id, client_id, product FROM mmx_slids;
      > 2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
      >   restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL

file(s)

      > added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
      > total=269.979 s; sync files=760, longest=0.002 s,

average=0.000 s;

      > distance=123412 kB, estimate=611242 kB
      > 2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:
       recovery
      > restart point at 643A/E7A30498
      > 2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:32:13.916101+00.

      > 2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
      >   restartpoint starting: time
      > 2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
      >   restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL

file(s)

      > added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
      > total=270.052 s; sync files=814, longest=0.001 s,

average=0.000 s;

      > distance=280595 kB, estimate=578177 kB
      > 2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:
       recovery
      > restart point at 643A/F8C351C8
      > 2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:37:14.067443+00.

      > 2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
      >   restartpoint starting: time
      > 2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
      >   restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL

file(s)

      > added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s,
      > total=269.886 s; sync files=834, longest=0.002 s,

average=0.000 s;

      > distance=236392 kB, estimate=543999 kB
      > 2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:
       recovery
      > restart point at 643B/730F3F8
      > 2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:42:13.900088+00.

      > 2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
      >   restartpoint starting: time
      > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
      > db=[unknown],user=replicator LOG:  terminating walsender

process

     due to
      > replication timeout
      > 2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
      >   restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL

file(s)

      > added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s,
      > total=270.085 s; sync files=852, longest=0.001 s,

average=0.000 s;

      > distance=183749 kB, estimate=507974 kB
      > 2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:
       recovery
      > restart point at 643B/12680A80
      > 2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user=

DETAIL:  Last

      > completed transaction was at log time 2020-03-11

09:47:14.069731+00.

      > 2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
      >   restartpoint starting: time
      >
      >
      >
      > Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
      > <adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>> ha scritto:

      >
      >     On 3/10/20 8:17 AM, Nicola Contu wrote:
      >     Please post to list also.
      >     Ccing list.
      >
      >     What came immediately before the temporary file error?
      >
      >      >    2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
      >      > db=postgres,user=postgres LOG:  temporary file: path
      >      > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
      >      > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
      >      > db=postgres,user=postgres LOG:  could not send data

to client:

      >     Broken pipe
      >      > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
      >      > db=postgres,user=postgres FATAL:  connection to

client lost

      >      > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1]

db=,user= LOG:

      >      >   restartpoint complete: wrote 37315 buffers

(0.4%); 0 WAL

     file(s)
      >      > added, 0 removed, 16 recycled; write=269.943 s,

sync=0.039 s,

      >      > total=269.999 s; sync files=1010, longest=0.001 s,
     average=0.000 s;
      >      > distance=175940 kB, estimate=416149 kB
      >      > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1]

db=,user= LOG:

      >       recovery
      >      > restart point at 6424/1D7DEDE8
      >      >
      >      > It is a cascade replication
      >      >
      >      > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
      >      > <adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>

      >     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>

      >     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>>> ha scritto:

      >      >
      >      >     On 3/10/20 2:26 AM, Nicola Contu wrote:
      >      >      > Hello,
      >      >      > I have two servers connected to the same

switch running

      >     postgres 11.5
      >      >      >
      >      >      > I am trying to replicate one of those

servers after

     a planned
      >      >     work on
      >      >      > the master, so the replica has been lost. It has
     always worked
      >      >     but now I
      >      >      > get this :
      >      >      >
      >      >      > pg_basebackup: could not receive data from WAL
     stream: server
      >      >     closed the
      >      >      > connection unexpectedly
      >      >      >          This probably means the server

terminated

     abnormally
      >      >      >          before or while processing the request.
      >      >      >
      >      >      > I don't really understand what the issue is.
      >      >
      >      >     I would start with the logs from the Postgres

server

     you are
      >     taking the
      >      >     backup from.
      >      >
      >      >      > I had this issue last week as well in another DC
     and I had to
      >      >     reboot the
      >      >      > slave to make it working (not sure why it

helped)

      >      >      >
      >      >      > Do you know what can cause this?
      >      >      >
      >      >      > Thank you,
      >      >      > Nicola
      >      >
      >      >
      >      >     --
      >      >     Adrian Klaver
      >      > adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>

      >     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>>

      >      >
      >
      >
      >     --
      >     Adrian Klaver
      > adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>>

      >

     --
     Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#9)
Re: Streaming replication - 11.5

The replicator is ok and the replicated as well.
%Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si,
0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

Il giorno mer 11 mar 2020 alle ore 23:13 Adrian Klaver <
adrian.klaver@aklaver.com> ha scritto:

Show quoted text

On 3/11/20 2:12 PM, Nicola Contu wrote:

CPU load on the server to be built? No.

CPU load, I/O load on the servers in the replication chain.

Basically you just recently, it seems, imposed extra overhead to the
process by encrypting/decrypting. From what I gather from earlier post
then your replication started failing. Trying to establish whether the
former is causing the latter.

Not sure what you are referring to in regards to server to be built?

System logs don't show anything relevant unfortunately

Il mer 11 mar 2020, 21:34 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/11/20 11:59 AM, Nicola Contu wrote:

I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one

having

problems is in Dallas as well on the same switch of the one

replicating

from the master.

It always worked not sure what is wrong now. We just encrypted

disks on

all servers

Do you have before and after on CPU load, I/O throughput?

Do system logs show anything relevant during replication drop out?

Il mer 11 mar 2020, 18:57 Adrian Klaver

<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>> ha scritto:

On 3/11/20 2:54 AM, Nicola Contu wrote:

These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
db=cmdv3,user=zabbix_check ERROR: recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
db=cmdv3,user=zabbix_check HINT: WAL control functions

cannot be

executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
db=cmdv3,user=zabbix_check STATEMENT: select
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(),

replay_lsn)) from

pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix

agent and

tested

again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot

"pg_basebackup_51199"

*pg_basebackup: could not receive data from WAL stream:

SSL SYSCALL

error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender

process due to

replication timeout

Where are the master and standby in relation to each other

network wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL

file(s)

added, 0 removed, 7 recycled; write=270.014 s, sync=0.009

s,

total=270.036 s; sync files=804, longest=0.001 s,

average=0.000 s;

distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:

recovery

restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:22:44.050084+00.

2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL

file(s)

added, 0 removed, 17 recycled; write=269.879 s, sync=0.006

s,

total=269.902 s; sync files=811, longest=0.001 s,

average=0.000 s;

distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:

recovery

restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:27:43.945485+00.

2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
db=cmdv3,user=nis LOG: duration: 1402.004 ms statement:

SELECT id,

name, parent_id, parent, short_name, sales_rep_id FROM

mmx_clients;

2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
db=cmdv3,user=nis LOG: duration: 9493.259 ms statement:

SELECT

slid,

gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL

file(s)

added, 0 removed, 13 recycled; write=269.953 s, sync=0.012

s,

total=269.979 s; sync files=760, longest=0.002 s,

average=0.000 s;

distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:

recovery

restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:32:13.916101+00.

2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL

file(s)

added, 0 removed, 12 recycled; write=270.026 s, sync=0.007

s,

total=270.052 s; sync files=814, longest=0.001 s,

average=0.000 s;

distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:

recovery

restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:37:14.067443+00.

2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG:
restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL

file(s)

added, 0 removed, 9 recycled; write=269.850 s, sync=0.019

s,

total=269.886 s; sync files=834, longest=0.002 s,

average=0.000 s;

distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:

recovery

restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:42:13.900088+00.

2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG:
restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]:

[1-1]

db=[unknown],user=replicator LOG: terminating walsender

process

due to

replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG:
restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL

file(s)

added, 0 removed, 9 recycled; write=270.048 s, sync=0.014

s,

total=270.085 s; sync files=852, longest=0.001 s,

average=0.000 s;

distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:

recovery

restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user=

DETAIL: Last

completed transaction was at log time 2020-03-11

09:47:14.069731+00.

2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:
restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver
<adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>> ha scritto:

On 3/10/20 8:17 AM, Nicola Contu wrote:
Please post to list also.
Ccing list.

What came immediately before the temporary file error?

2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
db=postgres,user=postgres LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
db=postgres,user=postgres LOG: could not send data

to client:

Broken pipe

2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
db=postgres,user=postgres FATAL: connection to

client lost

2020-03-10 15:10:26 GMT [] [12598]: [3544-1]

db=,user= LOG:

restartpoint complete: wrote 37315 buffers

(0.4%); 0 WAL

file(s)

added, 0 removed, 16 recycled; write=269.943 s,

sync=0.039 s,

total=269.999 s; sync files=1010, longest=0.001 s,

average=0.000 s;

distance=175940 kB, estimate=416149 kB
2020-03-10 15:10:26 GMT [] [12598]: [3545-1]

db=,user= LOG:

recovery

restart point at 6424/1D7DEDE8

It is a cascade replication

Il giorno mar 10 mar 2020 alle ore 15:58 Adrian

Klaver

<adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>>> ha scritto:

On 3/10/20 2:26 AM, Nicola Contu wrote:

Hello,
I have two servers connected to the same

switch running

postgres 11.5

I am trying to replicate one of those

servers after

a planned

work on

the master, so the replica has been lost. It

has

always worked

but now I

get this :

pg_basebackup: could not receive data from

WAL

stream: server

closed the

connection unexpectedly
This probably means the server

terminated

abnormally

before or while processing the

request.

I don't really understand what the issue is.

I would start with the logs from the Postgres

server

you are

taking the

backup from.

I had this issue last week as well in

another DC

and I had to

reboot the

slave to make it working (not sure why it

helped)

Do you know what can cause this?

Thank you,
Nicola

--
Adrian Klaver
adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>
<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>>>

--
Adrian Klaver
adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com> <mailto:adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#10)
Re: Streaming replication - 11.5

On 3/12/20 4:31 AM, Nicola Contu wrote:

The replicator is ok and the replicated as well.
%Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0 hi,  0.0 si,
 0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

I have to believe their is a connection between changing to encrypting
the disks and your issues. Not sure what, but to help how is the
encryption being done and what program is being used?

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#11)
Re: Streaming replication - 11.5

The encryption is at os level. So the drives are encrypted with a password
where the db saves data

Il gio 12 mar 2020, 15:51 Adrian Klaver <adrian.klaver@aklaver.com> ha
scritto:

Show quoted text

On 3/12/20 4:31 AM, Nicola Contu wrote:

The replicator is ok and the replicated as well.
%Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si,
0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

I have to believe their is a connection between changing to encrypting
the disks and your issues. Not sure what, but to help how is the
encryption being done and what program is being used?

--
Adrian Klaver
adrian.klaver@aklaver.com

#13Nicola Contu
nicola.contu@gmail.com
In reply to: Nicola Contu (#12)
Re: Streaming replication - 11.5

So in the logs I now see this :

2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout

So I tried increasing the wal_sender_timeout to 300s but it did not help

Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu <
nicola.contu@gmail.com> ha scritto:

Show quoted text

The encryption is at os level. So the drives are encrypted with a password
where the db saves data

Il gio 12 mar 2020, 15:51 Adrian Klaver <adrian.klaver@aklaver.com> ha
scritto:

On 3/12/20 4:31 AM, Nicola Contu wrote:

The replicator is ok and the replicated as well.
%Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si,
0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

I have to believe their is a connection between changing to encrypting
the disks and your issues. Not sure what, but to help how is the
encryption being done and what program is being used?

--
Adrian Klaver
adrian.klaver@aklaver.com

#14Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#13)
Re: Streaming replication - 11.5

On 3/13/20 4:11 AM, Nicola Contu wrote:

So in the logs I now see this :

2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process due to
replication timeout

Yeah that's been showing up the log snippets you have been posting.

To figure this out you will need to:

1) Make a list of what changed since the last time replication worked
consistently.

2) Monitor the changed components, start logging or increase logging.

3) Monitor the chain of replication as whole, to catch changes that you
do not know about. Since you seem to be operating across data centers
that would include verifying the network.

So I tried increasing the wal_sender_timeout to 300s but it did not help

Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu
<nicola.contu@gmail.com <mailto:nicola.contu@gmail.com>> ha scritto:

The encryption is at os level. So the drives are encrypted with a
password where the db saves data

Il gio 12 mar 2020, 15:51 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/12/20 4:31 AM, Nicola Contu wrote:

The replicator is ok and the replicated as well.
%Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0

hi,  0.0 si,

   0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

I have to believe their is a connection between changing to
encrypting
the disks and your issues. Not sure what, but to help how is the
encryption being done and what program is being used?

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#15Nicola Contu
nicola.contu@gmail.com
In reply to: Adrian Klaver (#14)
Re: Streaming replication - 11.5

I was able to make pg_basebackup working using --max-rate=128M

Still don't understand why. I guess it is related to the encryption and
slowness of the disk..
Do you have any idea?

Il giorno ven 13 mar 2020 alle ore 16:15 Adrian Klaver <
adrian.klaver@aklaver.com> ha scritto:

Show quoted text

On 3/13/20 4:11 AM, Nicola Contu wrote:

So in the logs I now see this :

2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
db=[unknown],user=replicator LOG: terminating walsender process due to
replication timeout

Yeah that's been showing up the log snippets you have been posting.

To figure this out you will need to:

1) Make a list of what changed since the last time replication worked
consistently.

2) Monitor the changed components, start logging or increase logging.

3) Monitor the chain of replication as whole, to catch changes that you
do not know about. Since you seem to be operating across data centers
that would include verifying the network.

So I tried increasing the wal_sender_timeout to 300s but it did not help

Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu
<nicola.contu@gmail.com <mailto:nicola.contu@gmail.com>> ha scritto:

The encryption is at os level. So the drives are encrypted with a
password where the db saves data

Il gio 12 mar 2020, 15:51 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/12/20 4:31 AM, Nicola Contu wrote:

The replicator is ok and the replicated as well.
%Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0

hi, 0.0 si,

0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?

I have to believe their is a connection between changing to
encrypting
the disks and your issues. Not sure what, but to help how is the
encryption being done and what program is being used?

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#16Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicola Contu (#15)
Re: Streaming replication - 11.5

On 3/16/20 2:57 AM, Nicola Contu wrote:

I was able to make pg_basebackup working using --max-rate=128M

Still don't understand why. I guess it is related to the encryption and
slowness of the disk..
Do you have any idea?

I think your explanation fits. Encryption/decryption have overhead.

Il giorno ven 13 mar 2020 alle ore 16:15 Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> ha scritto:

On 3/13/20 4:11 AM, Nicola Contu wrote:

So in the logs I now see this :

2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process

due to

replication timeout

Yeah that's been showing up the log snippets you have been posting.

To figure this out you will need to:

1) Make a list of what changed since the last time replication worked
consistently.

2) Monitor the changed components, start logging or increase logging.

3) Monitor the chain of replication as whole, to catch changes that you
do not know about. Since you seem to be operating across data centers
that would include verifying the network.

So I tried increasing the wal_sender_timeout to 300s but it did

not help

Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu
<nicola.contu@gmail.com <mailto:nicola.contu@gmail.com>

<mailto:nicola.contu@gmail.com <mailto:nicola.contu@gmail.com>>> ha
scritto:

     The encryption is at os level. So the drives are encrypted with a
     password where the db saves data

     Il gio 12 mar 2020, 15:51 Adrian Klaver

<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

     <mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>> ha scritto:

         On 3/12/20 4:31 AM, Nicola Contu wrote:
          > The replicator is ok and the replicated as well.
          > %Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0
         hi,  0.0 si,
          >   0.0 st
          >
          > CPU is really low on both.
          >
          > I am running pg_basebackup again everytime.
          > Any other suggestions?
          >

         I have to believe their is a connection between changing to
         encrypting
         the disks and your issues. Not sure what, but to help how

is the

         encryption being done and what program is being used?

         --
         Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com