streaming replication timeout error

Started by 高健over 12 years ago6 messagesgeneral

luckyjackgao@gmail.com

over 12 years ago

Hello:

My customer encountered some connection timeout, while using one
primary-one standby streaming replication.

The original log is japanese, because there are no error-code like oracle's
ora-xxx,
I tried to translate the japanese information into English, But that might
be not correct English for PG.

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not
receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found:
Senario I:

If the wal file wanted is removed manually：

Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL
segment 000000010000000000000011 has already been removed

Senario II:

If replication_timeout=60, wal_receiver_status_interval=0,
log will be like this:

On primary side：
LOG: terminating walsender process due to replication timeout

On standby side：
FATAL: could not receive data from WAL stream:
LOG: streaming replication successfully connected to primary

But they are not the same to my customer's case.
And I also noticed that my customer's recovery.conf:

In it it has primary_conninfo as this:
host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10
keepalives_idle=5 keepalives_interval=1

That is to say:
keepalives_idle=5
keepalives_interval=1
keepalives_count=3

With that setting, I think that For every 5 seconds,
In order to see whether the connection is alive, 3 keepalives package will
be sent,
And if there is no repsonse in 1 seconds, then he connection will be
invalid.

If my unserstanding is right,
when master is busy at network,
the three keepalives_ parameter will also cause the connection time out
error.

But I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?

2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG: Recovery
replay at point 1F1/45EF8E20
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL: Last
transaction log time 2013-09-22 09:40:42.673062+09
2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG:
restartpoint starting: time
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG:
restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s,
total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG: Recovery
replay at point 1F1/45EF8E20
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL: Last
transaction log time 2013-09-22 09:45:44.77683+09
2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG:
restartpoint starting: time
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG:
restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s)
added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s;
sync files=98, longest=0.002 s, average=0.000 s
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG: Recovery
replay at point 1F1/46CE8EA0
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL: Last
transaction log time 2013-09-22 09:50:36.564966+09
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not
receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory
2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG: Log file
497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or
directory
2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG: Streaming
replication connected to primary server successfully
2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG:
restartpoint starting: time
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG:
restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log
file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s,
total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG: Recovery
replay at point 1F1/48E9D238
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL: Last
transaction log time 2013-09-22 09:56:29.015514+09

Thanks in advance

jian gao

Adrian Klaver

adrian.klaver@aklaver.com

over 12 years ago

In reply to: 高健 (#1)

Re: streaming replication timeout error

On 10/08/2013 07:58 PM, 高健 wrote:

Hello:

My customer encountered some connection timeout, while using one
primary-one standby streaming replication.

The original log is japanese, because there are no error-code like
oracle's ora-xxx,
I tried to translate the japanese information into English, But that
might be not correct English for PG.

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could
not receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found:
Senario I:

If the wal file wanted is removed manually：

Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL
segment 000000010000000000000011 has already been removed

But I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?

Would seem to me the interesting part is:

scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No
such file or directory

Are using scp to move WAL files to an archive directory?

If so, it seems scp is having issues, either network interruption or the
file is disappearing under it.

Thanks in advance

jian gao

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

高健

luckyjackgao@gmail.com

over 12 years ago

In reply to: Adrian Klaver (#2)

Re: streaming replication timeout error

Hello:

Thanks for replying.

The recovery.conf file on standby(DB2) is like that:

standby_mode = 'on'
primary_conninfo = 'host=DB1 port=5432 application_name=testpg
user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1'
recovery_target_timeline = 'latest'
restore_command = 'scp -o "ConnectTimeout 5" -i
/opt/PostgresPlus/9.2AS/.ssh/id_edb
DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p'

I am not familiar with the scp command, I think that here scp is used to
copy archive wal log files from primary to standby...

Maybe the ConnectionTimeout is too small, And sometimes when network is not
very well,
the restore_command will fail and return FATAL error?

In fact I am a little confused about restore_command, we are using
streaming replication, but why restore_command is still needed to copy
archive wal log, isn't it the old warm standby (file shipping)?

Best Regards
jian gao

2013/10/9 Adrian Klaver <adrian.klaver@gmail.com>

Show quoted text

On 10/08/2013 07:58 PM, 高健 wrote:

Hello:

My customer encountered some connection timeout, while using one
primary-one standby streaming replication.

The original log is japanese, because there are no error-code like
oracle's ora-xxx,
I tried to translate the japanese information into English, But that
might be not correct English for PG.

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0]**[XX000]FATAL: Could
not receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/**arch/000000AC000001F10000004A: No
such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found:
Senario I:

If the wal file wanted is removed manually：

Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL
segment 000000010000000000000011 has already been removed

But I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?

Would seem to me the interesting part is:

scp: /opt/PostgresPlus/9.2AS/data/**arch/000000AC000001F10000004A: No
such file or directory

Are using scp to move WAL files to an archive directory?

If so, it seems scp is having issues, either network interruption or the
file is disappearing under it.

Thanks in advance

jian gao

--
Adrian Klaver
adrian.klaver@gmail.com

Adrian Klaver

adrian.klaver@aklaver.com

over 12 years ago

In reply to: 高健 (#3)

Re: streaming replication timeout error

On 10/09/2013 05:51 PM, 高健 wrote:

Hello:

Thanks for replying.

The recovery.conf file on standby(DB2) is like that:

standby_mode = 'on'
primary_conninfo = 'host=DB1 port=5432 application_name=testpg
user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1'
recovery_target_timeline = 'latest'
restore_command = 'scp -o "ConnectTimeout 5" -i
/opt/PostgresPlus/9.2AS/.ssh/id_edb
DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p'

I am not familiar with the scp command, I think that here scp is used
to copy archive wal log files from primary to standby...

Maybe the ConnectionTimeout is too small, And sometimes when network is
not very well,
the restore_command will fail and return FATAL error?

In fact I am a little confused about restore_command, we are using
streaming replication, but why restore_command is still needed to copy
archive wal log, isn't it the old warm standby (file shipping)?

Best explanation is in the docs:

http://www.postgresql.org/docs/9.3/static/warm-standby.html
"
At startup, the standby begins by restoring all WAL available in the
archive location, calling restore_command. Once it reaches the end of
WAL available there and restore_command fails, it tries to restore any
WAL available in the pg_xlog directory. If that fails, and streaming
replication has been configured, the standby tries to connect to the
primary server and start streaming WAL from the last valid record found
in archive or pg_xlog. If that fails or streaming replication is not
configured, or if the connection is later disconnected, the standby goes
back to step 1 and tries to restore the file from the archive again.
This loop of retries from the archive, pg_xlog, and via streaming
replication goes on until the server is stopped or failover is triggered
by a trigger file.
"

Basically by having a restore_command and primary_conninfo you are
telling the standby to do both, following the sequence described above.

FYI ConnectTimeout is a SSH option passed to scp.

man ssh_config will get you more information.

Would seem both your streaming and archiving are using the same network,
is that correct?

If so you have a single point of failure, the network.

Best Regards
jian gao

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Stuart Bishop

stuart@stuartbishop.net

over 12 years ago

In reply to: 高健 (#1)

Re: streaming replication timeout error

On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote:

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not
receive data from WAL stream: could not receive data from server: connection
timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

It is a network error. The TCP/IP socket died somehow, and the FATAL
error logged. PostgreSQL then fell back to using the restore_command
defined in your recovery.conf file, but the WAL file had not yet been
shipped and the output logged. I imagine streaming replication happily
reconnected soon after. This is all quite normal.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

高健

luckyjackgao@gmail.com

over 12 years ago

In reply to: Stuart Bishop (#5)

Re: streaming replication timeout error

Thank you all a lot!

I have got it.

Best regards

2013/10/10 Stuart Bishop <stuart@stuartbishop.net>

Show quoted text

On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote:

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could

not

receive data from WAL stream: could not receive data from server:

connection

timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

It is a network error. The TCP/IP socket died somehow, and the FATAL
error logged. PostgreSQL then fell back to using the restore_command
defined in your recovery.conf file, but the WAL file had not yet been
shipped and the output logged. I imagine streaming replication happily
reconnected soon after. This is all quite normal.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/