how to decrease the promotion time when performing a multiple failovers.....

Started by Shay Cohaviover 10 years ago4 messagesgeneral
Jump to latest
#1Shay Cohavi
cohavisi@gmail.com

*Hi,*
*I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
replication & continuos archiving).*
*I have created 2 failover & failback script in order to perform a
switchover between the DB servers:*
*1. failover - create a trigger file in order to promote the new primary.*
*2. failback - perform a base backup as mentions in :*
* a. start backup on the primary.*
* b. stop the failed node .*
didn't delete the DB directory on the failed node
* c. performing rsync between the nodes.*
* d.stopping the backup on the primary.*
* e.performing rsync on the pg_xlog.*
* f. creating a recovery.conf*

* standby_mode = 'on'*
* primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'*
* restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'*
* trigger_file = '/home/postgres/databases/fabrix/trigger'*
* archive_cleanup_command = 'ssh 10.50.1.153
/home/postgres/pg_utils/archive_cleanup.sh %r'*

* g. starting the failed node as secondary.*

*the switchover method:*
*1. stop the primary node.*
*2. promote the secondary node (failover.sh).*
*3. perform failback on the failed node.*
*4. start the failed node.*

*this method works great! *

*but if I perform multiple switchovers (>20), each time the new primary
gets promoted (trigger file) - it takes longer because it searches the
timelines on the archive. *
*is there any way to prevent the multiple 'scp' archive commands which
makes the promotion longer! *

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG: trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009300000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009200000002000000DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0000009100000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009000000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/00000095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG: restored log file "00000094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started*

*this could take for a least 1 min.....or more.*

*is there any way to skip the timeline searching in order to decrease the
promotion?*

*Thanks,*
*ShayC*

#2Shay Cohavi
cohavisi@gmail.com
In reply to: Shay Cohavi (#1)
Fwd: how to decrease the promotion time when performing a multiple failovers.....

*Hi,*
*I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
replication & continuos archiving).*
*I have created 2 failover & failback script in order to perform a
switchover between the DB servers:*
*1. failover - create a trigger file in order to promote the new primary.*
*2. failback - perform a base backup as mentions in :*
* a. start backup on the primary.*
* b. stop the failed node .*
didn't delete the DB directory on the failed node
* c. performing rsync between the nodes.*
* d.stopping the backup on the primary.*
* e.performing rsync on the pg_xlog.*
* f. creating a recovery.conf*

* standby_mode = 'on'*
* primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'*
* restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'*
* trigger_file = '/home/postgres/databases/fabrix/trigger'*
* archive_cleanup_command = 'ssh 10.50.1.153
/home/postgres/pg_utils/archive_cleanup.sh %r'*

* g. starting the failed node as secondary.*

*the switchover method:*
*1. stop the primary node.*
*2. promote the secondary node (failover.sh).*
*3. perform failback on the failed node.*
*4. start the failed node.*

*this method works great! *

*but if I perform multiple switchovers (>20), each time the new primary
gets promoted (trigger file) - it takes longer because it searches the
timelines on the archive. *
*is there any way to prevent the multiple 'scp' archive commands which
makes the promotion longer! *

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG: trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009300000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009200000002000000DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0000009100000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009000000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/00000095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG: restored log file "00000094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started*

*this could take for a least 1 min.....or more.*

*is there any way to skip the timeline searching in order to decrease the
promotion?*

*Thanks,*
*ShayC*

#3Vladimir Borodin
root@simply.name
In reply to: Shay Cohavi (#2)
Re: how to decrease the promotion time when performing a multiple failovers.....

1 янв. 2016 г., в 9:29, Shay Cohavi <cohavisi@gmail.com> написал(а):

Hi,
I have postgresql 9.3 setup with 2 nodes (active/standby with streaming replication & continuos archiving).
I have created 2 failover & failback script in order to perform a switchover between the DB servers:
1. failover - create a trigger file in order to promote the new primary.
2. failback - perform a base backup as mentions in :
a. start backup on the primary.
b. stop the failed node .
didn't delete the DB directory on the failed node
c. performing rsync between the nodes.
d.stopping the backup on the primary.
e.performing rsync on the pg_xlog.
f. creating a recovery.conf

standby_mode = 'on'
primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'
restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'
trigger_file = '/home/postgres/databases/fabrix/trigger'
archive_cleanup_command = 'ssh 10.50.1.153 /home/postgres/pg_utils/archive_cleanup.sh %r'

g. starting the failed node as secondary.

the switchover method:
1. stop the primary node.
2. promote the secondary node (failover.sh).
3. perform failback on the failed node.
4. start the failed node.

this method works great!

but if I perform multiple switchovers (>20), each time the new primary gets promoted (trigger file) - it takes longer because it searches the timelines on the archive.
is there any way to prevent the multiple 'scp' archive commands which makes the promotion longer!

for example:

[2015-12-12 20:35:10.769 IST] LOG: trigger file found: /home/postgres/databases/fabrix/trigger
[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due to administrator command
scp: /home/postgres/archive/0000009400000002000000DC: No such file or directory
[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168
[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100
scp: /home/postgres/archive/0000009400000002000000DC: No such file or directory
scp: /home/postgres/archive/0000009300000002000000DC: No such file or directory
scp: /home/postgres/archive/0000009200000002000000DC: No such file or directory
.
.
.

scp: /home/postgres/archive/0000009100000002000000DC: No such file or directory
scp: /home/postgres/archive/0000009000000002000000DC: No such file or directory
scp: /home/postgres/archive/00000095.history: No such file or directory
[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149
[2015-12-12 20:35:11.931 IST] LOG: restored log file "00000094.history" from archive
[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete
[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept connections
[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started

this could take for a least 1 min.....or more.

is there any way to skip the timeline searching in order to decrease the promotion?

You should add recovery_target_timeline = 'latest' to your recovery.conf [0]http://www.postgresql.org/docs/9.3/static/warm-standby.html <http://www.postgresql.org/docs/9.3/static/warm-standby.html&gt; <...> If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline to latest, to make the standby server follow the timeline change that occurs at failover to another standby. <...>.

[0]: http://www.postgresql.org/docs/9.3/static/warm-standby.html <http://www.postgresql.org/docs/9.3/static/warm-standby.html&gt; <...> If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline to latest, to make the standby server follow the timeline change that occurs at failover to another standby. <...>
<...>
If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline to latest, to make the standby server follow the timeline change that occurs at failover to another standby.
<...>

Thanks,
ShayC

--
May the force be with you…
https://simply.name

#4Shay Cohavi
cohavisi@gmail.com
In reply to: Vladimir Borodin (#3)
Re: how to decrease the promotion time when performing a multiple failovers.....

Thanks for the replay.....
Can you please explain the long failover according to the log:

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG: trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009300000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009200000002000000DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0000009100000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009000000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/00000095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG: restored log file "00000094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started*

*And how the recovery_target_timeline = 'latest' should minimize the
promotion time??*

Thanks,
ShayC

On Fri, Jan 1, 2016 at 12:55 PM, Vladimir Borodin <root@simply.name> wrote:

Show quoted text

1 янв. 2016 г., в 9:29, Shay Cohavi <cohavisi@gmail.com> написал(а):

*Hi,*
*I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
replication & continuos archiving).*
*I have created 2 failover & failback script in order to perform a
switchover between the DB servers:*
*1. failover - create a trigger file in order to promote the new primary.*
*2. failback - perform a base backup as mentions in :*
* a. start backup on the primary.*
* b. stop the failed node .*
didn't delete the DB directory on the failed node
* c. performing rsync between the nodes.*
* d.stopping the backup on the primary.*
* e.performing rsync on the pg_xlog.*
* f. creating a recovery.conf*

* standby_mode = 'on'*
* primary_conninfo = 'host=10.50.1.153 port=5432 user=usr
password=pass'*
* restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'*
* trigger_file = '/home/postgres/databases/fabrix/trigger'*
* archive_cleanup_command = 'ssh 10.50.1.153
/home/postgres/pg_utils/archive_cleanup.sh %r'*

* g. starting the failed node as secondary.*

*the switchover method:*
*1. stop the primary node.*
*2. promote the secondary node (failover.sh).*
*3. perform failback on the failed node.*
*4. start the failed node.*

*this method works great! *

*but if I perform multiple switchovers (>20), each time the new primary
gets promoted (trigger file) - it takes longer because it searches the
timelines on the archive. *
*is there any way to prevent the multiple 'scp' archive commands which
makes the promotion longer! *

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG: trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009300000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009200000002000000DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0000009100000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009000000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/00000095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG: restored log file "00000094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started*

*this could take for a least 1 min.....or more.*

*is there any way to skip the timeline searching in order to decrease the
promotion?*

You should add recovery_target_timeline = 'latest' to your recovery.conf
[0].

[0] http://www.postgresql.org/docs/9.3/static/warm-standby.html
<...>
If you plan to have multiple standby servers for high availability
purposes, set recovery_target_timeline to latest, to make the standby
server follow the timeline change that occurs at failover to another
standby.
<...>

*Thanks,*
*ShayC*

--
May the force be with you…
https://simply.name