How to start slave after pg_basebackup. Why min_wal_size and wal_keep_segments are duplicated
Hi!
pg_basebackup takes 8 hours.
After it is finished, replication slave does not start:
LOG: consistent recovery state reached at 2DE/985A5BE0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 2DE/99000000 on timeline 1
replikaator@[unknown] LOG: received replication command: SHOW data_directory_mode
replikaator@[unknown] LOG: received replication command: IDENTIFY_SYSTEM
replikaator@[unknown] LOG: received replication command: START_REPLICATION 2CF/E9000000 TIMELIN
replikaator@[unknown] ERROR: requested WAL segment 00000001000002CF000000E9 has already been re
replikaator@[unknown] LOG: received replication command: SHOW data_directory_mode
replikaator@[unknown] LOG: received replication command: IDENTIFY_SYSTEM
replikaator@[unknown] LOG: received replication command: START_REPLICATION 2CF/E9000000 TIMELIN
replikaator@[unknown] ERROR: requested WAL segment 00000001000002CF000000E9 has already been removed
...
i tried it again and same error occured.
How to force replication to start?
I increased wal parameters in master to
wal_compression=on
max_wal_size = 5GB
min_wal_size = 4GB # was 80MB
wal_keep_segments= 360 # was 180
Will this allow replication to start after pg_basebackup ?
According to doc min_wal_size and wal_keep_segments both keep the minimum number of wal segments for replication.
Why those parameters are duplicated?
Andrus.
On 5/31/20 12:47 PM, Andrus wrote:
Hi!
pg_basebackup takes 8 hours.
After it is finished,ļæ½ replication slave does not start:LOG:ļæ½ consistent recovery state reached at 2DE/985A5BE0
LOG:ļæ½ database system is ready to accept read only connections
LOG:ļæ½ started streaming WAL from primary at 2DE/99000000 on timeline 1
replikaator@[unknown] LOG:ļæ½ received replication command: SHOW
data_directory_mode
replikaator@[unknown] LOG:ļæ½ received replication command: IDENTIFY_SYSTEM
replikaator@[unknown] LOG:ļæ½ received replication command:
START_REPLICATION 2CF/E9000000 TIMELIN
replikaator@[unknown] ERROR:ļæ½ requested WAL segment
00000001000002CF000000E9 has already been re
replikaator@[unknown] LOG:ļæ½ received replication command: SHOW
data_directory_mode
replikaator@[unknown] LOG:ļæ½ received replication command: IDENTIFY_SYSTEM
replikaator@[unknown] LOG:ļæ½ received replication command:
START_REPLICATION 2CF/E9000000 TIMELIN
replikaator@[unknown] ERROR:ļæ½ requested WAL segment
00000001000002CF000000E9 has already been removed
There's your problem ^
...
i tried it again and same error occured.
How to force replication to start?
If the WAL is gone you can't.
More below.
I increased wal parameters in master to
wal_compression=on
max_wal_size = 5GB
min_wal_size = 4GB # was 80MB
wal_keep_segments= 360 # was 180Will this allow replication to start after pg_basebackup ?
According to doc min_wal_size and wal_keep_segments both keep the
minimum number of wal segments for replication.
No it doesn't:
https://www.postgresql.org/docs/12/runtime-config-replication.html
"wal_keep_segments (integer)
Specifies the minimum number of past log file segments kept in the
pg_wal directory, in case a standby server needs to fetch them for
streaming replication. Each segment is normally 16 megabytes. If a
standby server connected to the sending server falls behind by more than
wal_keep_segments segments, the sending server might remove a WAL
segment still needed by the standby, in which case the replication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
connection will be terminated. Downstream connections will also
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eventually fail as a result. (However, the standby server can recover by
fetching the segment from archive, if WAL archiving is in use.)
...
https://www.postgresql.org/docs/12/runtime-config-wal.html
"min_wal_size (integer)
As long as WAL disk usage stays below this setting, old WAL files
are always recycled for future use at a checkpoint, rather than removed.
This can be used to ensure that enough WAL space is reserved to handle
spikes in WAL usage, for example when running large batch jobs. If this
value is specified without units, it is taken as megabytes. The default
is 80 MB. This parameter can only be set in the postgresql.conf file or
on the server command line.
"
I'm guessing are looking for:
https://www.postgresql.org/docs/12/runtime-config-replication.html
"
26.2.6. Replication Slots
Replication slots provide an automated way to ensure that the master
does not remove WAL segments until they have been received by all
standbys, and that the master does not remove rows which could cause a
recovery conflict even when the standby is disconnected.
...
"
This is spelled out here:
https://www.postgresql.org/docs/12/warm-standby.html#STREAMING-REPLICATION
"If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the standby
has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_segments to a value large enough to ensure that WAL segments
are not recycled too early, or by configuring a replication slot for the
standby. If you set up a WAL archive that's accessible from the standby,
these solutions are not required, since the standby can always use the
archive to catch up provided it retains enough segments."
Why those parameters are duplicated?
Andrus.
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi Andrus,
On 31. May, 2020, at 21:47, Andrus <kobruleht2@hot.ee> wrote:
replikaator@[unknown] ERROR: requested WAL segment 00000001000002CF000000E9 has already been removed
the message says it all. You need to copy the WAL file 00000001000002CF000000E9 and newer to the replica's pg_wal directory because it has been removed already on the master site. Obviously, you can only do that if the files have been archived. Otherwise, you'd have to fully reinitiate the replica.
Replication will start again as soon as the requested WALs are copied over to the replica.
Use a replication slot to avoid this situation. If you use a replication slot, the master will only remove WAL files which are not needed by any one replica.
Hope this helps.
Cheers,
Paul
Hi!
On 31. May, 2020, at 21:47, Andrus <kobruleht2@hot.ee> wrote:
replikaator@[unknown] ERROR: requested WAL segment 00000001000002CF000000E9 has already been removed
the message says it all. You need to copy the WAL file 00000001000002CF000000E9 and newer to the replica's pg_wal directory because
it has been removed >already on the master site.
Obviously, you can only do that if the files have been archived.
wal files are not archieved.
Otherwise, you'd have to fully reinitiate the replica.
I have tried to re-initiate replica serveral times in low-use time but this error occurs again.
Replication will start again as soon as the requested WALs are copied over to the replica.
Use a replication slot to avoid this situation. If you use a replication slot, the master will only remove WAL files which are not
needed by any one replica.
If named replication slot is used commands like
vacuumdb --all --full
will cause main server crash due to disk space limit. pg_wal directory will occupy free disk space. After that main server stops.
I tried using wal_keep_segments =180
Will setting wal_keep_segments to higher value allw replication start after pg_basebackup ?
Andrus.
Hi!
I'm guessing are looking for:
https://www.postgresql.org/docs/12/runtime-config-replication.html
26.2.6. Replication Slots
Replication slots provide an automated way to ensure that the master
does not remove WAL segments until they have been received by all
standbys, and that the master does not remove rows which could cause a
recovery conflict even when the standby is disconnected.
Using replication slot can cause pg_wal directoy to occupy all free disk space and after that server stop respondig.
This is spelled out here:
https://www.postgresql.org/docs/12/warm-standby.html#STREAMING-REPLICATION
"If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the standby
has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_segments to a value large enough to ensure that WAL segments
are not recycled too early, or by configuring a replication slot for the
standby.
Will wal_keep_segments keep segments also if named replication slot is lot used ?
Andrus.
On 5/31/20 2:03 PM, Andrus wrote:
Hi!
I'm guessing are looking for:
https://www.postgresql.org/docs/12/runtime-config-replication.html
26.2.6. Replication Slots
Replication slots provide an automated way to ensure that the master
does not remove WAL segments until they have been received by all
standbys, and that the master does not remove rows which could cause a
recovery conflict even when the standby is disconnected.Using replication slot can cause pg_wal directoy to occupy all free disk
space and after that server stop respondig.This is spelled out here:
https://www.postgresql.org/docs/12/warm-standby.html#STREAMING-REPLICATION"If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the
standby has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_segments to a value large enough to ensure that WAL segments
are not recycled too early, or by configuring a replication slot for
the standby.Will wal_keep_segmentsļæ½ keep segments also if named replication slot is
lot used ?
Well if you are using a replication slot there is no point in using
wal_keep_segments. Slots where created in, part at least, so you did not
have to guess at a wal_keep_segments number.
To really answer this we will need to see the exact commands you are
using and the sequence they are done in.
Andrus.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 5/31/20 2:03 PM, Andrus wrote:
Hi!
I'm guessing are looking for:
https://www.postgresql.org/docs/12/runtime-config-replication.html
26.2.6. Replication Slots
Replication slots provide an automated way to ensure that the master
does not remove WAL segments until they have been received by all
standbys, and that the master does not remove rows which could cause a
recovery conflict even when the standby is disconnected.Using replication slot can cause pg_wal directoy to occupy all free disk
space and after that server stop respondig.This is spelled out here:
https://www.postgresql.org/docs/12/warm-standby.html#STREAMING-REPLICATION"If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the
standby has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_segments to a value large enough to ensure that WAL segments
are not recycled too early, or by configuring a replication slot for
the standby.Will wal_keep_segmentsļæ½ keep segments also if named replication slot is
lot used ?
In addition to my most recent questions:
What are you trying to achieve?
In other words why do a pg_basebackup if you have a standby receiving WALs?
Andrus.
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi Andrus,
On 31. May, 2020, at 22:56, Andrus <kobruleht2@hot.ee> wrote:
wal files are not archieved.
IMHO a bad decision. They should be. Now you're in the situation where you see why.
I have tried to re-initiate replica serveral times in low-use time but this error occurs again.
remove the whole replica's PGDATA/* and do a pg_basebackup again. But before that, make sure wal_keep_segments in big enough on the master and, just as much important, do a vacuumdb -a (takes much space during the process) and use archiving!
If named replication slot is used commands like
vacuumdb --all --full
will cause main server crash due to disk space limit. pg_wal directory will occupy free disk space. After that main server stops.
if you have disk constraints you will run into trouble sooner or later anyway. Make sure, you have enough disk space. There's no way around that anyway.
I tried using wal_keep_segments =180
Will setting wal_keep_segments to higher value allw replication start after pg_basebackup ?
it depends. If you start the replica immediately and don't wait for hours or days, you should be good to go. But that depends on different factors, for example, how many WAL files are written during the pg_basebackup and pg_ctl start of the replica. If more than 180 WALs have gone by on the master because it is really busy, then you're probably lost again. Point being, you'll have to launch the replica before WALs are expired!
Again: Make sure you have enough disk space, use archiving and use a replication slot.
Cheers,
Paul
Hi!
In addition to my most recent questions:
What are you trying to achieve?
I want to create hot standby async server using
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainold
pg_basebackup --write-recovery-conf -D /var/lib/postgresql/12/main
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql start
In other words why do a pg_basebackup if you have a standby receiving WALs?
I dont receive WALs.
Andrus.
Hi!
Will wal_keep_segments keep segments also if named replication slot is
lot used ?Well if you are using a replication slot there is no point in using
wal_keep_segments. Slots where created in, part at least, so you did not
have to guess at a wal_keep_segments number.
I dont use slot.
To really answer this we will need to see the exact commands you are
using and the sequence they are done in.
Replication server is created using
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainold
pg_basebackup --write-recovery-conf -D /var/lib/postgresql/12/main
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql start
Andrus.
On 5/31/20 2:43 PM, Andrus wrote:
Hi!
In addition to my most recent questions:
What are you trying to achieve?I want to createļæ½ hot standby async server using
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainold
pg_basebackup --write-recovery-conf -D /var/lib/postgresql/12/main
I don't see where the base backup is being taken from just where it is
going.
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql startIn other words why do a pg_basebackup if you have a standby receiving
WALs?I dont receive WALs.
If you are doing binary replication then you are receiving WALs. It just
a matter of whether you are streaming them or shipping them over complete.
Andrus.
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi!
I want to create hot standby async server using
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainold
pg_basebackup --write-recovery-conf -D /var/lib/postgresql/12/mainI don't see where the base backup is being taken from just where it is going.
It is taken from VPS server over 20 Mbit public internet connection.
Both servers are running Debian Linux.
I dont receive WALs.
If you are doing binary replication then you are receiving WALs. It just a matter of whether you are streaming them or shipping
them over complete.
Using wal_keep_segments=360 also causes same wal file not found error after pg_basebackup.
master server has 6GB wal files. wal log during pg_basebackup is much slower than 360.
Maybe pg_basebackup skips wal segments . Maybe using wal_compression=on causes the issue.
How to fix this ?
How to create base backup so that cluster is tranferred over internet faster? Maybe it can transferred in compressed form over
internet.
Andrus.
Hi!
I have tried to re-initiate replica serveral times in low-use time but this error occurs again.
remove the whole replica's PGDATA/* and do a pg_basebackup again. But before that, make sure wal_keep_segments in big enough on the
master and,
I renamed whole cluster before pg_basebackup
just as much important, do a vacuumdb -a (takes much space during the process) and use archiving!
I run vacuumdb --full --all before pg_basebackup
If named replication slot is used commands like
vacuumdb --all --full
will cause main server crash due to disk space limit. pg_wal directory will occupy free disk space. After that main server stops.if you have disk constraints you will run into trouble sooner or later anyway. Make sure, you have enough disk space. There's no
way around that anyway.
This space is sufficient for base backup and replication.
I tried using wal_keep_segments =180
Will setting wal_keep_segments to higher value allw replication start after pg_basebackup ?it depends. If you start the replica immediately and don't wait for hours or days, you should be good to go. But that depends on
different factors, for example, how >many WAL files are written during the pg_basebackup and pg_ctl start of the replica. If more
than 180 WALs have gone by on the master because it is really busy, >then you're probably lost again. Point being, you'll have to
launch the replica before WALs are expired!
Again: Make sure you have enough disk space, use archiving and use a replication slot.
I tried with wal_keep_segments=360 but problem persisists.
Server generates lot of less than 300 wal files.
Shell script starts server after pg_basebackup completes automatically:
PGHOST=example.com
PGPASSWORD=mypass
PGUSER=replikaator
export PGHOST PGPASSWORD PGUSER
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainennebaasbakuppi
pg_basebackup --verbose --progress --write-recovery-conf -D /var/lib/postgresql/12/main
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql start
How to create replication server ?
Andrus.
Hi Andrus,
On 01. Jun, 2020, at 10:17, Andrus <kobruleht2@hot.ee> wrote:
Shell script starts server after pg_basebackup completes automatically:PGHOST=example.com
PGPASSWORD=mypass
PGUSER=replikaator
export PGHOST PGPASSWORD PGUSER
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainennebaasbakuppi
pg_basebackup --verbose --progress --write-recovery-conf -D /var/lib/postgresql/12/main
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql startHow to create replication server ?
I always do it this way and it work for me:
$ pg_basebackup -h ${PGHOST} -p ${PGPORT} -U replicator -W -R -D ${PGDATA} -P -v -Fp -Xs
After that, I edit ${PGDATA}/postgresql.conf and (w/ PostgreSQL 11 and older ${PGDATA}/recovery.conf) to make it do what I want and then I just launch it:
$ pg_ctl start
From that moment onward, it replicates and applies to the replica. Checks in pg_stat_replication on the master and pg_stat_wal_receiver on the replica confirm that. They also show the WAL switches.
To provoke a WAL switch I always do:
postgres=# checkpoint; select pg_switch_wal();
CHECKPOINT
pg_switch_wal
---------------
C/50000128
(1 row)
I just don't understand what you're trying to achieve here. My guess is, you want to stop and backup the old database cluster, then create a new one in its old directory, right? In this case, you probably need to change your script to something like this:
PGHOST=remote.example.com
PGPASSWORD=mypass
PGUSER=replikaator
PGDATA=/var/lib/postgresql/12/main
export PGHOST PGPASSWORD PGUSER PGDATA
/etc/init.d/postgresql stop
mv ${PGDATA} /var/lib/postgresql/12/mainennebaasbakuppi
pg_basebackup -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -W -R -D ${PGDATA} -P -v -Fp -Xs
/etc/init.d/postgresql start
Note that my invocation of pg_basebackup asks for the replicator password. This is intended. You'd probably want to change that.
Also, no need to play around with ownership and permissions. Do it as "postgres", not as "root".
Cheers,
Paul
On Mon, Jun 1, 2020 at 10:17 AM Andrus <kobruleht2@hot.ee> wrote:
Hi!
I have tried to re-initiate replica serveral times in low-use time but
this error occurs again.
remove the whole replica's PGDATA/* and do a pg_basebackup again. But
before that, make sure wal_keep_segments in big enough on the
master and,
I renamed whole cluster before pg_basebackup
just as much important, do a vacuumdb -a (takes much space during the
process) and use archiving!
I run vacuumdb --full --all before pg_basebackup
If named replication slot is used commands like
vacuumdb --all --full
will cause main server crash due to disk space limit. pg_wal directorywill occupy free disk space. After that main server stops.
if you have disk constraints you will run into trouble sooner or later
anyway. Make sure, you have enough disk space. There's no
way around that anyway.
This space is sufficient for base backup and replication.
I tried using wal_keep_segments =180
Will setting wal_keep_segments to higher value allw replication startafter pg_basebackup ?
it depends. If you start the replica immediately and don't wait for hours
or days, you should be good to go. But that depends on
different factors, for example, how >many WAL files are written during
the pg_basebackup and pg_ctl start of the replica. If more
than 180 WALs have gone by on the master because it is really busy, >then
you're probably lost again. Point being, you'll have to
launch the replica before WALs are expired!
Again: Make sure you have enough disk space, use archiving and use areplication slot.
I tried with wal_keep_segments=360 but problem persisists.
Server generates lot of less than 300 wal files.
Have you verified that wal_keep_segments actually end up at 360, by
connecting to the database and issuing SHOW wal_keep_segments? I've seen
far too many examples of people who accidentally had a second line that
overrode the one they thought they changed, and thus still ran with a lower
number.
Shell script starts server after pg_basebackup completes automatically:
PGHOST=example.com
PGPASSWORD=mypass
PGUSER=replikaator
export PGHOST PGPASSWORD PGUSER
/etc/init.d/postgresql stop
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/mainennebaasbakuppi
pg_basebackup --verbose --progress --write-recovery-conf -D
/var/lib/postgresql/12/main
chmod --recursive --verbose 0700 /var/lib/postgresql/12/main
chown -Rv postgres:postgres /var/lib/postgresql/12/main
/etc/init.d/postgresql start
Do you get any useful output from the -v part of pg_basebackup? It should
for example tell you the exact start and stop point in the wal during the
basebackup, that can be correlated to the msising file.
Normally the window between end of pg_basebackup and start of the actual
service is not big enough to cause a problem (since v12 will do a streaming
receive of the logs *during* the backup -- it could be a big problem before
that was possible, or if one forgot to enable it before it was the
default), and it certainly sounds weird that it should be in your case,
unless the chmod and chown commands take a *long* time. But if it is, there
is nothing preventing you from creating a slot just during setup and then
get rid of it. That is:
1. create slot
2. pg_basebackup with slot
3. start replication with slot
4. restart replication without slot once it's caught up
5. drop slot
However, if you want reliable replication, you really should have a slot.
Or at least, you should have either a slot *or* log archiving that's
read-accessible from the replica.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
Hi!
How to create replication server ?
I always do it this way and it work for me:
$ pg_basebackup -h ${PGHOST} -p ${PGPORT} -U replicator -W -R -D ${PGDATA} -P -v -Fp -Xs
After that, I edit ${PGDATA}/postgresql.conf and (w/ PostgreSQL 11 and older ${PGDATA}/recovery.conf) to make it do what I want and
then I just launch it:
$ pg_ctl start
My script does the same thing as your comands.
From that moment onward, it replicates and applies to the replica. Checks in pg_stat_replication on the master and
pg_stat_wal_receiver on the replica >confirm that. They also show the WAL switches.
To provoke a WAL switch I always do:
postgres=# checkpoint; select pg_switch_wal();
CHECKPOINT
pg_switch_wal
I just don't understand what you're trying to achieve here.
I want to create replication server.
My guess is, you want to stop and backup the old database cluster,
Old cluster is empty, from initdb. Backup is not needed
then create a new one in its old directory, right?
pg_basebackup creates new main directory.
In this case, you probably need to change your script to something like this:
PGHOST=remote.example.com
PGPASSWORD=mypass
PGUSER=replikaator
PGDATA=/var/lib/postgresql/12/main
export PGHOST PGPASSWORD PGUSER PGDATA
/etc/init.d/postgresql stop
mv ${PGDATA} /var/lib/postgresql/12/mainennebaasbakuppi
pg_basebackup -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -W -R -D ${PGDATA} -P -v -Fp -Xs
/etc/init.d/postgresql start
pg_basebackup uses environment varuables if not specified in command line. So my script does the same thing.
Note that my invocation of pg_basebackup asks for the replicator password. This is intended. You'd probably want to change that.
Also, no need to play around with ownership and permissions. Do it as "postgres", not as "root".
I tried
sudo --user=postgres pg_basebackup ....
but got error
could not change directory to "/root": Permission denied
Andrus.
Hi Andrus,
On 01. Jun, 2020, at 12:26, Andrus <kobruleht2@hot.ee> wrote:
My script does the same thing as your comands.
no, it does not. And as long as you refuse to understand that, I can't help you.
I want to create replication server.
no, you want to create a streaming replica.
Old cluster is empty, from initdb. Backup is not needed
then why the mv stuff?
pg_basebackup uses environment varuables if not specified in command line. So my script does the same thing.
have you looked at the other options?
I tried
sudo --user=postgres pg_basebackup ....
but got error
could not change directory to "/root": Permission denied
try
sudo su - postgres pg_basebackup ....
Note the "-"! It's essential to get the environment of postgres, which is what you want.
Cheers,
Paul