WAL Archive Cleanup?
Hello,
We're evaluating PostgreSQL for use with Artifactory in our environment.
PostgreSQL seems like the obvious choice because it provides hot-standby
replication. I've followed several guides I've found by googling "postgres
replication how to" (i.e.: this one from DigitalOcean
<https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps>,
though I did it on Ubuntu 18 and CentOS 7), was able to make some test
inserts, and everything seemed to be working well.
Fast forward two weeks, this cluster has been running but not seeing any
traffic. And my master server has filled its archive directory. I found
an older thread
</messages/by-id/417C5AF7C228B94490192951394BEFE7B4805F@AIPHLEXDAG01B.airgas.com>
that seemed to indicate that the wal_keep_segments is what was causing psql
to keep so many WAL files... However, mine is set to 8, but there were
thousands of log files...
I was able to delete everything in the archive/ directory, start the
database back up, and there was no data loss... which is fine in the lab,
but if we're going to roll this out to production, I'm concerned that we'll
be continually running into this issue every couple weeks...
There seems to be a pg_archivecleanup
<https://www.postgresql.org/docs/9.2/pgarchivecleanup.html> command, which
I could run as a cron job, but something tells me that isn't the
recommended way as no literature I've found recommends this...
This SO thread
<https://stackoverflow.com/questions/21113210/archive-cleanup-command-when-is-called>
seems to indicate that archive_cleanup_command can be run every "restart
point", but googling "psql restart point" brings me to this page
<https://www.postgresql.org/docs/9.1/continuous-archiving.html> which a ^F
indicates does not mention a "restart point"... So this *feels* like the
configuration setting I want, but I can't find the documentation that
confirms it for me...
So at this point I'm kinda stumped. I could definitely add more space, but
if those WAL files are never cleaned up, even adding a 1TB Store for the
archive is just delaying the inevitable.
Thanks!
-QBR
These are my configs:
postgresql.local.conf -
effective_cache_size=1500MB
shared_buffers=500MB
maintenance_work_mem=125MB
work_mem=5MB
temp_buffers=5MB
postgresql.conf -
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'
On 3/21/19 11:51 AM, Foo Bar wrote:
Hello,
We're evaluating PostgreSQL for use with Artifactory in our
environment. PostgreSQL seems like the obvious choice because it
provides hot-standby replication. I've followed several guides I've
found by googling "postgres replication how to" (i.e.: this one from
DigitalOcean
<https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps>,
though I did it on Ubuntu 18 and CentOS 7), was able to make some test
inserts, and everything seemed to be working well.
Postgres version?
Fast forward two weeks, this cluster has been running but not seeing any
traffic. And my master server has filled its archive directory. I
It's the standby that has not seen any traffic?
found an older thread
</messages/by-id/417C5AF7C228B94490192951394BEFE7B4805F@AIPHLEXDAG01B.airgas.com>
that seemed to indicate that the wal_keep_segments is what was causing
psql to keep so many WAL files... However, mine is set to 8, but there
FYI, psql is the Postgres client program, Postgres(ql) is the server.
were thousands of log files...
You also have max_replication_slots = 5 also. See:
https://www.postgresql.org/docs/11/warm-standby.html#STREAMING-REPLICATION-SLOTS
"Replication slots provide an automated way to ensure that the master
does not remove WAL segments until they have been received by all
standbys, and that the master does not remove rows which could cause a
recovery conflict even when the standby is disconnected.
In lieu of using replication slots, it is possible to prevent the
removal of old WAL segments using wal_keep_segments, or by storing the
segments in an archive using archive_command. However, these methods
often result in retaining more WAL segments than required, whereas
replication slots retain only the number of segments known to be needed.
An advantage of these methods is that they bound the space requirement
for pg_wal; there is currently no way to do this using replication slots."
Looks like the slots take precedence.
I was able to delete everything in the archive/ directory, start the
database back up, and there was no data loss... which is fine in the
lab, but if we're going to roll this out to production, I'm concerned
that we'll be continually running into this issue every couple weeks...There seems to be a pg_archivecleanup
<https://www.postgresql.org/docs/9.2/pgarchivecleanup.html> command,
which I could run as a cron job, but something tells me that isn't the
recommended way as no literature I've found recommends this...
Slots will clean up after themselves once the segments are no longer
needed. The catch is that the segments need to be consumed by the
standby. What needs to be determined here is why the standby never
consumed the WAL's from the master? Do you still have the logs from the
standby and do they show anything relevant?
This SO thread
<https://stackoverflow.com/questions/21113210/archive-cleanup-command-when-is-called>
seems to indicate that archive_cleanup_command can be run every "restart
point", but googling "psql restart point" brings me to this page
<https://www.postgresql.org/docs/9.1/continuous-archiving.html> which a
^F indicates does not mention a "restart point"... So this /feels/ like
the configuration setting I want, but I can't find the documentation
that confirms it for me...So at this point I'm kinda stumped. I could definitely add more space,
but if those WAL files are never cleaned up, even adding a 1TB Store for
the archive is just delaying the inevitable.Thanks!
-QBRThese are my configs:
postgresql.local.conf -
effective_cache_size=1500MB
shared_buffers=500MB
maintenance_work_mem=125MB
work_mem=5MB
temp_buffers=5MBpostgresql.conf -
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'
--
Adrian Klaver
adrian.klaver@aklaver.com
On Fri, Mar 22, 2019 at 6:28 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Fast forward two weeks, this cluster has been running but not seeing any
traffic. And my master server has filled its archive directory.
Are you sure it is the archive directory (/hab/svc/postgresql/data/archive)
which is filling up, and not the live directory (pg_wal or pg_xlog)? This
is often a point of confusion.
I found an older thread
</messages/by-id/417C5AF7C228B94490192951394BEFE7B4805F@AIPHLEXDAG01B.airgas.com>
that seemed to indicate that the wal_keep_segments is what was causing
psql to keep so many WAL files... However, mine is set to 8, but there
were thousands of log files...
wal_keep_segments applies to the live directory (pg_xlog or pg_wal), not
the archive directory.
This SO thread
<https://stackoverflow.com/questions/21113210/archive-cleanup-command-when-is-called>
seems to indicate that archive_cleanup_command can be run every "restart
point", but googling "psql restart point" brings me to this page
<https://www.postgresql.org/docs/9.1/continuous-archiving.html> which a
^F indicates does not mention a "restart point"... So this *feels* like
the configuration setting I want, but I can't find the documentation that
confirms it for me...
"psql" is the name of a specific command line tool used to connect to a
PostgreSQL database server, it is not the name of the database itself. The
database is usually abbreviated "pgsql". And "restartpoint" is usually
spelled as one work in technical discussions of it. Or at least, searching
for it that way avoids finding things which mention each word separately in
different senses.
A restartpoint is just a checkpoint which is run on the replica (as opposed
to a checkpoint proper, which runs on the master). And
archive_cleanup_command is executed on the replica, not on the master, so
to use it the replica has to have write access on the archive directory to
implement it.
archive_cleanup_command is pretty much obsolete. The modern way to do this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with the archive
completely. There are reasons other than replication that one might want
to keep a WAL archive, but those reasons don't seem to apply to you. And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.
Cheers,
Jeff
Show quoted text
On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to do this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with the archive
completely. There are reasons other than replication that one might want
to keep a WAL archive, but those reasons don't seem to apply to you. And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.
Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
Hello All,
Wow! Lots of awesome replies, Went away for the weekend thinking my email
had been rejected and come back to a full inbox. Thanks for all the help!
Postgres version?
9.6.11
Doh. Fairly important detail there. :)
FYI, psql is the Postgres client program, Postgres(ql) is the server.
"psql" is the name of a specific command line tool used to connect to a
PostgreSQL database server, it is not the name of the database itself. The
database is usually abbreviated "pgsql".
Duly noted, thanks for the correction.
It's the standby that has not seen any traffic?
There's really no traffic. I built three nodes, connected them, created a
test table and inserted some values, then left the cluster be for a couple
weeks.
And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.
Ah ha. I had seen it that way but thought it was a typo. Thanks for the
clarification!
Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.
Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:
# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571
There is no pg_wal directory though (should there be?)
# find /hab/svc/postgresql/ -name '*pg*wal*'
#
If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely
To be honest, I thought it was required for streaming replication based on
the guides linked above.
There are reasons other than replication that one might want to keep a
WAL archive, but those reasons don't seem to apply to you
Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?
What needs to be determined here is why the standby never consumed the
WAL's from the master?
Ok, so it the standby that's the problem.
Do you still have the logs from the standby and do they show anything
relevant?
Sure, what am I looking for? I see a bunch of entries like:
2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 0/8D000028
On the 15th, around when I think I filled the disk, I see a bunch of:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?
Which makes sense since the pgsql service was down.
This appears to be when I recovered the master on Thursday:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client= (0:00000)LOG:
restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s, total=0.007 s;
sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB,
estimate=16384 kB
Then there's a bunch of the same entries where about the only thing (other
than the timestamp) that seems to change is the index in:
2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/91000108
I see effectively the same logs on the other hotstandby pgsql node.
This is my config on my standby node:
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'
Should I disable archive mode? Even though I'm not currently using it, it
seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?
Thanks again for all the replies, while it hasn't solved the problem yet,
this was incredibly helpful! Also, please don't hate me for munging all
your replies into one reply... I thought it might be easier to follow than
having three different branches...?
Best Regards,
- QBR
On Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to do
this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with the archive
completely. There are reasons other than replication that one might want
to keep a WAL archive, but those reasons don't seem to apply to you. And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
Hello All,
Ok, so maybe something helpful? On my master node I am seeing a bunch of:
2019-03-28 23:54:44 GMT [2611]: [1-1]
user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
database "admin" does not exist
2019-03-28 23:54:46 GMT [2613]: [1-1]
user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
(0:3D000)FATAL: database "admin" does not exist
Which is accurate, as there is no admin database... I usually connect with
psql -h localhost -U admin postgres
Should there be? Will this fix my issue with pgsql filling up the disk?
Thanks,
- QBR
On Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Show quoted text
Hello All,
Wow! Lots of awesome replies, Went away for the weekend thinking my email
had been rejected and come back to a full inbox. Thanks for all the help!Postgres version?
9.6.11
Doh. Fairly important detail there. :)
FYI, psql is the Postgres client program, Postgres(ql) is the server.
"psql" is the name of a specific command line tool used to connect toa PostgreSQL database server, it is not the name of the database itself.
The database is usually abbreviated "pgsql".Duly noted, thanks for the correction.
It's the standby that has not seen any traffic?
There's really no traffic. I built three nodes, connected them, created a
test table and inserted some values, then left the cluster be for a couple
weeks.And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.Ah ha. I had seen it that way but thought it was a typo. Thanks for the
clarification!Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571There is no pg_wal directory though (should there be?)
# find /hab/svc/postgresql/ -name '*pg*wal*'
#If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely
To be honest, I thought it was required for streaming replication based on
the guides linked above.There are reasons other than replication that one might want to keep a
WAL archive, but those reasons don't seem to apply to you
Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?What needs to be determined here is why the standby never consumed the
WAL's from the master?
Ok, so it the standby that's the problem.Do you still have the logs from the standby and do they show anything
relevant?
Sure, what am I looking for? I see a bunch of entries like:
2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 0/8D000028On the 15th, around when I think I filled the disk, I see a bunch of:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?Which makes sense since the pgsql service was down.
This appears to be when I recovered the master on Thursday:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
(0:00000)LOG: restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
(0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=16384 kB, estimate=16384 kBThen there's a bunch of the same entries where about the only thing (other
than the timestamp) that seems to change is the index in:2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000108I see effectively the same logs on the other hotstandby pgsql node.
This is my config on my standby node:
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'Should I disable archive mode? Even though I'm not currently using it, it
seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?Thanks again for all the replies, while it hasn't solved the problem yet,
this was incredibly helpful! Also, please don't hate me for munging all
your replies into one reply... I thought it might be easier to follow than
having three different branches...?Best Regards,
- QBROn Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael@paquier.xyz>
wrote:On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to do
this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with the archive
completely. There are reasons other than replication that one mightwant
to keep a WAL archive, but those reasons don't seem to apply to you.
And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
Hello All,
Ok, so creating the admin database has not enabled my cluster to cleanup
the `archive/` directory... It did eliminate the errors, though I'm
dubious as to if that was the correct solution...
Everything I'm able to google talks about setting up archive replication,
but not how to keep that directory in check... I did fine one link
<https://support.qasymphony.com/hc/en-us/articles/360006834212-Continuous-WAL-Archiving-for-Windows>
that that talks about how to cleanup obsolete archive files... but I
thought I didn't want to use `pg_archivecleanup`?
Also, I don't really undersatnd how `pg_basebackup` is supposed to work, as
you can only ever run it once per directory?
E.g.: if i wanted to run `pg_basebackup` weekly, I'd have to create a new
directory every time?
# pg_basebackup --verbose --host 127.0.0.1 --username replication --xlog
--progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not empty
# pg_archivecleanup -d /hab/svc/postgresql/data/archive/
000000010000000000000003.000000D0.backup
pg_archivecleanup: keep WAL file
"/hab/svc/postgresql/data/archive//000000010000000000000003" and later
# ls /hab/svc/postgresql/data/archive
000000010000000800000081 000000010000000800000083
000000010000000800000085 000000010000000800000087
000000010000000800000089 00000001000000080000008B
00000001000000080000008D 00000001000000080000008F
000000010000000800000091 000000010000000800000093
000000010000000800000082 000000010000000800000084
000000010000000800000086 000000010000000800000088
00000001000000080000008A 00000001000000080000008C
00000001000000080000008E 000000010000000800000090 000000010000000800000092
# rm -rf /hab/svc/postgresql/data/archive/*
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
transaction log start point: 8/96000028 on timeline 1
45946/45946 kB (100%), 1/1 tablespace
transaction log end point: 8/960000F8
pg_basebackup: base backup completed
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not empty
# pg_archivecleanup -d /hab/svc/postgresql/data/archive/
000000010000000800000096.00000028.backup
pg_archivecleanup: keep WAL file
"/hab/svc/postgresql/data/archive//000000010000000800000096" and later
pg_archivecleanup: removing file
"/hab/svc/postgresql/data/archive//000000010000000800000095"
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not empty
I feel like I'm missing something really fundamental here. Does everyone
just have infinite storage for their `archive/` directory (that doesn't
_seem_ plausible)?
Should this maybe be stored on an off "box" location?
I think I've figured out `wal_keep_segments` is for the streaming
replication, i.e.: so I can have a hot-standby, and the `archive_command`
is for point-in-time backups, e.g.: "oops we modified 10k rows and need to
roll back the DB to this morning/an hour ago/20mins ago"... (or, another
usecase I'd really like to be able to support is to give developers a
replica of the production database, which this seems like it might not be
the ideal way to go for that?)
Thanks!
- QBR
On Mon, Apr 1, 2019 at 10:24 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Show quoted text
Hello All,
Ok, so maybe something helpful? On my master node I am seeing a bunch of:
2019-03-28 23:54:44 GMT [2611]: [1-1]
user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
database "admin" does not exist
2019-03-28 23:54:46 GMT [2613]: [1-1]
user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
(0:3D000)FATAL: database "admin" does not existWhich is accurate, as there is no admin database... I usually connect
withpsql -h localhost -U admin postgres
Should there be? Will this fix my issue with pgsql filling up the disk?
Thanks,
- QBROn Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Hello All,
Wow! Lots of awesome replies, Went away for the weekend thinking my
email had been rejected and come back to a full inbox. Thanks for all the
help!Postgres version?
9.6.11
Doh. Fairly important detail there. :)
FYI, psql is the Postgres client program, Postgres(ql) is the server.
"psql" is the name of a specific command line tool used to connect toa PostgreSQL database server, it is not the name of the database itself.
The database is usually abbreviated "pgsql".Duly noted, thanks for the correction.
It's the standby that has not seen any traffic?
There's really no traffic. I built three nodes, connected them, created
a test table and inserted some values, then left the cluster be for a
couple weeks.And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.Ah ha. I had seen it that way but thought it was a typo. Thanks for the
clarification!Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571There is no pg_wal directory though (should there be?)
# find /hab/svc/postgresql/ -name '*pg*wal*'
#If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely
To be honest, I thought it was required for streaming replication based
on the guides linked above.There are reasons other than replication that one might want to keep a
WAL archive, but those reasons don't seem to apply to you
Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?What needs to be determined here is why the standby never consumed
the WAL's from the master?
Ok, so it the standby that's the problem.Do you still have the logs from the standby and do they show anything
relevant?
Sure, what am I looking for? I see a bunch of entries like:
2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 0/8D000028On the 15th, around when I think I filled the disk, I see a bunch of:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?Which makes sense since the pgsql service was down.
This appears to be when I recovered the master on Thursday:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
(0:00000)LOG: restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
(0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=16384 kB, estimate=16384 kBThen there's a bunch of the same entries where about the only thing
(other than the timestamp) that seems to change is the index in:2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000108I see effectively the same logs on the other hotstandby pgsql node.
This is my config on my standby node:
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'Should I disable archive mode? Even though I'm not currently using it,
it seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?Thanks again for all the replies, while it hasn't solved the problem yet,
this was incredibly helpful! Also, please don't hate me for munging all
your replies into one reply... I thought it might be easier to follow than
having three different branches...?Best Regards,
- QBROn Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael@paquier.xyz>
wrote:On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to do
this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with thearchive
completely. There are reasons other than replication that one might
want
to keep a WAL archive, but those reasons don't seem to apply to you.
And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
Hello All,
Ok, I think maybe I've been approaching this all wrong. According to THIS
SO <https://dba.stackexchange.com/a/186136> post: "PostgreSQL doesn't
manage the archive directory for you, that is up to you. It doesn't even
know where (or what) the archive location is, that is why it asks you for
an archive_command, rather than just a directory."
Which I think is maybe a key component I was overlooking here. If pgsql
doesn't know anything about the `archive/` directory, then it should be
treated and managed "out of band" from the server?
So I'm thinking something like this:
Mount NFS share to `/pgbackup/`
Master server has once-per-week `pg_basebackup` command that creates a
`/pgbkup/base_backup/<date>` to some NFS share.
Master server has `archive_command` that creates archive in
`/pgbackup/archive`
On another server not part of the above cluster, run `pg_archivecleanup
/pgbackup/archive $(find /pgbackup/base_backup/<date> -name '*.backup')`
(i.e. "find the .backup file from `base_backup/<date>` and remove any
archive files?)
Then I effectively have a place to do point-in-time restores from?
Thanks,
- QBR
On Mon, Apr 1, 2019 at 1:44 PM Foo Bar <qubitrenegade@gmail.com> wrote:
Show quoted text
Hello All,
Ok, so creating the admin database has not enabled my cluster to cleanup
the `archive/` directory... It did eliminate the errors, though I'm
dubious as to if that was the correct solution...Everything I'm able to google talks about setting up archive replication,
but not how to keep that directory in check... I did fine one link
<https://support.qasymphony.com/hc/en-us/articles/360006834212-Continuous-WAL-Archiving-for-Windows>
that that talks about how to cleanup obsolete archive files... but I
thought I didn't want to use `pg_archivecleanup`?Also, I don't really undersatnd how `pg_basebackup` is supposed to work,
as you can only ever run it once per directory?E.g.: if i wanted to run `pg_basebackup` weekly, I'd have to create a new
directory every time?# pg_basebackup --verbose --host 127.0.0.1 --username replication --xlog
--progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not empty
# pg_archivecleanup -d /hab/svc/postgresql/data/archive/
000000010000000000000003.000000D0.backup
pg_archivecleanup: keep WAL file
"/hab/svc/postgresql/data/archive//000000010000000000000003" and later
# ls /hab/svc/postgresql/data/archive
000000010000000800000081 000000010000000800000083
000000010000000800000085 000000010000000800000087
000000010000000800000089 00000001000000080000008B
00000001000000080000008D 00000001000000080000008F
000000010000000800000091 000000010000000800000093
000000010000000800000082 000000010000000800000084
000000010000000800000086 000000010000000800000088
00000001000000080000008A 00000001000000080000008C
00000001000000080000008E 000000010000000800000090 000000010000000800000092
# rm -rf /hab/svc/postgresql/data/archive/*
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
transaction log start point: 8/96000028 on timeline 1
45946/45946 kB (100%), 1/1 tablespace
transaction log end point: 8/960000F8
pg_basebackup: base backup completed
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not empty
# pg_archivecleanup -d /hab/svc/postgresql/data/archive/
000000010000000800000096.00000028.backup
pg_archivecleanup: keep WAL file
"/hab/svc/postgresql/data/archive//000000010000000800000096" and later
pg_archivecleanup: removing file
"/hab/svc/postgresql/data/archive//000000010000000800000095"
# pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
not emptyI feel like I'm missing something really fundamental here. Does everyone
just have infinite storage for their `archive/` directory (that doesn't
_seem_ plausible)?Should this maybe be stored on an off "box" location?
I think I've figured out `wal_keep_segments` is for the streaming
replication, i.e.: so I can have a hot-standby, and the `archive_command`
is for point-in-time backups, e.g.: "oops we modified 10k rows and need to
roll back the DB to this morning/an hour ago/20mins ago"... (or, another
usecase I'd really like to be able to support is to give developers a
replica of the production database, which this seems like it might not be
the ideal way to go for that?)Thanks!
- QBROn Mon, Apr 1, 2019 at 10:24 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Hello All,
Ok, so maybe something helpful? On my master node I am seeing a bunch of:
2019-03-28 23:54:44 GMT [2611]: [1-1]
user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
database "admin" does not exist
2019-03-28 23:54:46 GMT [2613]: [1-1]
user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
(0:3D000)FATAL: database "admin" does not existWhich is accurate, as there is no admin database... I usually connect
withpsql -h localhost -U admin postgres
Should there be? Will this fix my issue with pgsql filling up the disk?
Thanks,
- QBROn Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Hello All,
Wow! Lots of awesome replies, Went away for the weekend thinking my
email had been rejected and come back to a full inbox. Thanks for all the
help!Postgres version?
9.6.11
Doh. Fairly important detail there. :)
FYI, psql is the Postgres client program, Postgres(ql) is the server.
"psql" is the name of a specific command line tool used to connectto a PostgreSQL database server, it is not the name of the database itself.
The database is usually abbreviated "pgsql".Duly noted, thanks for the correction.
It's the standby that has not seen any traffic?
There's really no traffic. I built three nodes, connected them, created
a test table and inserted some values, then left the cluster be for a
couple weeks.And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.Ah ha. I had seen it that way but thought it was a typo. Thanks for
the clarification!Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571There is no pg_wal directory though (should there be?)
# find /hab/svc/postgresql/ -name '*pg*wal*'
#If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely
To be honest, I thought it was required for streaming replication based
on the guides linked above.There are reasons other than replication that one might want to keep
a WAL archive, but those reasons don't seem to apply to you
Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?What needs to be determined here is why the standby never consumed
the WAL's from the master?
Ok, so it the standby that's the problem.Do you still have the logs from the standby and do they show anything
relevant?
Sure, what am I looking for? I see a bunch of entries like:
2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client=
(0:00000)LOG: restartpoint complete: wrote 22 buffers (0.0%); 0
transaction log file(s) added, 0 removed, 1 recycled; write=2.211 s,
sync=0.062 s, total=2.281 s; sync files=18, longest=0.062 s, average=0.003
s; distance=16383 kB, estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 0/8D000028On the 15th, around when I think I filled the disk, I see a bunch of:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and
accepting
TCP/IP connections on port 5432?Which makes sense since the pgsql service was down.
This appears to be when I recovered the master on Thursday:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and
accepting
TCP/IP connections on port 5432?cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
(0:00000)LOG: restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
(0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=16384 kB, estimate=16384 kBThen there's a bunch of the same entries where about the only thing
(other than the timestamp) that seems to change is the index in:2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000108I see effectively the same logs on the other hotstandby pgsql node.
This is my config on my standby node:
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'Should I disable archive mode? Even though I'm not currently using it,
it seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?Thanks again for all the replies, while it hasn't solved the problem
yet, this was incredibly helpful! Also, please don't hate me for munging
all your replies into one reply... I thought it might be easier to follow
than having three different branches...?Best Regards,
- QBROn Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael@paquier.xyz>
wrote:On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to
do this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with thearchive
completely. There are reasons other than replication that one might
want
to keep a WAL archive, but those reasons don't seem to apply to you.
And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
That seems to be a misconfigured client application that explicitly tries
to connect to a non-existent database 'admin' (via db=admin).
Instead of adding that database, it seems more logical to fix the client
configuration.
On Tue, 2 Apr 2019 at 09:53, Foo Bar <qubitrenegade@gmail.com> wrote:
Hello All,
Ok, so maybe something helpful? On my master node I am seeing a bunch of:
2019-03-28 23:54:44 GMT [2611]: [1-1]
user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
database "admin" does not exist
2019-03-28 23:54:46 GMT [2613]: [1-1]
user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
(0:3D000)FATAL: database "admin" does not existWhich is accurate, as there is no admin database... I usually connect
withpsql -h localhost -U admin postgres
Should there be? Will this fix my issue with pgsql filling up the disk?
Thanks,
- QBROn Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade@gmail.com> wrote:
Hello All,
Wow! Lots of awesome replies, Went away for the weekend thinking my
email had been rejected and come back to a full inbox. Thanks for all the
help!Postgres version?
9.6.11
Doh. Fairly important detail there. :)
FYI, psql is the Postgres client program, Postgres(ql) is the server.
"psql" is the name of a specific command line tool used to connect toa PostgreSQL database server, it is not the name of the database itself.
The database is usually abbreviated "pgsql".Duly noted, thanks for the correction.
It's the standby that has not seen any traffic?
There's really no traffic. I built three nodes, connected them, created
a test table and inserted some values, then left the cluster be for a
couple weeks.And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.Ah ha. I had seen it that way but thought it was a typo. Thanks for the
clarification!Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571There is no pg_wal directory though (should there be?)
# find /hab/svc/postgresql/ -name '*pg*wal*'
#If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely
To be honest, I thought it was required for streaming replication based
on the guides linked above.There are reasons other than replication that one might want to keep a
WAL archive, but those reasons don't seem to apply to you
Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?What needs to be determined here is why the standby never consumed
the WAL's from the master?
Ok, so it the standby that's the problem.Do you still have the logs from the standby and do they show anything
relevant?
Sure, what am I looking for? I see a bunch of entries like:
2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 0/8D000028On the 15th, around when I think I filled the disk, I see a bunch of:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?Which makes sense since the pgsql service was down.
This appears to be when I recovered the master on Thursday:
cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client=
(0:XX000)FATAL: could not connect to the primary server: could not connect
to server: Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
(0:00000)LOG: restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
(0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=16384 kB, estimate=16384 kBThen there's a bunch of the same entries where about the only thing
(other than the timestamp) that seems to change is the index in:2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
(0:00000)LOG: recovery restart point at 5/91000108I see effectively the same logs on the other hotstandby pgsql node.
This is my config on my standby node:
listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'Should I disable archive mode? Even though I'm not currently using it,
it seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?Thanks again for all the replies, while it hasn't solved the problem yet,
this was incredibly helpful! Also, please don't hate me for munging all
your replies into one reply... I thought it might be easier to follow than
having three different branches...?Best Regards,
- QBROn Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael@paquier.xyz>
wrote:On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
archive_cleanup_command is pretty much obsolete. The modern way to do
this
is with streaming replication, using either replication slots or
wal_keep_segments. If the only reason you want an archive is for
replication, then use streaming replication and do away with thearchive
completely. There are reasons other than replication that one might
want
to keep a WAL archive, but those reasons don't seem to apply to you.
And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.Personally, I still find archives also very valuable when a standby
creation takes a long time because of a large initial base backup and
that the partition dedicated to pg_wal is not large enough to support
the retention associated with a slot, and it is easier to have larger
retention policies in the archives.
--
Michael
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.