does wal archiving block the current client connection?
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problems, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error. I'm
wondering if this is expected behavior with the archive command and if I
should build in some more smarts to my archive script. Maybe I should fork
and waitpid such that I can use a manual timeout shorter than whatever the
CIFS timeout is so that I can return an error in a reasonable amount of time?
Has anyone else seen this problem? Restarting the NAS device fixes the
problem but it would be much preferable if postges could soldier along without
the NAS for a little while before we resuscitate it. We don't have an NFS or
rsync server available in this environment currently, though I suppose setting
up an rsync server for windows on the NAS would be a possibility.
Any suggestions much appreciated.
Currently the script is fairly simple and just does a 'cp' and then a 'gzip'
although we do use cp -f to copy over a possible previosly failed 'cp'.
Script is below:
. /usr/local/lib/includes.sh
FULLPATH="$1"
FILENAME="$2"
#
# Make sure we have pgbackup dir mounted
#
checkpgbackupmount
/bin/cp -f "$FULLPATH" "$PITRDESTDIR/$FILENAME"
if [ $? -ne 0 ]; then
die "Could not cp $FULLPATH to $PITRDESTDIR/$FILENAME"
fi
/usr/bin/gzip -f "$PITRDESTDIR/$FILENAME"
#
# Make sure it worked, otherwise roll back
#
if [ $? -ne 0 ]; then
/bin/rm -f "$PITRDESTDIR/$FILENAME"
die "Could not /usr/bin/gzip $PITRDESTDIR/$FILENAME"
fi
exit 0
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
What might be more bullet proof would be to make the archive command
copy the file to an intermediate local directory, then have a
daemon/cron job that wakes up once a minute or so, check for new files,
then copy them to the network mount. You may want to use something like
lofs to make sure the archive command has finished and closed the file
before moving it to the network drive.
This is what I do, and I've never had a failure of the archive command.
-- Had a few network errors on the network drive (I use nfs) which I
fixed at my leisure, with no problems for the postgresql server.
Jeff Frost wrote:
Show quoted text
I've run into a problem with a PITR setup at a client. The problem is
that whenever the CIFS NAS device that we're mounting at /mnt/pgbackup
has problems, it seems that the current client connection gets blocked
and this eventually builds up to a "sorry, too many clients already"
error. I'm wondering if this is expected behavior with the archive
command and if I should build in some more smarts to my archive
script. Maybe I should fork and waitpid such that I can use a manual
timeout shorter than whatever the CIFS timeout is so that I can return
an error in a reasonable amount of time?Has anyone else seen this problem? Restarting the NAS device fixes
the problem but it would be much preferable if postges could soldier
along without the NAS for a little while before we resuscitate it. We
don't have an NFS or rsync server available in this environment
currently, though I suppose setting up an rsync server for windows on
the NAS would be a possibility.Any suggestions much appreciated.
Currently the script is fairly simple and just does a 'cp' and then a
'gzip' although we do use cp -f to copy over a possible previosly
failed 'cp'. Script is below:. /usr/local/lib/includes.sh
FULLPATH="$1"
FILENAME="$2"#
# Make sure we have pgbackup dir mounted
#
checkpgbackupmount/bin/cp -f "$FULLPATH" "$PITRDESTDIR/$FILENAME"
if [ $? -ne 0 ]; then
die "Could not cp $FULLPATH to $PITRDESTDIR/$FILENAME"
fi/usr/bin/gzip -f "$PITRDESTDIR/$FILENAME"
#
# Make sure it worked, otherwise roll back
#
if [ $? -ne 0 ]; then
/bin/rm -f "$PITRDESTDIR/$FILENAME"
die "Could not /usr/bin/gzip $PITRDESTDIR/$FILENAME"
fiexit 0
Jeff Frost <jeff@frostconsultingllc.com> writes:
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problems, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error. I'm
wondering if this is expected behavior with the archive command
No, I can't see what the connection should be there. It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.
What PG version is this exactly? Are you sure that the only connection
of the NAS device to PG operations is through the archive script, ie,
you don't have any part of the normal PG data directory mounted on it?
regards, tom lane
On Mon, 15 May 2006, Tom Lane wrote:
No, I can't see what the connection should be there. It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.
That's what I thought, but that doesn't seem to be what I'm observing. Of
course the NAS device only gets wedged about once every month or two, so it's
difficult to reproduce. Probably would need to build another system similar
and hit it with pg_bench or something similar, then pull the plug on the NAS
device to reproduce it and see if it acts similarly.
What PG version is this exactly? Are you sure that the only connection
of the NAS device to PG operations is through the archive script, ie,
you don't have any part of the normal PG data directory mounted on it?
It's postgresql-8.1.3 compiled from source.
PGDATA is in /usr/local/pgsql/data which is a local RAID10 array, and /pg_xlog
is mounted on a local RAID1.
Things that touch the /mnt/pgbackup directory are the pitr base backup scripts
which run twice monthly and the archive_command script.
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes:
On Mon, 15 May 2006, Tom Lane wrote:
No, I can't see what the connection should be there. It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.
That's what I thought, but that doesn't seem to be what I'm observing. Of
course the NAS device only gets wedged about once every month or two, so it's
difficult to reproduce.
If it's really a PG bug, it should be trivial to reproduce: put a long
sleep in the archive-command script.
regards, tom lane
On Mon, 15 May 2006, Tom Lane wrote:
That's what I thought, but that doesn't seem to be what I'm observing. Of
course the NAS device only gets wedged about once every month or two, so it's
difficult to reproduce.If it's really a PG bug, it should be trivial to reproduce: put a long
sleep in the archive-command script.
Thanks Tom. I'll try and reproduce on one of the other PG servers I have
available when I have a chance and get back to you if it looks like a possible
bug.
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problems
What kind of problems?
, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error.
This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?
I'm
wondering if this is expected behavior with the archive command and if I
should build in some more smarts to my archive script. Maybe I should fork
and waitpid such that I can use a manual timeout shorter than whatever the
CIFS timeout is so that I can return an error in a reasonable amount of time?
The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.
We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Mon, 15 May 2006, Simon Riggs wrote:
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problemsWhat kind of problems?
It becomes unwritable for whatever reason CIFS shares become unwritable. It's
a windows 2003 NAS device and a reboot solves the problem, but it leaves no
event logs on the windows side of things, so difficult to determine the root
cause.
, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error.This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?
that's possible. Does the archiver use a DB connection whenever it tries to
run archive_command? If so, then that's almost certainly the problem. I
suspect a faster timeout on the CIFS mount would fix the issue as well, but I
didn't see any such options in the mount.cifs manpage.
The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness.
That all sounds perfectly reasonable. If the archiver is using up a
connection for each archive_command issued, then I suspect that's our problem,
as there were also lots of debug logs showing that the db was trying to
archive several WAL files at near the same time, likely pushing us over our
100 connection limit. If the archiver does not use up a connection, then I
suppose I don't know what's actually going on unless postgres blocks the
commit of the transaction which triggered the archive_command until the
archive command finishes (or fails).
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
On Mon, 2006-05-15 at 16:29, Jeff Frost wrote:
On Mon, 15 May 2006, Simon Riggs wrote:
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problemsWhat kind of problems?
It becomes unwritable for whatever reason CIFS shares become unwritable. It's
a windows 2003 NAS device and a reboot solves the problem, but it leaves no
event logs on the windows side of things, so difficult to determine the root
cause.
I don't think you need logs to figure out that the problem is that it's
a Windows based NAS device. :)
Seriously, we had BSD based NAS devices one place I worked, and the
company making them switched to Win2k3 and after we had one or two do
exactly what you're talking about, we just started building our own
based on RedHat or BSD. Same hardware (we just formatted the drives on
one of the ones that came with w2k3) and we never had another moment's
problem with them.
In other news... Does the archiver actually connect to the database?
On Mon, 2006-05-15 at 14:29 -0700, Jeff Frost wrote:
On Mon, 15 May 2006, Simon Riggs wrote:
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
I've run into a problem with a PITR setup at a client. The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problemsWhat kind of problems?
It becomes unwritable for whatever reason CIFS shares become unwritable. It's
a windows 2003 NAS device and a reboot solves the problem, but it leaves no
event logs on the windows side of things, so difficult to determine the root
cause.
You should be able to re-create this problem without the database being
involved. Just set up a driver program over the top of the archive
script so it flies in a tighter loop than the archiver would make it. If
you still get the Windows NAS error... well, I'll leave that to you.
, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error.
Tell us more about what the blockage looks like. We may yet thank
Windows for finding a bug, but I'm not sure yet.
This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?that's possible. Does the archiver use a DB connection whenever it tries to
run archive_command?
Not at all.
If so, then that's almost certainly the problem. I
suspect a faster timeout on the CIFS mount would fix the issue as well, but I
didn't see any such options in the mount.cifs manpage.The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness.That all sounds perfectly reasonable. If the archiver is using up a
connection for each archive_command issued, then I suspect that's our problem,
as there were also lots of debug logs showing that the db was trying to
archive several WAL files at near the same time, likely pushing us over our
100 connection limit.
Oh, you mean database clients cannot connect. I thought you meant you
were getting a CIFS client connection error from the archiver. That's
wierd.
If the archiver does not use up a connection, then I
suppose I don't know what's actually going on unless postgres blocks the
commit of the transaction which triggered the archive_command until the
archive command finishes (or fails).
I think you need to show the database log covering the period in error.
Are you running out of disk space in the database directory? Can you
check again that pg_xlog and pg_xlog/archive_status is definitely not on
the NAS?
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Mon, 2006-05-15 at 16:58 -0700, Jeff Frost wrote:
The log is below. Note that the problem began around 2a.m. around the time
the complaint about checkpoint segments happens. After a bit of research it
appears that the checkpoint complaint happens when our db maintenance job
kicks off at 2:00a.m. The crisis ended around 3:45a.m. with the restart of the
NAS device and normal operations resumed just before the end of this log.
There were also lots of CIFS errors of the following type in dmesg and in the
log:CIFS VFS: Send error in read = -9
CIFS VFS: Send error in read = -9
I don't see much evidence for a connection between archiver and these
issues. The problems start after autovacuum of "vb_web" at 02:08. That
seems much more likely to have something to do with client connections
than the archiver - which is really nothing too special. However,
archiver doesn't start processing files again after 02:06. When is the
next "archived transaction log file" message?
Whatever happened between 02:08 and 02:14 seems important. Do you have
any other logs to cross-correlate for different events. What do the
clients think happened? What did they do during that time?
May 15 01:57:31 db3 postgres[12723]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
May 15 01:58:31 db3 postgres[12824]: [1-1] LOG: autovacuum: processing database "template1"
May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
May 15 01:59:31 db3 postgres[12928]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG: archived transaction log file "00000001000000160000007F"
May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG: archived transaction log file "000000010000001600000080"
May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG: checkpoints are occurring too frequently (9 seconds apart)
May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG: archived transaction log file "000000010000001600000081"
May 15 02:00:36 db3 postgres[13747]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG: checkpoints are occurring too frequently (13 seconds apart)
May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG: archived transaction log file "000000010000001600000082"
May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG: archived transaction log file "000000010000001600000083"
May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG: archived transaction log file "000000010000001600000084"
May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG: checkpoints are occurring too frequently (18 seconds apart)
May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG: archived transaction log file "000000010000001600000085"
May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG: archived transaction log file "000000010000001600000086"
May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG: checkpoints are occurring too frequently (19 seconds apart)
May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG: archived transaction log file "000000010000001600000087"
May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG: archived transaction log file "000000010000001600000088"
May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG: checkpoints are occurring too frequently (12 seconds apart)
May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG: archived transaction log file "000000010000001600000089"
May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:39 db3 postgres[22009]: [4897-1] LOG: archived transaction log file "00000001000000160000008A"
May 15 02:01:45 db3 postgres[22008]: [1097-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:01:45 db3 postgres[22008]: [1097-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:47 db3 postgres[22009]: [4898-1] LOG: archived transaction log file "00000001000000160000008B"
May 15 02:01:47 db3 postgres[13964]: [1-1] LOG: autovacuum: processing database "template1"
May 15 02:01:54 db3 postgres[22009]: [4899-1] LOG: archived transaction log file "00000001000000160000008C"
May 15 02:02:00 db3 postgres[22009]: [4900-1] LOG: archived transaction log file "00000001000000160000008D"
May 15 02:02:07 db3 crond(pam_unix)[14000]: session closed for user postgres
May 15 02:02:08 db3 postgres[22009]: [4901-1] LOG: archived transaction log file "00000001000000160000008E"
May 15 02:02:15 db3 postgres[22009]: [4902-1] LOG: archived transaction log file "00000001000000160000008F"
May 15 02:02:21 db3 postgres[22009]: [4903-1] LOG: archived transaction log file "000000010000001600000090"
May 15 02:02:27 db3 postgres[22009]: [4904-1] LOG: archived transaction log file "000000010000001600000091"
May 15 02:02:34 db3 postgres[22009]: [4905-1] LOG: archived transaction log file "000000010000001600000092"
May 15 02:02:42 db3 postgres[22009]: [4906-1] LOG: archived transaction log file "000000010000001600000093"
May 15 02:02:51 db3 postgres[14150]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:02:52 db3 postgres[22009]: [4907-1] LOG: archived transaction log file "000000010000001600000094"
May 15 02:03:00 db3 postgres[22009]: [4908-1] LOG: archived transaction log file "000000010000001600000095"
May 15 02:03:05 db3 crond(pam_unix)[14182]: session closed for user postgres
May 15 02:03:08 db3 postgres[22009]: [4909-1] LOG: archived transaction log file "000000010000001600000096"
May 15 02:03:15 db3 postgres[22009]: [4910-1] LOG: archived transaction log file "000000010000001600000097"
May 15 02:03:21 db3 postgres[22009]: [4911-1] LOG: archived transaction log file "000000010000001600000098"
May 15 02:03:51 db3 postgres[14292]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:04:08 db3 crond(pam_unix)[14305]: session closed for user postgres
May 15 02:04:51 db3 postgres[22009]: [4912-1] LOG: archived transaction log file "000000010000001600000099"
May 15 02:05:05 db3 crond(pam_unix)[14415]: session closed for user postgres
May 15 02:06:05 db3 crond(pam_unix)[14517]: session closed for user postgres
May 15 02:06:24 db3 postgres[14582]: [1-1] LOG: autovacuum: processing database "template1"
May 15 02:06:43 db3 postgres[22009]: [4913-1] LOG: archived transaction log file "00000001000000160000009A"
May 15 02:07:07 db3 crond(pam_unix)[14632]: session closed for user postgres
May 15 02:07:33 db3 postgres[14702]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:08:48 db3 postgres[14724]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:13:03 db3 postgres[14781]: [1-1] LOG: unexpected EOF on client connection
May 15 02:14:42 db3 postgres[14796]: [1-1] LOG: unexpected EOF on client connection
May 15 02:15:53 db3 postgres[14755]: [1-1] LOG: unexpected EOF on client connection
May 15 02:16:11 db3 postgres[14856]: [1-1] LOG: unexpected EOF on client connection
May 15 02:19:52 db3 postgres[14869]: [1-1] LOG: unexpected EOF on client connection
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Import Notes
Reply to msg id not found: Pine.LNX.4.64.0605151554000.9581@discord.dyndns.org
On Tue, 16 May 2006, Simon Riggs wrote:
I don't see much evidence for a connection between archiver and these
issues. The problems start after autovacuum of "vb_web" at 02:08. That
seems much more likely to have something to do with client connections
than the archiver - which is really nothing too special. However,
archiver doesn't start processing files again after 02:06. When is the
next "archived transaction log file" message?Whatever happened between 02:08 and 02:14 seems important. Do you have
any other logs to cross-correlate for different events. What do the
clients think happened? What did they do during that time?
I have the logs and after reviewing /var/log/messages for that time period,
there is no other activity besides postgres. The next archived transaction
log file messages are:
May 14 02:24:54 db3 postgres[22009]: [4876-1] LOG: archived transaction log
fil
e "000000010000001600000078"
May 14 02:34:05 db3 postgres[22009]: [4877-1] LOG: archived transaction log
fil
e "000000010000001600000079"
May 14 02:35:49 db3 postgres[22009]: [4878-1] LOG: archived transaction log
fil
e "00000001000000160000007A"
May 14 02:41:58 db3 postgres[22009]: [4879-1] LOG: archived transaction log
fil
e "00000001000000160000007B"
May 14 02:57:35 db3 postgres[22009]: [4880-1] LOG: archived transaction log
fil
e "00000001000000160000007C"
The clients are all on two windows web servers, anybody know where to look for
errors with that?
May 15 01:57:31 db3 postgres[12723]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
May 15 01:58:31 db3 postgres[12824]: [1-1] LOG: autovacuum: processing database "template1"
May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
May 15 01:59:31 db3 postgres[12928]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG: archived transaction log file "00000001000000160000007F"
May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG: archived transaction log file "000000010000001600000080"
May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG: checkpoints are occurring too frequently (9 seconds apart)
May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG: archived transaction log file "000000010000001600000081"
May 15 02:00:36 db3 postgres[13747]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG: checkpoints are occurring too frequently (13 seconds apart)
May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG: archived transaction log file "000000010000001600000082"
May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG: archived transaction log file "000000010000001600000083"
May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG: archived transaction log file "000000010000001600000084"
May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG: checkpoints are occurring too frequently (18 seconds apart)
May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG: archived transaction log file "000000010000001600000085"
May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG: archived transaction log file "000000010000001600000086"
May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG: checkpoints are occurring too frequently (19 seconds apart)
May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG: archived transaction log file "000000010000001600000087"
May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG: archived transaction log file "000000010000001600000088"
May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG: checkpoints are occurring too frequently (12 seconds apart)
May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG: archived transaction log file "000000010000001600000089"
May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:39 db3 postgres[22009]: [4897-1] LOG: archived transaction log file "00000001000000160000008A"
May 15 02:01:45 db3 postgres[22008]: [1097-1] LOG: checkpoints are occurring too frequently (10 seconds apart)
May 15 02:01:45 db3 postgres[22008]: [1097-2] HINT: Consider increasing the configuration parameter "checkpoint_segments".
May 15 02:01:47 db3 postgres[22009]: [4898-1] LOG: archived transaction log file "00000001000000160000008B"
May 15 02:01:47 db3 postgres[13964]: [1-1] LOG: autovacuum: processing database "template1"
May 15 02:01:54 db3 postgres[22009]: [4899-1] LOG: archived transaction log file "00000001000000160000008C"
May 15 02:02:00 db3 postgres[22009]: [4900-1] LOG: archived transaction log file "00000001000000160000008D"
May 15 02:02:07 db3 crond(pam_unix)[14000]: session closed for user postgres
May 15 02:02:08 db3 postgres[22009]: [4901-1] LOG: archived transaction log file "00000001000000160000008E"
May 15 02:02:15 db3 postgres[22009]: [4902-1] LOG: archived transaction log file "00000001000000160000008F"
May 15 02:02:21 db3 postgres[22009]: [4903-1] LOG: archived transaction log file "000000010000001600000090"
May 15 02:02:27 db3 postgres[22009]: [4904-1] LOG: archived transaction log file "000000010000001600000091"
May 15 02:02:34 db3 postgres[22009]: [4905-1] LOG: archived transaction log file "000000010000001600000092"
May 15 02:02:42 db3 postgres[22009]: [4906-1] LOG: archived transaction log file "000000010000001600000093"
May 15 02:02:51 db3 postgres[14150]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:02:52 db3 postgres[22009]: [4907-1] LOG: archived transaction log file "000000010000001600000094"
May 15 02:03:00 db3 postgres[22009]: [4908-1] LOG: archived transaction log file "000000010000001600000095"
May 15 02:03:05 db3 crond(pam_unix)[14182]: session closed for user postgres
May 15 02:03:08 db3 postgres[22009]: [4909-1] LOG: archived transaction log file "000000010000001600000096"
May 15 02:03:15 db3 postgres[22009]: [4910-1] LOG: archived transaction log file "000000010000001600000097"
May 15 02:03:21 db3 postgres[22009]: [4911-1] LOG: archived transaction log file "000000010000001600000098"
May 15 02:03:51 db3 postgres[14292]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:04:08 db3 crond(pam_unix)[14305]: session closed for user postgres
May 15 02:04:51 db3 postgres[22009]: [4912-1] LOG: archived transaction log file "000000010000001600000099"
May 15 02:05:05 db3 crond(pam_unix)[14415]: session closed for user postgres
May 15 02:06:05 db3 crond(pam_unix)[14517]: session closed for user postgres
May 15 02:06:24 db3 postgres[14582]: [1-1] LOG: autovacuum: processing database "template1"
May 15 02:06:43 db3 postgres[22009]: [4913-1] LOG: archived transaction log file "00000001000000160000009A"
May 15 02:07:07 db3 crond(pam_unix)[14632]: session closed for user postgres
May 15 02:07:33 db3 postgres[14702]: [1-1] LOG: autovacuum: processing database "postgres"
May 15 02:08:48 db3 postgres[14724]: [1-1] LOG: autovacuum: processing database "vb_web"
May 15 02:13:03 db3 postgres[14781]: [1-1] LOG: unexpected EOF on client connection
May 15 02:14:42 db3 postgres[14796]: [1-1] LOG: unexpected EOF on client connection
May 15 02:15:53 db3 postgres[14755]: [1-1] LOG: unexpected EOF on client connection
May 15 02:16:11 db3 postgres[14856]: [1-1] LOG: unexpected EOF on client connection
May 15 02:19:52 db3 postgres[14869]: [1-1] LOG: unexpected EOF on client connection
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes:
On Tue, 16 May 2006, Simon Riggs wrote:
Whatever happened between 02:08 and 02:14 seems important.
I have the logs and after reviewing /var/log/messages for that time period,
there is no other activity besides postgres.
I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens. The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple. (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)
This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.
regards, tom lane
On Wed, 17 May 2006, Tom Lane wrote:
I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens. The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple. (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.
This sounds interesting and I'll go have a look for this thread in the
archives.
It looks like it's running about 3,000 transactions per minute right now
(around 10pm). I got that by doing the following...tell me if this is
reasonable:
db3:~ $ echo select xact_commit from pg_stat_database where datname =
\'vb_web\'\; | psql vb_web ; sleep 60 ; echo select xact_commit from
pg_stat_database where datname = \'vb_web\'\; | psql vb_web
xact_commit
-------------
806325
(1 row)
xact_commit
-------------
809414
(1 row)
So, doing a little subtraction, I got 3,089 transactions for the 60 second
period I ran it. I would guess it's quite a bit more during the day. That
means it would hit 32k transactions around 10-11 minutes...does that fit in
with our timing from above? (Tom, I'm not sure which timing you're
using.)
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
On Wed, 2006-05-17 at 00:36 -0400, Tom Lane wrote:
Jeff Frost <jeff@frostconsultingllc.com> writes:
On Tue, 16 May 2006, Simon Riggs wrote:
Whatever happened between 02:08 and 02:14 seems important.
I have the logs and after reviewing /var/log/messages for that time period,
there is no other activity besides postgres.I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens. The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple. (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.
You'll have to explain a little more. I checked the archives...
archiver looks for archive_status files that end with .ready and that
has got nothing at all to do with transactions, LWlocks etc. If there's
a file ready, it will archive it, if there's not - it won't. There is
very deliberately a very low amount of synchronization there: archiver
holds no locks, LWLocks or spinlocks at any time.
The "lurking feeling" scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all. I see
no evidence for the archiver to be the source of a problem here and that
the only reason we're checking that is as a result of Jeff's original
conjecture that there was a connection. There *was* a problem, yes, but
I think we're looking in the wrong place for the murder weapon.
pg_clog page extension does look like it can offer problems, generally.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes:
You'll have to explain a little more. I checked the archives...
I was thinking of
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
full explanation here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
The "lurking feeling" scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all.
Well, I don't see it either; at this point we're waiting on Jeff to
provide some harder evidence ...
regards, tom lane
On Wed, 2006-05-17 at 10:01 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
You'll have to explain a little more. I checked the archives...
I was thinking of
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
full explanation here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
Sorry, I see my note looks like I couldn't find it. I meant "I've read
the archives". Thanks though.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Wed, 17 May 2006, Tom Lane wrote:
The "lurking feeling" scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all.Well, I don't see it either; at this point we're waiting on Jeff to
provide some harder evidence ...
Was the 3,000 transactions per minute helpful? What other evidence should I
be looking for? Really the only evidence I have at this point is that when
the NAS gets rebooted and comes back up, the postmaster resumes normal
operations and the client count goes back down; however, that could just be
coincidental, but we've seen it happen 3 times now. This might have nothing
to do with archiving, it just seems like the only likely candidate since there
are only two things which happen on this volmume: PITR and rsyncing of
pg_dumps (we dump locally first for speed reasons). Maybe this is just a
linux kernel thing? Or maybe something else is blocking because it tries to
stat all the filesystems and /mnt/pgbackup blocks? I can't imagine postgres
doing that though.
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes:
Was the 3,000 transactions per minute helpful? What other evidence should I
be looking for?
Did you try generating a test case using a long sleep() as a replacement
for the archive_command script? If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case. I'm up to my armpits
in other stuff and don't have time to try it myself...
regards, tom lane
On Wed, 17 May 2006, Tom Lane wrote:
Did you try generating a test case using a long sleep() as a replacement
for the archive_command script? If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case. I'm up to my armpits
in other stuff and don't have time to try it myself...
Interesting that you should say that as I had just started setting up a test.
I seem to get alot of these:
May 17 21:34:04 discord postgres[20573]: [4-1] LOG: archived transaction log
file "000000010000000000000001"
May 17 21:34:04 discord postgres[20573]: [5-1] WARNING: could not rename file
"pg_xlog/archive_status/000000010000000000000001.ready" to
May 17 21:34:04 discord postgres[20573]: [5-2]
"pg_xlog/archive_status/000000010000000000000001.done": No such file or
directory
I'm guessing that the archiver moves the WAL file to <wal filename>.ready, and
successful completion, it renames it to <wal filename>.done? My
archive_command is the utmost in simplicity...looks like this:
----
#!/bin/sh
FULLPATH="$1"
FILENAME="$2"
sleep 300
cat $FULLPATH > /dev/null
----
So far I haven't been able to reproduce, but I'm just getting started.
Currently I'm using pgbench to generate WAL rollover, do you guys have any
other handy testing tools for this sort of job or is this the best tool?
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954