"Garbled" postgres logs
While going through some log files, we noticed that some of the log entries
are "garbled". For example:
2007-03-27 01:19:44.139 UTC [1761474] oxrsa aepp xx.xx.xx.xx LOG:
duratio2007-03-n: 3751.27 01:19801 ms :44.139 statemenUTC [421940]
oxrt: EXECUsor
g aTE <unnaepp 10.4med> [P0.136.10REPARE: 8 LOG: select
durationname fro: 3866.1m epp_do88 ms smain_dz_tatementnames wh:
EXECUTere nameE <unnam = $1]
2007-03-27 01:19:49.213 UTC [528480] oxrsa aepp xx.xx.xx.xx LOG:
duration: 4510.129 ms statement: EXECUTE <unnam2007-03-ed> [PR27
01:19EPARE: :49
.213 select nUTC [294ame from930] oxr b_domsa aain_dz_nb
10.4ames whe0.136.10re name 7 LOG: = $1]
Here are the logging sections in the postgresql.conf file:
log_destination = 'stderr'
redirect_stderr = on
log_directory = '/opt/rg/logs/pgsql81'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = off
log_rotation_age = 1440
log_rotation_size = 10240
log_line_prefix = '%m [%p] %d %u %h ' # Special values:
We're seeing this in postgresql 8.1.8 on AIX 5.3. Also, even though we're
using the same binaries for a bunch of clusters, we've only seen this issue
on one of them.
Any ideas of what could cause this?
--
Tim Goodaire 416-673-4126 tgoodair@ca.afilias.info
Database Team Lead, Afilias Canada Corp.
Tim Goodaire <tgoodair@ca.afilias.info> writes:
While going through some log files, we noticed that some of the log entries
are "garbled". For example:
2007-03-27 01:19:44.139 UTC [1761474] oxrsa aepp xx.xx.xx.xx LOG:
duratio2007-03-n: 3751.27 01:19801 ms :44.139 statemenUTC [421940]
oxrt: EXECUsor
g aTE <unnaepp 10.4med> [P0.136.10REPARE: 8 LOG: select
durationname fro: 3866.1m epp_do88 ms smain_dz_tatementnames wh:
EXECUTere nameE <unnam = $1]
2007-03-27 01:19:49.213 UTC [528480] oxrsa aepp xx.xx.xx.xx LOG:
duration: 4510.129 ms statement: EXECUTE <unnam2007-03-ed> [PR27
01:19EPARE: :49
.213 select nUTC [294ame from930] oxr b_domsa aain_dz_nb
10.4ames whe0.136.10re name 7 LOG: = $1]
We're seeing this in postgresql 8.1.8 on AIX 5.3. Also, even though we're
using the same binaries for a bunch of clusters, we've only seen this issue
on one of them.
It looks like you've got messages from different backends being
interleaved, which is surely not impossible but normally it only happens
line-by-line at worst. This must have something to do with the
buffering behavior on stderr. Perhaps the difference is caused
by a difference in where the postmaster's stderr originally pointed
--- was this postmaster started in a different fashion than the others?
Note to hackers: would it make sense to use write() instead of
fprintf() in send_message_to_server_log to avoid any possibility
of stdio deciding to fragment the message? Possibly there'd be
some marginal efficiency gain too.
regards, tom lane
Tom Lane wrote:
Note to hackers: would it make sense to use write() instead of
fprintf() in send_message_to_server_log to avoid any possibility
of stdio deciding to fragment the message? Possibly there'd be
some marginal efficiency gain too.
What about in write_syslogger_file_binary()? Since redirect_stderr is
set true in the case reported, wont that be what does the writing?
cheers
andrew
On Tue, 2007-04-03 at 15:02 -0400, Tom Lane wrote:
Tim Goodaire <tgoodair@ca.afilias.info> writes:
While going through some log files, we noticed that some of the log entries
are "garbled". For example:2007-03-27 01:19:44.139 UTC [1761474] oxrsa aepp xx.xx.xx.xx LOG:
duratio2007-03-n: 3751.27 01:19801 ms :44.139 statemenUTC [421940]
oxrt: EXECUsor
g aTE <unnaepp 10.4med> [P0.136.10REPARE: 8 LOG: select
durationname fro: 3866.1m epp_do88 ms smain_dz_tatementnames wh:
EXECUTere nameE <unnam = $1]
2007-03-27 01:19:49.213 UTC [528480] oxrsa aepp xx.xx.xx.xx LOG:
duration: 4510.129 ms statement: EXECUTE <unnam2007-03-ed> [PR27
01:19EPARE: :49
.213 select nUTC [294ame from930] oxr b_domsa aain_dz_nb
10.4ames whe0.136.10re name 7 LOG: = $1]We're seeing this in postgresql 8.1.8 on AIX 5.3. Also, even though we're
using the same binaries for a bunch of clusters, we've only seen this issue
on one of them.It looks like you've got messages from different backends being interleaved, which is surely not impossible but normally it only happens line-by-line at worst. This must have something to do with the buffering behavior on stderr. Perhaps the difference is caused by a difference in where the postmaster's stderr originally pointed --- was this postmaster started in a different fashion than the others?
No. Standard init script was used to start all clusters.
--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
Note to hackers: would it make sense to use write() instead of
fprintf() in send_message_to_server_log to avoid any possibility
of stdio deciding to fragment the message? Possibly there'd be
some marginal efficiency gain too.
What about in write_syslogger_file_binary()? Since redirect_stderr is
set true in the case reported, wont that be what does the writing?
No, syslogger is single-threaded so it can't be at fault. The
interleaving must be happening when the data is inserted into the pipe
that leads to syslogger. We've got multiple backends concurrently
writing that pipe, remember.
BTW, although I'm blaming stdio here, it's conceivable that it is
issuing messages in a single write() and the kernel is failing to keep
the writes atomic, as I think is required by spec if the write is for
less than PIPEBUF bytes. So Tim might want to compare exact kernel
versions as well as exact libc versions between the misbehaving machine
and the others.
regards, tom lane
On Tue, Apr 03, 2007 at 04:51:33PM -0400, Tom Lane wrote:
No, syslogger is single-threaded so it can't be at fault. The
interleaving must be happening when the data is inserted into the pipe
that leads to syslogger. We've got multiple backends concurrently
writing that pipe, remember.BTW, although I'm blaming stdio here, it's conceivable that it is
issuing messages in a single write() and the kernel is failing to keep
the writes atomic, as I think is required by spec if the write is for
less than PIPEBUF bytes. So Tim might want to compare exact kernel
versions as well as exact libc versions between the misbehaving machine
and the others.
I've compared the libc and kernel versions between a misbehaving machine and a
machine that is logging properly and they're the same:
bos.rte.libc 5.3.0.52 C F libc Library
bos.mp64:bos.mp64:5.3.0.52: : :A:F:Base Operating System 64-bit
Multiprocessor Runtime
--
Tim Goodaire 416-673-4126 tgoodair@ca.afilias.info
Database Team Lead, Afilias Canada Corp.
tgoodair@ca.afilias.info (Tim Goodaire) writes:
I've compared the libc and kernel versions between a misbehaving machine and a
machine that is logging properly and they're the same:
[ scratches head... ] So what's different?
Anyway, if you are interested in trying to cut libc out of the equation,
the patch I had in mind goes like this (against 8.1)
regards, tom lane