Logging checkpoints and other slowdown causes
This patch puts entries in the log file around some backend events that
regularly cause client pauses, most of which are visible only when running
with log_min_messages = debug2. The main focus is regularly occuring
checkpoints, from either running out of segments or hitting a timeout.
This is an example of what gets written to the logs at checkpoint time:
DEBUG2: checkpoint required (timeout passed)
DEBUG2: checkpoint starting
DEBUG2: checkpoint flushing buffer pool
DEBUG2: checkpoint database fsync starting
DEBUG2: checkpoint database fsync complete
DEBUG1: checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms
sync=3716.5 ms
DEBUG2: checkpoint complete; 0 transaction log file(s) added, 0 removed,
0 recycled
The idea here is that if you want to really watch what your system is
doing, either to track down a problem or for benchmarking, you log at
DEBUG2 and timestamp the logs with something like log_line_prefix =
'[%m]:'. Then line up your timestamped transactions with the logs to
figure out what happened during bad periods. This is very easy to do with
the timestamped pgbench latency logs for example, so you can put them in
time sequence order and see something like "oh, those long transactions
all came in between the fsync start and end". The summary line with the
write/sync timing information has a broader use and suggests when you need
to go deeper, which is why I put that one at DEBUG1.
I also adjust some other log messages to make it easier to run the system
at DEBUG2 without going completely crazy, like moving individual WAL
segment recycles to DEBUG3 and standardizing the format of the fork/exit
messages so they're easier to filter out.
My main concern about this code is how it adds several gettimeofday calls
and complicated string building inside FlushBufferPool, all of which is
wasted if this data isn't actually going to get logged. I'd like to have
something that tells me whether I should even bother, but I didn't want to
poke into the elog variables for fear of adding a dependency on its
current structure to here. Ideally, it would be nice to call something
like WillLog(DEBUG2) and get a boolean back saying whether something at
that log level will be output anywhere; I don't know enough about the
logging code to add such a thing to it myself.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Attachments:
checkpointlog.txttext/plain; CHARSET=US-ASCII; NAME=checkpointlog.txtDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.265
diff -c -r1.265 xlog.c
*** src/backend/access/transam/xlog.c 3 Mar 2007 20:02:26 -0000 1.265
--- src/backend/access/transam/xlog.c 1 Apr 2007 04:55:19 -0000
***************
*** 1591,1600 ****
if (new_highbits != old_highbits ||
new_segno >= old_segno + (uint32) (CheckPointSegments-1))
{
! #ifdef WAL_DEBUG
! if (XLOG_DEBUG)
! elog(LOG, "time for a checkpoint, signaling bgwriter");
! #endif
RequestCheckpoint(false, true);
}
}
--- 1591,1598 ----
if (new_highbits != old_highbits ||
new_segno >= old_segno + (uint32) (CheckPointSegments-1))
{
! ereport(DEBUG2,(errmsg(
! "checkpoint required (wrote checkpoint_segments)")));
RequestCheckpoint(false, true);
}
}
***************
*** 1858,1863 ****
--- 1856,1863 ----
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m", tmppath)));
+ ereport(DEBUG2, (errmsg("WAL creating and filling new file on disk")));
+
/*
* Zero-fill the file. We have to do this the hard way to ensure that all
* the file space has really been allocated --- on platforms that allow
***************
*** 2565,2571 ****
true, &max_advance,
true))
{
! ereport(DEBUG2,
(errmsg("recycled transaction log file \"%s\"",
xlde->d_name)));
(*nsegsrecycled)++;
--- 2565,2571 ----
true, &max_advance,
true))
{
! ereport(DEBUG3,
(errmsg("recycled transaction log file \"%s\"",
xlde->d_name)));
(*nsegsrecycled)++;
***************
*** 5424,5429 ****
--- 5424,5430 ----
LWLockRelease(CheckpointStartLock);
LWLockRelease(CheckpointLock);
END_CRIT_SECTION();
+ ereport(DEBUG2,(errmsg("checkpoint request ignored (no changes)")));
return;
}
}
***************
*** 6087,6092 ****
--- 6088,6094 ----
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*/
+ ereport(DEBUG2, (errmsg("checkpoint required (starting backup)")));
RequestCheckpoint(true, false);
/*
Index: src/backend/postmaster/autovacuum.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/autovacuum.c,v
retrieving revision 1.40
diff -c -r1.40 autovacuum.c
*** src/backend/postmaster/autovacuum.c 28 Mar 2007 22:17:12 -0000 1.40
--- src/backend/postmaster/autovacuum.c 1 Apr 2007 04:55:20 -0000
***************
*** 811,816 ****
--- 811,818 ----
/* And do an appropriate amount of work */
recentXid = ReadNewTransactionId();
do_autovacuum();
+ ereport(DEBUG2,
+ (errmsg("autovacuum: processing database \"%s\" complete", dbname)));
}
/*
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.37
diff -c -r1.37 bgwriter.c
*** src/backend/postmaster/bgwriter.c 30 Mar 2007 18:34:55 -0000 1.37
--- src/backend/postmaster/bgwriter.c 1 Apr 2007 04:55:21 -0000
***************
*** 392,397 ****
--- 392,398 ----
if (elapsed_secs >= CheckPointTimeout)
{
do_checkpoint = true;
+ ereport(DEBUG2, (errmsg("checkpoint required (timeout passed)")));
if (!force_checkpoint)
BgWriterStats.m_timed_checkpoints++;
}
Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.527
diff -c -r1.527 postmaster.c
*** src/backend/postmaster/postmaster.c 22 Mar 2007 19:53:30 -0000 1.527
--- src/backend/postmaster/postmaster.c 1 Apr 2007 04:55:23 -0000
***************
*** 2630,2636 ****
/* in parent, successful fork */
ereport(DEBUG2,
! (errmsg_internal("forked new backend, pid=%d socket=%d",
(int) pid, port->sock)));
/*
--- 2630,2636 ----
/* in parent, successful fork */
ereport(DEBUG2,
! (errmsg_internal("forked new backend (PID %d) socket=%d",
(int) pid, port->sock)));
/*
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 30 Mar 2007 18:34:55 -0000 1.216
--- src/backend/storage/buffer/bufmgr.c 1 Apr 2007 04:55:25 -0000
***************
*** 946,957 ****
*
* This is called at checkpoint time to write out all dirty shared buffers.
*/
! void
BufferSync(void)
{
int buf_id;
int num_to_scan;
int absorb_counter;
/*
* Find out where to start the circular scan.
--- 946,958 ----
*
* This is called at checkpoint time to write out all dirty shared buffers.
*/
! int
BufferSync(void)
{
int buf_id;
int num_to_scan;
int absorb_counter;
+ int dirty_buffers;
/*
* Find out where to start the circular scan.
***************
*** 966,976 ****
*/
num_to_scan = NBuffers;
absorb_counter = WRITES_PER_ABSORB;
while (num_to_scan-- > 0)
{
if (SyncOneBuffer(buf_id, false))
{
! BgWriterStats.m_buf_written_checkpoints++;
/*
* If in bgwriter, absorb pending fsync requests after each
--- 967,978 ----
*/
num_to_scan = NBuffers;
absorb_counter = WRITES_PER_ABSORB;
+ dirty_buffers = 0;
while (num_to_scan-- > 0)
{
if (SyncOneBuffer(buf_id, false))
{
! dirty_buffers++;
/*
* If in bgwriter, absorb pending fsync requests after each
***************
*** 987,992 ****
--- 989,997 ----
if (++buf_id >= NBuffers)
buf_id = 0;
}
+
+ BgWriterStats.m_buf_written_checkpoints+=dirty_buffers;
+ return dirty_buffers;
}
/*
***************
*** 1288,1295 ****
void
FlushBufferPool(void)
{
! BufferSync();
smgrsync();
}
--- 1293,1320 ----
void
FlushBufferPool(void)
{
! int dirty;
! struct timeval start_t, sync_t, end_t;
! long u_sec_for_buffers, u_sec_for_sync;
!
! gettimeofday(&start_t, NULL);
! ereport(DEBUG2, (errmsg("checkpoint flushing buffer pool")));
! dirty=BufferSync();
! gettimeofday(&sync_t, NULL);
! ereport(DEBUG2, (errmsg("checkpoint database fsync starting")));
smgrsync();
+ gettimeofday(&end_t, NULL);
+ ereport(DEBUG2, (errmsg("checkpoint database fsync complete")));
+
+ u_sec_for_buffers = 1000000 * (long) (sync_t.tv_sec - start_t.tv_sec) +
+ (long) (sync_t.tv_usec - start_t.tv_usec);
+ u_sec_for_sync = 1000000 * (long) (end_t.tv_sec - sync_t.tv_sec) +
+ (long) (end_t.tv_usec - sync_t.tv_usec);
+
+ ereport(DEBUG1,(errmsg(
+ "checkpoint buffers dirty=%.1f MB (%.1f%%) write=%.1f ms sync=%.1f ms",
+ (float) (BLCKSZ * dirty) / (1024*1024), (float) 100 * dirty / NBuffers,
+ (float) u_sec_for_buffers / 1000, (float) u_sec_for_sync / 1000)));
}
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.102
diff -c -r1.102 bufmgr.h
*** src/include/storage/bufmgr.h 5 Jan 2007 22:19:57 -0000 1.102
--- src/include/storage/bufmgr.h 1 Apr 2007 04:55:26 -0000
***************
*** 150,156 ****
extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
! extern void BufferSync(void);
extern void BgBufferSync(void);
extern void AtProcExit_LocalBuffers(void);
--- 150,156 ----
extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
! extern int BufferSync(void);
extern void BgBufferSync(void);
extern void AtProcExit_LocalBuffers(void);
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---------------------------------------------------------------------------
Greg Smith wrote:
This patch puts entries in the log file around some backend events that
regularly cause client pauses, most of which are visible only when running
with log_min_messages = debug2. The main focus is regularly occuring
checkpoints, from either running out of segments or hitting a timeout.
This is an example of what gets written to the logs at checkpoint time:DEBUG2: checkpoint required (timeout passed)
DEBUG2: checkpoint starting
DEBUG2: checkpoint flushing buffer pool
DEBUG2: checkpoint database fsync starting
DEBUG2: checkpoint database fsync complete
DEBUG1: checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms
sync=3716.5 ms
DEBUG2: checkpoint complete; 0 transaction log file(s) added, 0 removed,
0 recycledThe idea here is that if you want to really watch what your system is
doing, either to track down a problem or for benchmarking, you log at
DEBUG2 and timestamp the logs with something like log_line_prefix =
'[%m]:'. Then line up your timestamped transactions with the logs to
figure out what happened during bad periods. This is very easy to do with
the timestamped pgbench latency logs for example, so you can put them in
time sequence order and see something like "oh, those long transactions
all came in between the fsync start and end". The summary line with the
write/sync timing information has a broader use and suggests when you need
to go deeper, which is why I put that one at DEBUG1.I also adjust some other log messages to make it easier to run the system
at DEBUG2 without going completely crazy, like moving individual WAL
segment recycles to DEBUG3 and standardizing the format of the fork/exit
messages so they're easier to filter out.My main concern about this code is how it adds several gettimeofday calls
and complicated string building inside FlushBufferPool, all of which is
wasted if this data isn't actually going to get logged. I'd like to have
something that tells me whether I should even bother, but I didn't want to
poke into the elog variables for fear of adding a dependency on its
current structure to here. Ideally, it would be nice to call something
like WillLog(DEBUG2) and get a boolean back saying whether something at
that log level will be output anywhere; I don't know enough about the
logging code to add such a thing to it myself.--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Content-Description:
[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Greg Smith wrote:
This patch puts entries in the log file around some backend events
that regularly cause client pauses, most of which are visible only
when running with log_min_messages = debug2.
Something that is aimed at a user should not be enabled at a "debug"
level. Debug levels are for debugging, not for very high verbosity.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Tue, 3 Apr 2007, Peter Eisentraut wrote:
Something that is aimed at a user should not be enabled at a "debug"
level. Debug levels are for debugging, not for very high verbosity.
I asked for feedback about where to log at when I intially sent the first
version of this in and didn't hear anything back on that part, so I pushed
these in line with other log messages I saw. The messages for when
checkpoints start and stop were both logged at DEBUG2, so I put progress
reports on the other significant phases of the process there as well.
I don't expect these messages will be helpful for a normal user--that's
what the new data in pg_stats_bgwriter is for. Their main purpose of this
patch is debugging checkpoint related performance issues at a level I'd
expect only a developer to work at; they're also helpful for someone
writing benchmark code.
There are several patches in process floating around that aim to adjust
either the background writer or the checkpoint process to reduce the
impact of checkpoints. This logging allows grading their success at that.
As my tests with this patch in place suggest this problem is far from
solved with any of the current suggestions, I'd like to get other
developers looking at that problem the same way I have been; that's why
I'd like to see some standardization on how checkpoints are instrumented.
The fact that really advanced users might also use this for
troubleshooting I consider a bonus rather than the main focus here.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote:
On Tue, 3 Apr 2007, Peter Eisentraut wrote:
Something that is aimed at a user should not be enabled at a "debug"
level. Debug levels are for debugging, not for very high verbosity.I asked for feedback about where to log at when I intially sent the
first version of this in and didn't hear anything back on that part, so
I pushed these in line with other log messages I saw. The messages for
when checkpoints start and stop were both logged at DEBUG2, so I put
progress reports on the other significant phases of the process there as
well.
I agree that debug levels are not suitable for this.
I'm thinking of INFO, NOTICE or LOG. The user manual says about LOG:
LOG
Reports information of interest to administrators, e.g., checkpoint
activity.
But looking at the code, all the checkpoint related messages are at
DEBUG-levels, nothing gets printed at LOG-level. Printing the messages
at LOG-level would bring the code in line with the documentation, but I
don't think we want to fill the log with checkpoint chatter unless the
DBA explicitly asks for that.
How about INFO? It seems like the best level for information on normal
activity of the system. The documentation however has this to say about it:
INFO
Provides information implicitly requested by the user, e.g., during
VACUUM VERBOSE.
We should adjust the documentation, but INFO seems like the best level
to me. Or we could add a GUC variable similar to log_connections or
log_statement to control if the messages are printed or not, and use LOG.
I don't expect these messages will be helpful for a normal user--that's
what the new data in pg_stats_bgwriter is for. Their main purpose of
this patch is debugging checkpoint related performance issues at a level
I'd expect only a developer to work at; they're also helpful for someone
writing benchmark code.
I disagree. They would be helpful to an administrator chasing down
checkpoint related problems. E.g. checkpoints taking too long, occurring
too often (though we already have log_checkpoint_warning for that), or
to identify if transient performance problems that users complain about
coincide with checkpoints. And at least I like to have messages like
that in the log just to comfort me that everything is going ok.
There are several patches in process floating around that aim to adjust
either the background writer or the checkpoint process to reduce the
impact of checkpoints. This logging allows grading their success at
that. As my tests with this patch in place suggest this problem is far
from solved with any of the current suggestions, I'd like to get other
developers looking at that problem the same way I have been; that's why
I'd like to see some standardization on how checkpoints are
instrumented. The fact that really advanced users might also use this
for troubleshooting I consider a bonus rather than the main focus here.
Agreed.
Looking at the patch, I think we want one line in the log when
checkpoint starts, one when buffer flushing starts (or maybe not, if we
assume that checkpointing clog, subtrans and multixact don't take long),
one when sync-phase starts and when the checkpoint is done. We don't
need to print the times elapsed in each phase on a separate line, that's
just derived information from the other lines, unless we use different
log-levels for detail lines (like you did in your patch).
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 5/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I agree that debug levels are not suitable for this.
Squid (the (reverse) proxy) has a nice feature for debugging purposes
(admin and developers). It's possible to have different logging level
for different sections of the code. For example, if you want to debug
a particular part, you can have ALL,1 33,2 in your config file which
allows you to have a more verbose debug on the section 33 of the code
(33 is ACL).
It's really useful because you can even debug a particular code path
or a misconfiguration in production without generating a huge amount
of logs.
--
Guillaume
Guillaume Smet wrote:
On 5/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I agree that debug levels are not suitable for this.
Squid (the (reverse) proxy) has a nice feature for debugging purposes
(admin and developers). It's possible to have different logging level
for different sections of the code. For example, if you want to debug
a particular part, you can have ALL,1 33,2 in your config file which
allows you to have a more verbose debug on the section 33 of the code
(33 is ACL).
It's really useful because you can even debug a particular code path
or a misconfiguration in production without generating a huge amount
of logs.
Yes, that'd be really nice, though using numeric codes is a bit archaic.
In Java-world, a logging framework called log4j has become the de facto
standard to implement logging. It's basically the same idea, you have
different log levels, and you can have enable different levels of
logging for different modules. I'm sure there's similar libraries for C,
that's something to think about in future releases.
8.3 will add some new log_* GUC variables, log_autovacuum and
log_temp_files, so it seems we're headed to that direction already. I'm
now thinking that's the approach we should take for checkpoints as well:
add a new GUC variable, log_checkpoints, and print the messages at
LOG-level if that's enabled.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 11 May 2007, Heikki Linnakangas wrote:
Printing the messages at LOG-level would bring the code in line with the
documentation, but I don't think we want to fill the log with checkpoint
chatter unless the DBA explicitly asks for that. How about INFO?
In putting that together, I pushed everything toward the lowest DEBUG
level that it was possible for it to run at without the messages I wanted
to see being completely lost. I didn't want this to get rejected just on
the basis of being too chatty for regular users. Putting a single line in
for each checkpoint at INFO, with all the individual progress points being
at DEBUG1 or 2, would be close to optimal as far as I'm concerned. More
on this below.
Or we could add a GUC variable similar to log_connections or log_statement to
control if the messages are printed or not, and use LOG.
If the levels for the individual messages are adjusted usefully the
ability to control whether they show up or not falls out of the existing
log level adjustments; I haven't ever felt a strong need for GUC when
using this code. If, as you say, there's already a move in this
direction, then fine--another GUC would be better.
We don't need to print the times elapsed in each phase on a separate
line, that's just derived information from the other lines, unless we
use different log-levels for detail lines
Let me turn this around for a second and ask you this: what do you think
people are going to do with this information? I can tell you what I do.
I parse the logs and look for that single line with the summary
information. I then take the three pieces of information it provides (MB
written, time to write, time to fsync) and save them into a table. From
there I generate stats, scatter plots, all sorts of useful stuff.
If you know when the checkpoint ended, and you know how long each of the
pieces took, you can reconstruct the other times easily. The way you
describe this it is true--that the summary is redundant given the
detail--but if you put yourself in the shoes of a log file parser the
other way around is easier to work with. Piecing together log entries is
a pain, splitting them is easy.
If I had to only keep one line out of this, it would be the one with the
summary. It would be nice to have it logged at INFO.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote:
On Fri, 11 May 2007, Heikki Linnakangas wrote:
Or we could add a GUC variable similar to log_connections or
log_statement to control if the messages are printed or not, and use LOG.If the levels for the individual messages are adjusted usefully the
ability to control whether they show up or not falls out of the existing
log level adjustments; I haven't ever felt a strong need for GUC when
using this code. If, as you say, there's already a move in this
direction, then fine--another GUC would be better.
The need for GUC becomes apparent when you start adding logging to other
modules as well, like we did for autovacuum a while back. If there
wasn't a log_autovacuum parameter, you'd have to choose between logging
checkpoints and autovacuum actions, or neither.
We don't need to print the times elapsed in each phase on a separate
line, that's just derived information from the other lines, unless we
use different log-levels for detail linesLet me turn this around for a second and ask you this: what do you
think people are going to do with this information? I can tell you what
I do. I parse the logs and look for that single line with the summary
information. I then take the three pieces of information it provides
(MB written, time to write, time to fsync) and save them into a table.
From there I generate stats, scatter plots, all sorts of useful stuff.If you know when the checkpoint ended, and you know how long each of the
pieces took, you can reconstruct the other times easily. The way you
describe this it is true--that the summary is redundant given the
detail--but if you put yourself in the shoes of a log file parser the
other way around is easier to work with. Piecing together log entries
is a pain, splitting them is easy.If I had to only keep one line out of this, it would be the one with the
summary. It would be nice to have it logged at INFO.
Yeah, if we have the summary line we don't need the other lines and vice
versa. I have sympathy for parsing log files, I've done that a lot in
the past and I can see what you mean. Having the individual lines is
nice when you're monitoring a running system; you don't get the summary
line until the checkpoint is finished. I suppose we can have both the
individual lines and the summary, the extra lines shouldn't hurt anyone,
and you won't get them unless you turn on the new log_checkpoints
parameter anyway.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Moving to -hackers.
On Fri, May 11, 2007 at 04:37:44PM +0100, Heikki Linnakangas wrote:
If you know when the checkpoint ended, and you know how long each of the
pieces took, you can reconstruct the other times easily. The way you
describe this it is true--that the summary is redundant given the
detail--but if you put yourself in the shoes of a log file parser the
other way around is easier to work with. Piecing together log entries
is a pain, splitting them is easy.If I had to only keep one line out of this, it would be the one with the
summary. It would be nice to have it logged at INFO.Yeah, if we have the summary line we don't need the other lines and vice
versa. I have sympathy for parsing log files, I've done that a lot in
the past and I can see what you mean. Having the individual lines is
nice when you're monitoring a running system; you don't get the summary
line until the checkpoint is finished. I suppose we can have both the
individual lines and the summary, the extra lines shouldn't hurt anyone,
and you won't get them unless you turn on the new log_checkpoints
parameter anyway.
Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?
I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.
One thing that doesn't seemed to be being looked at it is the cost of
logging. Logging is very expensive. I don't know if it is more expensive
than the stats system, but you can cut your tps in half by having any
level of verbose logging on.
Yes that can be offset by pushing the logging to another spindle, and
being careful about what you are logging but still.
Either way, we are taking the hit, it is just a matter of where. IMO it
would be better to have the information in the database where it makes
sense, than pushing out to a log that:
A. Will likely be forgotten
B. Is only accessible if you have shell access to the machine (not as
common as all of us would like to think)
Sincerely,
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
On Sat, 12 May 2007, Jim C. Nasby wrote:
Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring?
All of the really interesting DBA level information about checkpoints is
now sitting in pg_stat_bgwriter. There really is no reason I'd expect
this information to be part of normal performance monitoring setup; for me
it has fit into troubleshooting and R&D.
When I'm talking about parsing logs in this context, it's for answering
questions like "how does fsync time correlate with amount of data written
at checkpoint time?" Since I've found it educational to sort through
that, I just didn't want the patch modified so that was harder to do than
it had to be.
If I thought it worth the trouble to expose that level of information via
the stats interface, I'd have submitted that instead of this log-based
patch. This circles back to the previous discussion of whether this
particular information is strictly developer-level.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 12 May 2007, Joshua D. Drake wrote:
One thing that doesn't seemed to be being looked at it is the cost of
logging.
If any of this executed at something like the query level, sure, that
would be real important. The majority of the logging I suggested here is
of things that happen at checkpoint time. The presumption is that the
overhead of that is considerably greater than writing a log line or even
five.
The least intensive event I would like to be loggable is when a new WAL
segment is created and cleared, which is again a pretty small bit of log
compared to the 16MB write. I wouldn't mind seeing that exposed under
pg_stats instead, just had more interesting things to statify first.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 2007-12-05 at 14:26 -0700, Joshua D. Drake wrote:
Either way, we are taking the hit, it is just a matter of where. IMO it
would be better to have the information in the database where it makes
sense, than pushing out to a log
If performance monitoring information is provided as a database object,
what would the right interface be? IMHO the problem with cleanly
presenting monitoring information within a normal database system is
that this sort of data is fundamentally dynamic and continuous: to
determine how the performance of the system changes over time, you need
to repeatedly rescan the table/view/SRF and recompute your analysis
essentially from scratch. Trying to get even simple information like
"queries per second" from pg_stat_activity is an example of how this can
be painful.
<plug>
BTW, if the system included the concept of a continuous data *stream* as
a kind of database object, this problem would be much more tractable :)
In fact, there is some code in a version of TelegraphCQ that exposes
various information about the runtime state of the system as a set of
system-defined data streams -- like any other stream, users could then
use those streams in arbitrary queries.
</plug>
-Neil
Jim C. Nasby wrote:
Moving to -hackers.
On Fri, May 11, 2007 at 04:37:44PM +0100, Heikki Linnakangas wrote:
If you know when the checkpoint ended, and you know how long each of the
pieces took, you can reconstruct the other times easily. The way you
describe this it is true--that the summary is redundant given the
detail--but if you put yourself in the shoes of a log file parser the
other way around is easier to work with. Piecing together log entries
is a pain, splitting them is easy.If I had to only keep one line out of this, it would be the one with the
summary. It would be nice to have it logged at INFO.Yeah, if we have the summary line we don't need the other lines and vice
versa. I have sympathy for parsing log files, I've done that a lot in
the past and I can see what you mean. Having the individual lines is
nice when you're monitoring a running system; you don't get the summary
line until the checkpoint is finished. I suppose we can have both the
individual lines and the summary, the extra lines shouldn't hurt anyone,
and you won't get them unless you turn on the new log_checkpoints
parameter anyway.Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.
There is two counters for checkpoints in pgstats, the number of timed
(triggered by checkpoint_timeout) and requested (triggered by
checkpoint_segments) checkpoints.
Maybe we should improve the stats system so that we can collect events
with timestamps and durations, but in my experience log files actually
are the most reliable and universal way to collect real-time performance
information. Any serious tool has a generic log parser. The other
alternative is SNMP. I welcome the efforts on pgsnmpd..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Yeah, if we have the summary line we don't need the other lines and
vice versa. I have sympathy for parsing log files, I've done that a
lot in the past and I can see what you mean. Having the individual
lines is nice when you're monitoring a running system; you don't get
the summary line until the checkpoint is finished. I suppose we can
have both the individual lines and the summary, the extra lines
shouldn't hurt anyone, and you won't get them unless you turn on the
new log_checkpoints parameter anyway.Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.There is two counters for checkpoints in pgstats, the number of timed
(triggered by checkpoint_timeout) and requested (triggered by
checkpoint_segments) checkpoints.Maybe we should improve the stats system so that we can collect events
with timestamps and durations, but in my experience log files actually
are the most reliable and universal way to collect real-time performance
information. Any serious tool has a generic log parser. The other
alternative is SNMP. I welcome the efforts on pgsnmpd..
pgsnmpd can't provide any information that's not in the backend. Unless
we'd turn it into a log parser, which is not really something I think is
a good idea.
Log files are great for one kind of thing, "live statistics" through
SNMP or the statistics collector for another kind. It only gets wrong
when you put them in the wrong place. Things you poll regularly makes a
lot more sense in some kind of live view than in a log file.
//Magnus
On Sun, May 13, 2007 at 07:54:20AM +0100, Heikki Linnakangas wrote:
Maybe we should improve the stats system so that we can collect events
with timestamps and durations, but in my experience log files actually
are the most reliable and universal way to collect real-time performance
information. Any serious tool has a generic log parser. The other
alternative is SNMP. I welcome the efforts on pgsnmpd..
Having timing information in the stats system would be useful, but I'm
not sure how it could actually be done. But at least if the information
is in the stats system it's easy to programatically collect and process.
SNMP is just one example of that (fwiw I agree with Magnus that it
probably doesn't make sense to turn pgsnmpd into a log parser...)
--
Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Heikki Linnakangas napsal(a):
Jim C. Nasby wrote:
There is two counters for checkpoints in pgstats, the number of timed
(triggered by checkpoint_timeout) and requested (triggered by
checkpoint_segments) checkpoints.Maybe we should improve the stats system so that we can collect events
with timestamps and durations, but in my experience log files actually
are the most reliable and universal way to collect real-time performance
information. Any serious tool has a generic log parser. The other
alternative is SNMP. I welcome the efforts on pgsnmpd..
Whats about add some DTrace probes?
Zdenek
Here's a work-in-progress updated version of this patch. It's not
completed, but I've settled on:
- a log_checkpoints GUC variable to control whether anything is printed
or not. This is in line with log_connections, log_autovacuum etc.
- when checkpoints are logged, LOG-level is used
- the following log output:
LOG: checkpoint starting (CHECKPOINT command)
LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
sync=32.0 ms
where "checkpoint starting" is obviously printed when checkpoint starts,
and "checkpoint complete" when the checkpoint is fully complete.
- all information on the progress of checkpoint is collected to a global
struct. A global struct sounds a bit ugly, but it avoids the ugliness of
littering all functions with multiple return values to pass back
information to the caller that's only used to print it out. This makes
it easy to add more data points. MoveOfflineLogs should probably be
changed to use that as well for the sake of consistency, instead of
returning nsegsremoved and nsegsrecycled through pointers.
- RequestCheckpoint and CreateCheckpoint was changed to take a
CheckpointReason enum as argument, instead of boolean flags. This allows
printing a different message depending on the cause of checkpoint, and
is also cleaner IMHO.
I'm posting now to allow others, and Greg Smith in particular, to review
the changes this far, and take over from here.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
cplog-1.patchtext/x-diff; name=cplog-1.patchDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268
--- src/backend/access/transam/xlog.c 20 May 2007 15:21:49 -0000
***************
*** 71,76 ****
--- 71,78 ----
bool XLOG_DEBUG = false;
#endif
+ bool log_checkpoints = false;
+
/*
* XLOGfileslop is used in the code as the allowed "fuzz" in the number of
* preallocated XLOG segments --- we try to have at least XLOGfiles advance
***************
*** 120,125 ****
--- 122,130 ----
static TimestampTz recoveryStopTime;
static bool recoveryStopAfter;
+ CheckpointProgressData checkpointProgress;
+
+
/*
* During normal operation, the only timeline we care about is ThisTimeLineID.
* During recovery, however, things are more complicated. To simplify life
***************
*** 1595,1601 ****
if (XLOG_DEBUG)
elog(LOG, "time for a checkpoint, signaling bgwriter");
#endif
! RequestCheckpoint(false, true);
}
}
}
--- 1600,1606 ----
if (XLOG_DEBUG)
elog(LOG, "time for a checkpoint, signaling bgwriter");
#endif
! RequestCheckpoint(false, CHKP_SEGMENTS);
}
}
}
***************
*** 5067,5073 ****
* the rule that TLI only changes in shutdown checkpoints, which
* allows some extra error checking in xlog_redo.
*/
! CreateCheckPoint(true, true);
/*
* Close down recovery environment
--- 5072,5078 ----
* the rule that TLI only changes in shutdown checkpoints, which
* allows some extra error checking in xlog_redo.
*/
! CreateCheckPoint(CHKP_SHUTDOWN); /* XXX startup rather than shutdown... */
/*
* Close down recovery environment
***************
*** 5340,5346 ****
ereport(LOG,
(errmsg("shutting down")));
! CreateCheckPoint(true, true);
ShutdownCLOG();
ShutdownSUBTRANS();
ShutdownMultiXact();
--- 5345,5351 ----
ereport(LOG,
(errmsg("shutting down")));
! CreateCheckPoint(CHKP_SHUTDOWN);
ShutdownCLOG();
ShutdownSUBTRANS();
ShutdownMultiXact();
***************
*** 5356,5362 ****
* activity has occurred since the last one.
*/
void
! CreateCheckPoint(bool shutdown, bool force)
{
CheckPoint checkPoint;
XLogRecPtr recptr;
--- 5361,5367 ----
* activity has occurred since the last one.
*/
void
! CreateCheckPoint(CheckpointReason reason)
{
CheckPoint checkPoint;
XLogRecPtr recptr;
***************
*** 5370,5375 ****
--- 5375,5435 ----
int nsegsrecycled = 0;
TransactionId *inCommitXids;
int nInCommit;
+ struct timeval start_t;
+ int elapsed_secs;
+
+ /* Log beginning of checkpoint */
+
+ gettimeofday(&start_t, NULL);
+
+ if (log_checkpoints)
+ {
+ switch(reason)
+ {
+ case CHKP_SHUTDOWN:
+ /* ereport(LOG, (errmsg("checkpoint starting (shutdown)"))); */
+ break;
+ case CHKP_COMMAND:
+ ereport(LOG, (errmsg("checkpoint starting (CHECKPOINT command)")));
+ break;
+ case CHKP_BACKUP:
+ ereport(LOG, (errmsg("checkpoint starting (starting backup)")));
+ break;
+ case CHKP_TIMEOUT:
+ ereport(LOG, (errmsg("checkpoint starting (checkpoint_timeout reached)")));
+ break;
+ case CHKP_SEGMENTS:
+ ereport(LOG, (errmsg("checkpoint starting (checkpoint_segments reached)")));
+ break;
+ case CHKP_CREATEDB:
+ ereport(LOG, (errmsg("checkpoint starting (creating database)")));
+ break;
+ case CHKP_DROPDB:
+ ereport(LOG, (errmsg("checkpoint starting (dropping database)")));
+ break;
+ default:
+ elog(WARNING, "unexpected checkpoint reason %d", reason);
+ break;
+ }
+ }
+
+ /*
+ * We will warn if (a) too soon since last checkpoint (whatever
+ * caused it) and (b) somebody has set the ckpt_time_warn flag
+ * since the last checkpoint start. Note in particular that this
+ * implementation will not generate warnings caused by
+ * CheckPointTimeout < CheckPointWarning.
+ */
+ elapsed_secs = start_t.tv_sec - checkpointProgress.cs_start_t.tv_sec;
+ if (reason == CHKP_SEGMENTS && elapsed_secs < CheckPointWarning)
+ ereport(LOG,
+ (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
+ elapsed_secs),
+ errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+
+ checkpointProgress.cs_reason = reason;
+ checkpointProgress.cs_start_t = start_t;
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
***************
*** 5384,5390 ****
*/
START_CRIT_SECTION();
! if (shutdown)
{
ControlFile->state = DB_SHUTDOWNING;
ControlFile->time = time(NULL);
--- 5444,5450 ----
*/
START_CRIT_SECTION();
! if (reason == CHKP_SHUTDOWN)
{
ControlFile->state = DB_SHUTDOWNING;
ControlFile->time = time(NULL);
***************
*** 5416,5422 ****
* the end of the last checkpoint record, and its redo pointer must point
* to itself.
*/
! if (!shutdown && !force)
{
XLogRecPtr curInsert;
--- 5476,5482 ----
* the end of the last checkpoint record, and its redo pointer must point
* to itself.
*/
! if (reason == CHKP_TIMEOUT) /* XXX is the condition same as on head? */
{
XLogRecPtr curInsert;
***************
*** 5479,5488 ****
*/
LWLockRelease(WALInsertLock);
- if (!shutdown)
- ereport(DEBUG2,
- (errmsg("checkpoint starting")));
-
/*
* Before flushing data, we must wait for any transactions that are
* currently in their commit critical sections. If an xact inserted its
--- 5539,5544 ----
***************
*** 5530,5540 ****
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
! if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
! MultiXactGetCheckptMulti(shutdown,
&checkPoint.nextMulti,
&checkPoint.nextMultiOffset);
--- 5586,5596 ----
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
! if (reason != CHKP_SHUTDOWN)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
! MultiXactGetCheckptMulti(reason == CHKP_SHUTDOWN,
&checkPoint.nextMulti,
&checkPoint.nextMultiOffset);
***************
*** 5561,5567 ****
rdata.next = NULL;
recptr = XLogInsert(RM_XLOG_ID,
! shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE,
&rdata);
--- 5617,5623 ----
rdata.next = NULL;
recptr = XLogInsert(RM_XLOG_ID,
! (reason == CHKP_SHUTDOWN) ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE,
&rdata);
***************
*** 5571,5577 ****
* We now have ProcLastRecPtr = start of actual checkpoint record, recptr
* = end of actual checkpoint record.
*/
! if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));
--- 5627,5633 ----
* We now have ProcLastRecPtr = start of actual checkpoint record, recptr
* = end of actual checkpoint record.
*/
! if (reason == CHKP_SHUTDOWN && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));
***************
*** 5585,5591 ****
* Update the control file.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
! if (shutdown)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->prevCheckPoint = ControlFile->checkPoint;
ControlFile->checkPoint = ProcLastRecPtr;
--- 5641,5647 ----
* Update the control file.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
! if (reason == CHKP_SHUTDOWN)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->prevCheckPoint = ControlFile->checkPoint;
ControlFile->checkPoint = ProcLastRecPtr;
***************
*** 5626,5634 ****
* Make more log segments if needed. (Do this after deleting offline log
* segments, to avoid having peak disk space usage higher than necessary.)
*/
! if (!shutdown)
nsegsadded = PreallocXlogFiles(recptr);
/*
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
--- 5682,5694 ----
* Make more log segments if needed. (Do this after deleting offline log
* segments, to avoid having peak disk space usage higher than necessary.)
*/
! if (reason != CHKP_SHUTDOWN)
nsegsadded = PreallocXlogFiles(recptr);
+ ereport(DEBUG2,
+ (errmsg("checkpoint: %d transaction log file(s) added, %d removed, %d recycled",
+ nsegsadded, nsegsremoved, nsegsrecycled)));
+
/*
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
***************
*** 5639,5650 ****
if (!InRecovery)
TruncateSUBTRANS(GetOldestXmin(true, false));
- if (!shutdown)
- ereport(DEBUG2,
- (errmsg("checkpoint complete; %d transaction log file(s) added, %d removed, %d recycled",
- nsegsadded, nsegsremoved, nsegsrecycled)));
-
LWLockRelease(CheckpointLock);
}
/*
--- 5699,5725 ----
if (!InRecovery)
TruncateSUBTRANS(GetOldestXmin(true, false));
LWLockRelease(CheckpointLock);
+
+ /* Checkpoint complete. All we have left is updating and printing stats */
+
+ gettimeofday(&checkpointProgress.cs_end_t, NULL);
+ if (log_checkpoints)
+ {
+ long write_usecs;
+ long sync_usecs;
+
+ write_usecs = 1000000 * (long) (checkpointProgress.cs_sync_t.tv_sec - checkpointProgress.cs_start_t.tv_sec) +
+ (long) (checkpointProgress.cs_sync_t.tv_usec - checkpointProgress.cs_start_t.tv_usec);
+ sync_usecs = 1000000 * (long) (checkpointProgress.cs_end_t.tv_sec - checkpointProgress.cs_sync_t.tv_sec) +
+ (long) (checkpointProgress.cs_end_t.tv_usec - checkpointProgress.cs_sync_t.tv_usec);
+
+ ereport(LOG,(errmsg(
+ "checkpoint complete; buffers written=%.1f MB (%.1f%%) write=%.1f ms sync=%.1f ms",
+ (float) (BLCKSZ * checkpointProgress.cs_buffers_written) / (1024*1024),
+ (float) 100 * checkpointProgress.cs_buffers_written / NBuffers,
+ (float) write_usecs / 1000, (float) sync_usecs / 1000)));
+ }
}
/*
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c 7 Mar 2007 13:35:02 -0000 1.233
--- src/backend/bootstrap/bootstrap.c 20 May 2007 10:03:27 -0000
***************
*** 489,495 ****
/* Perform a checkpoint to ensure everything's down to disk */
SetProcessingMode(NormalProcessing);
! CreateCheckPoint(true, true);
/* Clean up and exit */
cleanup();
--- 489,495 ----
/* Perform a checkpoint to ensure everything's down to disk */
SetProcessingMode(NormalProcessing);
! CreateCheckPoint(CHKP_SHUTDOWN);
/* Clean up and exit */
cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.194
diff -c -r1.194 dbcommands.c
*** src/backend/commands/dbcommands.c 12 Apr 2007 15:04:35 -0000 1.194
--- src/backend/commands/dbcommands.c 20 May 2007 10:04:27 -0000
***************
*** 503,509 ****
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
! RequestCheckpoint(true, false);
/*
* Close pg_database, but keep lock till commit (this is important to
--- 503,509 ----
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
! RequestCheckpoint(true, CHKP_CREATEDB);
/*
* Close pg_database, but keep lock till commit (this is important to
***************
*** 656,662 ****
* open files, which would cause rmdir() to fail.
*/
#ifdef WIN32
! RequestCheckpoint(true, false);
#endif
/*
--- 656,662 ----
* open files, which would cause rmdir() to fail.
*/
#ifdef WIN32
! RequestCheckpoint(true, CHKP_DROPDB);
#endif
/*
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.37
diff -c -r1.37 bgwriter.c
*** src/backend/postmaster/bgwriter.c 30 Mar 2007 18:34:55 -0000 1.37
--- src/backend/postmaster/bgwriter.c 20 May 2007 15:31:30 -0000
***************
*** 116,122 ****
sig_atomic_t ckpt_done; /* advances when checkpoint done */
sig_atomic_t ckpt_failed; /* advances when checkpoint fails */
! sig_atomic_t ckpt_time_warn; /* warn if too soon since last ckpt? */
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
--- 116,122 ----
sig_atomic_t ckpt_done; /* advances when checkpoint done */
sig_atomic_t ckpt_failed; /* advances when checkpoint fails */
! sig_atomic_t ckpt_reason; /* XXX */
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
***************
*** 340,349 ****
for (;;)
{
bool do_checkpoint = false;
! bool force_checkpoint = false;
time_t now;
int elapsed_secs;
long udelay;
/*
* Emergency bailout if postmaster has died. This is to avoid the
--- 340,350 ----
for (;;)
{
bool do_checkpoint = false;
! CheckpointReason reason = 0; /* init to keep compiler happy */
time_t now;
int elapsed_secs;
long udelay;
+
/*
* Emergency bailout if postmaster has died. This is to avoid the
***************
*** 366,372 ****
{
checkpoint_requested = false;
do_checkpoint = true;
! force_checkpoint = true;
BgWriterStats.m_requested_checkpoints++;
}
if (shutdown_requested)
--- 367,373 ----
{
checkpoint_requested = false;
do_checkpoint = true;
! reason = BgWriterShmem->ckpt_reason;
BgWriterStats.m_requested_checkpoints++;
}
if (shutdown_requested)
***************
*** 389,399 ****
*/
now = time(NULL);
elapsed_secs = now - last_checkpoint_time;
! if (elapsed_secs >= CheckPointTimeout)
{
! do_checkpoint = true;
! if (!force_checkpoint)
BgWriterStats.m_timed_checkpoints++;
}
/*
--- 390,404 ----
*/
now = time(NULL);
elapsed_secs = now - last_checkpoint_time;
! if (!do_checkpoint)
{
! elapsed_secs = now - last_checkpoint_time;
! if (elapsed_secs >= CheckPointTimeout)
! {
! do_checkpoint = true;
! reason = CHKP_TIMEOUT;
BgWriterStats.m_timed_checkpoints++;
+ }
}
/*
***************
*** 404,421 ****
{
/*
* We will warn if (a) too soon since last checkpoint (whatever
! * caused it) and (b) somebody has set the ckpt_time_warn flag
! * since the last checkpoint start. Note in particular that this
* implementation will not generate warnings caused by
* CheckPointTimeout < CheckPointWarning.
*/
! if (BgWriterShmem->ckpt_time_warn &&
! elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
- BgWriterShmem->ckpt_time_warn = false;
/*
* Indicate checkpoint start to any waiting backends.
--- 409,424 ----
{
/*
* We will warn if (a) too soon since last checkpoint (whatever
! * caused it) and (b) checkpoint was initiated by filling WAL
! * segments. Note in particular that this
* implementation will not generate warnings caused by
* CheckPointTimeout < CheckPointWarning.
*/
! if (reason == CHKP_SEGMENTS && elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
/*
* Indicate checkpoint start to any waiting backends.
***************
*** 423,429 ****
ckpt_active = true;
BgWriterShmem->ckpt_started++;
! CreateCheckPoint(false, force_checkpoint);
/*
* After any checkpoint, close all smgr files. This is so we
--- 426,432 ----
ckpt_active = true;
BgWriterShmem->ckpt_started++;
! CreateCheckPoint(reason);
/*
* After any checkpoint, close all smgr files. This is so we
***************
*** 637,648 ****
* before returning; otherwise, just signal the request and return
* immediately.
*
! * If warnontime is true, and it's "too soon" since the last checkpoint,
* the bgwriter will log a warning. This should be true only for checkpoints
* caused due to xlog filling, else the warning will be misleading.
*/
void
! RequestCheckpoint(bool waitforit, bool warnontime)
{
/* use volatile pointer to prevent code rearrangement */
volatile BgWriterShmemStruct *bgs = BgWriterShmem;
--- 640,651 ----
* before returning; otherwise, just signal the request and return
* immediately.
*
! * XXX If warnontime is true, and it's "too soon" since the last checkpoint,
* the bgwriter will log a warning. This should be true only for checkpoints
* caused due to xlog filling, else the warning will be misleading.
*/
void
! RequestCheckpoint(bool waitforit, CheckpointReason reason)
{
/* use volatile pointer to prevent code rearrangement */
volatile BgWriterShmemStruct *bgs = BgWriterShmem;
***************
*** 654,660 ****
*/
if (!IsPostmasterEnvironment)
{
! CreateCheckPoint(false, true);
/*
* After any checkpoint, close all smgr files. This is so we won't
--- 657,663 ----
*/
if (!IsPostmasterEnvironment)
{
! CreateCheckPoint(reason);
/*
* After any checkpoint, close all smgr files. This is so we won't
***************
*** 666,673 ****
}
/* Set warning request flag if appropriate */
! if (warnontime)
! bgs->ckpt_time_warn = true;
/*
* Send signal to request checkpoint. When waitforit is false, we
--- 669,675 ----
}
/* Set warning request flag if appropriate */
! bgs->ckpt_reason = reason;
/*
* Send signal to request checkpoint. When waitforit is false, we
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218
--- src/backend/storage/buffer/bufmgr.c 20 May 2007 16:09:02 -0000
***************
*** 43,48 ****
--- 43,49 ----
#include "postgres.h"
#include <sys/file.h>
+ #include <sys/time.h>
#include <unistd.h>
#include "miscadmin.h"
***************
*** 1006,1011 ****
--- 1007,1013 ----
if (SyncOneBuffer(buf_id, false))
{
BgWriterStats.m_buf_written_checkpoints++;
+ checkpointProgress.cs_buffers_written++;
/*
* If in bgwriter, absorb pending fsync requests after each
***************
*** 1323,1330 ****
--- 1325,1339 ----
void
FlushBufferPool(void)
{
+ checkpointProgress.cs_buffers_written = 0;
+
BufferSync();
+
+ gettimeofday(&checkpointProgress.cs_write_t, NULL);
+
smgrsync();
+
+ gettimeofday(&checkpointProgress.cs_sync_t, NULL);
}
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.279
diff -c -r1.279 utility.c
*** src/backend/tcop/utility.c 27 Apr 2007 22:05:49 -0000 1.279
--- src/backend/tcop/utility.c 20 May 2007 10:22:16 -0000
***************
*** 1089,1095 ****
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
! RequestCheckpoint(true, false);
break;
case T_ReindexStmt:
--- 1089,1095 ----
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
! RequestCheckpoint(true, CHKP_COMMAND);
break;
case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.391
diff -c -r1.391 guc.c
*** src/backend/utils/misc/guc.c 8 May 2007 16:33:51 -0000 1.391
--- src/backend/utils/misc/guc.c 20 May 2007 10:11:45 -0000
***************
*** 104,109 ****
--- 104,110 ----
extern int CommitSiblings;
extern char *default_tablespace;
extern bool fullPageWrites;
+ extern bool log_checkpoints;
#ifdef TRACE_SORT
extern bool trace_sort;
***************
*** 596,601 ****
--- 597,610 ----
false, NULL, NULL
},
{
+ {"log_checkpoints", PGC_BACKEND, LOGGING_WHAT,
+ gettext_noop("Logs beginning and end of checkpoints, and some stats XXX."),
+ NULL
+ },
+ &log_checkpoints,
+ false, NULL, NULL
+ },
+ {
{"debug_assertions", PGC_USERSET, DEVELOPER_OPTIONS,
gettext_noop("Turns on various assertion checks."),
gettext_noop("This is a debugging aid."),
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.215
diff -c -r1.215 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample 18 Apr 2007 16:44:18 -0000 1.215
--- src/backend/utils/misc/postgresql.conf.sample 20 May 2007 10:14:36 -0000
***************
*** 318,323 ****
--- 318,324 ----
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = off
+ #log_checkpoints = off # XXX: Should this be under the checkpoints section?
#log_connections = off
#log_disconnections = off
#log_duration = off
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76
--- src/include/access/xlog.h 20 May 2007 15:10:43 -0000
***************
*** 128,133 ****
--- 128,164 ----
struct XLogRecData *next; /* next struct in chain, or NULL */
} XLogRecData;
+ /* specifying CHKP_SHUTDOWN modifies behavior. The other reasons
+ * are distinct for informational purposes */
+ typedef enum CheckpointReason
+ {
+ CHKP_SHUTDOWN = 1,
+ CHKP_COMMAND,
+ CHKP_BACKUP,
+ CHKP_TIMEOUT,
+ CHKP_SEGMENTS,
+ CHKP_CREATEDB,
+ CHKP_DROPDB
+ } CheckpointReason;
+
+ /* CheckpointProgress is used to collect information about the progress
+ * of checkpoint. When there's no checkpoint in progress, the fields store
+ * the details of the last finished checkpoint.
+ *
+ * The fields are in the order they're filled in during a checkpoint.
+ */
+ typedef struct CheckpointProgressData
+ {
+ CheckpointReason cs_reason;
+ struct timeval cs_start_t;
+ int cs_buffers_written;
+ struct timeval cs_write_t;
+ struct timeval cs_sync_t;
+ struct timeval cs_end_t;
+ } CheckpointProgressData;
+
+ extern CheckpointProgressData checkpointProgress;
+
extern TimeLineID ThisTimeLineID; /* current TLI */
extern bool InRecovery;
extern XLogRecPtr MyLastRecPtr;
***************
*** 162,168 ****
extern void StartupXLOG(void);
extern void ShutdownXLOG(int code, Datum arg);
extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr GetRedoRecPtr(void);
extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
--- 193,199 ----
extern void StartupXLOG(void);
extern void ShutdownXLOG(int code, Datum arg);
extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(CheckpointReason reason);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr GetRedoRecPtr(void);
extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h 5 Jan 2007 22:19:57 -0000 1.9
--- src/include/postmaster/bgwriter.h 20 May 2007 10:05:03 -0000
***************
*** 12,17 ****
--- 12,18 ----
#ifndef _BGWRITER_H
#define _BGWRITER_H
+ #include "access/xlog.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
***************
*** 23,29 ****
extern void BackgroundWriterMain(void);
! extern void RequestCheckpoint(bool waitforit, bool warnontime);
extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
extern void AbsorbFsyncRequests(void);
--- 24,30 ----
extern void BackgroundWriterMain(void);
! extern void RequestCheckpoint(bool waitforit, CheckpointReason reason);
extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
extern void AbsorbFsyncRequests(void);
On Mon, 2007-05-28 at 21:25 +0100, Heikki Linnakangas wrote:
- RequestCheckpoint and CreateCheckpoint was changed to take a
CheckpointReason enum as argument, instead of boolean flags. This allows
printing a different message depending on the cause of checkpoint, and
is also cleaner IMHO.
Looks good.
I'd be inclined to make the reason code for creating/dropping database
the same and just say "Cluster DDL" or something, since for analysis
purposes we just want to know it was DDL.
I'd rather see the timings in seconds rather than ms. Counting
significant digits will be a pain...
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
I'll take another stab at refining this can of worms I opened. The one
thing I noticed on a quick review is that it's almost possible to skip all
the calls to gettimeofday if log_checkpoints is off now. I'd like to make
that a specific goal, because that will make me feel better that adding
this code has almost no performance impact relative to now unless you turn
the feature on.
I agree with Simon that tracking create/drop separately is unnecessary.
As for why all the timing info is in ms, given the scale of the numbers
typically encountered I found it easier to work with. I originally wanted
resolution down to 0.1ms if the underlying OS supports it, which means 4
figures to the right of the decimal point if the unit was switched to
seconds. Quite often the times reported are less than 100ms, so you'll
normally be dealing with fractional part of a second. If we take Heikki's
example:
LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
sync=32.0 ms
And switch it to seconds:
LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms
sync=0.0320 ms
I don't find that as easy to work with. The only way a timing in seconds
would look OK is if the resolution of the whole thing is reduced to ms,
which then makes 3 decimal points--easy to read as ms instead. Having
stared at a fair amount of this data now, that's probably fine; I'll
collect up some more data on it from a fast server this week to confirm
whether's it's worthless precision or worth capturing.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote:
I'll take another stab at refining this can of worms I opened. The one
thing I noticed on a quick review is that it's almost possible to skip
all the calls to gettimeofday if log_checkpoints is off now. I'd like
to make that a specific goal, because that will make me feel better that
adding this code has almost no performance impact relative to now unless
you turn the feature on.
Saving a couple of gettimeofday calls on an event that happens as
infrequently as checkpoints is not going to make any difference.
Especially if you compare it to all the other work that's done on
checkpoint.
I agree with Simon that tracking create/drop separately is unnecessary.
As for why all the timing info is in ms, given the scale of the numbers
typically encountered I found it easier to work with. I originally
wanted resolution down to 0.1ms if the underlying OS supports it, which
means 4 figures to the right of the decimal point if the unit was
switched to seconds. Quite often the times reported are less than
100ms, so you'll normally be dealing with fractional part of a second.
If we take Heikki's example:LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
sync=32.0 msAnd switch it to seconds:
LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms
sync=0.0320 msI don't find that as easy to work with. The only way a timing in
seconds would look OK is if the resolution of the whole thing is reduced
to ms, which then makes 3 decimal points--easy to read as ms instead.
Having stared at a fair amount of this data now, that's probably fine;
I'll collect up some more data on it from a fast server this week to
confirm whether's it's worthless precision or worth capturing.
The checkpoint will take at least a couple of seconds on any interesting
system, so 0.1 s resolution should be enough IMHO.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, 29 May 2007, Heikki Linnakangas wrote:
The checkpoint will take at least a couple of seconds on any interesting
system, so 0.1 s resolution should be enough IMHO.
You may be underestimating the resources some interesting systems are
willing to put into speeding up checkpoints. I'm sometimes dumping into a
SAN whose cache is bigger than the shared_buffer cache in the server, and
0.1s isn't really enough resolution in that situation. A second is a
really long checkpoint there. Since even that's limited by fiber-channel
speeds, I know it's possible to do better than what I'm seeing with
something like a PCIe host adapter having on-board cache in the GB range
(which isn't that expensive nowadays).
Also, even if the checkpoint total takes seconds, much of that is in the
sync phase; the write time can still be in the small number of ms range,
and I wouldn't want to see that truncated too much.
Anyway, I have a bunch of data on this subject being collected at this
moment, and I'll rescale the results based on what I see after analyzing
that this week.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD