Partitioned checkpointing
Hi All,
Recently, I have found a paper titled "Segmented Fussy Checkpointing for
Main Memory Databases" published in 1996 at ACM symposium on Applied
Computing, which inspired me to implement a similar mechanism in PostgreSQL.
Since the early evaluation results obtained from a 16 core server was beyond
my expectation, I have decided to submit a patch to be open for discussion
by community members interested in this mechanism.
Attached patch is a PoC (or mayby prototype) implementation of a partitioned
checkpointing on 9.5alpha2 The term 'partitioned' is used here instead of
'segmented' because I feel 'segmented' is somewhat confusable with 'xlog
segment,' etc. In contrast, the term 'partitioned' is not as it implies
almost the same concept of 'buffer partition,' thus I think it is suitable.
Background and my motivation is that performance dip due to checkpoint is a
major concern, and thus it is valuable to mitigate this issue. In fact, many
countermeasures have been attempted against this issue. As far as I know,
those countermeasures so far focus mainly on mitigating the adverse impact
due to disk writes to implement the buffer sync; recent highly regarded
'checkpointer continuous flushing' is a typical example. On the other hand,
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.
In the following, I will describe early evaluation results and mechanism of
the partitioned checkpointing briefly.
1. Performance evaluation
1.1 Experimental setup
The configuration of the server machine was as follows.
CPU: Intel E5-2650 v2 (8 cores/chip) @ 2.60GHz x 2
Memory: 64GB
OS: Linux 2.6.32-504.12.2.el6.x86_64 (CentOS)
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
raid1 of 2 SSDs for $PGDATA (other than pg_xlog)
PostgreSQL settings
shared_buffers = 28GB
wal_buffers = 64MB
checkpoint_timeout = 10min
max_wal_size = 128GB
min_wal_size = 8GB
checkpoint_completion_target = 0.9
benchmark
pgbench -M prepared -N -P 1 -T 3600
The scaling factor was 1000. Both the number of clients (-c option) and
threads (-j option) were 120 for sync. commit case and 96 for async. commit
(synchronous_commit = off) case. These are chosen because maximum
throughputs were obtained under these conditions.
The server was connected to a client machine on which pgbench client program
run with a 1G ether. Since the client machine was not saturated in the
measurement and thus hardly affected the results, details of the client
machine are not described here.
1.2 Early results
The measurement results shown here are latency average, latency stddev, and
throughput (tps), which are the output of the pgbench program.
1.2.1 synchronous_commit = on
(a) 9.5alpha2(original)
latency average: 2.852 ms
latency stddev: 6.010 ms
tps = 42025.789717 (including connections establishing)
tps = 42026.137247 (excluding connections establishing)
(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.815 ms
latency stddev: 2.317 ms
tps = 42575.301137 (including connections establishing)
tps = 42575.677907 (excluding connections establishing)
1.2.2 synchronous_commit = off
(a) 9.5alpha2(original)
latency average: 2.136 ms
latency stddev: 5.422 ms
tps = 44870.897907 (including connections establishing)
tps = 44871.215792 (excluding connections establishing)
(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.085 ms
latency stddev: 1.529 ms
tps = 45974.617724 (including connections establishing)
tps = 45974.973604 (excluding connections establishing)
1.3 Summary
The partitioned checkpointing produced great improvement (reduction) in
latency stddev and slight improvement in latency average and tps; there was
no performance degradation. Therefore, there is an effect to stabilize the
operation in this partitioned checkpointing. In fact, the throughput
variation, obtained by -P 1 option, shows that the dips were mitigated in
both magnitude and frequency.
# Since I'm not sure whether it is OK to send an email to this mailing with
attaching some files other than patch, I refrain now from attaching raw
results (200K bytes of text/case) and result graphs in .jpg or .epsf format
illustrating the throughput variations to this email. If it is OK, I'm
pleased to show the results in those formats.
2. Mechanism
Imaginably, 'partitioned checkpointing' conducts buffer sync not for all
buffers at once but for the buffers belonging to one partition at one
invocation of the checkpointer. In the following description, the number of
partitions is expressed by N. (N is fixed to 16 in the attached patch).
2.1 Principles of operations
In order to preserve the semantics of the traditional checkpointing, the
checkpointer invocation interval is changed to checkpoint_timeout / N. The
checkpointer carries out the buffer sync for the buffer partition 0 at the
first invocation, and then for the buffer partition 1 at the second
invocation, and so on. When the turn of the the buffer partition N-1 comes,
i.e. the last round of a series of buffer sync, the checkpointer carries out
the buffer sync for the buffer partition and other usual checkpoint
operations, coded in CheckPointGuts() in xlog.c.
The principle is that, roughly speaking, 1) checkpointing for the buffer
partition 0 corresponds to the beginning of the traditional checkpointing,
where the XLOG location (LSN) is obtained and set to RedoRecPtr, and 2)
checkpointing for the buffer partition N - 1 corresponds to the end of the
traditional checkpointing, where the WAL files that are no longer need (up
to the previous log segment of that specified by the RedoRecPtr value) are
deleted or recycled.
A role of RedoRecPtr indicating the threshold to determine whether FPW is
necessary or not is moved to a new N-element array of XLogRecPtr, as the
threshold for each buffer is different among partitions. The n-th element of
the array is updated when the buffer sync for partition n is carried out.
2.2 Drawbacks
The 'partitioned checkpointing' works effectively in such situation that the
checkpointer is invoked by hitting the checkpoint_timeout; performance dip
is mitigated and the WAL size is not changed (in avarage).
On the other hand, when the checkpointer is invoked by another trigger event
than timeout, traditional checkpoint procedure which syncs all buffers at
once will take place, resulting in performance dip. Also the WAL size for
that checkpoint period (until the next invocation of the checkpointer) will
theoritically increase to 1.5 times of that of usual case because of the
increase in the FPW.
My opinion is that this is not serious because it is preferable for
checkpointer to be invoked by the timeout, and thus usual systems are
supposed to be tuned to work under such condition that is prefarable for the
'partitioned checkpointing.'
3. Conclusion
The 'partitioned checkpointing' mechanism is expected to be effective for
mitigating the performance dip due to checkpoint. In particular, it is
noteworthy that the effect was observed on a server machine that use SSDs
for $PGDATA, for which seek optimizations are not believed effective.
Therefore, this mechanism is worth to further investigation aiming to
implement in future PostgreSQL.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Attachments:
partitioned-checkpointing.patchapplication/octet-stream; name=partitioned-checkpointing.patchDownload
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/backend/access/transam/xlog.c postgresql-9.5alpha2-partitionedCKPT/src/backend/access/transam/xlog.c
--- postgresql-9.5alpha2/src/backend/access/transam/xlog.c 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/backend/access/transam/xlog.c 2015-09-10 17:00:29.779619746 +0900
@@ -336,7 +336,9 @@
* see GetRedoRecPtr. A freshly spawned backend obtains the value during
* InitXLOGAccess.
*/
+static XLogRecPtr BndryRecPtr[CheckPointPartitions];
static XLogRecPtr RedoRecPtr;
+static XLogRecPtr PriorCkptPtr;
/*
* doPageWrites is this backend's local copy of (forcePageWrites ||
@@ -492,6 +494,7 @@
* To read these fields, you must hold an insertion lock. To modify them,
* you must hold ALL the locks.
*/
+ XLogRecPtr BndryRecPtr[CheckPointPartitions];
XLogRecPtr RedoRecPtr; /* current redo point for insertions */
bool forcePageWrites; /* forcing full-page writes for PITR? */
bool fullPageWrites;
@@ -524,7 +527,8 @@
/* Protected by info_lck: */
XLogwrtRqst LogwrtRqst;
- XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */
+ XLogRecPtr BndryRecPtr[CheckPointPartitions]; /* a recent copy of Insert->RedoRecPtr */
+ XLogRecPtr RedoRecPtr;
uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */
TransactionId ckptXid;
XLogRecPtr asyncXactLSN; /* LSN of newest async commit/abort */
@@ -891,7 +895,7 @@
* WAL rule "write the log before the data".)
*/
XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata, XLogRecPtr *fpw_lsn)
{
XLogCtlInsert *Insert = &XLogCtl->Insert;
pg_crc32c rdata_crc;
@@ -901,6 +905,7 @@
rechdr->xl_info == XLOG_SWITCH);
XLogRecPtr StartPos;
XLogRecPtr EndPos;
+ int i;
/* we assume that all of the record header is in the first chunk */
Assert(rdata->len >= SizeOfXLogRecord);
@@ -959,6 +964,12 @@
* we could recompute the record without full pages, but we choose not to
* bother.)
*/
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ if (BndryRecPtr[i] != Insert->BndryRecPtr[i])
+ {
+ Assert(BndryRecPtr[i] < Insert->BndryRecPtr[i]);
+ BndryRecPtr[i] = Insert->BndryRecPtr[i];
+ }
if (RedoRecPtr != Insert->RedoRecPtr)
{
Assert(RedoRecPtr < Insert->RedoRecPtr);
@@ -966,16 +977,17 @@
}
doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
- if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
- {
- /*
- * Oops, some buffer now needs to be backed up that the caller didn't
- * back up. Start over.
- */
- WALInsertLockRelease();
- END_CRIT_SECTION();
- return InvalidXLogRecPtr;
- }
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ if (fpw_lsn[i] != InvalidXLogRecPtr && fpw_lsn[i] <= BndryRecPtr[i] && doPageWrites)
+ {
+ /*
+ * Oops, some buffer now needs to be backed up that the caller didn't
+ * back up. Start over.
+ */
+ WALInsertLockRelease();
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
/*
* Reserve space for the record in the WAL. This also sets the xl_prev
@@ -5913,6 +5925,7 @@
XLogPageReadPrivate private;
bool fast_promoted = false;
struct stat st;
+ int i;
/*
* Read control file and check XLOG status looks valid.
@@ -6388,6 +6401,8 @@
lastFullPageWrites = checkPoint.fullPageWrites;
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ BndryRecPtr[i] = XLogCtl->BndryRecPtr[i] = XLogCtl->Insert.BndryRecPtr[i] = checkPoint.redo;
RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
doPageWrites = lastFullPageWrites;
@@ -7814,6 +7829,8 @@
GetRedoRecPtr(void)
{
XLogRecPtr ptr;
+ XLogRecPtr bndryRecPtr[CheckPointPartitions];
+ int i;
/*
* The possibly not up-to-date copy in XlogCtl is enough. Even if we
@@ -7821,9 +7838,14 @@
* update it just after we've released the lock.
*/
SpinLockAcquire(&XLogCtl->info_lck);
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ bndryRecPtr[i] = XLogCtl->BndryRecPtr[i];
ptr = XLogCtl->RedoRecPtr;
SpinLockRelease(&XLogCtl->info_lck);
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ if (BndryRecPtr[i] < bndryRecPtr[i])
+ BndryRecPtr[i] = bndryRecPtr[i];
if (RedoRecPtr < ptr)
RedoRecPtr = ptr;
@@ -7839,10 +7861,11 @@
* up-to-date values, while holding the WAL insert lock.
*/
void
-GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
+GetFullPageWriteInfo(XLogRecPtr **BndryRecPtr_p, bool *doPageWrites_p, bool update)
{
- *RedoRecPtr_p = RedoRecPtr;
+ *BndryRecPtr_p = BndryRecPtr;
*doPageWrites_p = doPageWrites;
+ if (update) GetRedoRecPtr();
}
/*
@@ -7985,6 +8008,12 @@
(flags & CHECKPOINT_FLUSH_ALL) ? " flush-all" : "");
}
+static
+void LogBufferPartitionCheckpointStart(int partition)
+{
+ elog(LOG, "checkpoint for buffer partition %d starting: time", partition);
+}
+
/*
* Log end of a checkpoint.
*/
@@ -8068,6 +8097,85 @@
}
/*
+ * Log end of a buffer partition checkpoint.
+ */
+static void
+LogBufferPartitionCheckpointEnd(int partition, double distance)
+{
+ long write_secs,
+ sync_secs,
+ total_secs,
+ longest_secs,
+ average_secs;
+ int write_usecs,
+ sync_usecs,
+ total_usecs,
+ longest_usecs,
+ average_usecs;
+ uint64 average_sync_time;
+
+ CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
+
+ TimestampDifference(CheckpointStats.ckpt_write_t,
+ CheckpointStats.ckpt_sync_t,
+ &write_secs, &write_usecs);
+
+ TimestampDifference(CheckpointStats.ckpt_sync_t,
+ CheckpointStats.ckpt_sync_end_t,
+ &sync_secs, &sync_usecs);
+
+ /* Accumulate checkpoint timing summary data, in milliseconds. */
+ BgWriterStats.m_checkpoint_write_time +=
+ write_secs * 1000 + write_usecs / 1000;
+ BgWriterStats.m_checkpoint_sync_time +=
+ sync_secs * 1000 + sync_usecs / 1000;
+
+ /*
+ * All of the published timing statistics are accounted for. Only
+ * continue if a log message is to be written.
+ */
+ if (!log_checkpoints)
+ return;
+
+ TimestampDifference(CheckpointStats.ckpt_start_t,
+ CheckpointStats.ckpt_end_t,
+ &total_secs, &total_usecs);
+
+ /*
+ * Timing values returned from CheckpointStats are in microseconds.
+ * Convert to the second plus microsecond form that TimestampDifference
+ * returns for homogeneous printing.
+ */
+ longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000000);
+ longest_usecs = CheckpointStats.ckpt_longest_sync -
+ (uint64) longest_secs *1000000;
+
+ average_sync_time = 0;
+ if (CheckpointStats.ckpt_sync_rels > 0)
+ average_sync_time = CheckpointStats.ckpt_agg_sync_time /
+ CheckpointStats.ckpt_sync_rels;
+ average_secs = (long) (average_sync_time / 1000000);
+ average_usecs = average_sync_time - (uint64) average_secs *1000000;
+
+ elog(LOG, "checkpoint for buffer partition %d complete: wrote %d buffers (%.1f%%); "
+ "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
+ "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+ "distance up to this partition=%d kB (* %d / %d = %d kB)",
+ partition,
+ CheckpointStats.ckpt_bufs_written,
+ (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+ write_secs, write_usecs / 1000,
+ sync_secs, sync_usecs / 1000,
+ total_secs, total_usecs / 1000,
+ CheckpointStats.ckpt_sync_rels,
+ longest_secs, longest_usecs / 1000,
+ average_secs, average_usecs / 1000,
+ (int) (distance / 1024.0),
+ CheckPointPartitions, partition + 1,
+ (int) ((distance * CheckPointPartitions) / (partition + 1) / 1024.0));
+}
+
+/*
* Update the estimate of distance between checkpoints.
*
* The estimate is used to calculate the number of WAL segments to keep
@@ -8148,6 +8256,7 @@
XLogRecPtr prevPtr;
VirtualTransactionId *vxids;
int nvxids;
+ int i;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -8295,7 +8404,6 @@
else
curInsert += SizeOfXLogShortPHD;
}
- checkPoint.redo = curInsert;
/*
* Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -8308,7 +8416,15 @@
* XLogInserts that happen while we are dumping buffers must assume that
* their buffer changes are not included in the checkpoint.
*/
- RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+ if (!isCkptPartial(flags))
+ {
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ BndryRecPtr[i] = XLogCtl->Insert.BndryRecPtr[i] = curInsert;
+ } else {
+ BndryRecPtr[CheckPointPartitions-1] = XLogCtl->Insert.BndryRecPtr[CheckPointPartitions-1] = curInsert;
+ }
+ /* Minimum RedoRecPtr is stored in BndryRecPtr[0] now. */
+ checkPoint.redo = RedoRecPtr = XLogCtl->Insert.RedoRecPtr = BndryRecPtr[0];
/*
* Now we can release the WAL insertion locks, allowing other xacts to
@@ -8317,9 +8433,19 @@
WALInsertLockRelease();
/* Update the info_lck-protected copy of RedoRecPtr as well */
- SpinLockAcquire(&XLogCtl->info_lck);
- XLogCtl->RedoRecPtr = checkPoint.redo;
- SpinLockRelease(&XLogCtl->info_lck);
+ if (!isCkptPartial(flags))
+ {
+ SpinLockAcquire(&XLogCtl->info_lck);
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ XLogCtl->BndryRecPtr[i] = curInsert;
+ XLogCtl->RedoRecPtr = XLogCtl->BndryRecPtr[0];
+ SpinLockRelease(&XLogCtl->info_lck);
+ } else {
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->BndryRecPtr[CheckPointPartitions-1] = curInsert;
+ XLogCtl->RedoRecPtr = XLogCtl->BndryRecPtr[0];
+ SpinLockRelease(&XLogCtl->info_lck);
+ }
/*
* If enabled, log checkpoint start. We postpone this until now so as not
@@ -8410,7 +8536,7 @@
}
pfree(vxids);
- CheckPointGuts(checkPoint.redo, flags);
+ CheckPointGuts(RedoRecPtr, flags);
/*
* Take a snapshot of running transactions and write this to WAL. This
@@ -8523,7 +8649,7 @@
XLogSegNo _logSegNo;
/* Update the average distance between checkpoints. */
- UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+ UpdateCheckPointDistanceEstimate(curInsert - PriorCkptPtr);
XLByteToSeg(PriorRedoPtr, _logSegNo);
KeepLogSeg(recptr, &_logSegNo);
@@ -8532,6 +8658,11 @@
}
/*
+ * Remember this checkpoint's (psuedo) redo pointer, used later to determine
+ */
+ PriorCkptPtr = curInsert;
+
+ /*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
@@ -8565,6 +8696,107 @@
LWLockRelease(CheckpointLock);
}
+void
+CheckPointBufferPartition(int flags, int partition)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ XLogCtlInsert *Insert = &XLogCtl->Insert;
+ XLogRecPtr curInsert;
+
+ /*
+ * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
+ * (This is just pro forma, since in the present system structure there is
+ * only one process that is allowed to issue checkpoints at any given
+ * time.)
+ */
+ LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+ /*
+ * Prepare to accumulate statistics.
+ *
+ * Note: because it is possible for log_checkpoints to change while a
+ * checkpoint proceeds, we always accumulate stats, even if
+ * log_checkpoints is currently off.
+ */
+ MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+ CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+ /*
+ * Use a critical section to force system panic if we have trouble.
+ */
+ START_CRIT_SECTION();
+
+ /*
+ * We must block concurrent insertions while examining insert state to
+ * determine the checkpoint REDO pointer.
+ */
+ WALInsertLockAcquireExclusive();
+ curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+
+ /*
+ * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+ * must be done while holding all the insertion locks.
+ *
+ * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+ * pointing past where it really needs to point. This is okay; the only
+ * consequence is that XLogInsert might back up whole buffers that it
+ * didn't really need to. We can't postpone advancing RedoRecPtr because
+ * XLogInserts that happen while we are dumping buffers must assume that
+ * their buffer changes are not included in the checkpoint.
+ */
+ BndryRecPtr[partition] = xlogctl->Insert.BndryRecPtr[partition] = curInsert;
+
+ /*
+ * Now we can release the WAL insertion locks, allowing other xacts to
+ * proceed while we are flushing disk buffers.
+ */
+ WALInsertLockRelease();
+
+ /* Update the info_lck-protected copy of RedoRecPtr as well */
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->BndryRecPtr[partition] = curInsert;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ /*
+ * If enabled, log checkpoint start. We postpone this until now so as not
+ * to log anything if we decided to skip the checkpoint.
+ */
+ if (log_checkpoints)
+ LogBufferPartitionCheckpointStart(partition);
+
+ TRACE_POSTGRESQL_CHECKPOINT_START(flags);
+
+ /*
+ * Having constructed the checkpoint record, ensure all shmem disk buffers
+ * and commit-log buffers are flushed to disk.
+ *
+ * This I/O could fail for various reasons. If so, we will fail to
+ * complete the checkpoint, but there is no reason to force a system
+ * panic. Accordingly, exit critical section while doing it.
+ */
+ END_CRIT_SECTION();
+
+ CheckPointBuffers(flags, partition, CheckPointPartitions);
+
+ /* Real work is done, but log and update stats before releasing lock. */
+ LogBufferPartitionCheckpointEnd(partition, (double) (curInsert - PriorCkptPtr));
+
+ TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
+ NBuffers,
+ CheckpointStats.ckpt_segs_added,
+ CheckpointStats.ckpt_segs_removed,
+ CheckpointStats.ckpt_segs_recycled);
+
+ LWLockRelease(CheckpointLock);
+}
+
+void
+XLogSetInitialRedoRecPtr()
+{
+ PriorCkptPtr = ControlFile->checkPointCopy.redo;
+}
+
/*
* Mark the end of recovery in WAL though without running a full checkpoint.
* We can expect that a restartpoint is likely to be in progress as we
@@ -8635,7 +8867,10 @@
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
- CheckPointBuffers(flags); /* performs all required fsyncs */
+ if (!isCkptPartial(flags))
+ CheckPointBuffers(flags, 0, 1); /* performs all required fsyncs */
+ else
+ CheckPointBuffers(flags, CheckPointPartitions-1, CheckPointPartitions);
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
@@ -8699,6 +8934,7 @@
CheckPoint lastCheckPoint;
XLogRecPtr PriorRedoPtr;
TimestampTz xtime;
+ int i;
/*
* Acquire CheckpointLock to ensure only one restartpoint or checkpoint
@@ -8769,11 +9005,15 @@
* happening.
*/
WALInsertLockAcquireExclusive();
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ XLogCtl->Insert.BndryRecPtr[i] = lastCheckPoint.redo;
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo;
WALInsertLockRelease();
/* Also update the info_lck-protected copy */
SpinLockAcquire(&XLogCtl->info_lck);
+ for (i = 0 ; i < CheckPointPartitions ; i++)
+ XLogCtl->BndryRecPtr[i] = lastCheckPoint.redo;
XLogCtl->RedoRecPtr = lastCheckPoint.redo;
SpinLockRelease(&XLogCtl->info_lck);
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/backend/access/transam/xloginsert.c postgresql-9.5alpha2-partitionedCKPT/src/backend/access/transam/xloginsert.c
--- postgresql-9.5alpha2/src/backend/access/transam/xloginsert.c 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/backend/access/transam/xloginsert.c 2015-09-10 17:03:25.143116794 +0900
@@ -107,7 +107,7 @@
static MemoryContext xloginsert_cxt;
static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
- XLogRecPtr RedoRecPtr, bool doPageWrites,
+ XLogRecPtr *BndryRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
@@ -435,9 +435,9 @@
do
{
- XLogRecPtr RedoRecPtr;
+ XLogRecPtr *BndryRecPtr;
bool doPageWrites;
- XLogRecPtr fpw_lsn;
+ XLogRecPtr fpw_lsn[CheckPointPartitions];
XLogRecData *rdt;
/*
@@ -445,10 +445,10 @@
* we don't yet have an insertion lock, these could change under us,
* but XLogInsertRecData will recheck them once it has a lock.
*/
- GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+ GetFullPageWriteInfo(&BndryRecPtr, &doPageWrites, false);
- rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
- &fpw_lsn);
+ rdt = XLogRecordAssemble(rmid, info, BndryRecPtr, doPageWrites,
+ fpw_lsn);
EndPos = XLogInsertRecord(rdt, fpw_lsn);
} while (EndPos == InvalidXLogRecPtr);
@@ -468,11 +468,11 @@
* If there are any registered buffers, and a full-page image was not taken
* of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
* signals that the assembled record is only good for insertion on the
- * assumption that the RedoRecPtr and doPageWrites values were up-to-date.
+ * assumption that the BndryRecPtr and doPageWrites values were up-to-date.
*/
static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
- XLogRecPtr RedoRecPtr, bool doPageWrites,
+ XLogRecPtr *BndryRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn)
{
XLogRecData *rdt;
@@ -483,6 +483,7 @@
XLogRecData *rdt_datas_last;
XLogRecord *rechdr;
char *scratch = hdr_scratch;
+ int index;
/*
* Note: this function can be called multiple times for the same record.
@@ -502,7 +503,8 @@
* references. This includes the data for full-page images. Also append
* the headers for the block references in the scratch buffer.
*/
- *fpw_lsn = InvalidXLogRecPtr;
+ for (index = 0 ; index < CheckPointPartitions ; index++)
+ fpw_lsn[index] = InvalidXLogRecPtr;
for (block_id = 0; block_id < max_registered_block_id; block_id++)
{
registered_buffer *regbuf = ®istered_buffers[block_id];
@@ -533,11 +535,13 @@
*/
XLogRecPtr page_lsn = PageGetLSN(regbuf->page);
- needs_backup = (page_lsn <= RedoRecPtr);
+ /* calculate the buf_id of the regbuf->page */
+ index = ckptIndex(((char *) regbuf->page - BufferBlocks) / BLCKSZ);
+ needs_backup = (page_lsn <= BndryRecPtr[index]);
if (!needs_backup)
{
- if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
- *fpw_lsn = page_lsn;
+ if (fpw_lsn[index] == InvalidXLogRecPtr || page_lsn < fpw_lsn[index])
+ fpw_lsn[index] = page_lsn;
}
}
@@ -819,15 +823,15 @@
bool
XLogCheckBufferNeedsBackup(Buffer buffer)
{
- XLogRecPtr RedoRecPtr;
+ XLogRecPtr *BndryRecPtr;
bool doPageWrites;
Page page;
- GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+ GetFullPageWriteInfo(&BndryRecPtr, &doPageWrites, false);
page = BufferGetPage(buffer);
- if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ if (doPageWrites && PageGetLSN(page) <= BndryRecPtr[ckptIndex(buffer - 1)])
return true; /* buffer requires backup */
return false; /* buffer does not need to be backed up */
@@ -859,17 +863,18 @@
{
XLogRecPtr recptr = InvalidXLogRecPtr;
XLogRecPtr lsn;
- XLogRecPtr RedoRecPtr;
+ XLogRecPtr *BndryRecPtr;
+ bool doPageWrites;
/*
- * Ensure no checkpoint can change our view of RedoRecPtr.
+ * Ensure no checkpoint can change our view of BndryRecPtr.
*/
Assert(MyPgXact->delayChkpt);
/*
- * Update RedoRecPtr so that we can make the right decision
+ * Update BndryRecPtr so that we can make the right decision
*/
- RedoRecPtr = GetRedoRecPtr();
+ GetFullPageWriteInfo(&BndryRecPtr, &doPageWrites, true);
/*
* We assume page LSN is first data on *every* page that can be passed to
@@ -879,7 +884,7 @@
*/
lsn = BufferGetLSNAtomic(buffer);
- if (lsn <= RedoRecPtr)
+ if (lsn <= BndryRecPtr[buffer - 1])
{
int flags;
char copied_buffer[BLCKSZ];
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/backend/postmaster/checkpointer.c postgresql-9.5alpha2-partitionedCKPT/src/backend/postmaster/checkpointer.c
--- postgresql-9.5alpha2/src/backend/postmaster/checkpointer.c 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/backend/postmaster/checkpointer.c 2015-09-08 15:56:09.331920502 +0900
@@ -159,16 +159,19 @@
/* these values are valid when ckpt_active is true: */
static pg_time_t ckpt_start_time;
+static pg_time_t scheduled_start_time;
static XLogRecPtr ckpt_start_recptr;
static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
static pg_time_t last_xlog_switch_time;
+static int cur_partition;
+
/* Prototypes for private functions */
static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
+static bool IsCheckpointOnSchedule(double progress, bool partitioned);
static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -347,6 +350,9 @@
*/
ProcGlobal->checkpointerLatch = &MyProc->procLatch;
+ /* Set initial RedoRecPtr as PriorCkptPtr */
+ XLogSetInitialRedoRecPtr();
+
/*
* Loop forever
*/
@@ -412,7 +418,7 @@
*/
now = (pg_time_t) time(NULL);
elapsed_secs = now - last_checkpoint_time;
- if (elapsed_secs >= CheckPointTimeout)
+ if (elapsed_secs >= (CheckPointTimeout * (cur_partition+1)) / CheckPointPartitions)
{
if (!do_checkpoint)
BgWriterStats.m_timed_checkpoints++;
@@ -425,7 +431,7 @@
*/
if (do_checkpoint)
{
- bool ckpt_performed = false;
+ enum {ckpt_none, ckpt_full, ckpt_buf} ckpt_performed = ckpt_none;
bool do_restartpoint;
/* use volatile pointer to prevent code rearrangement */
@@ -484,17 +490,31 @@
ckpt_start_recptr = GetInsertRecPtr();
ckpt_start_time = now;
ckpt_cached_elapsed = 0;
+ scheduled_start_time = last_checkpoint_time + (CheckPointTimeout * (cur_partition+1)) / CheckPointPartitions - ckpt_start_time;
/*
* Do the checkpoint.
*/
if (!do_restartpoint)
{
- CreateCheckPoint(flags);
- ckpt_performed = true;
+ if (!isCkptPartial(flags) || cur_partition == (CheckPointPartitions-1))
+ {
+ CreateCheckPoint(flags);
+ cur_partition = 0;
+ ckpt_performed = ckpt_full;
+ }
+ else
+ {
+ CheckPointBufferPartition(flags, cur_partition);
+ cur_partition++;
+ ckpt_performed = ckpt_buf;
+ }
}
else
- ckpt_performed = CreateRestartPoint(flags);
+ {
+ ckpt_performed = CreateRestartPoint(flags) ? ckpt_full : ckpt_none;
+ cur_partition = 0;
+ }
/*
* After any checkpoint, close all smgr files. This is so we
@@ -509,7 +529,7 @@
cps->ckpt_done = cps->ckpt_started;
SpinLockRelease(&cps->ckpt_lck);
- if (ckpt_performed)
+ if (ckpt_performed == ckpt_full)
{
/*
* Note we record the checkpoint start time not end time as
@@ -518,7 +538,7 @@
*/
last_checkpoint_time = now;
}
- else
+ else if (ckpt_performed == ckpt_none)
{
/*
* We were not able to perform the restartpoint (checkpoints
@@ -550,9 +570,9 @@
*/
now = (pg_time_t) time(NULL);
elapsed_secs = now - last_checkpoint_time;
- if (elapsed_secs >= CheckPointTimeout)
+ if (elapsed_secs >= (CheckPointTimeout * (cur_partition+1)) / CheckPointPartitions)
continue; /* no sleep for us ... */
- cur_timeout = CheckPointTimeout - elapsed_secs;
+ cur_timeout = (CheckPointTimeout * (cur_partition+1)) / CheckPointPartitions - elapsed_secs;
if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
{
elapsed_secs = now - last_xlog_switch_time;
@@ -680,7 +700,7 @@
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!shutdown_requested &&
!ImmediateCheckpointRequested() &&
- IsCheckpointOnSchedule(progress))
+ IsCheckpointOnSchedule(progress, isCkptPartial(flags)))
{
if (got_SIGHUP)
{
@@ -729,7 +749,7 @@
* than the elapsed time/segments.
*/
static bool
-IsCheckpointOnSchedule(double progress)
+IsCheckpointOnSchedule(double progress, bool partitioned)
{
XLogRecPtr recptr;
struct timeval now;
@@ -786,8 +806,14 @@
* Check progress against time elapsed and checkpoint_timeout.
*/
gettimeofday(&now, NULL);
- elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
- now.tv_usec / 1000000.0) / CheckPointTimeout;
+ if (!partitioned)
+ {
+ elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
+ now.tv_usec / 1000000.0) / CheckPointTimeout;
+ } else {
+ elapsed_time = ((double) ((pg_time_t) now.tv_sec - scheduled_start_time) +
+ now.tv_usec / 1000000.0) * CheckPointPartitions / CheckPointTimeout;
+ }
if (progress < elapsed_time)
{
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/backend/storage/buffer/bufmgr.c postgresql-9.5alpha2-partitionedCKPT/src/backend/storage/buffer/bufmgr.c
--- postgresql-9.5alpha2/src/backend/storage/buffer/bufmgr.c 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/backend/storage/buffer/bufmgr.c 2015-09-07 16:29:42.313033387 +0900
@@ -395,7 +395,7 @@
static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
static void PinBuffer_Locked(volatile BufferDesc *buf);
static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, int partition, int stride);
static int SyncOneBuffer(int buf_id, bool skip_recently_used);
static void WaitIO(volatile BufferDesc *buf);
static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1572,7 +1572,7 @@
* currently have no effect here.
*/
static void
-BufferSync(int flags)
+BufferSync(int flags, int partition, int stride)
{
int buf_id;
int num_to_scan;
@@ -1609,7 +1609,7 @@
* certainly need to be written for the next checkpoint attempt, too.
*/
num_to_write = 0;
- for (buf_id = 0; buf_id < NBuffers; buf_id++)
+ for (buf_id = partition ; buf_id < NBuffers; buf_id += stride)
{
volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -2230,11 +2230,11 @@
* need to be flushed.
*/
void
-CheckPointBuffers(int flags)
+CheckPointBuffers(int flags, int partition, int stride)
{
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
- BufferSync(flags);
+ BufferSync(flags, partition, stride);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
smgrsync();
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/include/access/xlog.h postgresql-9.5alpha2-partitionedCKPT/src/include/access/xlog.h
--- postgresql-9.5alpha2/src/include/access/xlog.h 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/include/access/xlog.h 2015-09-08 15:55:22.762720976 +0900
@@ -209,7 +209,7 @@
struct XLogRecData;
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr *fpw_lsn);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -252,11 +252,13 @@
extern void ShutdownXLOG(int code, Datum arg);
extern void InitXLOGAccess(void);
extern void CreateCheckPoint(int flags);
+extern void CheckPointBufferPartition(int flags, int partition);
+extern void XLogSetInitialRedoRecPtr(void);
extern bool CreateRestartPoint(int flags);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr XLogRestorePoint(const char *rpName);
extern void UpdateFullPageWrites(void);
-extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
+extern void GetFullPageWriteInfo(XLogRecPtr **RedoRecPtr_p, bool *doPageWrites_p, bool update);
extern XLogRecPtr GetRedoRecPtr(void);
extern XLogRecPtr GetInsertRecPtr(void);
extern XLogRecPtr GetFlushRecPtr(void);
diff -r -u -P --exclude-from=/home/postgres/cmd/diff.excludes postgresql-9.5alpha2/src/include/storage/bufmgr.h postgresql-9.5alpha2-partitionedCKPT/src/include/storage/bufmgr.h
--- postgresql-9.5alpha2/src/include/storage/bufmgr.h 2015-08-04 05:34:55.000000000 +0900
+++ postgresql-9.5alpha2-partitionedCKPT/src/include/storage/bufmgr.h 2015-09-07 16:29:42.314033413 +0900
@@ -142,6 +142,13 @@
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
/*
+ * definitions for partitioned checkpoint
+ */
+#define CheckPointPartitions 16
+#define ckptIndex(x) ((x)&(CheckPointPartitions-1))
+#define isCkptPartial(f) ((f)==CHECKPOINT_CAUSE_TIME)
+
+/*
* prototypes for functions in bufmgr.c
*/
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
@@ -165,7 +172,7 @@
extern void InitBufferPoolBackend(void);
extern void AtEOXact_Buffers(bool isCommit);
extern void PrintBufferLeakWarning(Buffer buffer);
-extern void CheckPointBuffers(int flags);
+extern void CheckPointBuffers(int flags, int partition, int stride);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
Hello Takashi-san,
I suggest that you have a look at the following patch submitted in June:
https://commitfest.postgresql.org/6/260/
And these two threads:
/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto/
/messages/by-id/alpine.DEB.2.10.1506011320000.28433@sto/
Including many performance figures under different condition in the second
thread. I'm not sure how it would interact with what you are proposing,
but it is also focussing on improving postgres availability especially on
HDD systems by changing how the checkpointer write buffers with sorting
and flushing.
Also, you may consider running some tests about the latest version of this
patch onto your hardware, to complement Amit tests?
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.
On this point, you may have a look at this item:
https://commitfest.postgresql.org/5/283/
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Feel free to do that. 200kB is well below this list's limits. (I'm not
sure how easy it is to open .epsf files in today's systems, but .jpg or
other raster image formats are pretty common.)
Thanks for your comment.
Please find two result graphs (for sync and async commit cases) in .jpg format.
I think it is obvious that performance dips were mitigated in both magnitude and frequency by introducing the partitioned checkpointing.
(In passing) one correction for my previous email.
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
raid1 of 2 SSDs for $PGDATA (other than pg_xlog)
'raid1' is wrong and 'raid0' is correct.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: Alvaro Herrera [mailto:pgsql-hackers-owner@postgresql.org]
Sent: Friday, September 11, 2015 12:03 AM
To: Horikawa Takashi(堀川 隆)
Subject: Re: [HACKERS] Partitioned checkpointingTakashi Horikawa wrote:
# Since I'm not sure whether it is OK to send an email to this mailing
with attaching some files other than patch, I refrain now from
attaching raw results (200K bytes of text/case) and result graphs in
.jpg or .epsf format illustrating the throughput variations to this
email. If it is OK, I'm pleased to show the results in those formats.Feel free to do that. 200kB is well below this list's limits. (I'm not
sure how easy it is to open .epsf files in today's systems, but .jpg or
other raster image formats are pretty common.)--
Álvaro Herrera
Attachments:
sync_120.JPGimage/jpeg; name=sync_120.JPGDownload
���� JFIF ` ` �� C
$ &%# #"(-90(*6+"#2D26;=@@@&0FKE>J9?@=�� C
=)#)==================================================�� ��"