Fix checkpoint skip logic on idle systems by tracking LSN progress
Hi all,
A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.
One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.
Attached is a patch implementing a solution for it, by adding in
WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.
Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.
I am adding that to the commit fest of September.
Regards,
--
Michael
On Thu, May 19, 2016 at 6:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I am adding that to the commit fest of September.
And a lot of activity has happened here since. Attached are refreshed
patches based on da6c4f6. v11 still applies correctly but it's always
better to avoid hunks when applying them.
--
Michael
Hi,
I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(
The attached patches are rebased to the master and additional one
mentioned below.
At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>
A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.
Attached is a patch implementing a solution for it, by adding in
If I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.
WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.
WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.
/messages/by-id/CAA4eK1LB9HDq+F8Lw8bGRQx6Sw42XaikX1UQ2DZk+AuEGbfjWA@mail.gmail.com
Amit> Taking and releasing locks again and again (which is done in patch)
Amit> matters much more than adding few instructions under it and I think
Amit> if you would have written the code in-a-way as in patch in some of
Amit> the critical path, it would have been regressed badly, but because
Amit> checkpoint doesn't happen often, reproducing regression is difficult.
/messages/by-id/CAB7nPqSO6HVJ0T6LUT84PCy+_ihitdt64Ze2D+SJrHZy72Y0wg@mail.gmail.com
Also, I think it is worth to once take the performance data for
write tests (using pgbench 30 minute run or some other way) with
minimum checkpoint_timeout (i.e 30s) to see if the additional locking
has any impact on performance. I think taking locks at intervals
of 15s or 30s should not matter much, but it is better to be safe.I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.
I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.
Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.
This looks reasonable.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 9/27/16 6:16 AM, Kyotaro HORIGUCHI wrote:
I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(The attached patches are rebased to the master and additional one
mentioned below.
I tried the attached patch set and noticed an interesting behavior.
With archive_timeout=5 whenever I made a change I would get a WAL
segment within a few seconds as expected then another one would follow a
few minutes later.
Database init:
16M Sep 27 20:05 000000010000000000000001
16M Sep 27 20:09 000000010000000000000002
Create test table:
16M Sep 27 20:13 000000010000000000000003
16M Sep 27 20:15 000000010000000000000004
Insert row into test table:
16M Sep 27 20:46 000000010000000000000005
16M Sep 27 20:49 000000010000000000000006
The cluster was completely idle with no sessions connected in between
those three commands. Is it possible this is caused by:
+ * switch segment only when any substantial progress have made from
+ * the last segment switching by timeout. Segment switching by other
+ * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
+ * matter since it just causes possible one excessive segment switch.
*/
I would like to give Michael a chance to respond to the updated patches
before delving deeper.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 27, 2016 at 7:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(
No problem.
At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>
A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.
Attached is a patch implementing a solution for it, by adding inIf I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.
Possibly. That's a second problem I did not want to tackle now. I was
going to study that more precisely after this set of patches gets
done. There is already enough complication in them, and they solve a
large portion of the problem.
WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.
Based on my past look at the problem and memories, having a variable
in WALInsertLock allows use to not have to touch the hottest spinlock
code path in WAL insertion and PG: ReserveXLogInsertLocation(). I'd
rather still avoid that.
Also, I think it is worth to once take the performance data for
write tests (using pgbench 30 minute run or some other way) with
minimum checkpoint_timeout (i.e 30s) to see if the additional locking
has any impact on performance. I think taking locks at intervals
of 15s or 30s should not matter much, but it is better to be safe.I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.
Yeah, I couldn't either.. Still I would like to keep the hot spinlock
section as small as possible if we can.
Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.This looks reasonable.
Thanks. That would be fine as a first, independent patch, but that
would mean that XLogSetFlags has only only value, which is a bit
pointless so I grouped them. And this makes the exiting interface
cleaner as well for replication origins.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:
I tried the attached patch set and noticed an interesting behavior. With
archive_timeout=5 whenever I made a change I would get a WAL segment within
a few seconds as expected then another one would follow a few minutes later.
That's intentional. We may be able to make XLOG_SWITCH records as not
updating the progress LSN, but I wanted to tackle that as a separate
patch once we got the basics done correctly, which is still what I
think this patch is doing. I should have been more precise upthread:
this patch makes the handling of checkpoint skip logic correct for
only standby snapshots, not segment switches, and puts the infra to
handle other things.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/28/16 3:35 AM, Michael Paquier wrote:
On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:
I tried the attached patch set and noticed an interesting behavior. With
archive_timeout=5 whenever I made a change I would get a WAL segment within
a few seconds as expected then another one would follow a few minutes later.That's intentional. We may be able to make XLOG_SWITCH records as not
updating the progress LSN, but I wanted to tackle that as a separate
patch once we got the basics done correctly, which is still what I
think this patch is doing. I should have been more precise upthread:
this patch makes the handling of checkpoint skip logic correct for
only standby snapshots, not segment switches, and puts the infra to
handle other things.
OK, I've done functional testing and this patch seems to work as
specified (including the caveat noted above). Some comments:
* [PATCH 1/3] hs-checkpoints-v12-1
+++ b/src/backend/access/transam/xlog.c
+ * Taking a lock is as well necessary to prevent potential torn reads
+ * on some platforms.
How about, "Taking a lock is also necessary..."
+ LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
That's a lot of exclusive locks and that would seem to have performance
implications. It seems to me this is going to be a hard one to
benchmark because the regression (if any) would only be seen under heavy
load on a very large system.
In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.
+++ b/src/backend/access/transam/xloginsert.c * Should this record
include the replication origin if one is set up?
Outdated comment from XLogIncludeOrigin().
* [PATCH 2/3] hs-checkpoints-v12-2
+++ b/src/backend/postmaster/checkpointer.c
+ /* OK, it's time to switch */
+ elog(LOG, "Request XLog Switch");
LOG level seems a bit much here, perhaps DEBUG1?
* [PATCH 3/3] hs-checkpoints-v12-3
+ * switch segment only when any substantial progress have made from
+ * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
How about, "Switch segment only when substantial progress has been made
after the last segment was switched by a timeout. Segment switching for
other reasons..."
+++ b/src/backend/access/transam/xlog.c
+ elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X,
ckpt %X/%X",
+ elog(LOG, "Checkpoint is skipped");
+ elog(LOG, "snapshot taken by checkpoint %X/%X",
Same for the above, seems like it would just be noise for most users.
+++ b/src/backend/postmaster/bgwriter.c
+ elog(LOG, "snapshot taken by bgwriter %X/%X",
Ditto.
I don't see any unintended consequences in this patch but it doesn't
mean there aren't any. I'm definitely concerned by the exclusive locks
but it may turn out they do not actually represent a bottleneck.
This does seem like the kind of patch that should get committed very
early in the release cycle to allow maximum time for regression testing.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:
OK, I've done functional testing and this patch seems to work as
specified (including the caveat noted above). Some comments:
Thanks!
* [PATCH 1/3] hs-checkpoints-v12-1
+++ b/src/backend/access/transam/xlog.c + * Taking a lock is as well necessary to prevent potential torn reads + * on some platforms.How about, "Taking a lock is also necessary..."
+ LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
That's a lot of exclusive locks and that would seem to have performance
implications. It seems to me this is going to be a hard one to
benchmark because the regression (if any) would only be seen under heavy
load on a very large system.In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.
Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.
+++ b/src/backend/access/transam/xloginsert.c * Should this record include the replication origin if one is set up?Outdated comment from XLogIncludeOrigin().
Fixed. I added as well some comments on top of XLogSetFlags to mention
what are the flags that can be used. I didn't think that it was
necessary to add an assertion here. Also, I noticed that the comment
on top of XLogInsertRecord mentioned those flags but was incorrect.
* [PATCH 2/3] hs-checkpoints-v12-2
+++ b/src/backend/postmaster/checkpointer.c + /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.
* [PATCH 3/3] hs-checkpoints-v12-3
+ * switch segment only when any substantial progress have made from + * reasons will cause last_xlog_switch_lsn stay behind but it doesn'tHow about, "Switch segment only when substantial progress has been made
after the last segment was switched by a timeout. Segment switching for
other reasons..."+++ b/src/backend/access/transam/xlog.c + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X", + elog(LOG, "Checkpoint is skipped"); + elog(LOG, "snapshot taken by checkpoint %X/%X",Same for the above, seems like it would just be noise for most users.
+++ b/src/backend/postmaster/bgwriter.c + elog(LOG, "snapshot taken by bgwriter %X/%X",Ditto.
The original patch was developed to ease debugging, and I chose LOG to
not be polluted with a bunch of DEBUG1 entries :)
Now we can do something, as follows:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8450,6 +8450,8 @@ CreateCheckPoint(int flags)
{
if (progress_lsn == ControlFile->checkPoint)
{
+ if (log_checkpoints)
+ ereport(LOG, "checkpoint skipped");
WALInsertLockRelease();
LWLockRelease(CheckpointLock);
END_CRIT_SECTION();
Letting users know that the checkpoint has been skipped sounds like a
good idea. Perhaps that's better if squashed with the first patch.
I don't see any unintended consequences in this patch but it doesn't
mean there aren't any. I'm definitely concerned by the exclusive locks
but it may turn out they do not actually represent a bottleneck.
That's a hard to see a difference. Perhaps I didn't try hard enough..
Well for now attached are two patches, that could just be squashed into one.
--
Michael
On 9/28/16 10:32 PM, Michael Paquier wrote:
On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:
In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.
I don't have any better ideas than that.
+++ b/src/backend/postmaster/checkpointer.c + /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.
I still see this:
+++ b/src/backend/postmaster/checkpointer.c
/* OK, it's time to switch */
+ elog(LOG, "Request XLog Switch");
Well for now attached are two patches, that could just be squashed into one.
Yes, I think that makes sense.
More importantly, there is a regression. With your new patch the xlogs
are switching on archive_timeout again even with no changes. The v12
worked fine.
The differences are all in 0002-hs-checkpoints-v12-2.patch and as far as
I can see the patch does not work correctly without these changes. Am I
missing something?
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sorry, I might have torn off this thread somehow..
At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in <30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>
On 9/28/16 10:32 PM, Michael Paquier wrote:
On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
wrote:In general I agree with the other comments that this could end up
being
a problem. On the other hand, since the additional locks are only
taken
at checkpoint or archive_timeout it may not be that big a deal.Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.I don't have any better ideas than that.
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when
NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
rather alleviates the impact. But it actually doesn't seem so
harmful up to 8. (Even though I don't like the locking in
GetProgressRecPtr..)
Currently possiblly harmful calling of GetProgressRecPtr is that
in BackgroundWriterMain. It should be called with ther interval
BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
see the issuer of SIGUSR1 of bgwriter..
+++ b/src/backend/postmaster/checkpointer.c + /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.I still see this:
+++ b/src/backend/postmaster/checkpointer.c /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");Well for now attached are two patches, that could just be squashed
into one.
Mmmm. Sorry, this was for my quite private instant debug, spilt
outside.. But I don't mind to leave it with DEBUG2 if it seems
useful.
Yes, I think that makes sense.
More importantly, there is a regression. With your new patch the
xlogs are switching on archive_timeout again even with no changes.
The v12 worked fine.
As Michael mentioned in this or another thread, it is another
issue that he wants to solve separately. I personally doubt that
this patch (v11 and v13) can be evaluated alone without it, but
we can review this with the excessive switching problem, perhaps?
The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
as I can see the patch does not work correctly without these changes.
Am I missing something?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sorry, it wrote wrong thing.
At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
Sorry, I might have torn off this thread somehow..
At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in <30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>
On 9/28/16 10:32 PM, Michael Paquier wrote:
On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
wrote:In general I agree with the other comments that this could end up
being
a problem. On the other hand, since the additional locks are only
taken
at checkpoint or archive_timeout it may not be that big a deal.Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.I don't have any better ideas than that.
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when
But I suspect that GetProgressRecPtr could be harmful.
NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
rather alleviates the impact. But it actually doesn't seem so
harmful up to 8. (Even though I don't like the locking in
GetProgressRecPtr..)Currently possiblly harmful calling of GetProgressRecPtr is that
in BackgroundWriterMain. It should be called with ther interval
BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
see the issuer of SIGUSR1 of bgwriter..+++ b/src/backend/postmaster/checkpointer.c + /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.I still see this:
+++ b/src/backend/postmaster/checkpointer.c /* OK, it's time to switch */ + elog(LOG, "Request XLog Switch");Well for now attached are two patches, that could just be squashed
into one.Mmmm. Sorry, this was for my quite private instant debug, spilt
outside.. But I don't mind to leave it with DEBUG2 if it seems
useful.Yes, I think that makes sense.
More importantly, there is a regression. With your new patch the
xlogs are switching on archive_timeout again even with no changes.
The v12 worked fine.As Michael mentioned in this or another thread, it is another
issue that he wants to solve separately. I personally doubt that
this patch (v11 and v13) can be evaluated alone without it, but
we can review this with the excessive switching problem, perhaps?The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
as I can see the patch does not work correctly without these changes.
Am I missing something?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially whenBut I suspect that GetProgressRecPtr could be harmful.
Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially whenBut I suspect that GetProgressRecPtr could be harmful.
Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..
Note: I am moving this patch to next CF.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(Squashing replies)
On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially whenBut I suspect that GetProgressRecPtr could be harmful.
Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..Note: I am moving this patch to next CF.
And I am back on it more seriously... And I am taking back what I said upthread.
I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.
There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.
I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.
How does that look?
--
Michael
Attachments:
hs-checkpoints-v14.patchapplication/x-download; name=hs-checkpoints-v14.patchDownload+198-63
Thanks for merging. It still applies on the current master with
some displacements.
At Wed, 5 Oct 2016 15:18:53 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT4U=OSOLXuFuxMonmfdQFmd5F_0DmKoddvjG-HHWQaBA@mail.gmail.com>
(Squashing replies)
On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially whenBut I suspect that GetProgressRecPtr could be harmful.
Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..Note: I am moving this patch to next CF.
And I am back on it more seriously... And I am taking back what I said upthread.
I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?
I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)
I found that the former reuiqres to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)
For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.How does that look?
All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?
Be my guest :)
I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)
Okay.
I found that the former requires to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)
Interesting...
For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.
The WAL insert slots have been introduced in 9.4, and the PG atomics
in 9.5, so perhaps the first implementation of the WAL insert slots
would have used it. Still that looks quite promising. At the same time
we may be able to do something for insertingAt to make the locks held
more consistent, and just remove WALInsertLocks, even if this makes me
wonder about torn reads and about how we may break things if we rely
on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.
All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.
I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
XLogInsert flag present on HEAD. Could the patch be marked as "ready
for committer" then?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Tue, 8 Nov 2016 14:45:59 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT-VW5gRbUJwQusmgiu2MKpZSCV-XdrHv84w8FZa286KQ@mail.gmail.com>
On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?Be my guest :)
I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)Okay.
I found that the former requires to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)Interesting...
For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.The WAL insert slots have been introduced in 9.4, and the PG atomics
in 9.5, so perhaps the first implementation of the WAL insert slots
would have used it. Still that looks quite promising. At the same time
we may be able to do something for insertingAt to make the locks held
more consistent, and just remove WALInsertLocks, even if this makes me
wonder about torn reads and about how we may break things if we rely
If I understand you correctly, atomics prevents torn reads by its
definition on cache management and bus arbitration level. It
should be faster than LWlocks but as I said in the previous mail,
on some platforms, if any, it will fallbacks to individual
spinlocks. (atomics.c)
on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.
Ok, the patch looks fine. So there's nothing for me but to accept
the current shape since the discussion about performance seems
not to be settled with out performance measurement with machines
with many cores.
All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
XLogInsert flag present on HEAD. Could the patch be marked as "ready
for committer" then?
Ok, that is not so siginificant point. Well, I'd like to wait for
a couple of days for anyone wants to comment, then mark this
'ready for committer'.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/5/16 7:18 AM, Michael Paquier wrote:
Note: I am moving this patch to next CF.
And I am back on it more seriously... And I am taking back what I said upthread.
I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.How does that look?
This looks much better now and exhibits exactly the behavior that I
expect.
In theory it would be nice if the checkpoint records did not cause
rotation, but this can be mitigated in the way you have described and
perhaps for safety's sake it's best.
I had a bit of trouble parsing this paragraph:
+ /*
+ * Update the progress LSN positions. At least one WAL insertion lock
+ * is already taken appropriately before doing that, and it is just more
+ * simple to do that here where WAL record data and type is at hand.
+ * The progress is set at the start position of the record tracked that
+ * is being added, making easier checkpoint progress tracking as the
+ * control file already saves the start LSN position of the last
+ * checkpoint run. If an exclusive lock is taken for WAL insertion,
+ * there is actually no need to update all the progression fields, so
So I did a little reworking:
Update the LSN progress positions. At least one WAL insertion lock is
already taken appropriately before doing that, and it is simpler to do
that here when the WAL record data and type are at hand. Progress is set
at the start position of the tracked record that is being added, making
checkpoint progress tracking easier as the control file already saves
the start LSN position of the last checkpoint. If an exclusive lock is
taken for WAL insertion there is no need to update all the progress
fields, only the first one.
If that still says what you think it should, then I believe it is
clearer. Also:
+ * last time a segment has switched because of a timeout. Segment
+ * switching because of other reasons, like manual trigerring of
typo, should be "triggering".
I don't see any further issues with this patch unless there are
performance concerns about the locks taken in GetProgressRecPtr(). The
locks seem reasonable to me but I'd like to see this committed so
there's plenty of time to detect any regression before 10.0.
As such, my vote is to mark this "Ready for Committer." I'm fine with
waiting a few days as Kyotaro suggested, or we can consider my review
"additional comments" and do it now.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 8, 2016 at 9:32 PM, David Steele <david@pgmasters.net> wrote:
I had a bit of trouble parsing this paragraph:
[...]
So I did a little reworking:
[...]
If that still says what you think it should, then I believe it is clearer.
Thanks! I have included your suggestion.
Also:
+ * last time a segment has switched because of a timeout. Segment + * switching because of other reasons, like manual trigerring oftypo, should be "triggering".
Right.
I don't see any further issues with this patch unless there are performance
concerns about the locks taken in GetProgressRecPtr(). The locks seem
reasonable to me but I'd like to see this committed so there's plenty of
time to detect any regression before 10.0.As such, my vote is to mark this "Ready for Committer." I'm fine with
waiting a few days as Kyotaro suggested, or we can consider my review
"additional comments" and do it now.
Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.
--
Michael
Attachments:
hs-checkpoints-v15.patchtext/plain; charset=US-ASCII; name=hs-checkpoints-v15.patchDownload+197-63
Michael,
* Michael Paquier (michael.paquier@gmail.com) wrote:
Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.
I figured I'd go ahead and start looking into this (and it's pretty easy
for me to discuss it with David, given he works in the same office ;).
A couple initial comments:
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index adab2f8..38c2385 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2826,12 +2826,9 @@ include_dir 'conf.d' parameter is greater than zero, the server will switch to a new segment file whenever this many seconds have elapsed since the last segment file switch, and there has been any database activity, - including a single checkpoint. (Increasing - <varname>checkpoint_timeout</> will reduce unnecessary - checkpoints on an idle system.) - Note that archived files that are closed early - due to a forced switch are still the same length as completely full - files. Therefore, it is unwise to use a very short + including a single checkpoint. Note that archived files that are + closed early due to a forced switch are still the same length as + completely full files. Therefore, it is unwise to use a very short <varname>archive_timeout</> — it will bloat your archive storage. <varname>archive_timeout</> settings of a minute or so are usually reasonable. You should consider using streaming replication,
We should probably include in here that we may skip a checkpoint if no
activity has happened, meaning that this is a safe setting to set for
environments which are idle for long periods (I'm thinking embedded
systems here).
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
[...]
+ if (log_checkpoints) + ereport(LOG, (errmsg("checkpoint skipped")));
Do we really need to log that we're skipping a checkpoint..? As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.
Thanks!
Stephen