Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
PostgreSQL (<=9.4) trend to smooth buffer write smooth in a
checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
but when we use synchronous_commit=off, there is a little problem for
the checkpoint_segments
target, because xlog write fast(for full page write which the first page
write after checkpoint), so checkpointer cann't sleep and write buffer not
smooth.
There is an test:
# stap -DMAXSKIPPED=100000 -v 11111 -e '
global s_var, e_var, stat_var;
/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid,
int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {
s_var[pid(),1] = gettimeofday_us()
}
/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int,
int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {
e_var[pid(),1] = gettimeofday_us()
if ( s_var[pid(),1] > 0 )
stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]
}
/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid,
int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {
s_var[pid(),2] = gettimeofday_us()
}
/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int,
int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {
e_var[pid(),2] = gettimeofday_us()
if ( s_var[pid(),2] > 0 )
stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]
}
probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {
printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n",
$NBuffers, $num_to_write)
}
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
printf("checkpoint start\n")
}
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
printf("checkpoint done\n")
}
probe timer.s(1) {
foreach ([v1,v2] in stat_var +) {
if ( @count(stat_var[v1,v2]) >0 ) {
printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d,
count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]),
@avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
}
}
printf("----------------------------------end-----------------------------\n")
delete s_var
delete e_var
delete stat_var
}'
Use the test table and data:
create table tbl(id primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();
Use pgbench test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000
When on schedule checkpoint occure , the tps:
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524
The parameters for onschedule checkpoint:
checkpoint_segments = 512
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9
stap's output :
there is 156467 dirty blocks, we can see the buffer write per second, write
buffer is not smooth between time target.
but between xlog target.
156467/(4.5*60*0.9) = 579.5 write per second.
checkpoint start
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count:
8596
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count:
8097
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count:
4748
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count:
6304
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count:
4703
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count:
6534
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count:
3318
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216
......
I think we can add an condition to the IsCheckpointOnSchedule,
if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
{
recptr = GetInsertRecPtr();
elapsed_xlogs = (((double) (recptr -
ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
if (progress < elapsed_xlogs)
{
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
}
# vi src/backend/postmaster/checkpointer.c
#include "access/xact.h"
/*
* IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
* in time?
*
* Compares the current progress against the time/segments elapsed since
last
* checkpoint, and returns true if the progress we've made this far is
greater
* than the elapsed time/segments.
*/
static bool
IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
struct timeval now;
double elapsed_xlogs,
elapsed_time;
Assert(ckpt_active);
/* Scale progress according to checkpoint_completion_target. */
progress *= CheckPointCompletionTarget;
/*
* Check against the cached value first. Only do the more expensive
* calculations once we reach the target previously calculated.
Since
* neither time or WAL insert pointer moves backwards, a freshly
* calculated value can only be greater than or equal to the cached
value.
*/
if (progress < ckpt_cached_elapsed)
return false;
/*
* Check progress against WAL segments written and
checkpoint_segments.
*
* We compare the current WAL insert location against the location
* computed before calling CreateCheckPoint. The code in XLogInsert
that
* actually triggers a checkpoint when checkpoint_segments is
exceeded
* compares against RedoRecptr, so this is not completely accurate.
* However, it's good enough for our purposes, we're only
calculating an
* estimate anyway.
*/
if (!RecoveryInProgress())
{
if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
{
recptr = GetInsertRecPtr();
elapsed_xlogs = (((double) (recptr -
ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
if (progress < elapsed_xlogs)
{
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
}
}
/*
* Check progress against time elapsed and checkpoint_timeout.
*/
gettimeofday(&now, NULL);
elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time)
+
now.tv_usec / 1000000.0) /
CheckPointTimeout;
if (progress < elapsed_time)
{
ckpt_cached_elapsed = elapsed_time;
return false;
}
/* It looks like we're on schedule. */
return true;
}
# gmake && gmake install
$ pg_ctl restart -m fast
Test again:
progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383
progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433
progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460
progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043
progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286
progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941
progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356
progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676
progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134
progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470
progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325
progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297
progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309
We can see checkpointer wiriter buffer smooth(spread time perid) this time.
checkpoint start
----------------------------------end-----------------------------
buffer__sync__start num_buffers: 262144, dirty_buffers: 156761
r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545
But there is also a little problem, When PostgreSQL write xlog reach
checkpoint_segments earlier then checkpoint_timeout, the next checkpoint
will start soon, so we must tuning the checkpoint_segments larger when the
checkpoint occure busy.
Regards,
Digoal
--
公益是一辈子的事,I'm Digoal,Just Do It.
On 05/12/2015 03:27 AM, digoal zhou wrote:
PostgreSQL (<=9.4) trend to smooth buffer write smooth in a
checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
but when we use synchronous_commit=off, there is a little problem for
the checkpoint_segments
target, because xlog write fast(for full page write which the first page
write after checkpoint), so checkpointer cann't sleep and write buffer not
smooth.
...
I think we can add an condition to the IsCheckpointOnSchedule,
if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
{
recptr = GetInsertRecPtr();
elapsed_xlogs = (((double) (recptr -
ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;if (progress < elapsed_xlogs)
{
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
}
This has nothing to do with asynchronous_commit, except that setting
asynchronous_commit=off makes your test case run faster, and hit the
problem harder.
I think the real problem here is that IsCheckpointOnSchedule assumes
that the rate of WAL generated is constant throughout the checkpoint
cycle, but in reality you generate a lot more WAL immediately after the
checkpoint begins, thanks to full_page_writes. For example, in the
beginning of the cycle, you quickly use up, say, 20% of the WAL space in
the first 10 seconds, and the scheduling thinks it's in a lot of hurry
to finish the checkpoint because it extrapolates that the rest of the
WAL will be used up in the next 40 seconds. But in reality, the WAL
consumption levels off, and you have many minutes left until
CheckPointSegments.
Can you try the attached patch? It modifies the above calculation to
take the full-page-write effect into account. I used X^1.5 as the
corrective function, which roughly reflects the typical WAL consumption
pattern. You can adjust the exponent, 1.5, to make the correction more
or less aggressive.
- Heikki
Attachments:
compensate-fpw-effect-on-checkpoint-scheduling-1.patchapplication/x-patch; name=compensate-fpw-effect-on-checkpoint-scheduling-1.patchDownload+13-0
(please keep the mailing list CC'd, and please don't top-post)
On 05/13/2015 05:00 AM, digoal zhou wrote:
I test it, but use exponent not very perfect in any environment.
why cann't use time only?
As you mentioned yourself earlier, if you only use time but you reach
checkpoint_segments before checkpoint_timeout, you will not complete the
checkpoint until you'd already need to begin the next checkpoint. You
can't completely ignore checkpoint_segments.
Comparing the numbers you give below with
compensate-fpw-effect-on-checkpoint-scheduling-1.patch, with the ones
from your first post, it looks like the patch already made the situation
much better. You still have a significant burst in the beginning of the
checkpoint cycle, but it's a lot smaller than without the patch. Before
the patch, the "count" topped at 9078, and below it topped at 2964.
There is a strange "lull" after the burst, I'm not sure what's going on
there, but overall it seems like a big improvement.
Did the patch alleviate the bump in latency that pgbench reports?
I put the "count" numbers from your original post and below into a
spreadsheet, and created some fancy charts. See attached. It shows the
same thing but with pretty pictures. Assuming we want the checkpoint to
be spread as evenly as possible across the cycle, the ideal would be a
straight line from 0 to about 150000 in 270 seconds in the cumulative
chart. You didn't give the full data, but you can extrapolate the lines
to get a rough picture of how close the different versions are from that
ideal.
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.
I don't want to over-compensate for the full-page-write effect either,
because there are also applications where that effect isn't so big. For
example, an application that performs a lot of updates, but all the
updates are on a small number of pages, so the full-page-write storm
immediately after checkpoint doesn't last long. A worst case for this
patch would be such an application - lots of updates on only a few pages
- with a long checkpoint_timeoout but relatively small
checkpoint_segments, so that checkpoints are always driven by
checkpoint_segments. I'd like to see some benchmarking of that worst
case before committing anything like this.
----------------------------------end-----------------------------
checkpoint start
buffer__sync__start num_buffers: 524288, dirty_buffers: 156931
r1_or_w2 2, pid: 29132, min: 44, max: 151, avg: 52, sum: 49387, count: 932
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 95, avg: 49, sum: 41532, count: 837
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 747, avg: 54, sum: 100419, count: 1849
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 372, avg: 52, sum: 110701, count: 2090
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 115, avg: 57, sum: 147510, count: 2575
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 470, avg: 58, sum: 145217, count: 2476
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 120, avg: 54, sum: 161401, count: 2964
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 208, avg: 59, sum: 170280, count: 2847
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 10089, avg: 62, sum: 136106, count:
2181
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 41, max: 487, avg: 56, sum: 88990, count: 1570
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 39, max: 102, avg: 55, sum: 59807, count: 1083
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 557, avg: 56, sum: 117274, count: 2083
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 537, avg: 58, sum: 169867, count: 2882
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 147, avg: 60, sum: 92835, count: 1538
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 55, sum: 14641, count: 264
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 92, avg: 56, sum: 11834, count: 210
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 91, avg: 56, sum: 9151, count: 162
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 8621, count: 151
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 57, sum: 7962, count: 139
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 93, avg: 58, sum: 7194, count: 123
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 58, sum: 7143, count: 123
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 57, sum: 6801, count: 118
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 58, sum: 6818, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 57, sum: 6982, count: 121
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 6459, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 88, avg: 58, sum: 7022, count: 121
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 47, max: 94, avg: 57, sum: 5952, count: 104
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 49, max: 95, avg: 57, sum: 6871, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 49, max: 85, avg: 58, sum: 6829, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 89, avg: 57, sum: 6851, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 57, sum: 6779, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 93, avg: 55, sum: 6502, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 47, max: 98, avg: 58, sum: 6805, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 57, sum: 6771, count: 118
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 56, sum: 6593, count: 116
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 57, sum: 6809, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 100, avg: 57, sum: 6171, count: 107
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 105, avg: 57, sum: 6801, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 95, avg: 57, sum: 6792, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6693, count: 118
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 57, sum: 6878, count: 120
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6664, count: 117
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 57, sum: 7051, count: 123
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 6957, count: 120
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6842, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 100, avg: 57, sum: 6865, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 49, max: 102, avg: 58, sum: 6915, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6187, count: 107
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 50, max: 86, avg: 58, sum: 6957, count: 119
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 97, avg: 55, sum: 33636, count: 609
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 55, sum: 34180, count: 620
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 29, max: 92, avg: 53, sum: 36569, count: 680
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 91, avg: 54, sum: 37374, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 86, avg: 54, sum: 33347, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 54, sum: 37603, count: 684
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 93, avg: 55, sum: 33777, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 104, avg: 55, sum: 37566, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 92, avg: 54, sum: 37037, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 42, max: 106, avg: 57, sum: 35181, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 99, avg: 54, sum: 36981, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 53, sum: 33202, count: 622
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 89, avg: 54, sum: 36825, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 88, avg: 53, sum: 33917, count: 635
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 55, sum: 36234, count: 658
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 99, avg: 55, sum: 37719, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 54, sum: 33491, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 37365, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 101, avg: 54, sum: 33481, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 37, max: 93, avg: 54, sum: 37102, count: 685
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 87, avg: 54, sum: 36968, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 84, avg: 54, sum: 33565, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 54, sum: 37271, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 48, max: 96, avg: 55, sum: 34272, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 85, avg: 54, sum: 37378, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 107, avg: 53, sum: 36715, count: 680
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 54, sum: 33620, count: 616
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 29, max: 94, avg: 54, sum: 37093, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 39, max: 110, avg: 53, sum: 33013, count: 612
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 97, avg: 54, sum: 37215, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 37, max: 90, avg: 54, sum: 37240, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 41, max: 95, avg: 54, sum: 33555, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 38, max: 89, avg: 54, sum: 37503, count: 683
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 38, max: 95, avg: 55, sum: 33803, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 89, avg: 56, sum: 38403, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 92, avg: 54, sum: 37354, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 55, sum: 33881, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 34, max: 91, avg: 54, sum: 37047, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 85, avg: 53, sum: 33003, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 92, avg: 53, sum: 36854, count: 683
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 92, avg: 54, sum: 36597, count: 673
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33689, count: 620
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 92, avg: 54, sum: 37194, count: 684
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 53, sum: 32813, count: 612
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 100, avg: 54, sum: 37485, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 54, sum: 33294, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 94, avg: 54, sum: 37320, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 28, max: 92, avg: 54, sum: 37067, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 87, avg: 54, sum: 33766, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 110, avg: 53, sum: 36220, count: 680
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 98, avg: 54, sum: 33442, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 97, avg: 55, sum: 37692, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 54, sum: 37073, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 88, avg: 54, sum: 33676, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 103, avg: 53, sum: 36770, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33447, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 91, avg: 55, sum: 37643, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 54, sum: 37377, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 103, avg: 56, sum: 34531, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 121, avg: 54, sum: 37412, count: 683
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 40, max: 89, avg: 54, sum: 33173, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 31, max: 94, avg: 54, sum: 37385, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 28, max: 106, avg: 55, sum: 38132, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 34, max: 96, avg: 55, sum: 33800, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 37, max: 98, avg: 56, sum: 38305, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 28, max: 104, avg: 55, sum: 33744, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 103, avg: 54, sum: 36923, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 36, max: 89, avg: 55, sum: 37797, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 103, avg: 56, sum: 34902, count: 620
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 38025, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 43, max: 102, avg: 56, sum: 34545, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 34, max: 94, avg: 55, sum: 37756, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 54, sum: 33530, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 55, sum: 37992, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 99, avg: 55, sum: 37923, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 39, max: 101, avg: 55, sum: 34027, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 53, sum: 36078, count: 680
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 41, max: 89, avg: 51, sum: 31563, count: 612
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 31, max: 92, avg: 52, sum: 35596, count: 680
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 102, avg: 55, sum: 37816, count: 685
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 45, max: 102, avg: 55, sum: 33828, count: 613
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 54, sum: 37285, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 90, avg: 55, sum: 34037, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 86, avg: 54, sum: 37584, count: 684
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 32, max: 103, avg: 55, sum: 37946, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34556, count: 617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 33, max: 99, avg: 56, sum: 38213, count: 681
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34613, count: 614
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 55, sum: 37925, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 35, max: 93, avg: 55, sum: 35504, count: 639
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 46, max: 90, avg: 55, sum: 36459, count: 655
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 30, max: 97, avg: 54, sum: 37369, count: 682
----------------------------------end-----------------------------
r1_or_w2 2, pid: 29132, min: 31, max: 93, avg: 54, sum: 33161, count: 612
----------------------------------end----------------------------
- Heikki
Attachments:
checkpoint-progress-charts.odsapplication/vnd.oasis.opendocument.spreadsheet; name=checkpoint-progress-charts.odsDownload+6-0
Import Notes
Reply to msg id not found: CAKHd5CeVkVQ-vHEV4ufmrCrYU42gRZTZZ6Ukm8BdEHk9YLJsQ@mail.gmail.com
This needs more performance testing.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
This needs more performance testing.
Definitely. I may do that some day. However I'm not sure that this is
currently the main issue in the checkpointer.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.
Agreed
I don't want to over-compensate for the full-page-write effect either,
because there are also applications where that effect isn't so big. For
example, an application that performs a lot of updates, but all the updates
are on a small number of pages, so the full-page-write storm immediately
after checkpoint doesn't last long. A worst case for this patch would be
such an application - lots of updates on only a few pages - with a long
checkpoint_timeoout but relatively small checkpoint_segments, so that
checkpoints are always driven by checkpoint_segments. I'd like to see some
benchmarking of that worst case before committing anything like this.
We could do better, but that is not a reason not to commit this, as is.
Commit, please.
This has been in place for a while and still remains: TODO: reduce impact
of full page writes
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.Agreed
Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?
Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello Simon,
We could do better, but that is not a reason not to commit this, as is.
Commit, please.
My 0,02ᅵ: Please do not commit without further testing...
I've submitted a patch to improve checkpoint write scheduling, including
X00 hours of performance test on various cases. This patch changes
significantly the load distribution over the whole checkpoint, and AFAICS
has been tested on rather small cases.
I'm not sure that the power 1.5 is the right one for all cases. For a big
checkpoint over 30 minutes, it may have, or not, very large and possibly
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what
I really think is that it needs performance measures.
In conclusion, and very egoistically, I would prefer if this patch could
wait for the checkpoint scheduling patch to be considered, as it would
basically invalidate the X00 hours of performance tests I ran:-)
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3 July 2015 at 06:38, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Simon,
We could do better, but that is not a reason not to commit this, as is.
Commit, please.
My 0,02€: Please do not commit without further testing...
I've submitted a patch to improve checkpoint write scheduling, including
X00 hours of performance test on various cases. This patch changes
significantly the load distribution over the whole checkpoint, and AFAICS
has been tested on rather small cases.I'm not sure that the power 1.5 is the right one for all cases. For a big
checkpoint over 30 minutes, it may have, or not, very large and possibly
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I
really think is that it needs performance measures.
power 1,5 is almost certainly not right for all cases, but it is simple and
better. And easy to remove if something even better arrives.
I don't see the two patches being in conflict.
In conclusion, and very egoistically, I would prefer if this patch could
wait for the checkpoint scheduling patch to be considered, as it would
basically invalidate the X00 hours of performance tests I ran:-)
I recommend making peace with yourself that probably 50% of development
time is wasted. But we try to keep the best half.
Thank you for your time spent contributing.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-03 07:38:15 +0200, Fabien COELHO wrote:
I've submitted a patch to improve checkpoint write scheduling, including X00
hours of performance test on various cases. This patch changes significantly
the load distribution over the whole checkpoint, and AFAICS has been tested
on rather small cases.I'm not sure that the power 1.5 is the right one for all cases. For a big
checkpoint over 30 minutes, it may have, or not, very large and possibly
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I
really think is that it needs performance measures.In conclusion, and very egoistically, I would prefer if this patch could
wait for the checkpoint scheduling patch to be considered, as it would
basically invalidate the X00 hours of performance tests I ran:-)
These two patches target pretty independent mechanics. If you patch were
significantly influenced by this something would be wrong. It might
decrease the benefit of your patch a mite, but that's not really a
problem.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 2, 2015 at 4:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.
Agreed
Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue?
I have ran some tests with this patch and the detailed results of the runs
are
attached with this mail. I thought the patch should show difference if I
keep
max_wal_size to somewhat lower or moderate value so that checkpoint
should get triggered due to wal size, but I am not seeing any major
difference
in the writes spreading. Below is the configuration, I have used to run the
tests
synchronous_commit=off
checkpoint_timeout = 5min
max_wal_size=1GB
min_wal_size=80MB
checkpoint_completion_target=0.9
shared_buffers=8GB
Test (used almost similar testing mechanism as reported in mail)
------------------------------------------------------------------------------------------------
1.
Use the test table and data:
create table tbl(id int primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();
2.
Use pgbench to test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;
3.
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 300
on another session, I ran system tap script
stap pgbenchrun.stp -DMAXSKIPPED=100000 -o pgbenchrun.out
pgbenchrun.stp (attached with mail).
During configure, you need to use (--enable-dtrace) and need to use
debug-build option if you want capture results using system tap script.
Also to run this SystemTap should be installed in your system.
I should have ran more tests to see in which cases this patch makes
difference, but I thought it is better to first discuss the current results
and see what other kind of tests you want to run for this patch.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
power 1,5 is almost certainly not right for all cases, but it is simple
and better.
It is better "in some cases", as I've been told on my patch. If you have a
separate disk for WAL writes the power formula may just degrade
performance, or maybe not, or not too much, or it really should be a guc.
Well, I just think that it needs more performance testing with various
loads and sizes, really. I'm not against this patch at all.
And easy to remove if something even better arrives.
I don't see the two patches being in conflict.
They are not "in conflict" from a git point of view, or even so it would
be trivial to solve.
They are in conflict as the patch changes the checkpoint load
significantly, which would mean that my X00 hours of performance testing
on the checkpoint scheduler should more or less be run again. Ok, it is
somehow egoistic, but I'm trying to avoid wasting people time.
Another point is that I'm not sure I understand the decision process: for
some patch in some area extensive performance tests are required, and for
other patches in the same area they would not be.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Andres,
In conclusion, and very egoistically, I would prefer if this patch could
wait for the checkpoint scheduling patch to be considered, as it would
basically invalidate the X00 hours of performance tests I ran:-)These two patches target pretty independent mechanics. If you patch were
significantly influenced by this something would be wrong. It might
decrease the benefit of your patch a mite, but that's not really a
problem.
That is not the issue I see. On the principle of performance testing it
really means that I should rerun the tests, even if I expect that the
overall influence would be pretty small in this case. This is my egoistic
argument. Well, probably I would just rerun a few cases to check that the
impact is "mite", as you said, not all cases.
Another point is that I'm not sure that this patch is ripe, in particular
I'm skeptical about the hardcoded 1.5 without further testing. Maybe it is
good, maybe 1.3 or 1.6 is better, maybe it depends and it should just be a
guc with some advises about how to set it. So I really think that it needs
more performance figures than "it has a positive effect on one load".
Well, this is just my opinion, no need to care too much about it:-)
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.
I've looked at the maths.
I think that the load is distributed as the derivative of this function,
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
is the finishing load is 1.5 the average load, just before fsyncing files.
This looks like a recipee for a bad time: I would say this is too large an
overload. I would suggest a much lower value, say around 1.1...
The other issue with this function is that it should only degrade
performance by disrupting the write distribution if someone has WAL on a
different disk. As I understand it this thing does only make sense if the
WAL & the data are on the samee disk. This really suggest a guc.
I have ran some tests with this patch and the detailed results of the
runs are attached with this mail.
I do not understand really the aggregated figures in the files attached.
I guess that maybe between "end" markers there is a summary of figures
collected for 28 backends over 300-second runs (?), but I do not know what
the min/max/avg/sum/count figures are about.
I thought the patch should show difference if I keep max_wal_size to
somewhat lower or moderate value so that checkpoint should get triggered
due to wal size, but I am not seeing any major difference in the writes
spreading.
I'm not sure I understand your point. I would say that at full speed
pgbench the disk is always busy writing as much as possible, either
checkpoint writes or wal writes, so the write load as such should not be
that different anyway?
I understood that the point of the patch is to check whether there is a
tps dip or not when the checkpoint begins, but I'm not sure how this can
be infered from the many aggregated data you sent, and from my recent
tests the tps is very variable anyway on HDD.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Jul 4, 2015, at 11:34 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.I've looked at the maths.
I think that the load is distributed as the derivative of this function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that is the finishing load is 1.5 the average load, just before fsyncing files. This looks like a recipee for a bad time: I would say this is too large an overload. I would suggest a much lower value, say around 1.1...
The other issue with this function is that it should only degrade performance by disrupting the write distribution if someone has WAL on a different disk. As I understand it this thing does only make sense if the WAL & the data are on the samee disk. This really suggest a guc.
I am a bit skeptical about this. We need test scenarios that clearly show the benefit of having and of not having this behavior. It might be that doing this always is fine for everyone.
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Robert,
I've looked at the maths.
I think that the load is distributed as the derivative of this
function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky
reaches 0.5, it pass the 1.0 (average load) around 40% progress, and
ends up at 1.5, that is the finishing load is 1.5 the average load,
just before fsyncing files. This looks like a recipee for a bad time: I
would say this is too large an overload. I would suggest a much lower
value, say around 1.1...
The other issue with this function is that it should only degrade
performance by disrupting the write distribution if someone has WAL on
a different disk. As I understand it this thing does only make sense if
the WAL & the data are on the samee disk. This really suggest a guc.I am a bit skeptical about this. We need test scenarios that clearly
show the benefit of having and of not having this behavior. It might be
that doing this always is fine for everyone.
Do you mean I have to proove that there is an actual problem induced from
this patch?
The logic fails me: I thought the patch submitter would have to show that
his/her patch did not harm performance in various reasonable cases. At
least this is what I'm told in another thread:-)
Currently this patch changes heavily the checkpoint write load
distribution in many cases with a proof which consist in showing that it
may improve tps *briefly* on *one* example, as far as I understood the
issue and the tests. If this is enough proof to apply the patch, then the
minimum is that it should be possible to desactivate it, hence a guc.
Having a guc would also help to test the feature with different values
than 1.5, which really seems harmful from a math point of view. I'm not
sure at all that a power formula is the right approach.
The potential impact I see would be to aggravate significantly the write
stall issues I'm working on, but the measures provided in these tests do
not even look at that or measure that.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/04/2015 07:34 PM, Fabien COELHO wrote:
In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.I've looked at the maths.
I think that the load is distributed as the derivative of this function,
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
is the finishing load is 1.5 the average load, just before fsyncing files.
This looks like a recipee for a bad time: I would say this is too large an
overload. I would suggest a much lower value, say around 1.1...
Hmm. Load is distributed as a derivate of that, but probably not the way
you think. Note that X means the amount of WAL consumed, not time. The
goal is that I/O is constant over time, but the consumption of WAL over
time is non-linear, with a lot more WAL consumed in the beginning of a
checkpoint cycle. The function compensates for that.
The other issue with this function is that it should only degrade
performance by disrupting the write distribution if someone has WAL on a
different disk. As I understand it this thing does only make sense if the
WAL & the data are on the samee disk. This really suggest a guc.
No, the I/O storm caused by full-page-writes is a problem even if WAL is
on a different disk. Even though the burst of WAL I/O then happens on a
different disk, the fact that we consume a lot of WAL in the beginning
of a checkpoint makes the checkpointer think that it needs to hurry up,
in order to meet the deadline. It will flush a lot of pages in a rush,
so you get a burst of I/O on the data disk too. Yes, it's even worse
when WAL and data are on the same disk, but even then, I think the
random I/O caused by the checkpointer hurrying is more significant than
the extra WAL I/O, which is sequential.
To illustrate that, imagine that the checkpoint begins now. The
checkpointer calculates that it has 10 minutes to complete the
checkpoint (checkpoint_timeout), or until 1 GB of WAL has been generated
(derived from max_wal_size), whichever happens first. Immediately after
the Redo-point has been established, in the very beginning of the
checkpoint, the WAL storm begins. Every backend that dirties a page also
writes a full-page image. After just 10 seconds, those backends have
already written 200 MB of WAL. That's 1/5 of the quota, and based on
that, the checkpointer will quickly flush 1/5 of all buffers. In
reality, the WAL consumption is not linear, and will slow down as time
passes and less full-page writes happen. So in reality, the checkpointer
would have a lot more time to complete the checkpoint - it is
unnecessarily aggressive in the beginning of the checkpoint.
The correction factor in the patch compensates for that. With the X^1.5
formula, when 20% of the WAL has already been consumed, the checkpointer
have flushed only ~ 9% of the buffers, not 20% as without the patch.
The ideal correction formula f(x), would be such that f(g(X)) = X, where:
X is time, 0 = beginning of checkpoint, 1.0 = targeted end of
checkpoint (checkpoint_segments), and
g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0
= targeted end of checkpoint (derived from max_wal_size).
Unfortunately, we don't know the shape of g(X), as that depends on the
workload. It might be linear, if there is no effect at all from
full_page_writes. Or it could be a step-function, where every write
causes a full page write, until all pages have been touched, and after
that none do (something like an UPDATE without a where-clause might
cause that). In pgbench-like workloads, it's something like sqrt(x). I
picked X^1.5 as a reasonable guess. It's close enough to linear that it
shouldn't hurt too much if g(x) is linear. But it cuts the worst spike
at the very beginning, if g(x) is more like sqrt(x).
This is all assuming that the application load is constant. If it's
not, g(x) can obviously have any shape, and there's no way we can
predict that. But that's a different story, nothing to do with
full_page_writes.
I have ran some tests with this patch and the detailed results of the
runs are attached with this mail.I do not understand really the aggregated figures in the files attached.
Me neither. It looks like Amit measured the time spent in mdread and
mdwrite, but I'm not sure what conclusions one can draw from that.
I thought the patch should show difference if I keep max_wal_size to
somewhat lower or moderate value so that checkpoint should get triggered
due to wal size, but I am not seeing any major difference in the writes
spreading.I'm not sure I understand your point. I would say that at full speed
pgbench the disk is always busy writing as much as possible, either
checkpoint writes or wal writes, so the write load as such should not be
that different anyway?I understood that the point of the patch is to check whether there is a
tps dip or not when the checkpoint begins, but I'm not sure how this can
be infered from the many aggregated data you sent, and from my recent
tests the tps is very variable anyway on HDD.
Right, that's my understanding too. If the disk is not saturated,
perhaps because you used pgbench's rate-limiting option, then measuring
the disk I/O would be useful too: flatter is better.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/05/2015 08:19 AM, Fabien COELHO wrote:
I am a bit skeptical about this. We need test scenarios that clearly
show the benefit of having and of not having this behavior. It might be
that doing this always is fine for everyone.Do you mean I have to proove that there is an actual problem induced from
this patch?
You don't have to do anything if you don't want to. I said myself that
this needs performance testing of the worst-case scenario, one where we
would expect this to perform worse than without the patch. Then we can
look at how bad that effect is, and decide if that's acceptable.
That said, if you could do that testing, that would be great! I'm not
planning to spend much time on this myself, and it would take me a fair
amount of time to set up the hardware and tools to test this. I was
hoping Digoal would have the time to do that, since he started this
thread, or someone else that has a system ready for this kind of
testing. If no-one steps up to the plate to test this more, however,
we'll have to just forget about this.
Having a guc would also help to test the feature with different values
than 1.5, which really seems harmful from a math point of view. I'm not
sure at all that a power formula is the right approach.
Yeah, a GUC would be helpful in testing this. I'm hoping that we would
come up with a reasonable formula that would work well enough for
everyone that we wouldn't need to have a GUC in the final patch, though.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
You don't have to do anything if you don't want to.
Sure:-) What I mean is that I think that this patch is not ripe, and I
understood that some people were suggesting that it could be applied as is
right away. I'm really disagreeing with that.
I said myself that this needs performance testing of the worst-case
scenario, one where we would expect this to perform worse than without
the patch. Then we can look at how bad that effect is, and decide if
that's acceptable.
Ok, I'm fine with that. It's quite different from "looks ok apply now".
That said, if you could do that testing, that would be great!
Hmmm. I was not really planing to. On the other hand, I have some scripts
and a small setup that I've been using to test checkpointer flushing, and
it would be easy to start some tests.
Having a guc would also help to test the feature with different values
than 1.5, which really seems harmful from a math point of view. I'm not
sure at all that a power formula is the right approach.Yeah, a GUC would be helpful in testing this. I'm hoping that we would come
up with a reasonable formula that would work well enough for everyone that we
wouldn't need to have a GUC in the final patch, though.
Yep. If it is a guc testing is quite easy and I may run my scripts...
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Heikki,
I think that the load is distributed as the derivative of this function,
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
is the finishing load is 1.5 the average load, just before fsyncing files.
This looks like a recipee for a bad time: I would say this is too large an
overload. I would suggest a much lower value, say around 1.1...Hmm. Load is distributed as a derivate of that, but probably not the way you
think. Note that X means the amount of WAL consumed, not time.
Interesting point. After a look at IsCheckpointOnSchedule, and if I
understand the code correctly, it is actually *both*, so it really depends
whether the checkpoint was xlog or time triggered, and especially which
one (time/xlog) is proeminent at the beginning of the checkpoint.
If it is time triggered and paced my reasonning is probably right and
things will go bad/worse in the end, but if it is xlog-triggered and paced
your line of argument is probably closer to what happens.
This suggest that the corrective function should be applied with more
care, maybe only for the xlog-based on schedule test, but not the
time-based check.
The goal is that I/O is constant over time, but the consumption of WAL
over time is non-linear, with a lot more WAL consumed in the beginning
of a checkpoint cycle. The function compensates for that.
*If* the checkpointer pacing comes from WAL size, which may or may not be
the case.
[...]
Unfortunately, we don't know the shape of g(X), as that depends on the
workload. It might be linear, if there is no effect at all from
full_page_writes. Or it could be a step-function, where every write causes a
full page write, until all pages have been touched, and after that none do
(something like an UPDATE without a where-clause might cause that).
If postgresql is running in its cache (i.e. within shared buffers), the
usual assumption would be an unknown exponential probability decreasing
with time while the same pages are hit over and over.
If postgresql is running on memory or disk (effective database size
greater than shared buffers), pages are statiscally not reused by another
update before being sent out, so the full page write would be always used
during the whole checkpoint, there is no WAL storm (or it is always a
storm, depending on the point of view) and the corrective factor would
only create issues...
So basically I would say that what to do heavily depends on the database
size and checkpoint trigger (time vs xlog), which really suggest that a
guc is indispensible, and maybe that the place the correction is applied
is currently not the right one.
In pgbench-like workloads, it's something like sqrt(x).
Probably for a small database size?
I picked X^1.5 as a reasonable guess. It's close enough to linear that
it shouldn't hurt too much if g(x) is linear.
My understanding is still a 50% overload at the end of the checkpoint just
before issuing fsync... I think that could hurt in some case.
But it cuts the worst spike at the very beginning, if g(x) is more like
sqrt(x).
Hmmm. It's a balance between saving the 10 first seconds of the checkpoint
at the price of risking a panic at the end of the checkpoint.
Now the right approach might be for pg to know what is happening by
collecting statistics while running, and to apply a correction when it is
needed, for the amount needed.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers