Re: automatic analyze: readahead - add "IO read time" log message
Hi Stephen, hackers,
With all those 'readahead' calls it certainly makes one wonder if the
Linux kernel is reading more than just the block we're looking for
because it thinks we're doing a sequential read and will therefore want
the next few blocks when, in reality, we're going to skip past them,
meaning that any readahead the kernel is doing is likely just wasted
I/O.I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changes required to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". One thing that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparently there is some boost due to readahead.
Interesting that you weren't seeing any benefit to disabling readahead.
I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:
Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows
Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s
No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads [ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s
One thing not clear here is maybe in future worth measuring how striped LVs are being
affected by readaheads.
Were you able to see where the time in the kernel was going when
readahead was turned off for the ANALYZE?
Yes, my interpretation is that the time spent goes into directly block I/O layer waiting.
54.67% 1.33% postgres postgres [.] FileRead
---FileRead
--53.33%--__pread_nocancel
--50.67%--entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_pread64
|--49.33%--vfs_read
| --48.00%--__vfs_read
| |--45.33%--generic_file_read_iter
| | |--42.67%--ondemand_readahead
| | | __do_page_cache_readahead
| | | |--25.33%--ext4_mpage_readpages
| | | | |--10.67%--submit_bio
| | | | | generic_make_request
| | | | | |--8.00%--blk_mq_make_request
| | | | | | |--4.00%--blk_mq_get_request
| | | | | | | |--1.33%--blk_mq_get_tag
| | | | | | | --1.33%--sched_clock
| | | | | | | xen_sched_clock
| | | | | | | pvclock_clocksource_read
| | | | | | |--1.33%--bio_integrity_prep
| | | | | | --1.33%--blk_account_io_start
| | | | | | part_round_stats
| | | | | | blk_mq_in_flight
| | | | | | blk_mq_queue_tag_busy_iter
| | | | | --2.67%--dm_make_request
| | | | | __split_and_process_bio
| | | | | __split_and_process_non_flush
| | | | | |--1.33%--__map_bio
| | | | | | generic_make_request
| | | | | | generic_make_request_checks
| | | | | | percpu_counter_add_batch
| | | | | --1.33%--bio_alloc_bioset
| | | | | mempool_alloc
| | | | | kmem_cache_alloc
| | | | |--6.67%--ext4_map_blocks
| | | | | |--4.00%--ext4_es_lookup_extent
| | | | | | --2.67%--_raw_read_lock
| | | | | --2.67%--__check_block_validity.constprop.81
| | | | | ext4_data_block_valid
| | | | --6.67%--add_to_page_cache_lru
| | | | |--4.00%--__add_to_page_cache_locked
| | | | | --1.33%--mem_cgroup_try_charge
| | | | | get_mem_cgroup_from_mm
| | | | --2.67%--__lru_cache_add
| | | | pagevec_lru_move_fn
| | | | __lock_text_start
| | | |--12.00%--blk_finish_plug
| | | | blk_flush_plug_list
| | | | blk_mq_flush_plug_list
| | | | |--10.67%--__blk_mq_delay_run_hw_queue
| | | | | __blk_mq_run_hw_queue
| | | | | blk_mq_sched_dispatch_requests
| | | | | --9.33%--blk_mq_dispatch_rq_list
| | | | | nvme_queue_rq
| | | | | --1.33%--blk_mq_start_request
The VACUUM case is going to be complicated by what's in the visibility
map. (..)
After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.
My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;)
(..)AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().
Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....
The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data), both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>
I hope the previous simple patch goes into master and helps other people understand
the picture more easily.
-J.
Greetings,
* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
Interesting that you weren't seeing any benefit to disabling readahead.
I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:
Thanks for spending time on this!
Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rowsHot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40sNo VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28sNo VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads [ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s
[ ... ]
The VACUUM case is going to be complicated by what's in the visibility
map. (..)After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.
This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.
Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.
Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none'). If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.
Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.
My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;)
(..)AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....
I don't think we should throw out the idea of using PrefetchBuffer()
here. "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.
The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data), both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>
Actual parallel ANALYZE could be interesting, of course, but that's also
quite a bit more complicated to make happen, and even in that case it
would seem like we'd probably want to use posix_fadvise() anyway since
it'd hopefully help.
I hope the previous simple patch goes into master and helps other people understand
the picture more easily.
I agree that we should also be providing users with more information
about what's going on, but maybe we can do that and improve things by
leveraging posix_fadvise().
Thanks,
Stephen
Attachments:
Greetings Stephen,
I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
huge ~5.2x improvement which is amazing for a such simple patch.
The system and data was identical like last time, so results are directly comparable
to the previous post. The only change is that I've applied Yours patch on top of
REL12_STABLE as this is cluster where I was having original data. I think it deserves
it's own commitfest entry, right? Or do we combine those two?
Each test was conducted with
1. alter system set effective_io_concurrency=$N;
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.
no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s
readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)
I think one needs to be sure to restart, so that autovacuum
workers get the new effective_io_concurrency, maybe that's a
documentation thing only.
I have just one question, with this patch the strace I/O of analyze
look like below:
pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
-J.
----------------------------------------------------------------------------------------------------------------
From: Stephen Frost
Sent: Tuesday, November 3, 2020 6:47 PM
To: Jakub Wartak
Cc: pgsql-hackers
Subject: Re: automatic analyze: readahead - add "IO read time" log message
Greetings,
* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
Interesting that you weren't seeing any benefit to disabling readahead.
I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:
Thanks for spending time on this!
Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rowsHot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40sNo VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28sNo VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads [ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s
[ ... ]
The VACUUM case is going to be complicated by what's in the visibility
map. (..)After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.
This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.
Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.
Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none'). If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.
Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.
My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;)
(..)AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....
I don't think we should throw out the idea of using PrefetchBuffer()
here. "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.
The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data), both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>
Actual parallel ANALYZE could be interesting, of course, but that's also
quite a bit more complicated to make happen, and even in that case it
would seem like we'd probably want to use posix_fadvise() anyway since
it'd hopefully help.
I hope the previous simple patch goes into master and helps other people understand
the picture more easily.
I agree that we should also be providing users with more information
about what's going on, but maybe we can do that and improve things by
leveraging posix_fadvise().
Thanks,
Stephen
Import Notes
Resolved by subject fallback
On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote:
Greetings Stephen,
I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
huge ~5.2x improvement which is amazing for a such simple patch.The system and data was identical like last time, so results are directly comparable
to the previous post. The only change is that I've applied Yours patch on top of
REL12_STABLE as this is cluster where I was having original data. I think it deserves
it's own commitfest entry, right? Or do we combine those two?Each test was conducted with
1. alter system set effective_io_concurrency=$N;
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/sreadahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)
Thanks for the testing, those are indeed quite promising numbers for
such a simple patch. I wonder how would this behave on a different
storage - I'll see if I can do some testing, but I guess it should be
the same as for other places that do prefetching.
I think one needs to be sure to restart, so that autovacuum
workers get the new effective_io_concurrency, maybe that's a
documentation thing only.I have just one question, with this patch the strace I/O of analyze
look like below:pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.
BTW it seems your e-mail client does something funny, stripping the
"references" headers, which breaks threading and makes following the
discussion very hard (and will confuse the CF app too). If you look at
pgsql-hackers archives, each of your responses starts a new thread.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote:
I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
huge ~5.2x improvement which is amazing for a such simple patch.
Nice!
The system and data was identical like last time, so results are directly comparable
to the previous post. The only change is that I've applied Yours patch on top of
REL12_STABLE as this is cluster where I was having original data. I think it deserves
it's own commitfest entry, right? Or do we combine those two?Each test was conducted with
1. alter system set effective_io_concurrency=$N;
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s
I'm speculating here, but it seems likely that by disabling read-ahead,
the posix_fadvise() calls are basically getting ignored and that's why
there's not much difference here...?
readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)
Very nice.
Thanks for the testing, those are indeed quite promising numbers for
such a simple patch. I wonder how would this behave on a different
storage - I'll see if I can do some testing, but I guess it should be
the same as for other places that do prefetching.
That would certainly be cool.
I think one needs to be sure to restart, so that autovacuum
workers get the new effective_io_concurrency, maybe that's a
documentation thing only.I have just one question, with this patch the strace I/O of analyze
look like below:pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.
Yeah, this was just something quickly thrown together to see if it'd
help. I'll clean it up by, as you say, doing something similar to what
we do with bitmap heap scan, by having a separate BlockSampler that's
set up early on and then have the pre-fetching keep pace with the actual
scan instead of posix_fadvise()'ing the same pages over and over.
I'll also look at doing something for VACUUM too, as that could also
benefit from posix_fadvise() calls, when we're skipping over all-frozen
pages.
Thanks!
Stephen
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.
Updated patch attached which:
- Starts out by pre-fetching the first effective_io_concurrency number
of blocks we are going to want, hopefully making it so the kernel will
trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
hopefully avoiding having the prefetching get in the way of the
regular i/o.
- Added some comments, ran pgindent, added a commit message.
I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.
I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.
If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use. Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.
Thanks,
Stephen
Attachments:
Hi,
On 11/4/20 5:02 PM, Stephen Frost wrote:
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.Updated patch attached which:
- Starts out by pre-fetching the first effective_io_concurrency number
of blocks we are going to want, hopefully making it so the kernel will
trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
hopefully avoiding having the prefetching get in the way of the
regular i/o.
- Added some comments, ran pgindent, added a commit message.
Nice, that was quick ;-)
I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.
+1
I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.
+1
If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use. Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.
Yeah. I'd expect this to be heavily dependent on hardware.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
On 11/4/20 5:02 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.Updated patch attached which:
- Starts out by pre-fetching the first effective_io_concurrency number
of blocks we are going to want, hopefully making it so the kernel will
trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
hopefully avoiding having the prefetching get in the way of the
regular i/o.
- Added some comments, ran pgindent, added a commit message.Nice, that was quick ;-)
:)
I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.+1
Attached is an updated patch which updates the documentation and
integrates Jakub's initial work on improving the logging around
auto-analyze (and I made the logging in auto-vacuum more-or-less match
it).
I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.+1
I spent some time looking into this but it's a bit complicated.. For
some sound reasons, VACUUM will avoid skipping through a table when
there's only a few pages that it could skip (not skipping allows us to
move forward the relfrozenxid). That said, perhaps we could start doing
prefetching once we've decided that we're skipping. We'd need to think
about if we have to worry about the VM changing between the pre-fetching
and the time when we're actually going to ask for the page.. I don't
*think* that's an issue because only VACUUM would be changing the pages
to be all-frozen or all-visible, and so if we see a page that isn't one
of those then we're going to want to visit that page and that's not
going to change, and we probably don't need to care about a page that
used to be all-frozen and now isn't during this run- but if the prefetch
went ahead and got page 10, and now page 8 is not-all-frozen and the
actual scan is at page 5, then maybe it wants page 8 next and that isn't
what we pre-fetched...
Anyhow, all-in-all, definitely more complicated and probably best
considered and discussed independently.
If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use. Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.Yeah. I'd expect this to be heavily dependent on hardware.
Sure, I agree with that too.
Thanks,
Stephen
Attachments:
On 11/9/20 7:06 PM, Stephen Frost wrote:
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
On 11/4/20 5:02 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.Updated patch attached which:
- Starts out by pre-fetching the first effective_io_concurrency number
of blocks we are going to want, hopefully making it so the kernel will
trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
hopefully avoiding having the prefetching get in the way of the
regular i/o.
- Added some comments, ran pgindent, added a commit message.Nice, that was quick ;-)
:)
I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.+1
Attached is an updated patch which updates the documentation and
integrates Jakub's initial work on improving the logging around
auto-analyze (and I made the logging in auto-vacuum more-or-less match
it).
Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.
I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.+1
I spent some time looking into this but it's a bit complicated.. For
some sound reasons, VACUUM will avoid skipping through a table when
there's only a few pages that it could skip (not skipping allows us to
move forward the relfrozenxid). That said, perhaps we could start doing
prefetching once we've decided that we're skipping. We'd need to think
about if we have to worry about the VM changing between the pre-fetching
and the time when we're actually going to ask for the page.. I don't
*think* that's an issue because only VACUUM would be changing the pages
to be all-frozen or all-visible, and so if we see a page that isn't one
of those then we're going to want to visit that page and that's not
going to change, and we probably don't need to care about a page that
used to be all-frozen and now isn't during this run- but if the prefetch
went ahead and got page 10, and now page 8 is not-all-frozen and the
actual scan is at page 5, then maybe it wants page 8 next and that isn't
what we pre-fetched...Anyhow, all-in-all, definitely more complicated and probably best
considered and discussed independently>
+1
FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.
Agreed, but would still be nice to see test results beyond just what
I've done.
FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.
I've gone ahead and updated the CF entry for this to hopefully make it
clearer for those interested in looking at it. I'll try to come back to
this in the next CF, ideally we'd at least get someone else to take a
look at the code beyond me. :) (Obviously, you looked at it some, but
wasn't really clear if you were alright with it or if you felt it needed
more review.)
Thanks!
Stephen
Greetings,
* Stephen Frost (sfrost@snowman.net) wrote:
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.Agreed, but would still be nice to see test results beyond just what
I've done.
Any chance you were able to run those tests..?
FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.I've gone ahead and updated the CF entry for this to hopefully make it
clearer for those interested in looking at it. I'll try to come back to
this in the next CF, ideally we'd at least get someone else to take a
look at the code beyond me. :) (Obviously, you looked at it some, but
wasn't really clear if you were alright with it or if you felt it needed
more review.)
I've updated the patch to leverage Tom's introduction of
TimestatmpDifferenceMilliseconds, which simplifies things a bit (and I
don't think we need to worry about an analyze taking over 25 days...)
and generally rebased this up to current HEAD.
Would be great to get a review / comments from others as to if there's
any concerns. I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)
Thanks,
Stephen
Attachments:
On 13/01/2021 23:17, Stephen Frost wrote:
Would be great to get a review / comments from others as to if there's
any concerns. I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)
Look good overall. A few minor comments:
The patch consists of two part: add stats to the log for auto-analyze,
and implement prefetching. They seem like independent features, consider
splitting into two patches.
It's a bit weird that you get more stats in the log for
autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not
really this patch's fault though.
This conflicts with the patch at
https://commitfest.postgresql.org/31/2907/, to refactor the table AM
analyze API. That's OK, it's straightforward to resolve regardless of
which patch is committed first.
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
#ifdef USE_PREFETCH
BlockNumber prefetch_targblock = InvalidBlockNumber;
#endif
BlockNumber targblock = BlockSampler_Next(&bs);#ifdef USE_PREFETCH
/*
* Make sure that every time the main BlockSampler is moved forward
* that our prefetch BlockSampler also gets moved forward, so that we
* always stay out ahead.
*/
if (BlockSampler_HasMore(&prefetch_bs))
prefetch_targblock = BlockSampler_Next(&prefetch_bs);
#endifvacuum_delay_point();
if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
continue;#ifdef USE_PREFETCH
/*
* When pre-fetching, after we get a block, tell the kernel about the
* next one we will want, if there's any left.
*/
if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber)
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
#endifwhile (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
...
}pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
++blksdone);
}
If effective_io_concurrency == 0, this calls
BlockSampler_Next(&prefetch_bs) anyway, which is a waste of cycles.
If table_scan_analyze_next_block() returns false, we skip the
PrefetchBuffer() call. That seem wrong.
Is there any potential harm from calling PrefetchBuffer() on a page that
table_scan_analyze_next_block() later deems as unsuitable for smapling
and returns false? That's theoretical at the moment, because
heapam_scan_analyze_next_block() always returns true. (The tableam
ANALYZE API refactor patch will make this moot, as it moves this logic
into the tableam's implementation, so the implementation can do whatever
make sense for the particular AM.)
- Heikki
Greetings,
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 13/01/2021 23:17, Stephen Frost wrote:
Would be great to get a review / comments from others as to if there's
any concerns. I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)Look good overall. A few minor comments:
Thanks a lot for the review!
The patch consists of two part: add stats to the log for auto-analyze, and
implement prefetching. They seem like independent features, consider
splitting into two patches.
Yeah, that's a good point. I had anticipated that there would be
overlap but in the end there really wasn't. Done in the attached.
It's a bit weird that you get more stats in the log for
autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not really
this patch's fault though.
Agreed.
This conflicts with the patch at https://commitfest.postgresql.org/31/2907/,
to refactor the table AM analyze API. That's OK, it's straightforward to
resolve regardless of which patch is committed first.
Agreed.
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
#ifdef USE_PREFETCH
BlockNumber prefetch_targblock = InvalidBlockNumber;
#endif
BlockNumber targblock = BlockSampler_Next(&bs);#ifdef USE_PREFETCH
/*
* Make sure that every time the main BlockSampler is moved forward
* that our prefetch BlockSampler also gets moved forward, so that we
* always stay out ahead.
*/
if (BlockSampler_HasMore(&prefetch_bs))
prefetch_targblock = BlockSampler_Next(&prefetch_bs);
#endifvacuum_delay_point();
if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
continue;#ifdef USE_PREFETCH
/*
* When pre-fetching, after we get a block, tell the kernel about the
* next one we will want, if there's any left.
*/
if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber)
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
#endifwhile (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
...
}pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
++blksdone);
}If effective_io_concurrency == 0, this calls BlockSampler_Next(&prefetch_bs)
anyway, which is a waste of cycles.
Good point, fixed.
If table_scan_analyze_next_block() returns false, we skip the
PrefetchBuffer() call. That seem wrong.
Agreed, fixed.
Is there any potential harm from calling PrefetchBuffer() on a page that
table_scan_analyze_next_block() later deems as unsuitable for smapling and
returns false? That's theoretical at the moment, because
heapam_scan_analyze_next_block() always returns true. (The tableam ANALYZE
API refactor patch will make this moot, as it moves this logic into the
tableam's implementation, so the implementation can do whatever make sense
for the particular AM.)
I can't see any potential harm and it seems pretty likely that if an
heapam_scan_analyze_next_block()-equivilant were to decide that a block
isn't appropriate to analyze it'd have to do so after reading that block
anyway, making the prefetch still useful.
Perhaps there'll be a case in the future where a given AM would know
based on just the block number that it isn't useful to analyze, in which
case it'd make sense to adjust the code to skip that block for both
Prefetching and actually reading, but I don't think that would be too
hard to do. Doesn't seem sensible to invent that in advance of actually
having that case though- it's certainly not the case for heap AM today,
at least, as you say.
Unless there's anything else on this, I'll commit these sometime next
week.
Thanks again for the review!
Stephen
Attachments:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.
One more thing: Instead of using 'effective_io_concurrency' GUC
directly, should call get_tablespace_maintenance_io_concurrency().
- Heikki
Greetings,
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().
Ah, yeah, of course.
Updated patch attached.
Thanks!
Stephen
Attachments:
Hi,
On 2/10/21 11:10 PM, Stephen Frost wrote:
Greetings,
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.
2) I haven't tried but I'm pretty sure there'll be a compiler warning
about 'prefetch_maximum' being unused without USE_PREFETCH defined.
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 2/10/21 11:10 PM, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.
This was already done..
2) I haven't tried but I'm pretty sure there'll be a compiler warning
about 'prefetch_maximum' being unused without USE_PREFETCH defined.
Ah, that part is likely true, moved down into the #ifdef block to
address that, which also is good since it should avoid mistakenly using
it outside of the #ifdef's later on by mistake too.
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
Perhaps..
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.
but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.
Rebased and updated patch attached.
Thanks!
Stephen
Attachments:
On 3/8/21 8:42 PM, Stephen Frost wrote:
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 2/10/21 11:10 PM, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.This was already done..
Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.
2) I haven't tried but I'm pretty sure there'll be a compiler warning
about 'prefetch_maximum' being unused without USE_PREFETCH defined.Ah, that part is likely true, moved down into the #ifdef block to
address that, which also is good since it should avoid mistakenly using
it outside of the #ifdef's later on by mistake too.
OK
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
Perhaps..
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.
Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?
I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 3/8/21 8:42 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 2/10/21 11:10 PM, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.This was already done..
Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.
They are two patches..
➜ ~ grep Subject analyze_prefetch_v8.patch
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
Subject: [PATCH 2/2] Use pre-fetching for ANALYZE
The first doesn't have any prefetch-related things, the second doesn't
have any track_io_timing things..
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
Perhaps..
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?
Telling the kernel "hey, we're about to need this, please go get it" and
then immediately going to sleep just strikes me as a bit off. We should
be trying to minimize the time between prefetch and actual request for
the page. Of course, there's going to be some times that we will issue
a prefetch and then come around again and end up hitting the limit and
sleeping before we actually request the page and maybe it doesn't
ultimately matter much but it just seems better to sleep first before
issuing the prefetch to minimize the amount 'outstanding' when we do end
up sleeping. One thing about prefetching is that the kernel is
certainly within its rights to decide to drop the page before we
actually go to read it, if it's under pressure and we're just sleeping.
I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.
Sure, done in the attached.
Thanks for the review! Unless there's other comments, I'll plan to push
this over the weekend or early next week.
Stephen
Attachments:
On 3/12/21 1:11 AM, Stephen Frost wrote:
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 3/8/21 8:42 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 2/10/21 11:10 PM, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.This was already done..
Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.They are two patches..
â ~ grep Subject analyze_prefetch_v8.patch
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
Subject: [PATCH 2/2] Use pre-fetching for ANALYZEThe first doesn't have any prefetch-related things, the second doesn't
have any track_io_timing things..
Oh! I didn't realize a single file can contain multiple separate
patches, so I saw a single file and assumed it's one patch. How do you
produce a single file, and is that better than just generating multiple
files using git format-patch?
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
Perhaps..
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?Telling the kernel "hey, we're about to need this, please go get it" and
then immediately going to sleep just strikes me as a bit off. We should
be trying to minimize the time between prefetch and actual request for
the page. Of course, there's going to be some times that we will issue
a prefetch and then come around again and end up hitting the limit and
sleeping before we actually request the page and maybe it doesn't
ultimately matter much but it just seems better to sleep first before
issuing the prefetch to minimize the amount 'outstanding' when we do end
up sleeping. One thing about prefetching is that the kernel is
certainly within its rights to decide to drop the page before we
actually go to read it, if it's under pressure and we're just sleeping.
I don't know, but this argument seems rather hand-wavy to me, TBH.
Firstly, we're prefetching quite far ahead anyway, so there's almost
always going to be a vacuum_delay_point() call between you issue the
prefetch and actually using the page. Secondly, vacuum_delay_point won't
really sleep for most of the calls (assuming the cost limit is not
unreasonably small).
The way I see it (a) it's unlikely a particular vacuum_delay_point()
call will sleep, while at the same time (b) it's quite probable one of
the calls between prefetching and using the page will sleep.
So I think the exact ordering of those two calls does not really matter,
at least not for this particular reason.
That being said, I don't feel strongly about this - if you prefer it
like this, so be it. I can live with two ifdefs instead of one.
I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.Sure, done in the attached.
Thanks for the review! Unless there's other comments, I'll plan to push
this over the weekend or early next week.
+1
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Greetings,
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 3/12/21 1:11 AM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 3/8/21 8:42 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
On 2/10/21 11:10 PM, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
A couple minor comments:
1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.This was already done..
Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.They are two patches..
➜ ~ grep Subject analyze_prefetch_v8.patch
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
Subject: [PATCH 2/2] Use pre-fetching for ANALYZEThe first doesn't have any prefetch-related things, the second doesn't
have any track_io_timing things..Oh! I didn't realize a single file can contain multiple separate
patches, so I saw a single file and assumed it's one patch. How do you
produce a single file, and is that better than just generating multiple
files using git format-patch?
I use my gfp alias, which sends to stdout:
gfp='git format-patch @{u} --stdout'
So I only have one file to move around and there's no question about the
ordering of applying them and such.. Not sure if everyone would agree
that it's better or worse but I do find it works a bit better for me
than dealing with multiple files.
3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
Perhaps..
This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?Telling the kernel "hey, we're about to need this, please go get it" and
then immediately going to sleep just strikes me as a bit off. We should
be trying to minimize the time between prefetch and actual request for
the page. Of course, there's going to be some times that we will issue
a prefetch and then come around again and end up hitting the limit and
sleeping before we actually request the page and maybe it doesn't
ultimately matter much but it just seems better to sleep first before
issuing the prefetch to minimize the amount 'outstanding' when we do end
up sleeping. One thing about prefetching is that the kernel is
certainly within its rights to decide to drop the page before we
actually go to read it, if it's under pressure and we're just sleeping.I don't know, but this argument seems rather hand-wavy to me, TBH.
I agree it is, but then again, the only argument raised against doing it
this way is that it might reduce the number of #ifdef's and, at least
for my part, that doesn't seem like a terribly strong argument in this
case.
Firstly, we're prefetching quite far ahead anyway, so there's almost
always going to be a vacuum_delay_point() call between you issue the
prefetch and actually using the page. Secondly, vacuum_delay_point won't
really sleep for most of the calls (assuming the cost limit is not
unreasonably small).The way I see it (a) it's unlikely a particular vacuum_delay_point()
call will sleep, while at the same time (b) it's quite probable one of
the calls between prefetching and using the page will sleep.
Doing the prefetch before the sleep sets us up in a position where
there'll always be one more page that's been prefetched while we're
sleeping, no matter when that sleep actually happens, than if we do the
prefetch after the sleep. Considering the default's only 10, that's 10%
of the pages prefetched.
So I think the exact ordering of those two calls does not really matter,
at least not for this particular reason.
I don't see it as a huge deal, but..
That being said, I don't feel strongly about this - if you prefer it
like this, so be it. I can live with two ifdefs instead of one.
Yeah, I do think I prefer it.
I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.Sure, done in the attached.
Thanks for the review! Unless there's other comments, I'll plan to push
this over the weekend or early next week.+1
Thanks again for the review!
Stephen
Greetings,
* Stephen Frost (sfrost@snowman.net) wrote:
I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.Sure, done in the attached.
Thanks for the review! Unless there's other comments, I'll plan to push
this over the weekend or early next week.+1
Thanks again for the review!
and pushed.
Thanks everyone for the suggestions and reviews, and to Jakub in
particular for starting us down this path towards improving things in
ANALYZE.
Stephen
Hi,
On 11.02.2021 01:10, Stephen Frost wrote:
Greetings,
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
I happened to notice that get_tablespace_io_concurrency() is called
instead of get_tablespace_maintenance_io_concurrency(). It doesn't look
right, no?
Regards,
Egor Rogov.
Greetings,
* Egor Rogov (e.rogov@postgrespro.ru) wrote:
On 11.02.2021 01:10, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
I happened to notice that get_tablespace_io_concurrency() is called instead
of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?
Hah, indeed you're right. Will fix.
Thanks!
Stephen
This patch adds hits/misses/dirtied, but explain says hit/read/dirtied/written.
Should it say "read" instead of "misses" ?
src/backend/access/heap/vacuumlazy.c: _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
src/backend/commands/explain.c- appendStringInfo(es->str, " hit=%lld",
src/backend/commands/explain.c- (long long) usage->shared_blks_hit);
src/backend/commands/explain.c- if (usage->shared_blks_read > 0)
src/backend/commands/explain.c- appendStringInfo(es->str, " read=%lld",
src/backend/commands/explain.c- (long long) usage->shared_blks_read);
src/backend/commands/explain.c- if (usage->shared_blks_dirtied > 0)
src/backend/commands/explain.c: appendStringInfo(es->str, " dirtied=%lld",
src/backend/commands/explain.c- (long long) usage->shared_blks_dirtied);
src/backend/commands/explain.c- if (usage->shared_blks_written > 0)
src/backend/commands/explain.c- appendStringInfo(es->str, " written=%lld",
src/backend/commands/explain.c- (long long) usage->shared_blks_written);
src/backend/commands/explain.c- if (has_local || has_temp)
Greetings,
* Justin Pryzby (pryzby@telsasoft.com) wrote:
This patch adds hits/misses/dirtied, but explain says hit/read/dirtied/written.
Should it say "read" instead of "misses" ?
src/backend/access/heap/vacuumlazy.c: _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
Errr, the analyze readahead patch set didn't add the above use of
'misses', that's been around for an age or more. For giggles, I hunted
it back to be this 2011 commit: 9d3b50244357ef4c4e3b6e01f91de599077179c8
As for the "misses" usage which was actually added by the patch here:
src/backend/commands/analyze.c: appendStringInfo(&buf, _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
That was specifically done to match the 'vacuumlazy.c' usage of 'misses'
above.
As for which to use, I don't have any particular dog in that fight, but
I do think we should be consistent between how VACUUM reports this
information and how ANALYZE does.
Thanks!
Stephen
Greetings,
* Stephen Frost (sfrost@snowman.net) wrote:
* Egor Rogov (e.rogov@postgrespro.ru) wrote:
On 11.02.2021 01:10, Stephen Frost wrote:
* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
On 05/02/2021 23:22, Stephen Frost wrote:
Unless there's anything else on this, I'll commit these sometime next
week.One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().Ah, yeah, of course.
Updated patch attached.
I happened to notice that get_tablespace_io_concurrency() is called instead
of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?Hah, indeed you're right. Will fix.
Found this under a bit of a pile in my todo list. :)
Fix pushed.
Thanks!
Stephen