Dynamic gathering the values for seq_page_cost/xxx_cost
The optimizer cost model usually needs 2 inputs, one is used to represent
data distribution and the other one is used to represent the capacity of
the hardware, like cpu/io let's call this one as system stats.
In Oracle database, the system stats can be gathered with
dbms_stats.gather_system_stats [1]https://docs.oracle.com/database/121/ARPLS/d_stats.htm#ARPLS68580 on the running hardware, In
postgresql, the value is set on based on experience (user can change the
value as well, but is should be hard to decide which values they should
use). The pg way is not perfect in theory(In practice, it may be good
enough or not). for example, HDD & SSD have different capacity regards to
seq_scan_cost/random_page_cost, cpu cost may also different on different
hardware as well.
I run into a paper [2]https://dsl.cds.iisc.ac.in/publications/thesis/pankhuri.pdf which did some research on dynamic gathering the
values for xxx_cost, looks it is interesting. However it doesn't provide
the code for others to do more research. before I dive into this, It
would be great to hear some suggestion from experts.
so what do you think about this method and have we have some discussion
about this before and the result?
[1]: https://docs.oracle.com/database/121/ARPLS/d_stats.htm#ARPLS68580
[2]: https://dsl.cds.iisc.ac.in/publications/thesis/pankhuri.pdf
Thanks
On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
The optimizer cost model usually needs 2 inputs, one is used to represent
data distribution and the other one is used to represent the capacity of
the hardware, like cpu/io let's call this one as system stats.In Oracle database, the system stats can be gathered with
dbms_stats.gather_system_stats [1] on the running hardware, In
postgresql, the value is set on based on experience (user can change the
value as well, but is should be hard to decide which values they should
use). The pg way is not perfect in theory(In practice, it may be good
enough or not). for example, HDD & SSD have different capacity regards to
seq_scan_cost/random_page_cost, cpu cost may also different on different
hardware as well.I run into a paper [2] which did some research on dynamic gathering the
values for xxx_cost, looks it is interesting. However it doesn't provide
the code for others to do more research. before I dive into this, It
would be great to hear some suggestion from experts.so what do you think about this method and have we have some discussion
about this before and the result?
IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.
It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.
So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 28, 2019 at 12:48 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
The optimizer cost model usually needs 2 inputs, one is used to represent
data distribution and the other one is used to represent the capacity of
the hardware, like cpu/io let's call this one as system stats.In Oracle database, the system stats can be gathered with
dbms_stats.gather_system_stats [1] on the running hardware, In
postgresql, the value is set on based on experience (user can change the
value as well, but is should be hard to decide which values they should
use). The pg way is not perfect in theory(In practice, it may be good
enough or not). for example, HDD & SSD have different capacity regardsto
seq_scan_cost/random_page_cost, cpu cost may also different on different
hardware as well.I run into a paper [2] which did some research on dynamic gathering the
values for xxx_cost, looks it is interesting. However it doesn't provide
the code for others to do more research. before I dive into this, It
would be great to hear some suggestion from experts.so what do you think about this method and have we have some discussion
about this before and the result?IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I recently tried something in this direction and the result looks
promising based on my limited test.
Since the unit of a xxx_cost is "seq_page_cost", then how to detect
seq_page_cost is important. In the cost model, the IO cost of a seqscan is
rel->pages * seq_page_cost, it doesn't consider any cache (file system
cache or
shared buffer cache). However, it assumes the OS will prefetch the IO. So
to
detect the seq_page_cost, I enabled the prefetch but avoided the file system
cache. I tested this with 1). drop the cache on the file system. 2). Open
the test
file without O_DIRECT so that the prefetch can work.
To detect the random page read, I read it with pread with a random offset.
Since the random offsets may be the same as each other during the test,
so even dropping the file system cache at the beginning doesn't work. so
I open it with the O_DIRECT option.
I also measure the cost of reading a page from a file system cache, during
my test, it is about 10% of a seq scan read.
After I get the basic numbers about the hardware capability, I let the user
provide a cache hit ratio (This is a place where we can further improve if
this
is a right direction).
Here is the test result on my hardware.
fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
73.987732us
cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 5.073692
cache hit ratio: 0.500000 random_page_cost 7.957589
cache hit ratio: 0.100000 random_page_cost 8.551591
cache hit ratio: 0.000000 random_page_cost 8.633049
Then I tested the suggested value with the 10GB TPCH
workload. I compared the plans with 2 different settings random_page_cost =
1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then 11
out of the 22
queries generated a different plan. At last I drop the cache (including
both
file system cache and shared_buffer) before run each query and run the 11
queries
under the 2 different settings. The execution time is below.
| | random_page_cost=4 | random_page_cost=8.6 |
|-----+--------------------+----------------------|
| Q1 | 1425.964 | 1121.928 |
| Q2 | 2553.072 | 2567.450 |
| Q5 | 4397.514 | 1475.343 |
| Q6 | 12576.985 | 4622.503 |
| Q7 | 3459.777 | 2987.241 |
| Q8 | 8360.995 | 8415.311 |
| Q9 | 4661.842 | 2930.370 |
| Q11 | 4885.289 | 2348.541 |
| Q13 | 2610.937 | 1497.776 |
| Q20 | 13218.122 | 10985.738 |
| Q21 | 264.639 | 262.350 |
The attached main.c is the program I used to detect the
random_page_cost. result.tar.gz is the test result, you can run a git log
first
to see the difference on plan or execution stat.
Any feedback is welcome. Thanks!
--
Best Regards
Andy Fan
On Fri, Sep 18, 2020 at 09:28:10PM +0800, Andy Fan wrote:
On Thu, Nov 28, 2019 at 12:48 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
The optimizer cost model usually needs 2 inputs, one is used to represent
data distribution and the other one is used to represent the capacity of
the hardware, like cpu/io let's call this one as system stats.In Oracle database, the system stats can be gathered with
dbms_stats.gather_system_stats [1] on the running hardware, In
postgresql, the value is set on based on experience (user can change the
value as well, but is should be hard to decide which values they should
use). The pg way is not perfect in theory(In practice, it may be good
enough or not). for example, HDD & SSD have different capacity regardsto
seq_scan_cost/random_page_cost, cpu cost may also different on different
hardware as well.I run into a paper [2] which did some research on dynamic gathering the
values for xxx_cost, looks it is interesting. However it doesn't provide
the code for others to do more research. before I dive into this, It
would be great to hear some suggestion from experts.so what do you think about this method and have we have some discussion
about this before and the result?IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & ServicesI recently tried something in this direction and the result looks
promising based on my limited test.Since the unit of a xxx_cost is "seq_page_cost", then how to detect
seq_page_cost is important. In the cost model, the IO cost of a seqscan is
rel->pages * seq_page_cost, it doesn't consider any cache (file system
cache or
shared buffer cache). However, it assumes the OS will prefetch the IO. So
to
detect the seq_page_cost, I enabled the prefetch but avoided the file system
cache. I tested this with 1). drop the cache on the file system. 2). Open
the test
file without O_DIRECT so that the prefetch can work.To detect the random page read, I read it with pread with a random offset.
Since the random offsets may be the same as each other during the test,
so even dropping the file system cache at the beginning doesn't work. so
I open it with the O_DIRECT option.I also measure the cost of reading a page from a file system cache, during
my test, it is about 10% of a seq scan read.After I get the basic numbers about the hardware capability, I let the user
provide a cache hit ratio (This is a place where we can further improve if
this
is a right direction).Here is the test result on my hardware.
fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
73.987732uscache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 5.073692
cache hit ratio: 0.500000 random_page_cost 7.957589
cache hit ratio: 0.100000 random_page_cost 8.551591
cache hit ratio: 0.000000 random_page_cost 8.633049Then I tested the suggested value with the 10GB TPCH
workload. I compared the plans with 2 different settings random_page_cost =
1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then 11
out of the 22
queries generated a different plan. At last I drop the cache (including
both
file system cache and shared_buffer) before run each query and run the 11
queries
under the 2 different settings. The execution time is below.| | random_page_cost=4 | random_page_cost=8.6 |
|-----+--------------------+----------------------|
| Q1 | 1425.964 | 1121.928 |
| Q2 | 2553.072 | 2567.450 |
| Q5 | 4397.514 | 1475.343 |
| Q6 | 12576.985 | 4622.503 |
| Q7 | 3459.777 | 2987.241 |
| Q8 | 8360.995 | 8415.311 |
| Q9 | 4661.842 | 2930.370 |
| Q11 | 4885.289 | 2348.541 |
| Q13 | 2610.937 | 1497.776 |
| Q20 | 13218.122 | 10985.738 |
| Q21 | 264.639 | 262.350 |The attached main.c is the program I used to detect the
random_page_cost. result.tar.gz is the test result, you can run a git log
first
to see the difference on plan or execution stat.Any feedback is welcome. Thanks!
That seems pretty neat. What kind of hardware have you done these tests
on? It's probably worth testing on various other storage systems to see
how that applies to those.
Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas:
Thanks for checking.
On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
I recently tried something in this direction and the result looks
promising based on my limited test.Since the unit of a xxx_cost is "seq_page_cost", then how to detect
seq_page_cost is important. In the cost model, the IO cost of a seqscan is
rel->pages * seq_page_cost, it doesn't consider any cache (file system
cache or
shared buffer cache). However, it assumes the OS will prefetch the IO. So
to
detect the seq_page_cost, I enabled the prefetch but avoided the filesystem
cache. I tested this with 1). drop the cache on the file system. 2). Open
the test
file without O_DIRECT so that the prefetch can work.To detect the random page read, I read it with pread with a random offset.
Since the random offsets may be the same as each other during the test,
so even dropping the file system cache at the beginning doesn't work. so
I open it with the O_DIRECT option.I also measure the cost of reading a page from a file system cache, during
my test, it is about 10% of a seq scan read.After I get the basic numbers about the hardware capability, I let the
user
provide a cache hit ratio (This is a place where we can further improve if
this
is a right direction).Here is the test result on my hardware.
fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
73.987732uscache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 5.073692
cache hit ratio: 0.500000 random_page_cost 7.957589
cache hit ratio: 0.100000 random_page_cost 8.551591
cache hit ratio: 0.000000 random_page_cost 8.633049Then I tested the suggested value with the 10GB TPCH
workload. I compared the plans with 2 different settings random_page_cost=
1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then
11
out of the 22
queries generated a different plan. At last I drop the cache (including
both
file system cache and shared_buffer) before run each query and run the 11
queries
under the 2 different settings. The execution time is below.| | random_page_cost=4 | random_page_cost=8.6 |
|-----+--------------------+----------------------|
| Q1 | 1425.964 | 1121.928 |
| Q2 | 2553.072 | 2567.450 |
| Q5 | 4397.514 | 1475.343 |
| Q6 | 12576.985 | 4622.503 |
| Q7 | 3459.777 | 2987.241 |
| Q8 | 8360.995 | 8415.311 |
| Q9 | 4661.842 | 2930.370 |
| Q11 | 4885.289 | 2348.541 |
| Q13 | 2610.937 | 1497.776 |
| Q20 | 13218.122 | 10985.738 |
| Q21 | 264.639 | 262.350 |The attached main.c is the program I used to detect the
random_page_cost. result.tar.gz is the test result, you can run a git log
first
to see the difference on plan or execution stat.Any feedback is welcome. Thanks!
That seems pretty neat. What kind of hardware have you done these tests
on?
The following is my hardware info.
I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer.
[1]: https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO
root# lshw -short -C disk
H/W path Device Class Description
==============================================================
/0/100/2/0/2.0.0 /dev/sda disk 2398GB MR9271-8i
/0/100/2/0/2.1.0 /dev/sdb disk 5597GB MR9271-8i <-- my
data location
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL
Adapter #0
Memory Size : 1024MB
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if
Bad
BBU
...
Device Present
================
Virtual Drives : 2
Degraded : 0
Offline : 0
Physical Devices : 14
Disks : 12
Critical Disks : 0
Failed Disks : 0
root# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | egrep 'Media
Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0
Physical machine.
It's probably worth testing on various other storage systems to see
how that applies to those.
Yes, I can test more on new hardware once I get it. Now it is still in
progress.
However I can only get a physical machine with SSD or Virtual machine with
SSD, other types are hard for me right now.
Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.
Thanks for this hint, I found more interesting stuff during the comparison.
I define the FIO jobs as below.
random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio
[random_page_cost]
direct=1
readwrite=randread
Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.
seq.job
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio
[seq_page_cost]
buffered=1
readwrite=read
For seq read, We need buffered IO for perfetch, however, we need to bypass
the file
system cache and device cache. fio have no control of such caches, so I did:
1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file
bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.
The final avg latency is ~12 us.
This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat =
8.570290us, random_page_lat =
73.987732us)
Here are some changes for my detection program.
| | seq_read_lat (us) |
random_read_lat (us) |
| FIO | 12 |
148 |
| Previous main.c | 8.5 |
74 |
| invalidate_device_cache before each testing | 9 |
150 |
| prepare the test data file with O_DIRECT option | 15 |
150 |
In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous
fio setup.
prepare test data file with O_DIRECT option means in the past, I prepare
the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache. I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.
I still can't find out the 25% difference with the FIO result. (12 us vs 9
us).
At last, the random_page_cost happens to not change very much.
/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat =
148.650589us
cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383
This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
The attached is the modified detection program.
[1]: https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO
https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO
--
Best Regards
Andy Fan
Attachments:
On Mon, Sep 21, 2020 at 9:11 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Here are some changes for my detection program.
| | seq_read_lat (us) | random_read_lat (us) |
| FIO | 12 | 148 |
| Previous main.c | 8.5 | 74 |
| invalidate_device_cache before each testing | 9 | 150 |
| prepare the test data file with O_DIRECT option | 15 | 150 |In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous fio setup.prepare test data file with O_DIRECT option means in the past, I prepare the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache. I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.
I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).At last, the random_page_cost happens to not change very much.
/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589uscache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
Very interesting. Thanks for working on this. In an earlier email you
mentioned that TPCH plans changed to efficient ones when you changed
random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
setting random_page_cost to seq_page_cost to the same ratio as that
between the corresponding latencies improved the plans. How about
trying this with that ratio set to the one obtained from the latencies
provided by FIO? Do we see any better plans?
page cost is one thing, but there are CPU costs also involved in costs
of expression evaluation. Should those be changed accordingly to the
CPU latency?
--
Best Wishes,
Ashutosh Bapat
Thanks Ashutosh for coming:)
On Mon, Sep 21, 2020 at 9:03 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:
On Mon, Sep 21, 2020 at 9:11 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Here are some changes for my detection program.
| | seq_read_lat (us) |
random_read_lat (us) |
| FIO | 12 |
148 |
| Previous main.c | 8.5 |
74 |
| invalidate_device_cache before each testing | 9 |
150 |
| prepare the test data file with O_DIRECT option | 15 |
150 |
In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as theprevious fio setup.
prepare test data file with O_DIRECT option means in the past, I prepare
the test
file with buffer IO. and before testing, I do invalidate device cache,
file
system cache. but the buffered prepared file still get better
performance, I
have no idea of it. Since I don't want any cache. I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.
I still can't find out the 25% difference with the FIO result. (12 us vs9 us).
At last, the random_page_cost happens to not change very much.
/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat =148.650589us
cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
Very interesting. Thanks for working on this. In an earlier email you
mentioned that TPCH plans changed to efficient ones when you changed
random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
setting random_page_cost to seq_page_cost to the same ratio as that
between the corresponding latencies improved the plans.
Yes.
How about
trying this with that ratio set to the one obtained from the latencies
provided by FIO? Do we see any better plans?
My tools set the random_page_cost to 8.6, but based on the fio data, it
should be
set to 12.3 on the same hardware. and I do see the better plan as well
with 12.3.
Looks too smooth to believe it is true..
The attached result_fio_mytool.tar.gz is my test result. You can use git
show HEAD^^
is the original plan with 8.6. git show HEAD^ show the plan changes after
we changed
the random_page_cost. git show HEAD shows the run time statistics changes
for these queries.
I also uploaded the test tool [1]https://github.com/zhihuiFan/tpch-postgres/blob/master/random_page_cost.sh for this for your double check.
| | 8.6 | 12.3 |
|-----+----------+----------|
| Q2 | 2557.064 | 2444.995 |
| Q4 | 3544.606 | 3148.884 |
| Q7 | 2965.820 | 2240.185 |
| Q14 | 4988.747 | 4931.375 |
page cost is one thing, but there are CPU costs also involved in costs
of expression evaluation. Should those be changed accordingly to the
CPU latency?Yes, we need that as well. At the beginning of this thread, I treat all
of them equally.
In the first reply of Tomas, he mentioned random_page_cost specially. After
~10 months, I tested TPCH on a hardware and then found random_page_cost
is set too incorrectly, after fixing it, the result looks much better. So
I'd like to work
on this special thing first.
[1]: https://github.com/zhihuiFan/tpch-postgres/blob/master/random_page_cost.sh
https://github.com/zhihuiFan/tpch-postgres/blob/master/random_page_cost.sh
--
Best Regards
Andy Fan
Attachments:
result_fio_mytool.tar.gzapplication/x-gzip; name=result_fio_mytool.tar.gzDownload
� ��i_ �] \G�oQT�G�H4��&� U*
"r�
�B�!�\$�����(�U(j�(x���
RP����-*�"V�Q����$@�4�}o���$���<3�3�53�����yb���]� "� �g�D����D�a"Ip���?3]DET t�a�2YV�Xf��P�Pil1��!�
����,�V���3����a2E�X�=�%�� ?5!��?��K����w ��j����H�?V�f����T6+���}h�
��P}l�$%�c�Y\�%B9V� !������c��ZT n� TR&�t� [Z�'����t[ V|'~��*��l�*�m
�!n��'5-�B���M5"��f�l`*U���X\�u��T���b��tl�xA�7ah0A���R|k�5��
|hT.��^���~TLWk���������b
�-B��S����= �Y���;����X����iL�`�,<�����8����0` P��K>�����WP����`7��ll�cA{a=�Xs��""������=!bqP��p�@K��E� q ���x�>"^�����z�2��9�[��AD��%��H"���6��2UBG���-�>���|��i��Z^���Y��JD��D!��qp��L �T����q���\OZ��L�?H�S���V�F@Y����)>�;q?V|aC�L�|"B�6��M�I�����"|3!��~?�Z7+41�$���xL���" >�4�uv��~����}� ���eC����4UM,�.%��
�4�MgC�\[��o�WKcl$`b��RIe���)I�n�A������4�e(O�|���'��{+��P�)iJ���H�joK�OTk�HG������j�DgaSgi?��y����j6qr=��o1�-�����1/��<^�k�H�����T.�X�E��6:I��+�T�T�|X�6�����(� 4�v�Rsz�0��$q���V[��>�������GY�P��R� =Z�(�h<?������QWXy
4��Q`���*iw@��:4eK��%��4D�/-��AH�?�S��V&����4���?m@C���2y��e^��y�F#B�"��G�p4L�x��������4n"