Estimating seq_page_fetch and random_page_fetch

Started by Umar Farooq Minhasalmost 19 years ago9 messages
#1Umar Farooq Minhas
umarfm13@hotmail.com

Hi,

How can we accrately estimate the "seq_page_fetch" and "random_page_fetch" costs from outside the postgres using for example a C routine. Using a very simple program in C, I have two sets of files. I want to access one set randomly i.e. pulling data from random locations within the files. The second set of files is accessed sequentially. The goal here is to approximate the disk I/O cost for a "random page fetch" and a "sequential page fetch" respectively. I am using low-level(unbuffered) C routines i.e. read/write and lseek ( for positioning file pointer), the read/write buffer size is 8k (to match the size of postgres page), and Linux is the host OS. We all know that linux is a heavily cached OS, for that very reason I am using sets of files instead of a single file, in a hope that whenever a new file from a set of files is accessed for the first time, it will NOT be in the OS cache, thus giving accurate results of actually fetching the file pages from the physical disk. And also the host is restarted before running the experiment so as to force a cold-cache start.

I am hoping somebody could point me in the right direction.

Thanks

-Umar

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Umar Farooq Minhas (#1)
Re: Estimating seq_page_fetch and random_page_fetch

"Umar Farooq Minhas" <umarfm13@hotmail.com> writes:

How can we accrately estimate the "seq_page_fetch" and =
"random_page_fetch" costs from outside the postgres using for example a =
C routine.

Use a test case larger than memory. Repeat many times to average out
noise. IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

regards, tom lane

#3Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#2)
Re: Estimating seq_page_fetch and random_page_fetch

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Umar Farooq Minhas" <umarfm13@hotmail.com> writes:

How can we accrately estimate the "seq_page_fetch" and =
"random_page_fetch" costs from outside the postgres using for example a =
C routine.

Use a test case larger than memory. Repeat many times to average out
noise. IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#4Luke Lonergan
LLonergan@greenplum.com
In reply to: Gregory Stark (#3)
Re: Estimating seq_page_fetch and random_page_fetch

Adding to this:

Ayush recently wrote a C program that emulates PG IO to do this analysis, and we came out with (predictably) a ratio of sequential/random of 20-50 (for a single user). This is predictable because the random component is fixed at the access time of a single hard drive no matter how many disks are in an array, while the sequential scales nearly linearly with the number of drives in the array.

So, you can estimate random using 8-12ms per random access, and sequential as 1/(number of disks X 60-130MB/s).

Ayush, can you forward your C program?

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Gregory Stark [mailto:stark@enterprisedb.com]
Sent: Thursday, March 08, 2007 12:37 PM Eastern Standard Time
To: Tom Lane
Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Umar Farooq Minhas" <umarfm13@hotmail.com> writes:

How can we accrately estimate the "seq_page_fetch" and =
"random_page_fetch" costs from outside the postgres using for example a =
C routine.

Use a test case larger than memory. Repeat many times to average out
noise. IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#5Jim C. Nasby
jim@nasby.net
In reply to: Gregory Stark (#3)
Re: Estimating seq_page_fetch and random_page_fetch

On Thu, Mar 08, 2007 at 05:35:03PM +0000, Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Umar Farooq Minhas" <umarfm13@hotmail.com> writes:

How can we accrately estimate the "seq_page_fetch" and =
"random_page_fetch" costs from outside the postgres using for example a =
C routine.

Use a test case larger than memory. Repeat many times to average out
noise. IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

Another trick I've used in the past is to just run the machine out of
memory, using the following:

/*
* $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $
*
* Utility to clear out a chunk of memory and zero it. Useful for flushing disk buffers
*/

int main(int argc, char *argv[]) {
if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); }
}

I'll monitor top while that's running to ensure that some stuff gets
swapped out to disk. I believe this might still leave some cached data
in other areas of the kernel, but it's probably not enough to worry
about.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#6Neil Conway
neilc@samurai.com
In reply to: Gregory Stark (#3)
Re: Estimating seq_page_fetch and random_page_fetch

On Thu, 2007-03-08 at 17:35 +0000, Gregory Stark wrote:

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

On recent Linux kernels, /proc/sys/vm/drop_caches can also be useful:

http://linux.inet.hr/proc_sys_vm_drop_caches.html

You could also use posix_fadvise() to achieve a similar effect on a per-file
basis.

-Neil

#7Umar Farooq Minhas
umarfm13@hotmail.com
In reply to: Luke Lonergan (#4)
Re: Estimating seq_page_fetch and random_page_fetch

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetchThanks a lot for your replies. The suggestions have proved much useful.
Ayush, I'm curious to see your C program, thanks.

Here is a related but different issue. I started looking at the postgres optimizer/planner code a month back to modify it for the purposes of experiments that I need to conduct. The EXPLAIN command prints the total costs i.e both CPU + I/O however, for my purposes I need these two costs to be separated i.e. instead of getting one cost displayed, I want cpu cost and io cost displayed separated when i run EXPLAIN on a particular query. Till now I haven't been able to figure out a 'clean' way of doing this. Can anyone tell me how much time should I expect to spend making such a change ? and from where should I start ? costsize.c ?

I have another question. Looking at the optimizer code, it pretty much looks insensitive to the memory factor. The only parameters being utilized are the "effective_cache_size" ( in estimating index cost only) and "work_mem" for (sort, aggregation, groups, hash/merge joins). Are these the only memory factors that DIRECTLY effect the cost estimates of the planner/optimizer?

Again your help is appreciated.

-Umar
----- Original Message -----
From: Luke Lonergan
To: Gregory Stark ; Tom Lane ; Ayush Parashar
Cc: Umar Farooq Minhas ; pgsql-hackers@postgresql.org
Sent: Thursday, March 08, 2007 2:16 PM
Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

Adding to this:

Ayush recently wrote a C program that emulates PG IO to do this analysis, and we came out with (predictably) a ratio of sequential/random of 20-50 (for a single user). This is predictable because the random component is fixed at the access time of a single hard drive no matter how many disks are in an array, while the sequential scales nearly linearly with the number of drives in the array.

So, you can estimate random using 8-12ms per random access, and sequential as 1/(number of disks X 60-130MB/s).

Ayush, can you forward your C program?

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Gregory Stark [mailto:stark@enterprisedb.com]
Sent: Thursday, March 08, 2007 12:37 PM Eastern Standard Time
To: Tom Lane
Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Umar Farooq Minhas" <umarfm13@hotmail.com> writes:

How can we accrately estimate the "seq_page_fetch" and =
"random_page_fetch" costs from outside the postgres using for example a =
C routine.

Use a test case larger than memory. Repeat many times to average out
noise. IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Umar Farooq Minhas (#7)
Re: Estimating seq_page_fetch and random_page_fetch

On Thu, Mar 08, 2007 at 07:01:17PM -0500, Umar Farooq Minhas wrote:

displayed, I want cpu cost and io cost displayed separated when i run
EXPLAIN on a particular query. Till now I haven't been able to figure
out a 'clean' way of doing this. Can anyone tell me how much time
should I expect to spend making such a change ? and from where should
I start ? costsize.c ?

That's going to be a lot of work. You need to duplicate the variable
and eery usage of that variable. And I can't imagine why you'd be
interested anyway...

I have another question. Looking at the optimizer code, it pretty
much looks insensitive to the memory factor. The only parameters
being utilized are the "effective_cache_size" ( in estimating index
cost only) and "work_mem" for (sort, aggregation, groups, hash/merge
joins). Are these the only memory factors that DIRECTLY effect the
cost estimates of the planner/optimizer?

Sure, what other factors were you considering?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#9Grzegorz Jaskiewicz
gj@pointblue.com.pl
In reply to: Umar Farooq Minhas (#7)
Re: Estimating seq_page_fetch and random_page_fetch

It would be interested to see some code here.
Maybe this would be a great oportunity to start - some sort of 'auto-
tune' (as an option), in the area.

--
GJ

C/C++/SQL freelance to hire.