ice-broker scan thread
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.
What happens to the original sequential scan:
for (;;)
{
/*
* a physical read may happen, due to current content of
* file system cache and if the kernel is smart enough to
* understand you want to do sequential scan
*/
physical or logical read a page;
process the page;
}
What happens to the sequential scan with ice-broker:
for (;;)
{
/* since the ice-broker has read the page in already */
logical read a page with big chance;
process the page;
}
I wrote a program to simulate the sequential scan in PostgreSQL
with/without ice-broker. The results indicate this technique has the
following characters:
(1) The important factor of speedup is the how much CPU time PostgreSQL
used on each data page. If PG is fast enough, then no speedup occurs; else
a 10% to 20% speedup is expected due to my test.
(2) It uses more CPU - this is easy to understand, since it does more
work;
(3) The benefits also depends on other factors, like how smart your file
system ...
Here is a test results on my machine:
---
$#uname -a
Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
$#cat /proc/meminfo | grep MemTotal
MemTotal: 1030988 kB
$#cat /proc/cpuinfo | grep CPU
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
$#./seqscan 10 $HOME/pginstall/bin/data/base/10794/18986 50
PostgreSQL sequential scan simulator configuration:
Memory size: 943718400
CPU cost per page: 50
Scan thread read unit size: 4
With scan threads off - duration: 56862.738 ms
With scan threads on - duration: 40611.101 ms
With scan threads off - duration: 46859.207 ms
With scan threads on - duration: 38598.234 ms
With scan threads off - duration: 56919.572 ms
With scan threads on - duration: 47023.606 ms
With scan threads off - duration: 52976.825 ms
With scan threads on - duration: 43056.506 ms
With scan threads off - duration: 54292.979 ms
With scan threads on - duration: 42946.526 ms
With scan threads off - duration: 51893.590 ms
With scan threads on - duration: 42137.684 ms
With scan threads off - duration: 46552.571 ms
With scan threads on - duration: 41892.628 ms
With scan threads off - duration: 45107.800 ms
With scan threads on - duration: 38329.785 ms
With scan threads off - duration: 47527.787 ms
With scan threads on - duration: 38293.581 ms
With scan threads off - duration: 48810.656 ms
With scan threads on - duration: 39018.500 ms
---
Notice in above the cpu_cost=50 might looks too big (if you look into the
code) - but in concurrent situation, it is not that huge. Also, on my
windows box(PIII, 800), a cpu_cost=5 can is enough to prove the benefits
of 10%.
So in general, it does help in some situations, but not a rocket science
since we can't predicate the performance of the file system. It fairly
easy to be integrated, and we should add a GUC parameter to control it.
We need more tests, any comments and tests are welcome,
Regards,
Qingqing
---
/*
* seqscan.c
* PostgreSQL sequential scan simulator with helper scan thread
*
* Note
* I wrote this simulator to see if there is any benefits for sequential scan to
* do read-ahead by another thread. The only thing you may want to change in the
* source file is MEMSZ, make it big enough to thrash your file system cache.
*
* Use the following command to compile:
* $gcc -O2 -Wall -pthread -lm seqscan.c -o seqscan
* To use it:
* $./seqscan <rounds> <datafile> <cpu_cost>
* In which rounds is how many times you want to run the test (notice each round include
* two disk-burn test), datafile is the path to any file (suggest size > 100M), and cpu_cost
* is the cost that processing each page of the file. Try different cpu_cost.
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <memory.h>
#include <errno.h>
#include <math.h>
#ifdef WIN32
#include <io.h>
#include <windows.h>
#define PG_BINARY O_BINARY
#else
#include <unistd.h>
#include <pthread.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/file.h>
#define PG_BINARY 0
#endif
typedef char bool;
#define true ((bool) 1)
#define false ((bool) 0)
#define BLCKSZ 8192
#define UNITSZ 4
#define MEMSZ (950*1024*1024)
char *data_file;
int cpu_cost;
volatile bool stop_scan;
char thread_buffer[BLCKSZ*UNITSZ];
static void
cleanup_cache(void)
{
char *p;
if (NULL == (p = (char *)malloc(MEMSZ)))
{
fprintf(stderr, "insufficient memory\n");
exit(-1);
}
memset(p, 'a', MEMSZ);
free(p);
}
#ifdef WIN32
bool enable_aio = false;
static const unsigned __int64 epoch = 116444736000000000L;
static int gettimeofday(struct timeval * tp, struct timezone * tzp)
{
FILETIME file_time;
SYSTEMTIME system_time;
ULARGE_INTEGER ularge;
GetSystemTime(&system_time);
SystemTimeToFileTime(&system_time, &file_time);
ularge.LowPart = file_time.dwLowDateTime;
ularge.HighPart = file_time.dwHighDateTime;
tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 10000000L);
tp->tv_usec = (long) (system_time.wMilliseconds * 1000);
return 0;
}
static void
sleep(int secs)
{
SleepEx(secs*1000, true);
}
static int
thread_open()
{
HANDLE fd;
SECURITY_ATTRIBUTES sa;
sa.nLength = sizeof(sa);
sa.bInheritHandle = TRUE;
sa.lpSecurityDescriptor = NULL;
fd = CreateFile(data_file,
GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
&sa,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN
| (enable_aio?FILE_FLAG_OVERLAPPED:0),
NULL);
if (fd == INVALID_HANDLE_VALUE)
{
int errCode;
switch (errCode = GetLastError())
{
/* EMFILE, ENFILE should not occur from CreateFile. */
case ERROR_PATH_NOT_FOUND:
case ERROR_FILE_NOT_FOUND: errno = ENOENT; break;
case ERROR_FILE_EXISTS: errno = EEXIST; break;
case ERROR_ACCESS_DENIED: errno = EACCES; break;
default:
fprintf(stderr, "thread_open failed: %d\n", errCode);
errno = EINVAL;
}
return -1;
}
return (int)fd;
}
static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{
long offset = BLCKSZ*blkno;
long nbytes;
OVERLAPPED ol;
memset(&ol, 0, sizeof(OVERLAPPED));
ol.Offset = offset;
ol.OffsetHigh = 0;
if (ReadFile((HANDLE)fd, buf, BLCKSZ*nblk, &nbytes, &ol))
{
/* successfully done without delay */
NULL;
}
else
{
int errCode;
switch (errCode = GetLastError())
{
case ERROR_IO_PENDING:
break;
case ERROR_HANDLE_EOF:
break;
default:
/* unknown error occured */
fprintf(stderr, "asyncread failed: %d\n", errCode);
exit(-1);
}
}
return nbytes;
}
static void
thread_close(int fd)
{
CloseHandle((HANDLE)fd);
}
#else /* non-windows platforms */
static int
thread_open()
{
int fd;
fd = open(data_file, O_RDWR | PG_BINARY, 0600);
if (fd < 0)
{
fprintf(stderr, "thread_open failed: %d\n", errno);
exit(-1);
}
return (int)fd;
}
static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{
long offset = BLCKSZ*blkno;
long nbytes;
nbytes = lseek(fd, offset, SEEK_SET);
nbytes = read(fd, buf, BLCKSZ*nblk);
if (nbytes <= 0)
{
fprintf(stderr, "thread_read failed: %d\n", errno);
exit(-1);
}
return nbytes;
}
static void
thread_close(int fd)
{
close(fd);
}
#endif
#ifdef WIN32
static DWORD WINAPI
scan_thread(LPVOID args)
#else
static void *
scan_thread(void *args)
#endif
{
int i, fd;
int start, end;
start = 0;
end = (size_t)args;
fd = thread_open();
for (i = start; i < end; i+=UNITSZ)
{
thread_read(fd, i, UNITSZ, (char *)thread_buffer);
/* check if I was asked to stop */
if (stop_scan == true)
break;
}
thread_close(fd);
return 0;
}
static int
init_scan(bool with_threads, size_t *nblocks)
{
int fd;
/* open file for do_scan */
fd = open(data_file, O_RDWR | PG_BINARY, 0600);
if (fd < 0)
{
fprintf(stderr, "failed to open file %s\n", data_file);
exit(-1);
}
*nblocks = lseek(fd, 0, SEEK_END) / BLCKSZ;
if (*nblocks < 0)
{
fprintf(stderr, "failed to get file length %s\n", data_file);
exit(-1);
}
if (with_threads)
{
#ifndef WIN32
pthread_t thread;
#endif
/* create scan threads */
stop_scan = false;
#ifdef WIN32
if (NULL == CreateThread(NULL, 0,
scan_thread, (void *)(*nblocks),
0, NULL))
#else
if (pthread_create(&thread, NULL,
scan_thread, (void *)(*nblocks)))
#endif
{
fprintf(stderr, "failed to start scan thread");
exit(-1);
}
}
return fd;
}
static void
do_scan(int fd, size_t nblocks)
{
int i, j, k, nbytes;
char buffer[BLCKSZ];
for (i = 0; i < nblocks; i++)
{
nbytes = lseek(fd, i*BLCKSZ, SEEK_SET);
nbytes = read(fd, buffer, BLCKSZ);
if (nbytes != BLCKSZ)
{
fprintf(stderr, "do_scan read failed\n");
exit(-1);
}
/* pretend to do some CPU intensive analysis */
for (k = 0; k < cpu_cost; k++)
{
for (j = (k*sizeof(int))%BLCKSZ;
j < BLCKSZ / (5 * sizeof(int));
j += sizeof(int))
{
int x, y;
x = ((int *)buffer)[j];
x = (int)pow((double)x, (double)(x+1));
y = (int)sin((double)x*x);
((int *)buffer)[j] = x*y;
}
}
}
}
static void
close_scan(fd)
{
stop_scan = true;
close(fd);
}
int
main(int argc, char *argv[])
{
int i, rounds, fd;
size_t nblocks;
if (argc != 4)
{
fprintf(stderr, "usage: cache <rounds> <datafile> <cpu_cost>\n");
exit(-1);
}
rounds = atoi(argv[1]);
data_file = argv[2];
cpu_cost = atoi(argv[3]);
fd = init_scan(false, &nblocks);
close_scan(fd);
fprintf(stdout, "PostgreSQL sequential scan simulator configuration:\n"
"\tMemory size: %u\n"
"\tCPU cost per page: %d\n"
"\tScan thread read unit size: %d\n\n",
MEMSZ, cpu_cost, UNITSZ);
for (i = 0; i < 2*rounds; i++)
{
struct timeval start_t, stop_t;
long usecs;
bool enable = i%2?true:false;
/* eliminate system cached data */
cleanup_cache();
sleep(2);
/* do the scan task */
gettimeofday(&start_t, NULL);
fd = init_scan(enable, &nblocks);
do_scan(fd, nblocks);
close_scan(fd);
gettimeofday(&stop_t, NULL);
/* measure the time */
if (stop_t.tv_usec < start_t.tv_usec)
{
stop_t.tv_sec--;
stop_t.tv_usec += 1000000;
}
usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000
+ (long) (stop_t.tv_usec - start_t.tv_usec);
fprintf (stdout, "With scan threads %s - duration: %ld.%03ld ms\n",
enable?"on":"off",
(long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +
(stop_t.tv_usec - start_t.tv_usec) / 1000),
(long) (stop_t.tv_usec - start_t.tv_usec) % 1000);
sleep(2);
}
exit(0);
}
Qingqing Zhou wrote:
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.
Interesting, and I wondered about this too. But for my taste the
demonstrated benefit really
isn't large enough to make it worthwhile.
BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where
it attempts
to detect the application's access pattern including if it is reading
sequentially and even
if there is a 'stride' to the accesses when they're not contiguous. I
would imagine that
other filesystems attempt similar tricks. So one might expect a simple
linear prefectch
to not help much in the presence of such a filesystem.
Were you worried about the icebreaker thread getting too far ahead of
the scan ?
If it did it might page out the data you're about to read, I think. Of
course this could
be fixed by having the read ahead thread perodically check the current
location being
read by the query thread and pausing if it's got too far ahead.
Anyway, the recent performance thread has been intersting to me because
in all my career
I've never seen a database that scanned scads of data from disk to
process a query.
Typically the problems I work on arrange to read the entire database
into memory.
I think I need to get out more... ;)
On Mon, 28 Nov 2005, Qingqing Zhou wrote:
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.
MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'. I've been experimenting with two ideas. The first is to
increase the readahead when we're doing sequential scans (see prototype
patch using posix fadvise attached). I've not got any hardware at the
moment which I can test this patch on but I am waiting on some dbt-3
results which should indicate whether fadvise is a good idea or a bad one.
The second idea is using posix async IO at key points within the system
to better parallelise CPU and IO work. There areas I think we could use
async IO are: during sequential scans, use async IO to do pre-fetching of
blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
and, inside the background writer/check point process, asynchronously
write out pages and, potentially, asynchronously build new checkpoint segments.
The motivation for using async IO is two fold: first, the results of this
paper[1]http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf are compelling; second, modern OSs support async IO. I know that
Linux[2]http://lse.sourceforge.net/io/aionotes.txt, Solaris[3]http://developers.sun.com/solaris/articles/event_completion.html - I'm fairly sure they have a posix AIO wrapper around these routines, but I cannot see it documented anywhere :-(, AIX and Windows all have async IO and I presume that
all their rivals have it as well.
The fundamental premise of the paper mentioned above is that if the
database is busy, IO should be busy. With our current block-at-a-time
processing, this isn't always the case. This is why Qingqing's read-ahead
thread makes sense. My reason for mailing is, however, that the async IO
results are more compelling than the read ahead thread.
I haven't had time to prototype whether we can easily implement async IO
but I am planning to work on it in December. The two main goals will be to
a) integrate and utilise async IO, at least within the executor context,
and b) build a primitive kind of scheduler so that we stop prefetching
when we know that there are a certain number of outstanding IOs for a
given device.
Thanks,
Gavin
[1]: http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
[2]: http://lse.sourceforge.net/io/aionotes.txt
[3]: http://developers.sun.com/solaris/articles/event_completion.html - I'm fairly sure they have a posix AIO wrapper around these routines, but I cannot see it documented anywhere :-(
fairly sure they have a posix AIO wrapper around these routines, but I
cannot see it documented anywhere :-(
Attachments:
fadvise.difftext/plain; CHARSET=US-ASCII; NAME=fadvise.diffDownload+75-19
Qingqing,
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the
data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.
You probably mean "ice-breaker" by the way :)
Chris
Gavin Sherry <swm@linuxworld.com.au> writes:
I haven't had time to prototype whether we can easily implement async IO
Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in. Otherwise the portability issues are just going to make it not
worth the trouble.
regards, tom lane
Gavin Sherry wrote:
MySQL, Oracle and others implement read-ahead threads to simulate async IO
I always believed that Oracle used async file I/O. Not that I've seen their
code, but I'm fairly sure they funded the addition of kernel aio to Linux
a few years back.
But....Oracle comes from a time long ago when threads and decent
filesystems didn't exist, so some of the things they do may not be
appropriate
to add to a product that doesn't have them today.
Now...network async I/O...that'd be really useful in my world...
On Mon, 28 Nov 2005, David Boreham wrote:
Gavin Sherry wrote:
MySQL, Oracle and others implement read-ahead threads to simulate async IO
I always believed that Oracle used async file I/O. Not that I've seen their
code, but I'm fairly sure they funded the addition of kernel aio to Linux
a few years back.
That's right.
But....Oracle comes from a time long ago when threads and decent
filesystems didn't exist, so some of the things they do may not be
appropriate
to add to a product that doesn't have them today.
The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g.
Gavin
Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
I haven't had time to prototype whether we can easily implement async IO
Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in. Otherwise the portability issues are just going to make it not
worth the trouble.
Do these ideas require threads in principle? ISTM that there could be
(additional) process(es) waiting to perform pre-fetching or async io,
and we could use the usual IPC machinary to talk between them...
cheers
Mark
The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g.
...<reads paper>... ok, interesting. Did they say that Oracle isn't
using aio ?
I can't see that. They that Oracle has no more than one outstanding I/O
operation in flight per concurrent query,
and they appear to think that's a bad thing. I'm not seeing
that myself. Perhaps once I sleep on it, it'll become clear what they're
getting at.
One theory for lack of aio in Oracle as tested in that paper would be
that they
were testing on Linux. Since aio is relatively new in Linux I wouldn't
be surprised
if Oracle didn't actually use it until it's known to be widely deployed
in the field
and to have proven reliability. Perhaps we've reached that state around now,
and so Oracle may not yet have released an aio-capable Linux version of
their
RDBMS. Just a theory...someone from those tubular towers lurking here
could tell us for sure I guess...
On Mon, 28 Nov 2005, Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
I haven't had time to prototype whether we can easily implement async IO
Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in. Otherwise the portability issues are just going to make it not
worth the trouble.
The architecture I am looking at would not rely on threads.
I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.
Gavin
Gavin Sherry wrote:
The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g.
There have been async io type parameters in Oracle's init.ora files from
(at least) 8i (disk_async_io=true IIRC) - on Solaris anyway. Whether
this enabled real or simulated async io is probably a good question - I
recall during testing turning it off and seeing kio()? or similar type
calls become write()/read() in truss oupout.
regards
Mark
On Mon, 28 Nov 2005, Mark Kirkwood wrote:
Do these ideas require threads in principle? ISTM that there could be
(additional) process(es) waiting to perform pre-fetching or async io,
and we could use the usual IPC machinary to talk between them...
Right. I use threads because it is easy to write simulation program :-)
Regards,
Qingqing
FYI, I've personally used Oracle 9.2.0.4's async IO on Linux and have seen
several installations which make use of it also.
Show quoted text
On 11/28/05, Gavin Sherry <swm@linuxworld.com.au> wrote:
On Mon, 28 Nov 2005, Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
I haven't had time to prototype whether we can easily implement async
IO
Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in. Otherwise the portability issues are just going to make it not
worth the trouble.The architecture I am looking at would not rely on threads.
I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.Gavin
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Mon, 28 Nov 2005, Gavin Sherry wrote:
MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'.
Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ...
Regards,
Qingqing
On Mon, 28 Nov 2005, Gavin Sherry wrote:
I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.
Don't worry :-) I separate the logic into a standalone program in order to
let more people can help on this issue.
Regards,
Qingqing
"David Boreham" <david_list@boreham.org> wrote
BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where
it attempts to detect the application's access pattern including if it is
reading sequentially and even if there is a 'stride' to the accesses when
they're not contiguous. I would imagine that other filesystems attempt
similar tricks. So one might expect a simple linear prefectch
to not help much in the presence of such a filesystem.
So we need more tests. I understand how smart current file systems are, and
seems that depends on the interval that you send next file block read
request (decided by cpu_cost parameter in my program).
I imagine on a multi-way machine with strong IO device, the ice-breaker
could do much better ...
Were you worried about the icebreaker thread getting too far ahead of the
scan ? If it did it might page out the data you're about to read, I think.
Of course this could be fixed by having the read ahead thread perodically
check the current location being read by the query thread and pausing if
it's got too far ahead.
Right.
Regards,
Qingqing
Qingqing Zhou wrote:
On Mon, 28 Nov 2005, Gavin Sherry wrote:
MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'.Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ...
I don't think your NT overlapped I/O code is quite right. At least
I think it will issue reads at a high rate without waiting for any of them
to complete. Beyond some point that has to give the kernel gut-rot.
But anyway, I wouldn't expect the use of aio to make any
significant difference in an already threaded test program.
The point of aio is to allow
I/O concurrency _without_ the use of threads or multiple processes.
You could re-write your program to have a single thread but use aio.
In that case it should show the same read ahead benefit that you see
with the thread.
On Mon, 28 Nov 2005, Qingqing Zhou wrote:
On Mon, 28 Nov 2005, Gavin Sherry wrote:
MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'.Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ...
Right, I would imagine that it's very close. I intend to use kernel based
async IO so that we can have the prefetch effect of your sample program
without the need for threads.
Thanks,
Gavin
"David Boreham" <david_list@boreham.org> wrote
I don't think your NT overlapped I/O code is quite right. At least
I think it will issue reads at a high rate without waiting for any of them
to complete. Beyond some point that has to give the kernel gut-rot.
[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh,
this behavior is intended - I try to push enough requests shortly to kernel
so that it understands that I am doing sequential scan, so it would pull the
data from disk to file system cache more efficiently. Some file systems may
have "free-behind" mechanism, but our main thread (who really process the
query) should be fast enough before the data vanished.
You could re-write your program to have a single thread but use aio.
In that case it should show the same read ahead benefit that you see
with the thread.
I guess this is also Gavin's point - I understand that will be two different
methodologies to handle "read-ahead". If no other thread/process involved,
then the main thread will be responsible to grab a free buffer page from
bufferpool and ask the kernel to put the data there by sync IO (current
PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to
use a dedicated thread/process to "break the ice" only, i.e., pull data from
disk to file system cache, so that the main thread will only issue *logical*
read.
Regards,
Qingqing
On Tue, Nov 29, 2005 at 02:53:36PM +1100, Gavin Sherry wrote:
The second idea is using posix async IO at key points within the system
to better parallelise CPU and IO work. There areas I think we could use
async IO are: during sequential scans, use async IO to do pre-fetching of
blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
and, inside the background writer/check point process, asynchronously
write out pages and, potentially, asynchronously build new checkpoint segments.
I actually worked on this and got it to the stage where it wouldn't
crash anymore. It basically added a command to bufmgr.c called
PrefetchBuffer() which would initiate a request but not block. I then
hooked a few strategic places to call this. In particular during an
index scan, it would prefetch the next index block and the next few
data blocks and then return them in order as they came in.
Unfortunatly I can't really test it at it's full potential because it
uses glibc's default POSIX AIO which is *lame*. No more than one
outstanding request per fd which for PostgreSQL is crappy. There was
some evidence that in an index scan of a highly uncorrelated index that
it did make a small difference, but I never got around to testing it
fully. But bitmap scans already hugely reduce the cost of uncorrelated
indexes.
It doesn't pass regression because index_getmulti doesn't do backward
scans. Everything else works though.
If anyone is interested in the code I can send it to them. The results
on my system just wern't good enough to justify a lot more effort.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.