Patch : seq scan readahead (WIP)

Started by Pierre Frédéric Caillaudover 16 years ago2 messages
1 attachment(s)

This is a spinoff of the current work on compression...
I've discovered that linux doesn't apply readahead to sparse files.
So I added a little readahead in seq scans.

Then I realized this might also be beneficial for the standard Postgres.
On my RAID1 it shows some pretty drastic effects.

The PC :

- RAID1 of 2xSATA disks, reads at about 60 MB/s
- RAID5 of 3xSATA disks, reads at about 210 MB/s

Both RAIDs are Linux Software RAID.

Test data :

A 9.3GB table with not too small rows, so count(*) doesn't use lots of CPU.

The problem :

- On the RAID5 there is no problem, count(*) maxes out the disk.
- On the RAID1, count(*) also maxes out the disk, but there are 2 disks.
One works, one sits idle. It does nothing.
Linux Software RAID cannot use 2 disks on sequential reads, at least on my
kernel version. What do your boxes do in such a situation ?

For standard postgres, iostat says :

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 3,00 0,00 40,00 0 40
sdb 727,00 116600,00 40,00 116600 40

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 124,00 23408,00 0,00 23408 0
sdb 628,00 101640,00 0,00 101640 0

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 744,00 124536,00 0,00 124536 0
sdb 0,00 0,00 0,00 0 0

Basically it is reading the disks in turn, but not at the same time.

The solution :

Somehow coerce Linux Software RAID to stripe reads across the 2 mirrors to
get more throughput.

After a bit of fiddling, this seems to do it :

- for each page read in a seq scan

Strategy 0 : do nothing (this is the current strategy)
Strategy 1 : issue a Prefetch call 4096 pages ahead (32MB) of current
position
Strategy 2 : if (the current page & 4096) == 1, issue a Prefetch call 4096
pages ahead (32MB) of current position
Strategy 3 : issue a prefetch at 32MB * ((the current page & 4096) ? 1 :
2) ahead of current position

Results to seq scan 9.3GB of data on the RAID5 :

Strategy 0 :46.4 s
It maxes out the disk anyway, so I didn't try the others.
However RAID1 is better for not so read-only databases...

Results to seq scan 9.3GB of data on the RAID1 :

Strategy 0 :162.8 s
Strategy 1 :152.9 s
Strategy 2 :105.2 s
Strategy 3 :152.3 s

Strategy 2 cuts the seq scan duration by 35%, ie. disk bandwidth gets a
+54% boost.

For strategy 2, iostat says :

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 625,00 105288,00 0,00 105288 0
sdb 820,00 105968,00 0,00 105968 0

Both RAID1 volumes are exploited at the same time.

I guess it would need some experimenting with the values, and a
per-tablespace setting, but since lots of people use Linux Software RAID1
on servers, this might be interesting...

You guys want to try it ?

Patch attached.

Attachments:

pg-8.4.0-ra-patch-v0.01.txttext/plain; name=pg-8.4.0-ra-patch-v0.01.txtDownload
diff -rupN postgresql-8.4.0-orig/src/backend/access/heap/heapam.c postgresql-8.4.0-ra/src/backend/access/heap/heapam.c
--- postgresql-8.4.0-orig/src/backend/access/heap/heapam.c	2009-06-11 16:48:53.000000000 +0200
+++ postgresql-8.4.0-ra/src/backend/access/heap/heapam.c	2009-08-08 10:41:15.000000000 +0200
@@ -135,6 +135,8 @@ initscan(HeapScanDesc scan, ScanKey key,
 	{
 		if (scan->rs_strategy == NULL)
 			scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);
+		
+		scan->rs_readahead_pages = 4096;	/* TODO: GUC ? or maybe put it in AccessStrategy ? */
 	}
 	else
 	{
@@ -766,6 +768,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 			if (page == 0)
 				page = scan->rs_nblocks;
 			page--;
+			
+			/*
+			 * do some extra readahead (really needed for compressed files)
+			 */
+			if( scan->rs_readahead_pages && !finished )
+				PrefetchBuffer( scan->rs_rd, MAIN_FORKNUM, page - scan->rs_readahead_pages + ((page >= scan->rs_readahead_pages) ? 0 : scan->rs_nblocks));
 		}
 		else
 		{
@@ -788,6 +796,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 */
 			if (scan->rs_syncscan)
 				ss_report_location(scan->rs_rd, page);
+			
+			/*
+			 * do some extra readahead (really needed for compressed files)
+			 */
+
+			if( scan->rs_readahead_pages && !finished && (page & 4096))
+				PrefetchBuffer( scan->rs_rd, MAIN_FORKNUM, (page + scan->rs_readahead_pages) % scan->rs_nblocks );
 		}
 
 		/*
@@ -1209,6 +1224,7 @@ heap_beginscan_internal(Relation relatio
 	scan->rs_strategy = NULL;	/* set in initscan */
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
+	scan->rs_readahead_pages = 0;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
diff -rupN postgresql-8.4.0-orig/src/include/access/relscan.h postgresql-8.4.0-ra/src/include/access/relscan.h
--- postgresql-8.4.0-orig/src/include/access/relscan.h	2009-01-01 18:23:56.000000000 +0100
+++ postgresql-8.4.0-ra/src/include/access/relscan.h	2009-08-08 09:44:38.000000000 +0200
@@ -35,6 +35,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_startblock;	/* block # to start at */
 	BufferAccessStrategy rs_strategy;	/* access strategy for reads */
 	bool		rs_syncscan;	/* report location to syncscan logic? */
+	int			rs_readahead_pages;	/* if non-zero, issue a Prefetch to get a page rs_readahead_pages ahead of current page */
 
 	/* scan current state */
 	bool		rs_inited;		/* false = scan not init'd yet */
#2Albert Cervera i Areny
albert@nan-tic.com
In reply to: Pierre Frédéric Caillaud (#1)
Re: Patch : seq scan readahead (WIP)

A Dissabte, 8 d'agost de 2009, Pierre Frᅵdᅵric Caillaud va escriure:

I guess it would need some experimenting with the values, and a
per-tablespace setting, but since lots of people use Linux Software RAID1
on servers, this might be interesting...

You guys want to try it ?

Your tests involve only one user. What about having two (or more) users
reading different tables? You're using both disks for one user for a 35%
performance gain "only"...

Patch attached.

--
Albert Cervera i Areny
http://www.NaN-tic.com
Mᅵbil: +34 669 40 40 18