WAL Performance Improvements

Started by Janardhana Reddyalmost 24 years ago11 messages

jana-reddy@mediaring.com.sg

almost 24 years ago

2 attachment(s)

Hi,
I've attached a patch which should improve the performance of WAL by
reducing the fsync time
and write time by 50%(if OS page size is 4k) , if the transaction
generate the WAL data less then 4k. Instead of
writing every time 8k data in to the WAL file it will write only the
portion of the data which
as changed from the last time(Example : if transaction generates 150
bytes of WAL data ,then it writes
only 150 bytes instead of 8k).

please apply for 7.3.

Regards
jana

Attachments:

patch_xlogtext/plain; charset=us-ascii; name=patch_xlogDownload

--- src/backend/access/transam/xlog.c.orig	Mon Feb 25 11:04:39 2002
+++ src/backend/access/transam/xlog.c	Mon Feb 25 12:30:58 2002
@@ -991,16 +991,17 @@
 XLogWrite(XLogwrtRqst WriteRqst)
 {
 	XLogCtlWrite *Write = &XLogCtl->Write;
-	char	   *from;
+	char		*from;
 	bool		ispartialpage;
 	bool		use_existent;
-
+	uint32	data_size,first_time,current_xrecoff;
 	/*
 	 * Update local LogwrtResult (caller probably did this already,
 	 * but...)
 	 */
 	LogwrtResult = Write->LogwrtResult;
-
+	first_time=1;
+	current_xrecoff=LogwrtResult.Write.xrecoff;
 	while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
 	{
 		/*
@@ -1080,19 +1081,46 @@
 			openLogOff = 0;
 		}
 
-		/* Need to seek in the file? */
-		if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
+	/* Need to seek in the file? */
+		if (openLogOff != (current_xrecoff ) % XLogSegSize)
 		{
-			openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+			openLogOff = (current_xrecoff ) % XLogSegSize;
 			if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
 				elog(STOP, "lseek of log file %u, segment %u, offset %u failed: %m",
 					 openLogId, openLogSeg, openLogOff);
-		}
+		} 
 
 		/* OK to write the page */
 		from = XLogCtl->pages + Write->curridx * BLCKSZ;
+		data_size=BLCKSZ ;
+		{  /* compute the data size to write in to the file */
+			int offset;
+			int type=0; /* used only for display debug info */
+			offset= current_xrecoff % BLCKSZ ;
+			from = XLogCtl->pages + Write->curridx * BLCKSZ +offset ;
+			if (!ispartialpage && first_time==0)
+			{
+				data_size=BLCKSZ ;
+				type=1;        
+			}else
+			{
+				if(ispartialpage)
+				{
+					data_size=WriteRqst.Write.xrecoff-current_xrecoff;
+					type=2;
+				}else
+				{
+					data_size=BLCKSZ-(current_xrecoff-((current_xrecoff/BLCKSZ)*BLCKSZ)); 
+					type=3;
+				} 
+			}
+			current_xrecoff=current_xrecoff+data_size;
+			if (XLOG_DEBUG) 
+				elog(DEBUG,"XLogWrite type:%d from:%x(%d)  size:%d  Fileoffset:%d \n",type,from,from,data_size,openLogOff);
+			first_time=0;
+		} 
 		errno = 0;
-		if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
+		if (write(openLogFile, from, data_size) != data_size)
 		{
 			/* if write didn't set errno, assume problem is no disk space */
 			if (errno == 0)
@@ -1100,7 +1128,7 @@
 			elog(STOP, "write of log file %u, segment %u, offset %u failed: %m",
 				 openLogId, openLogSeg, openLogOff);
 		}
-		openLogOff += BLCKSZ;
+		openLogOff += data_size;
 
 		/*
 		 * If we just wrote the whole last page of a logfile segment,

xlog.ctext/plain; charset=us-ascii; name=xlog.cDownload

Tom Lane

tgl@sss.pgh.pa.us

almost 24 years ago

In reply to: Janardhana Reddy (#1)

Re: [PATCHES] WAL Performance Improvements

Janardhana Reddy <jana-reddy@mediaring.com.sg> writes:

I've attached a patch which should improve the performance of WAL by
reducing the fsync time
and write time by 50%(if OS page size is 4k) , if the transaction
generate the WAL data less then 4k. Instead of
writing every time 8k data in to the WAL file it will write only the
portion of the data which
as changed from the last time(Example : if transaction generates 150
bytes of WAL data ,then it writes
only 150 bytes instead of 8k).

As near as I can tell, this breaks WAL by failing to ensure that the
rest of the current page is zeroed. After crash and recovery, you might
read obsolete WAL records (written during the previous cycle of life
of the WAL segment file) and think they are valid.

I'd also be interested to see the measurements backing up the claim of 50%
performance improvement. That'd depend very largely on the filesystem block
size, no?

regards, tom lane

Janardhana Reddy

jana-reddy@mediaring.com.sg

almost 24 years ago

In reply to: Janardhana Reddy (#1)

Re: [PATCHES] WAL Performance Improvements

Tom Lane wrote:

Janardhana Reddy <jana-reddy@mediaring.com.sg> writes:

I've attached a patch which should improve the performance of WAL by
reducing the fsync time
and write time by 50%(if OS page size is 4k) , if the transaction
generate the WAL data less then 4k. Instead of
writing every time 8k data in to the WAL file it will write only the
portion of the data which
as changed from the last time(Example : if transaction generates 150
bytes of WAL data ,then it writes
only 150 bytes instead of 8k).

As near as I can tell, this breaks WAL by failing to ensure that the
rest of the current page is zeroed. After crash and recovery, you might
read obsolete WAL records (written during the previous cycle of life
of the WAL segment file) and think they are valid.

I'd also be interested to see the measurements backing up the claim of 50%
performance improvement. That'd depend very largely on the filesystem block
size, no?

regards, tom lane

correct, this breaks WAL by failing to ensure that the rest of the
current page is zeroed when the WAL file is reused. I am thinking to
fix this by writing an extra WAL record(few bytes which are zeroed ) more when
there is write
and size of data is less then BLKSIZE, this should fix the problem.

I think performance improvement depends on the OS page size , since OS
looks which page is dirty and writes entire page
at the the of sync even if few bytes of the page are modified. I think
for linux it is 4k. The measurement of the test on Linux is as follows:

This is output of "strace -c" of the backend before the patch is applied:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
87.75 0.462903 2269 204 fdatasync
6.13 0.032322 158 204 send
2.91 0.015330 75 204 recv
2.55 0.013477 63 214 write
0.23 0.001226 6 210 lseek
0.21 0.001089 5 204 time
0.15 0.000765 4 204 gettimeofday
0.07 0.000362 91 4 read
0.01 0.000035 35 1 open
------ ----------- ----------- --------- --------- ----------------
100.00 0.527509 1449 total

This ouput is after the patch is applied
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
47.92 0.101630 498 204 fdatasync
47.14 0.099969 490 204 recv
2.30 0.004879 23 215 write
1.57 0.003340 16 204 send
0.51 0.001084 5 204 time
0.38 0.000809 4 204 gettimeofday
0.13 0.000269 67 4 read
0.02 0.000046 7 7 lseek
0.02 0.000041 41 1 open
------ ----------- ----------- --------- --------- ----------------
100.00 0.212067 1247 total

The main improvement comes from fdatasync from 2269 usec to 498 usec.
but i expect
the fdatasync time to reduce by 50% (since the linux OS 2.4 uses 4K page
size) but all the tests show the reduction by 75%. In all the tests
each transaction generates/writes 150 bytes in to the WAL file.

regards
jana

Janardhana Reddy

jana-reddy@mediaring.com.sg

almost 24 years ago

In reply to: Janardhana Reddy (#1)

2 attachment(s)

Re: WAL Performance Improvements

Tom Lane wrote:

Janardhana Reddy <jana-reddy@mediaring.com.sg> writes:

I've attached a patch which should improve the performance of WAL by
reducing the fsync time
and write time by 50%(if OS page size is 4k) , if the transaction
generate the WAL data less then 4k. Instead of
writing every time 8k data in to the WAL file it will write only the
portion of the data which
as changed from the last time(Example : if transaction generates 150
bytes of WAL data ,then it writes
only 150 bytes instead of 8k).

As near as I can tell, this breaks WAL by failing to ensure that the
rest of the current page is zeroed. After crash and recovery, you might
read obsolete WAL records (written during the previous cycle of life
of the WAL segment file) and think they are valid.

I'd also be interested to see the measurements backing up the claim of 50%
performance improvement. That'd depend very largely on the filesystem block
size, no?

regards, tom lane

Hi,
I have done some changes to patch to overcome the "WAL failing to ensure
that the
rest of the current page is zeroed". for every write to the WAL empty
record(32 bytes with zero value ) is appended so that
at the time of REDO it can see the record with zero length in all cases.

Test Results with Latest patch :
environment: Intel PC ,IDE (harddisk),Linux Kernel 2.4.0 (OS
Version). Single
connection is connected to the database and pumping
continously insert statements. each insert
generates 160 bytes to WAL Log.

With out applying the Patch :
Transaction Per Second : 332 TPS
Time Taken by fdatasync : 2160
usec/call
Time taken by write : 61
usec/call
After applying the patch :
Transaction Per Second : 435 TPS
Time Taken by fdatasync : 512
usec/sec
Time Taken by write : 42 usec/sec

I've attached latest patch and xlog.c .

Regards
jana

Attachments:

patch_xlogtext/plain; charset=us-ascii; name=patch_xlogDownload

--- src/backend/access/transam/xlog.c.orig	Mon Feb 25 11:04:39 2002
+++ src/backend/access/transam/xlog.c	Tue Feb 26 14:33:58 2002
@@ -991,16 +991,17 @@
 XLogWrite(XLogwrtRqst WriteRqst)
 {
 	XLogCtlWrite *Write = &XLogCtl->Write;
-	char	   *from;
+	char		*from;
 	bool		ispartialpage;
 	bool		use_existent;
-
+	uint32	data_size,total_size,first_time,current_xrecoff;
 	/*
 	 * Update local LogwrtResult (caller probably did this already,
 	 * but...)
 	 */
 	LogwrtResult = Write->LogwrtResult;
-
+	first_time=1;
+	current_xrecoff=LogwrtResult.Write.xrecoff;
 	while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
 	{
 		/*
@@ -1080,19 +1081,56 @@
 			openLogOff = 0;
 		}
 
-		/* Need to seek in the file? */
-		if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
+	/* Need to seek in the file? */
+		if (openLogOff != (current_xrecoff ) % XLogSegSize)
 		{
-			openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+			openLogOff = (current_xrecoff ) % XLogSegSize;
 			if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
 				elog(STOP, "lseek of log file %u, segment %u, offset %u failed: %m",
 					 openLogId, openLogSeg, openLogOff);
-		}
+		} 
 
 		/* OK to write the page */
 		from = XLogCtl->pages + Write->curridx * BLCKSZ;
+		data_size=BLCKSZ ;
+		total_size=BLCKSZ ;
+		{  /* compute the data size to write in to the file */
+			int offset;
+			int type=0; /* used only for display debug info */
+			offset= current_xrecoff % BLCKSZ ;
+			from = XLogCtl->pages + Write->curridx * BLCKSZ +offset ;
+			if (!ispartialpage && first_time==0)
+			{
+				data_size=BLCKSZ ;
+                                total_size=BLCKSZ ;
+				type=1;        
+			}else
+			{
+				if(ispartialpage)
+				{
+					type=2;
+					data_size=WriteRqst.Write.xrecoff-current_xrecoff;
+					if ( (offset+data_size+sizeof(XLogRecord)) <= BLCKSZ)
+                                        {
+						total_size=data_size+sizeof(XLogRecord); type=5;
+                                        }else
+					{
+						total_size=BLCKSZ-offset;
+					}
+				}else
+				{
+					data_size=BLCKSZ-(current_xrecoff-((current_xrecoff/BLCKSZ)*BLCKSZ)); 
+					total_size=data_size;
+					type=3;
+				} 
+			}
+			current_xrecoff=current_xrecoff+data_size;
+			if (XLOG_DEBUG) 
+				elog(DEBUG,"XLogWrite type:%d from:%x(%d)  data size:%d totalsize:%d  Fileoffset:%d \n",type,from,from,data_size,total_size,openLogOff);
+			first_time=0;
+		} 
 		errno = 0;
-		if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
+		if (write(openLogFile, from, total_size) != total_size)
 		{
 			/* if write didn't set errno, assume problem is no disk space */
 			if (errno == 0)
@@ -1100,7 +1138,7 @@
 			elog(STOP, "write of log file %u, segment %u, offset %u failed: %m",
 				 openLogId, openLogSeg, openLogOff);
 		}
-		openLogOff += BLCKSZ;
+		openLogOff += total_size;
 
 		/*
 		 * If we just wrote the whole last page of a logfile segment,

xlog.ctext/plain; charset=us-ascii; name=xlog.cDownload

Helge Bahmann

bahmann@math.tu-freiberg.de

almost 24 years ago

In reply to: Janardhana Reddy (#4)

Re: WAL Performance Improvements

On Tue, 26 Feb 2002, Janardhana Reddy wrote:

Test Results with Latest patch :
environment: Intel PC ,IDE (harddisk),Linux Kernel 2.4.0 (OS
Version). Single
connection is connected to the database and pumping
continously insert statements. each insert
generates 160 bytes to WAL Log.

8192:

Transaction Per Second : 332 TPS
Time Taken by fdatasync : 2160

4096:

Transaction Per Second : 435 TPS
Time Taken by fdatasync : 512

Unforunately your timings are meaningless. Assuming you have a
10000rpm drive (that is, 166 rounds per second), it is physically
impossible to write 332 or 435 times per second to the same location
on the disk.

So I guess your disk is performing write-caching and not really writing
the data back when requested by fsync(). You may try to disable
write caching and see if it makes a difference:

hdparm -W 0 /dev/hda

But note that most (or even all) modern IDE drives will not disable write
caching even when instructed to do so. You should try to repeat the timings
using SCSI drives -- I guess you will not see any improvement here.

Janardhana Reddy

jana-reddy@mediaring.com.sg

almost 24 years ago

In reply to: Helge Bahmann (#5)

Re: [PATCHES] WAL Performance Improvements

Helge Bahmann wrote:

On Tue, 26 Feb 2002, Janardhana Reddy wrote:

Test Results with Latest patch :
environment: Intel PC ,IDE (harddisk),Linux Kernel 2.4.0 (OS
Version). Single
connection is connected to the database and pumping
continously insert statements. each insert
generates 160 bytes to WAL Log.

8192:

Transaction Per Second : 332 TPS
Time Taken by fdatasync : 2160

4096:

Transaction Per Second : 435 TPS
Time Taken by fdatasync : 512

Unforunately your timings are meaningless. Assuming you have a
10000rpm drive (that is, 166 rounds per second), it is physically
impossible to write 332 or 435 times per second to the same location
on the disk.

So I guess your disk is performing write-caching and not really writing
the data back when requested by fsync(). You may try to disable
write caching and see if it makes a difference:

hdparm -W 0 /dev/hda

But note that most (or even all) modern IDE drives will not disable write
caching even when instructed to do so. You should try to repeat the timings
using SCSI drives -- I guess you will not see any improvement here.

Regards
--
Helge Bahmann <bahmann@math.tu-freiberg.de> /| \__
Network admin, systems programmer /_|____\
_/\ | __)
$ ./configure \\ \|__/__|
checking whether build environment is sane... yes \\/___/ |
checking for AIX... no (we already did this) |

i have tested again but it gives the same result .
now i have tested with small program justing doing write and fdatasync by
changing
the size of data in write call. there is big difference in fdatasync time:
The test program looks as below:
------------------------------------
#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
main()
{
int fd;
int i;
int data_size;
char buf[20000];
fd=open("testdata",O_CREAT|O_WRONLY);
data_size=8*1024 ;
while (1)
{
i++;
lseek(fd,0,SEEK_SET);
write(fd,buf,data_size);
fdatasync(fd);
if (i%10000 ==0)
{
printf("---------------\nindex: %d size: %d :\n ",i,data_size);
system("date");
}
}
}
=========================================================
Test1 : with test program data_size= 8*1024
the output looks as below:
./a.out
---------------
index: 134520000 size: 8192 :
Tue Feb 26 19:46:12 SGT 2002
---------------
index: 134530000 size: 8192 :
Tue Feb 26 19:46:51 SGT 2002
---------------
index: 134540000 size: 8192 :
Tue Feb 26 19:47:27 SGT 2002
---------------
index: 134550000 size: 8192 :
Tue Feb 26 19:48:04 SGT 2002
---------------
strace output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.36 29.354861 3141 9347 fdatasync
1.45 0.432686 46 9350 write
0.15 0.044835 5 9348 lseek
0.03 0.009375 9375 1 wait4
0.00 0.001114 1114 1 vfork
0.00 0.000140 35 4 rt_sigaction
0.00 0.000007 4 2 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 29.843018 28053 total
======================================================================
Test2 : with test program , data_size=160
the output looks as below:
---------------
index: 134520000 size: 160 :
Tue Feb 26 19:44:41 SGT 2002
---------------
index: 134530000 size: 160 :
Tue Feb 26 19:44:44 SGT 2002
---------------
index: 134540000 size: 160 :
Tue Feb 26 19:44:48 SGT 2002
---------------
index: 134550000 size: 160 :
Tue Feb 26 19:44:52 SGT 2002
---------------
index: 134560000 size: 160 :
Tue Feb 26 19:44:56 SGT 2002

strace output :
strace -c -p 4741
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
95.54 5.672195 396 14328 fdatasync
3.12 0.185227 13 14330 write
1.16 0.069158 5 14329 lseek
0.12 0.006927 6927 1 vfork
0.05 0.003146 3146 1 wait4
0.00 0.000020 5 4 rt_sigaction
0.00 0.000007 4 2 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 5.936680 42995 total
======================================================================================

SUMMARY :

Test1: (data_size= 8192 , with test program)
fdatasync time +write time: 3141+46 = 3187 usec/call
Time taken for 10000 iterations: nearly 40 seconds
Test2 : (data_size = 160, with test program)
fdatasync time+write time: 396 +13 = 409 usec/call
Time taken for 10000 iterations: nealy 4 seconds

When i test with database by doing 10000 inserts which generates 160 bytes
into WAL Log :
Test3: (without apllying the patch, with existing database)
10000 insert = 30 seconds
fdatasync time= 2160 usec
Test4 : (with patch apllied, with database)
10000 inserts = 23 seconds
fdatasync time= 512 usec

what i don't understand is, in the test3 with extisting postgres database
it takes
the fdatasync time 2160 usec. but according to Test1 it takes 3141 usec
eventhough both the
data size it write is 8192. This cause the difference in the results .
The hard disk is not doing any write caching. While doing the fdatasync the
linux OS
writies only the dirty buffers (size=512 bytes) so it causes big difference in
fdatsync from
396 usec to 3141 usec .

Regards
jana

Helge Bahmann

bahmann@math.tu-freiberg.de

almost 24 years ago

In reply to: Janardhana Reddy (#6)

Re: WAL Performance Improvements

On Tue, 26 Feb 2002, Janardhana Reddy wrote:

SUMMARY :

Test1: (data_size= 8192 , with test program)
fdatasync time +write time: 3141+46 = 3187 usec/call
Time taken for 10000 iterations: nearly 40 seconds
Test2 : (data_size = 160, with test program)
fdatasync time+write time: 396 +13 = 409 usec/call
Time taken for 10000 iterations: nealy 4 seconds

This only shows that your harddisk is doing write caching, although it
claims it does not (And on such systems I am tempted to say you can
turn off fsync unconditionally as it will gain you almost nothing).

Please look at the numbers: It is really *impossible* for any harddisk to
write to the same location more than 2000 times per second - simply due to
the fact that the disks are not rotating that fast. The fact that turning
writing caching on or off does not make a difference should make you
suspicious.

(In fact looking more closely at the numbers I am tempted to bet that you
operate your IDE disk in PIO mode: 1024bytes/400usec= 8192bytes/3200usec=
2.5MByte/s, and all you are benchmarking is the PIO transfer rate of your
IDE-controller/CPU combination).

This is not to say that your WAL optimization is worthless, but the
benchmark you gave is certainly wrong.

Tom Lane

tgl@sss.pgh.pa.us

almost 24 years ago

In reply to: Helge Bahmann (#7)

Re: WAL Performance Improvements

Helge Bahmann <bahmann@math.tu-freiberg.de> writes:

This is not to say that your WAL optimization is worthless, but the
benchmark you gave is certainly wrong.

There is actually good reason to think that the change would be a net
loss in many scenarios. The problem is that if you write a partial
filesystem block, the kernel must first read in the old contents of
the block, then overlay the data you've specified to write onto the
appropriate part of the buffer. That disk read takes time --- and
what's worse, it's physical I/O that will be done while the process
requesting the write is holding the WALWriteLock. (AFAIK the kernel
will not absorb the user data until it's got a buffer to dump it
into; anyone want to dig into kernel sources and confirm that?)

On the other hand, when you write a full block, there's no need to
read the old block contents. The user data will just be copied to
a freshly-allocated kernel disk buffer. This is why I suggested
that the first write of a WAL block should write the entire block.
We can hope that subsequent writes of just part of the block will find
the block still in kernel disk buffers, and so avoid a read operation.

AFAICS the only real win that can be gotten with a change like
Janardhana's would be to avoid writing multiple blocks in the case
where the filesystem block size is smaller than the xlog's BLCKSZ.
Tuning this correctly would require knowing the kernel's block size.
Anyone have ideas about a portable way to find that out?

regards, tom lane

Helge Bahmann

bahmann@math.tu-freiberg.de

almost 24 years ago

In reply to: Tom Lane (#8)

Re: WAL Performance Improvements

On Tue, 26 Feb 2002, Tom Lane wrote:

AFAICS the only real win that can be gotten with a change like
Janardhana's would be to avoid writing multiple blocks in the case
where the filesystem block size is smaller than the xlog's BLCKSZ.
Tuning this correctly would require knowing the kernel's block size.
Anyone have ideas about a portable way to find that out?

I have been thinking for quite some time now that it would be a cool
project to turn Postgres into using aio_(read|write) + O_DIRECT instead of
read|write + fsync; in that case the caller gets to control the blocksize
within the limits permitted by the hardware, and Janardhana's optimization
could safely be applied.

#10

Hannu Krosing

hannu@krosing.net

almost 24 years ago

In reply to: Janardhana Reddy (#6)

Re: [HACKERS] WAL Performance Improvements

On Tue, 2002-02-26 at 17:33, Janardhana Reddy wrote:

Helge Bahmann wrote:

On Tue, 26 Feb 2002, Janardhana Reddy wrote:

Test Results with Latest patch :
environment: Intel PC ,IDE (harddisk),Linux Kernel 2.4.0 (OS
Version). Single
connection is connected to the database and pumping
continously insert statements. each insert
generates 160 bytes to WAL Log.

8192:

Transaction Per Second : 332 TPS
Time Taken by fdatasync : 2160

4096:

Transaction Per Second : 435 TPS
Time Taken by fdatasync : 512

Unforunately your timings are meaningless. Assuming you have a
10000rpm drive (that is, 166 rounds per second), it is physically
impossible to write 332 or 435 times per second to the same location
on the disk.

But it is possible to push data to disk cache and be pretty sure that it
will be written in next 1/166th of a second. Most modern disk should be
able to cache writes for at least one whole track and it makes sense to
report it as "written" when it is in write cache and disk is confident
that it will be written even if power failure happens the next moment.

--------------
Hannu

#11

Janardhana Reddy

jana-reddy@mediaring.com.sg

almost 24 years ago

In reply to: Helge Bahmann (#7)

Re: WAL Performance Improvements

Helge Bahmann wrote:

On Tue, 26 Feb 2002, Janardhana Reddy wrote:

SUMMARY :

Test1: (data_size= 8192 , with test program)
fdatasync time +write time: 3141+46 = 3187 usec/call
Time taken for 10000 iterations: nearly 40 seconds
Test2 : (data_size = 160, with test program)
fdatasync time+write time: 396 +13 = 409 usec/call
Time taken for 10000 iterations: nealy 4 seconds

This only shows that your harddisk is doing write caching, although it
claims it does not (And on such systems I am tempted to say you can
turn off fsync unconditionally as it will gain you almost nothing).

Please look at the numbers: It is really *impossible* for any harddisk to
write to the same location more than 2000 times per second - simply due to
the fact that the disks are not rotating that fast. The fact that turning
writing caching on or off does not make a difference should make you
suspicious.

(In fact looking more closely at the numbers I am tempted to bet that you
operate your IDE disk in PIO mode: 1024bytes/400usec= 8192bytes/3200usec=
2.5MByte/s, and all you are benchmarking is the PIO transfer rate of your
IDE-controller/CPU combination).

This is not to say that your WAL optimization is worthless, but the
benchmark you gave is certainly wrong.

Regards
--
Helge Bahmann <bahmann@math.tu-freiberg.de> /| \__
Network admin, systems programmer /_|____\
_/\ | __)
$ ./configure \\ \|__/__|
checking whether build environment is sane... yes \\/___/ |
checking for AIX... no (we already did this) |

you are correct the hard disk using on my machine is doing write caching .
I have repeted Test1 and Test2 on another machines with scsci , the time
taken by Test1 and Test2 are almost same and it around 100seconds.
what i don't understand is In Test1 the OS keeps 8192 bytes(512X16= 16
blocks) on harddisk.
where as in Test2 the OS keeps only 512(1block) on hard disk during the
fdatasync. so
how come harddisk takes same time to write 1 block or 16 blocks?. Is it
because
the hard disk seek time is much larger when compare to hard disk write
time?.

Regards
jana