WAL: O_DIRECT and multipage-writer

Started by ITAGAKI Takahiroalmost 21 years ago22 messages
#1ITAGAKI Takahiro
itagaki.takahiro@lab.ntt.co.jp
1 attachment(s)

Hello, all.

I think that there is room for improvement in WAL.
Here is a patch for it.
- Multiple pages are written in one write() if it is contiguous.
- Add 'open_direct' to wal_sync_method.

WAL writer writes one page in one write(). This is not efficient
when wal_sync_method is 'open_sync', because the writer waits for
IO completions at each write(). Multipage-writer can reduce syscalls
and improve IO throughput.

'open_direct' uses O_DIRECT instead of O_SYNC. O_DIRECT implies synchronous
writing, so it may show the tendency like open_sync. But maybe it can reduce
memcpy() and save OS's disk cache memory.

I benchmarked this patch with pgbench. It works well and
improved 50% of tps on my machine. WAL seems to be bottle-neck
on machines with poor disks.

This patch has not yet tested enough. I would like it to be examined much
and taken into PostgreSQL.

There are still many TODOs:
* Is this logic really correct?
- O_DIRECT_BUFFER_ALIGN should be adjusted to runtime, not compile time.
- Consider to use writev() instead of write().
Buffers are noncontiguous when WAL ring buffer rotates.
- If wan_sync_method is not open_direct, XLOG_EXTRA_BUFFERS can be 0.

Sincerely,
ITAGAKI Takahiro

-- pgbench result --

$ ./pgbench -s 100 -c 50 -t 400

- 8.0.0 default + fsync:
tps = 20.630632 (including connections establishing)
tps = 20.636768 (excluding connections establishing)
- multipage-writer + open_direct:
tps = 33.761917 (including connections establishing)
tps = 33.778320 (excluding connections establishing)

Environment:
OS : Linux kernel 2.6.9
CPU : Pentium 4 3GHz
disk : ATA 5400rpm (Data and WAL are placed on same partition.)
memory : 1GB
config : shared_buffers=10000, wal_buffers=256,
XLOG_SEG_SIZE=256MB, checkpoint_segment=4

---
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp>
NTT Cyber Space Laboratories
Nippon Telegraph and Telephone Corporation.

Attachments:

xlog.diffapplication/octet-stream; name=xlog.diffDownload
46,69d45
< /*-------------------------------------------------------------------------*/
< 
< #ifdef O_DIRECT
< #	define OPEN_DIRECT_FLAG		O_DIRECT
< #endif
< 
< #define XLOG_MULTIPAGE_WRITER_DEBUG
< 
< /*-------------------------------------------------------------------------*/
< 
< /* O_DIRECT : BEGIN */
< 
< /* TODO: Aligment depends on OS and filesystem. */
< #define O_DIRECT_BUFFER_ALIGN	4096
< 
< #ifdef OPEN_DIRECT_FLAG
< #	define XLOG_EXTRA_BUFFERS		O_DIRECT_BUFFER_ALIGN
< #	define XLOG_BUFFERS_ALIGN(LEN)	TYPEALIGN(XLOG_EXTRA_BUFFERS, (LEN))
< #else
< #	define XLOG_EXTRA_BUFFERS		0
< #	define XLOG_BUFFERS_ALIGN(LEN)	MAXALIGN(LEN)
< #endif
< 
< /* O_DIRECT : END */
492,556d467
< /* BEGIN : XLOG_MULTIPAGE_WRITER */
< 
< static struct XLogMultipageData
< {
< 	char	*pages;		/* Head of first page */
< 	int		 size;		/* Total bytes of pages == count(pages) * BLCKSZ */
< 	int		 offset;	/* Offset in xlog segment file  */
< } XLogMultipage;
< 
< static void XLogMultipageFlush(void)
< {
< 	if (!XLogMultipage.pages)
< 	{	/* No needs to write pages. */
< 		return;
< 	}
< 	
< 	/* Need to seek in the file? */
< 	if (openLogOff != XLogMultipage.offset)
< 	{
< 		openLogOff = XLogMultipage.offset;
< 		if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
< 			ereport(PANIC,
< 					(errcode_for_file_access(),
< 					 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
< 							openLogId, openLogSeg, openLogOff)));
< 	}
< 
< 	/* OK to write the page */
< 	errno = 0;
< 	if (write(openLogFile, XLogMultipage.pages, XLogMultipage.size) != XLogMultipage.size)
< 	{
< 		/* if write didn't set errno, assume problem is no disk space */
< 		if (errno == 0)
< 			errno = ENOSPC;
< 		ereport(PANIC,
< 				(errcode_for_file_access(),
< 				 errmsg("could not write to log file %u, segment %u at offset %u: %m",
< 						openLogId, openLogSeg, openLogOff)));
< 	}
< 
< #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
< 	elog(LOG, "XLogMultipageFlush writes %d pages.", XLogMultipage.size / BLCKSZ);
< #endif
< 
< 	openLogOff += XLogMultipage.size;
< 	memset(&XLogMultipage, 0, sizeof(XLogMultipage));
< }
< 
< static void XLogMultipageWrite(char *page, int size, int offset)
< {
< 	if (XLogMultipage.pages + XLogMultipage.size == page
< 		&& XLogMultipage.offset + XLogMultipage.size == offset)
< 	{	/* Pages are continuous. Append new page. */
< 		XLogMultipage.size += size;
< 	}
< 	else
< 	{	/* Pages are not continuous. Flush and clear. */
< 		XLogMultipageFlush();
< 		XLogMultipage.pages = page;
< 		XLogMultipage.size = size;
< 		XLogMultipage.offset = offset;
< 	}
< }
< 
< /* END : XLOG_MULTIPAGE_WRITER */
1230a1142
> 	char	   *from;
1233,1238d1144
< 	int			currentIndex = Write->curridx;
< 
< #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
< 	if (XLogMultipage.pages)
< 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
< #endif
1254c1160
< 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
---
> 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
1257,1258c1163,1164
< 				 XLogCtl->xlblocks[currentIndex].xlogid,
< 				 XLogCtl->xlblocks[currentIndex].xrecoff);
---
> 				 XLogCtl->xlblocks[Write->curridx].xlogid,
> 				 XLogCtl->xlblocks[Write->curridx].xrecoff);
1261c1167
< 		LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
---
> 		LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
1269d1174
< 			XLogMultipageFlush();
1339a1245,1256
> 			openLogOff = 0;
> 		}
> 
> 		/* Need to seek in the file? */
> 		if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> 		{
> 			openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> 			if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> 				ereport(PANIC,
> 						(errcode_for_file_access(),
> 						 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> 								openLogId, openLogSeg, openLogOff)));
1342,1346c1259,1272
< 		/* Add a page to buffer */
< 		XLogMultipageWrite(
< 			XLogCtl->pages + currentIndex * BLCKSZ,
< 			BLCKSZ,
< 			(LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize);
---
> 		/* OK to write the page */
> 		from = XLogCtl->pages + Write->curridx * BLCKSZ;
> 		errno = 0;
> 		if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> 		{
> 			/* if write didn't set errno, assume problem is no disk space */
> 			if (errno == 0)
> 				errno = ENOSPC;
> 			ereport(PANIC,
> 					(errcode_for_file_access(),
> 					 errmsg("could not write to log file %u, segment %u at offset %u: %m",
> 							openLogId, openLogSeg, openLogOff)));
> 		}
> 		openLogOff += BLCKSZ;
1358c1284
< 		if (openLogOff + XLogMultipage.size >= XLogSegSize && !ispartialpage)
---
> 		if (openLogOff >= XLogSegSize && !ispartialpage)
1360d1285
< 			XLogMultipageFlush();
1374c1299
< 		currentIndex = NextBufIdx(currentIndex);
---
> 		Write->curridx = NextBufIdx(Write->curridx);
1376d1300
< 	XLogMultipageFlush();
1412,1413d1335
< 	Write->curridx = currentIndex;
< 
1435,1439d1356
< 
< #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
< 	if (XLogMultipage.pages)
< 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
< #endif
1562,1564d1478
< 
< 	if (XLogMultipage.pages)
< 		elog(PANIC, "xlog multipage-write usage error at XLogFlush");
3469c3383
< 		+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers +
---
> 		+ BLCKSZ * XLOGbuffers +
3487c3401
< 						+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers,
---
> 						+ BLCKSZ * XLOGbuffers,
3515,3516c3429,3430
< 		(char*)XLOG_BUFFERS_ALIGN(((char *) XLogCtl)
< 		+ sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
---
> 		((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> 									  sizeof(XLogRecPtr) * XLOGbuffers);
3574,3577d3487
< #if 1
< 	buffer = (char *) XLOG_BUFFERS_ALIGN(
< 		malloc(BLCKSZ + XLOG_EXTRA_BUFFERS) );
< #else
3580,3582d3489
< #endif
< 	/* XXX: Does buffer memory-leak? */
< 
5276,5282d5182
< #ifdef OPEN_DIRECT_FLAG
< 	else if (pg_strcasecmp(method, "open_direct") == 0)
< 	{
< 		new_sync_method = SYNC_METHOD_OPEN;
< 		new_sync_bit = OPEN_DIRECT_FLAG;
< 	}
< #endif
#2Michael Paesold
mpaesold@gmx.at
In reply to: ITAGAKI Takahiro (#1)
Re: WAL: O_DIRECT and multipage-writer

ITAGAKI Takahiro wrote:

I think that there is room for improvement in WAL.
Here is a patch for it.

I think you should resend your patch as a context diff (diff -c). Otherwise
it's hard to see what your patch does.

Best Regards,
Michael Paesold

#3ITAGAKI Takahiro
itagaki.takahiro@lab.ntt.co.jp
In reply to: Michael Paesold (#2)
1 attachment(s)
Re: WAL: O_DIRECT and multipage-writer

Excuse me.
I resend the patch with diff -c.

On Tue, 25 Jan 2005 10:30:01 +0100
"Michael Paesold" <mpaesold@gmx.at> wrote:

ITAGAKI Takahiro wrote:

I think that there is room for improvement in WAL.
Here is a patch for it.

I think you should resend your patch as a context diff (diff -c). Otherwise
it's hard to see what your patch does.

---
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp>
NTT Cyber Space Laboratories
Nippon Telegraph and Telephone Corporation.

Attachments:

xlog.c.diffapplication/octet-stream; name=xlog.c.diffDownload
*** xlog.org.c	Sat Jan  1 06:59:30 2005
--- xlog.c	Tue Jan 25 10:05:09 2005
***************
*** 43,48 ****
--- 43,72 ----
  #include "utils/guc.h"
  #include "utils/relcache.h"
  
+ /*-------------------------------------------------------------------------*/
+ 
+ #ifdef O_DIRECT
+ #	define OPEN_DIRECT_FLAG		O_DIRECT
+ #endif
+ 
+ #define XLOG_MULTIPAGE_WRITER_DEBUG
+ 
+ /*-------------------------------------------------------------------------*/
+ 
+ /* O_DIRECT : BEGIN */
+ 
+ /* TODO: Aligment depends on OS and filesystem. */
+ #define O_DIRECT_BUFFER_ALIGN	4096
+ 
+ #ifdef OPEN_DIRECT_FLAG
+ #	define XLOG_EXTRA_BUFFERS		O_DIRECT_BUFFER_ALIGN
+ #	define XLOG_BUFFERS_ALIGN(LEN)	TYPEALIGN(XLOG_EXTRA_BUFFERS, (LEN))
+ #else
+ #	define XLOG_EXTRA_BUFFERS		0
+ #	define XLOG_BUFFERS_ALIGN(LEN)	MAXALIGN(LEN)
+ #endif
+ 
+ /* O_DIRECT : END */
  
  /*
   * This chunk of hackery attempts to determine which file sync methods
***************
*** 465,470 ****
--- 489,559 ----
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  
+ /* BEGIN : XLOG_MULTIPAGE_WRITER */
+ 
+ static struct XLogMultipageData
+ {
+ 	char	*pages;		/* Head of first page */
+ 	int		 size;		/* Total bytes of pages == count(pages) * BLCKSZ */
+ 	int		 offset;	/* Offset in xlog segment file  */
+ } XLogMultipage;
+ 
+ static void XLogMultipageFlush(void)
+ {
+ 	if (!XLogMultipage.pages)
+ 	{	/* No needs to write pages. */
+ 		return;
+ 	}
+ 	
+ 	/* Need to seek in the file? */
+ 	if (openLogOff != XLogMultipage.offset)
+ 	{
+ 		openLogOff = XLogMultipage.offset;
+ 		if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
+ 			ereport(PANIC,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
+ 							openLogId, openLogSeg, openLogOff)));
+ 	}
+ 
+ 	/* OK to write the page */
+ 	errno = 0;
+ 	if (write(openLogFile, XLogMultipage.pages, XLogMultipage.size) != XLogMultipage.size)
+ 	{
+ 		/* if write didn't set errno, assume problem is no disk space */
+ 		if (errno == 0)
+ 			errno = ENOSPC;
+ 		ereport(PANIC,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to log file %u, segment %u at offset %u: %m",
+ 						openLogId, openLogSeg, openLogOff)));
+ 	}
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	elog(LOG, "XLogMultipageFlush writes %d pages.", XLogMultipage.size / BLCKSZ);
+ #endif
+ 
+ 	openLogOff += XLogMultipage.size;
+ 	memset(&XLogMultipage, 0, sizeof(XLogMultipage));
+ }
+ 
+ static void XLogMultipageWrite(char *page, int size, int offset)
+ {
+ 	if (XLogMultipage.pages + XLogMultipage.size == page
+ 		&& XLogMultipage.offset + XLogMultipage.size == offset)
+ 	{	/* Pages are continuous. Append new page. */
+ 		XLogMultipage.size += size;
+ 	}
+ 	else
+ 	{	/* Pages are not continuous. Flush and clear. */
+ 		XLogMultipageFlush();
+ 		XLogMultipage.pages = page;
+ 		XLogMultipage.size = size;
+ 		XLogMultipage.offset = offset;
+ 	}
+ }
+ 
+ /* END : XLOG_MULTIPAGE_WRITER */
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 1139,1147 ****
  XLogWrite(XLogwrtRqst WriteRqst)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
- 	char	   *from;
  	bool		ispartialpage;
  	bool		use_existent;
  
  	/*
  	 * Update local LogwrtResult (caller probably did this already,
--- 1228,1241 ----
  XLogWrite(XLogwrtRqst WriteRqst)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
  	bool		use_existent;
+ 	int			currentIndex = Write->curridx;
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
+ #endif
  
  	/*
  	 * Update local LogwrtResult (caller probably did this already,
***************
*** 1157,1170 ****
  		 * end of the last page that's been initialized by
  		 * AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[Write->curridx].xlogid,
! 				 XLogCtl->xlblocks[Write->curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 1251,1264 ----
  		 * end of the last page that's been initialized by
  		 * AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[currentIndex].xlogid,
! 				 XLogCtl->xlblocks[currentIndex].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1172,1177 ****
--- 1266,1272 ----
  			/*
  			 * Switch to new logfile segment.
  			 */
+ 			XLogMultipageFlush();
  			if (openLogFile >= 0)
  			{
  				if (close(openLogFile))
***************
*** 1242,1275 ****
  		{
  			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
  			openLogFile = XLogFileOpen(openLogId, openLogSeg);
- 			openLogOff = 0;
- 		}
- 
- 		/* Need to seek in the file? */
- 		if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
- 		{
- 			openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
- 			if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
- 				ereport(PANIC,
- 						(errcode_for_file_access(),
- 						 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
- 								openLogId, openLogSeg, openLogOff)));
  		}
  
! 		/* OK to write the page */
! 		from = XLogCtl->pages + Write->curridx * BLCKSZ;
! 		errno = 0;
! 		if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
! 		{
! 			/* if write didn't set errno, assume problem is no disk space */
! 			if (errno == 0)
! 				errno = ENOSPC;
! 			ereport(PANIC,
! 					(errcode_for_file_access(),
! 					 errmsg("could not write to log file %u, segment %u at offset %u: %m",
! 							openLogId, openLogSeg, openLogOff)));
! 		}
! 		openLogOff += BLCKSZ;
  
  		/*
  		 * If we just wrote the whole last page of a logfile segment,
--- 1337,1349 ----
  		{
  			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
  			openLogFile = XLogFileOpen(openLogId, openLogSeg);
  		}
  
! 		/* Add a page to buffer */
! 		XLogMultipageWrite(
! 			XLogCtl->pages + currentIndex * BLCKSZ,
! 			BLCKSZ,
! 			(LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize);
  
  		/*
  		 * If we just wrote the whole last page of a logfile segment,
***************
*** 1281,1288 ****
  		 * This is also the right place to notify the Archiver that the
  		 * segment is ready to copy to archival storage.
  		 */
! 		if (openLogOff >= XLogSegSize && !ispartialpage)
  		{
  			issue_xlog_fsync();
  			LogwrtResult.Flush = LogwrtResult.Write;	/* end of current page */
  
--- 1355,1363 ----
  		 * This is also the right place to notify the Archiver that the
  		 * segment is ready to copy to archival storage.
  		 */
! 		if (openLogOff + XLogMultipage.size >= XLogSegSize && !ispartialpage)
  		{
+ 			XLogMultipageFlush();
  			issue_xlog_fsync();
  			LogwrtResult.Flush = LogwrtResult.Write;	/* end of current page */
  
***************
*** 1296,1303 ****
  			LogwrtResult.Write = WriteRqst.Write;
  			break;
  		}
! 		Write->curridx = NextBufIdx(Write->curridx);
  	}
  
  	/*
  	 * If asked to flush, do so
--- 1371,1379 ----
  			LogwrtResult.Write = WriteRqst.Write;
  			break;
  		}
! 		currentIndex = NextBufIdx(currentIndex);
  	}
+ 	XLogMultipageFlush();
  
  	/*
  	 * If asked to flush, do so
***************
*** 1333,1338 ****
--- 1409,1416 ----
  		LogwrtResult.Flush = LogwrtResult.Write;
  	}
  
+ 	Write->curridx = currentIndex;
+ 
  	/*
  	 * Update shared-memory status
  	 *
***************
*** 1354,1359 ****
--- 1432,1442 ----
  	}
  
  	Write->LogwrtResult = LogwrtResult;
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
+ #endif
  }
  
  /*
***************
*** 1476,1481 ****
--- 1559,1567 ----
  			 "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
+ 
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "xlog multipage-write usage error at XLogFlush");
  }
  
  /*
***************
*** 3380,3386 ****
  		XLOGbuffers = MinXLOGbuffers;
  
  	return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
! 		+ BLCKSZ * XLOGbuffers +
  		MAXALIGN(sizeof(ControlFileData));
  }
  
--- 3466,3472 ----
  		XLOGbuffers = MinXLOGbuffers;
  
  	return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
! 		+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers +
  		MAXALIGN(sizeof(ControlFileData));
  }
  
***************
*** 3398,3404 ****
  		ShmemInitStruct("XLOG Ctl",
  						MAXALIGN(sizeof(XLogCtlData) +
  								 sizeof(XLogRecPtr) * XLOGbuffers)
! 						+ BLCKSZ * XLOGbuffers,
  						&foundXLog);
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
--- 3484,3490 ----
  		ShmemInitStruct("XLOG Ctl",
  						MAXALIGN(sizeof(XLogCtlData) +
  								 sizeof(XLogRecPtr) * XLOGbuffers)
! 						+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers,
  						&foundXLog);
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 3426,3433 ****
  	 * buffers have worst-case alignment.
  	 */
  	XLogCtl->pages =
! 		((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
! 									  sizeof(XLogRecPtr) * XLOGbuffers);
  	memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
  
  	/*
--- 3512,3519 ----
  	 * buffers have worst-case alignment.
  	 */
  	XLogCtl->pages =
! 		(char*)XLOG_BUFFERS_ALIGN(((char *) XLogCtl)
! 		+ sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
  	memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
  
  	/*
***************
*** 3485,3492 ****
--- 3571,3585 ----
  	/* First timeline ID is always 1 */
  	ThisTimeLineID = 1;
  
+ #if 1
+ 	buffer = (char *) XLOG_BUFFERS_ALIGN(
+ 		malloc(BLCKSZ + XLOG_EXTRA_BUFFERS) );
+ #else
  	/* Use malloc() to ensure buffer is MAXALIGNED */
  	buffer = (char *) malloc(BLCKSZ);
+ #endif
+ 	/* XXX: Does buffer memory-leak? */
+ 
  	page = (XLogPageHeader) buffer;
  	memset(buffer, 0, BLCKSZ);
  
***************
*** 5180,5185 ****
--- 5273,5285 ----
  		new_sync_bit = OPEN_DATASYNC_FLAG;
  	}
  #endif
+ #ifdef OPEN_DIRECT_FLAG
+ 	else if (pg_strcasecmp(method, "open_direct") == 0)
+ 	{
+ 		new_sync_method = SYNC_METHOD_OPEN;
+ 		new_sync_bit = OPEN_DIRECT_FLAG;
+ 	}
+ #endif
  	else
  		return NULL;
  
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#3)
Re: WAL: O_DIRECT and multipage-writer

ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:

I resend the patch with diff -c.

What does XLOG_EXTRA_BUFFERS accomplish?

Also, I'm worried that you broke something by not updating
Write->curridx immediately in XLogWrite. There certainly isn't going
to be any measurable performance boost from keeping that in a local
variable, so why take any risk?

regards, tom lane

#5ITAGAKI Takahiro
itagaki@tiara.ocn.ne.jp
In reply to: Tom Lane (#4)
Re: WAL: O_DIRECT and multipage-writer

Tom Lane <tgl@sss.pgh.pa.us> writes:

What does XLOG_EXTRA_BUFFERS accomplish?

It is because the buffer passed to direct-io must be aligned to
the same size of filesystem page, typically 4KB. Buffers allocated
with ShmemInitStruct are not necessarily aligned, so we need to allocate
extra buffers and align them by ourself.

Also, I'm worried that you broke something by not updating
Write->curridx immediately in XLogWrite. There certainly isn't going
to be any measurable performance boost from keeping that in a local
variable, so why take any risk?

Keeping Write->curridx in a local variable is not for performance,
but for writing multiple pages at once.
I think it is ok to update Write->curridx at the end of XLogWrite,
because XLogCtl.Write.curridx will be touched by only one process
at a time. Process-shared variables are not modified until XLogWrite
is completed, so that other backends can write same contents later
even if the backend in XLogWrite is crushed.

Sincerely,
ITAGAKI Takahiro

#6Mark Wong
markw@osdl.org
In reply to: ITAGAKI Takahiro (#1)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer

Hi everyone,

I gave this a try with DBT-2, but got a core dump on our ia64 system.
I hope this isn't a random thing, like I ran into previously. Maybe
I'll try again, but postgres dumped core. Binary and core here:
http://developer.osdl.org/markw/pgsql/core/2morefiles.tar.bz2

#0 FunctionCall2 (flinfo=0x0, arg1=0, arg2=0) at fmgr.c:1141
1141 result = FunctionCallInvoke(&fcinfo);
(gdb) bt
#0 FunctionCall2 (flinfo=0x0, arg1=0, arg2=0) at fmgr.c:1141
#1 0x40000000003bdb80 in FunctionCall2 (flinfo=Cannot access memory at address 0x0
) at fmgr.c:1141
#2 0x40000000003bdb80 in FunctionCall2 (flinfo=Cannot access memory at address 0x0
) at fmgr.c:1141

Over and over again, so I'll keep the backtrace short.

Mark

#7ITAGAKI Takahiro
itagaki@tiara.ocn.ne.jp
In reply to: Mark Wong (#6)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer

Thanks for testing, Mark!

I gave this a try with DBT-2, but got a core dump on our ia64 system.
I hope this isn't a random thing, like I ran into previously. Maybe
I'll try again, but postgres dumped core.

Sorry, this seems to be my patch's bug.
Which datatype did you compile with? LP64, ILP64, or LLP64?
If you used LLP64, I think the cause is buffer alignment routine
because of sizeof(long) != sizeof(void*).

I'll fix it soon...

ITAGAKI Takahiro

#8Mark Wong
markw@osdl.org
In reply to: ITAGAKI Takahiro (#7)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer

Hmm... I don't remember specifying a datatype. I suppose whatever the
default one is. :)

I'll be happy to test again, just let me know.

Mark

Show quoted text

On Fri, Jan 28, 2005 at 06:28:32AM +0900, ITAGAKI Takahiro wrote:

Thanks for testing, Mark!

I gave this a try with DBT-2, but got a core dump on our ia64 system.
I hope this isn't a random thing, like I ran into previously. Maybe
I'll try again, but postgres dumped core.

Sorry, this seems to be my patch's bug.
Which datatype did you compile with? LP64, ILP64, or LLP64?
If you used LLP64, I think the cause is buffer alignment routine
because of sizeof(long) != sizeof(void*).

I'll fix it soon...

ITAGAKI Takahiro

#9ITAGAKI Takahiro
itagaki.takahiro@lab.ntt.co.jp
In reply to: Mark Wong (#6)
1 attachment(s)
Re: [PATCHES] WAL: O_DIRECT and multipage-writer (+ memory leak)

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

1. Fix update timing of Write->curridx. (pointed by Tom)
Change to update it soon after write().

2. Fix buffer alignment routine on 64bit cpu. (pointed by Mark)
I checked it on Xeon EM64T and it worked properly, but I don't have IA64...

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

ITAGAKI Takahiro

Attachments:

xlog.c.diffapplication/octet-stream; name=xlog.c.diffDownload
*** xlog.org.c	Sat Jan  1 06:59:30 2005
--- xlog.c	Tue Feb  1 18:28:16 2005
***************
*** 43,48 ****
--- 43,82 ----
  #include "utils/guc.h"
  #include "utils/relcache.h"
  
+ /*-------------------------------------------------------------------------*/
+ 
+ #ifdef O_DIRECT
+ #	define OPEN_DIRECT_FLAG		O_DIRECT
+ #endif
+ 
+ #define XLOG_MULTIPAGE_WRITER_DEBUG
+ 
+ #define ISSUE_BOOTSTRAP_MEMORYLEAK		/* XXX: memory leak? */
+ 
+ /*-------------------------------------------------------------------------*/
+ 
+ /* O_DIRECT : BEGIN */
+ 
+ /* TODO: Aligment depends on OS and filesystem. */
+ #define O_DIRECT_BUFFER_ALIGN	4096
+ 
+ /* assume sizeof(ptrdiff_t) == sizeof(void*) */
+ #define POINTERALIGN(ALIGNVAL,PTR)  \
+ 	(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1)))
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	static char STATIC_ASSERT_POINTERSIZE[((int)(sizeof(ptrdiff_t) == sizeof(void*))) - 1];
+ #endif
+ 
+ #ifdef OPEN_DIRECT_FLAG
+ #	define XLOG_EXTRA_BUFFERS		O_DIRECT_BUFFER_ALIGN
+ #	define XLOG_BUFFERS_ALIGN(PTR)	POINTERALIGN(XLOG_EXTRA_BUFFERS, (PTR))
+ #else
+ #	define XLOG_EXTRA_BUFFERS		0
+ #	define XLOG_BUFFERS_ALIGN(PTR)	POINTERALIGN(MAXIMUM_ALIGNOF, (PTR))
+ #endif
+ 
+ /* O_DIRECT : END */
  
  /*
   * This chunk of hackery attempts to determine which file sync methods
***************
*** 465,470 ****
--- 499,583 ----
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  
+ /* BEGIN : XLOG_MULTIPAGE_WRITER */
+ 
+ static struct XLogMultipageData
+ {
+ 	char	*pages;		/* Head of first page */
+ 	int		 size;		/* Total bytes of pages == count(pages) * BLCKSZ */
+ 	int		 offset;	/* Offset in xlog segment file  */
+ } XLogMultipage;
+ 
+ static void XLogMultipageFlush(int index)
+ {
+ 	if (!XLogMultipage.pages)
+ 	{	/* No needs to write pages. */
+ 		XLogCtl->Write.curridx = index;
+ 		return;
+ 	}
+ 	
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	{
+ 		int i = (XLogCtl->Write.curridx + XLogMultipage.size / BLCKSZ + XLOGbuffers - index) % XLOGbuffers;
+ 		if (i != 0 && i != 1)
+ 			elog(PANIC, "XLogMultipageFlush (%d)", __LINE__);
+ 	}
+ #endif
+ 
+ 	/* Need to seek in the file? */
+ 	if (openLogOff != XLogMultipage.offset)
+ 	{
+ 		openLogOff = XLogMultipage.offset;
+ 		if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
+ 			ereport(PANIC,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
+ 							openLogId, openLogSeg, openLogOff)));
+ 	}
+ 
+ 	/* OK to write the page */
+ 	errno = 0;
+ 	if (write(openLogFile, XLogMultipage.pages, XLogMultipage.size) != XLogMultipage.size)
+ 	{
+ 		/* if write didn't set errno, assume problem is no disk space */
+ 		if (errno == 0)
+ 			errno = ENOSPC;
+ 		ereport(PANIC,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to log file %u, segment %u at offset %u: %m",
+ 						openLogId, openLogSeg, openLogOff)));
+ 	}
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	elog(LOG, "XLogMultipageFlush writes %d pages.", XLogMultipage.size / BLCKSZ);
+ #endif
+ 
+ 	openLogOff += XLogMultipage.size;
+ 	XLogCtl->Write.curridx = index;
+ 	memset(&XLogMultipage, 0, sizeof(XLogMultipage));
+ }
+ 
+ static void XLogMultipageWrite(int index)
+ {
+ 	char *page = XLogCtl->pages + index * BLCKSZ;
+ 	int size = BLCKSZ;
+ 	int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+ 
+ 	if (XLogMultipage.pages + XLogMultipage.size == page
+ 		&& XLogMultipage.offset + XLogMultipage.size == offset)
+ 	{	/* Pages are continuous. Append new page. */
+ 		XLogMultipage.size += size;
+ 	}
+ 	else
+ 	{	/* Pages are not continuous. Flush and clear. */
+ 		XLogMultipageFlush(index);
+ 		XLogMultipage.pages = page;
+ 		XLogMultipage.size = size;
+ 		XLogMultipage.offset = offset;
+ 	}
+ }
+ 
+ /* END : XLOG_MULTIPAGE_WRITER */
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 1139,1147 ****
  XLogWrite(XLogwrtRqst WriteRqst)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
- 	char	   *from;
  	bool		ispartialpage;
  	bool		use_existent;
  
  	/*
  	 * Update local LogwrtResult (caller probably did this already,
--- 1252,1265 ----
  XLogWrite(XLogwrtRqst WriteRqst)
  {
  	XLogCtlWrite *Write = &XLogCtl->Write;
  	bool		ispartialpage;
  	bool		use_existent;
+ 	int			currentIndex = Write->curridx;
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
+ #endif
  
  	/*
  	 * Update local LogwrtResult (caller probably did this already,
***************
*** 1157,1170 ****
  		 * end of the last page that's been initialized by
  		 * AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[Write->curridx].xlogid,
! 				 XLogCtl->xlblocks[Write->curridx].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 1275,1288 ----
  		 * end of the last page that's been initialized by
  		 * AdvanceXLInsertBuffer.
  		 */
! 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
  				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! 				 XLogCtl->xlblocks[currentIndex].xlogid,
! 				 XLogCtl->xlblocks[currentIndex].xrecoff);
  
  		/* Advance LogwrtResult.Write to end of current buffer page */
! 		LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
  		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
  
  		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1172,1177 ****
--- 1290,1296 ----
  			/*
  			 * Switch to new logfile segment.
  			 */
+ 			XLogMultipageFlush(currentIndex);
  			if (openLogFile >= 0)
  			{
  				if (close(openLogFile))
***************
*** 1242,1275 ****
  		{
  			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
  			openLogFile = XLogFileOpen(openLogId, openLogSeg);
- 			openLogOff = 0;
  		}
  
! 		/* Need to seek in the file? */
! 		if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
! 		{
! 			openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
! 			if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
! 				ereport(PANIC,
! 						(errcode_for_file_access(),
! 						 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
! 								openLogId, openLogSeg, openLogOff)));
! 		}
! 
! 		/* OK to write the page */
! 		from = XLogCtl->pages + Write->curridx * BLCKSZ;
! 		errno = 0;
! 		if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
! 		{
! 			/* if write didn't set errno, assume problem is no disk space */
! 			if (errno == 0)
! 				errno = ENOSPC;
! 			ereport(PANIC,
! 					(errcode_for_file_access(),
! 					 errmsg("could not write to log file %u, segment %u at offset %u: %m",
! 							openLogId, openLogSeg, openLogOff)));
! 		}
! 		openLogOff += BLCKSZ;
  
  		/*
  		 * If we just wrote the whole last page of a logfile segment,
--- 1361,1370 ----
  		{
  			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
  			openLogFile = XLogFileOpen(openLogId, openLogSeg);
  		}
  
! 		/* Add a page to buffer */
! 		XLogMultipageWrite(currentIndex);
  
  		/*
  		 * If we just wrote the whole last page of a logfile segment,
***************
*** 1281,1288 ****
  		 * This is also the right place to notify the Archiver that the
  		 * segment is ready to copy to archival storage.
  		 */
! 		if (openLogOff >= XLogSegSize && !ispartialpage)
  		{
  			issue_xlog_fsync();
  			LogwrtResult.Flush = LogwrtResult.Write;	/* end of current page */
  
--- 1376,1384 ----
  		 * This is also the right place to notify the Archiver that the
  		 * segment is ready to copy to archival storage.
  		 */
! 		if (openLogOff + XLogMultipage.size >= XLogSegSize && !ispartialpage)
  		{
+ 			XLogMultipageFlush(currentIndex);
  			issue_xlog_fsync();
  			LogwrtResult.Flush = LogwrtResult.Write;	/* end of current page */
  
***************
*** 1296,1303 ****
  			LogwrtResult.Write = WriteRqst.Write;
  			break;
  		}
! 		Write->curridx = NextBufIdx(Write->curridx);
  	}
  
  	/*
  	 * If asked to flush, do so
--- 1392,1400 ----
  			LogwrtResult.Write = WriteRqst.Write;
  			break;
  		}
! 		currentIndex = NextBufIdx(currentIndex);
  	}
+ 	XLogMultipageFlush(currentIndex);
  
  	/*
  	 * If asked to flush, do so
***************
*** 1333,1338 ****
--- 1430,1440 ----
  		LogwrtResult.Flush = LogwrtResult.Write;
  	}
  
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	if (Write->curridx != currentIndex)
+ 		elog(PANIC, "Write->curridx != currentIndex (%d) : %d != %d", __LINE__, Write->curridx, currentIndex);
+ #endif
+ 
  	/*
  	 * Update shared-memory status
  	 *
***************
*** 1354,1359 ****
--- 1456,1466 ----
  	}
  
  	Write->LogwrtResult = LogwrtResult;
+ 
+ #ifdef XLOG_MULTIPAGE_WRITER_DEBUG
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "XLogMultipage.pages not null (%d) : size=%d", __LINE__, XLogMultipage.size);
+ #endif
  }
  
  /*
***************
*** 1476,1481 ****
--- 1583,1591 ----
  			 "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
+ 
+ 	if (XLogMultipage.pages)
+ 		elog(PANIC, "xlog multipage-write usage error at XLogFlush");
  }
  
  /*
***************
*** 3380,3386 ****
  		XLOGbuffers = MinXLOGbuffers;
  
  	return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
! 		+ BLCKSZ * XLOGbuffers +
  		MAXALIGN(sizeof(ControlFileData));
  }
  
--- 3490,3496 ----
  		XLOGbuffers = MinXLOGbuffers;
  
  	return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
! 		+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers +
  		MAXALIGN(sizeof(ControlFileData));
  }
  
***************
*** 3398,3404 ****
  		ShmemInitStruct("XLOG Ctl",
  						MAXALIGN(sizeof(XLogCtlData) +
  								 sizeof(XLogRecPtr) * XLOGbuffers)
! 						+ BLCKSZ * XLOGbuffers,
  						&foundXLog);
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
--- 3508,3514 ----
  		ShmemInitStruct("XLOG Ctl",
  						MAXALIGN(sizeof(XLogCtlData) +
  								 sizeof(XLogRecPtr) * XLOGbuffers)
! 						+ XLOG_EXTRA_BUFFERS + BLCKSZ * XLOGbuffers,
  						&foundXLog);
  	ControlFile = (ControlFileData *)
  		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
***************
*** 3426,3433 ****
  	 * buffers have worst-case alignment.
  	 */
  	XLogCtl->pages =
! 		((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
! 									  sizeof(XLogRecPtr) * XLOGbuffers);
  	memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
  
  	/*
--- 3536,3543 ----
  	 * buffers have worst-case alignment.
  	 */
  	XLogCtl->pages =
! 		(char*)XLOG_BUFFERS_ALIGN(((char *) XLogCtl)
! 		+ sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
  	memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
  
  	/*
***************
*** 3465,3470 ****
--- 3575,3584 ----
  	struct timeval tv;
  	crc64		crc;
  
+ #ifdef ISSUE_BOOTSTRAP_MEMORYLEAK
+ 	void* buffer0;
+ #endif
+ 
  	/*
  	 * Select a hopefully-unique system identifier code for this
  	 * installation. We use the result of gettimeofday(), including the
***************
*** 3485,3492 ****
--- 3599,3616 ----
  	/* First timeline ID is always 1 */
  	ThisTimeLineID = 1;
  
+ 	/* XXX: Does buffer leak? */
+ #ifdef ISSUE_BOOTSTRAP_MEMORYLEAK
+ 	buffer0 = malloc(BLCKSZ + XLOG_EXTRA_BUFFERS);
+ 	buffer = (char *) XLOG_BUFFERS_ALIGN(buffer0);
+ #elif 1
+ 	buffer = (char *) XLOG_BUFFERS_ALIGN(
+ 		malloc(BLCKSZ + XLOG_EXTRA_BUFFERS) );
+ #else
  	/* Use malloc() to ensure buffer is MAXALIGNED */
  	buffer = (char *) malloc(BLCKSZ);
+ #endif
+ 
  	page = (XLogPageHeader) buffer;
  	memset(buffer, 0, BLCKSZ);
  
***************
*** 3576,3581 ****
--- 3700,3709 ----
  	/* Bootstrap the commit log, too */
  	BootStrapCLOG();
  	BootStrapSUBTRANS();
+ 
+ #ifdef ISSUE_BOOTSTRAP_MEMORYLEAK
+ 	free(buffer0);
+ #endif
  }
  
  static char *
***************
*** 5180,5185 ****
--- 5308,5320 ----
  		new_sync_bit = OPEN_DATASYNC_FLAG;
  	}
  #endif
+ #ifdef OPEN_DIRECT_FLAG
+ 	else if (pg_strcasecmp(method, "open_direct") == 0)
+ 	{
+ 		new_sync_method = SYNC_METHOD_OPEN;
+ 		new_sync_bit = OPEN_DIRECT_FLAG;
+ 	}
+ #endif
  	else
  		return NULL;
  
#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: ITAGAKI Takahiro (#1)
Re: WAL: O_DIRECT and multipage-writer

This thread has been saved for the 8.1 release:

http://momjian.postgresql.org/cgi-bin/pgpatches2

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

Hello, all.

I think that there is room for improvement in WAL.
Here is a patch for it.
- Multiple pages are written in one write() if it is contiguous.
- Add 'open_direct' to wal_sync_method.

WAL writer writes one page in one write(). This is not efficient
when wal_sync_method is 'open_sync', because the writer waits for
IO completions at each write(). Multipage-writer can reduce syscalls
and improve IO throughput.

'open_direct' uses O_DIRECT instead of O_SYNC. O_DIRECT implies synchronous
writing, so it may show the tendency like open_sync. But maybe it can reduce
memcpy() and save OS's disk cache memory.

I benchmarked this patch with pgbench. It works well and
improved 50% of tps on my machine. WAL seems to be bottle-neck
on machines with poor disks.

This patch has not yet tested enough. I would like it to be examined much
and taken into PostgreSQL.

There are still many TODOs:
* Is this logic really correct?
- O_DIRECT_BUFFER_ALIGN should be adjusted to runtime, not compile time.
- Consider to use writev() instead of write().
Buffers are noncontiguous when WAL ring buffer rotates.
- If wan_sync_method is not open_direct, XLOG_EXTRA_BUFFERS can be 0.

Sincerely,
ITAGAKI Takahiro

-- pgbench result --

$ ./pgbench -s 100 -c 50 -t 400

- 8.0.0 default + fsync:
tps = 20.630632 (including connections establishing)
tps = 20.636768 (excluding connections establishing)
- multipage-writer + open_direct:
tps = 33.761917 (including connections establishing)
tps = 33.778320 (excluding connections establishing)

Environment:
OS : Linux kernel 2.6.9
CPU : Pentium 4 3GHz
disk : ATA 5400rpm (Data and WAL are placed on same partition.)
memory : 1GB
config : shared_buffers=10000, wal_buffers=256,
XLOG_SEG_SIZE=256MB, checkpoint_segment=4

---
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp>
NTT Cyber Space Laboratories
Nippon Telegraph and Telephone Corporation.

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Bruce Momjian
pgman@candle.pha.pa.us
In reply to: ITAGAKI Takahiro (#9)
Re: [PATCHES] WAL: O_DIRECT and multipage-writer (+ memory

This has been saved for the 8.1 release:

http://momjian.postgresql.org/cgi-bin/pgpatches2

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

1. Fix update timing of Write->curridx. (pointed by Tom)
Change to update it soon after write().

2. Fix buffer alignment routine on 64bit cpu. (pointed by Mark)
I checked it on Xeon EM64T and it worked properly, but I don't have IA64...

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

ITAGAKI Takahiro

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#12Mark Wong
markw@osdl.org
In reply to: ITAGAKI Takahiro (#9)
Re: [PATCHES] WAL: O_DIRECT and multipage-writer (+ memory leak)

On Thu, Feb 03, 2005 at 07:25:55PM +0900, ITAGAKI Takahiro wrote:

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

Ok, finally got back into the office and was able to run 1 set of
tests.

So the new baseline result with 8.0.1:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/
Throughput: 3639.97

Results with the patch but open_direct not set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/308/
Throughput: 3494.72

Results with the patch and open_direct set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/312/
Throughput: 3489.69

You can verify that the wall_sync_method is set to open_direct under
the "database parameters" link, but I'm wondering if I missed
something. It looks a little odd the the performance dropped.

Mark

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Wong (#12)
Re: [PATCHES] WAL: O_DIRECT and multipage-writer (+

On Tue, 2005-03-01 at 13:53 -0800, Mark Wong wrote:

On Thu, Feb 03, 2005 at 07:25:55PM +0900, ITAGAKI Takahiro wrote:

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

Ok, finally got back into the office and was able to run 1 set of
tests.

So the new baseline result with 8.0.1:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/
Throughput: 3639.97

Results with the patch but open_direct not set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/308/
Throughput: 3494.72

Results with the patch and open_direct set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/312/
Throughput: 3489.69

You can verify that the wall_sync_method is set to open_direct under
the "database parameters" link, but I'm wondering if I missed
something. It looks a little odd the the performance dropped.

Is there anything more to say on this?
Is it case-closed, or is there further work underway - I can't see any
further chat on this thread.

These results show it doesn't work better on larger systems. The
original testing showed it worked better on smaller systems - is there
still scope to include this for smaller configs?

If not, thanks for taking the time to write the patch and investigate
whether changes in this area would help. Not every performance patch
improves things, but that doesn't mean we shouldn't try...

Best Regards, Simon Riggs

#14Mark Wong
markw@osdl.org
In reply to: ITAGAKI Takahiro (#1)
Re: WAL: O_DIRECT and multipage-writer

On Tue, Jan 25, 2005 at 06:06:23PM +0900, ITAGAKI Takahiro wrote:

Environment:
OS : Linux kernel 2.6.9
CPU : Pentium 4 3GHz
disk : ATA 5400rpm (Data and WAL are placed on same partition.)
memory : 1GB
config : shared_buffers=10000, wal_buffers=256,
XLOG_SEG_SIZE=256MB, checkpoint_segment=4

Hi Itagaki,

In light of this thread, have you compared the performance on
Linux-2.4?

Direct io on block device has performance regression on 2.6.x kernel
http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/0328.html

Mark

#15ITAGAKI Takahiro
itagaki.takahiro@lab.ntt.co.jp
In reply to: Mark Wong (#14)
Re: WAL: O_DIRECT and multipage-writer

Hi, Mark.

Mark Wong <markw@osdl.org> wrote:

In light of this thread, have you compared the performance on
Linux-2.4?

No, but I'm just testing my patch on Linux-2.4 with a middle-range server.
I will report the results sometime soon.

By the way, I found the debug option (XLOG_MULTIPAGE_WRITER_DEBUG) was enabled
in your prior benchmarks. It writes logs a lot.
I hope performance doesn't fall at least without debug.

So the new baseline result with 8.0.1:
Throughput: 3639.97
Results with the patch but open_direct not set:
Throughput: 3494.72
Results with the patch and open_direct set:
Throughput: 3489.69

---
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp>
NTT Cyber Space Laboratories

#16Mark Wong
markw@osdl.org
In reply to: ITAGAKI Takahiro (#15)
Re: WAL: O_DIRECT and multipage-writer

On Wed, Mar 23, 2005 at 01:55:46PM +0900, ITAGAKI Takahiro wrote:

Hi, Mark.

Mark Wong <markw@osdl.org> wrote:

In light of this thread, have you compared the performance on
Linux-2.4?

No, but I'm just testing my patch on Linux-2.4 with a middle-range server.
I will report the results sometime soon.

By the way, I found the debug option (XLOG_MULTIPAGE_WRITER_DEBUG) was enabled
in your prior benchmarks. It writes logs a lot.
I hope performance doesn't fall at least without debug.

That's 'log_min_messages = debug1' right? Maybe I should give it
another go with that set back to the default.

Thanks,
Mark

#17Bruce Momjian
pgman@candle.pha.pa.us
In reply to: ITAGAKI Takahiro (#9)
1 attachment(s)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory

ITAGAKI Takahiro wrote:

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

1. Fix update timing of Write->curridx. (pointed by Tom)
Change to update it soon after write().

2. Fix buffer alignment routine on 64bit cpu. (pointed by Mark)
I checked it on Xeon EM64T and it worked properly, but I don't have IA64...

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

Does the following patch fix the memory leak you described?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/freetext/plainDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.195
diff -c -c -r1.195 xlog.c
*** src/backend/access/transam/xlog.c	2 Jun 2005 05:55:28 -0000	1.195
--- src/backend/access/transam/xlog.c	5 Jun 2005 03:38:23 -0000
***************
*** 3754,3759 ****
--- 3754,3760 ----
  	BootStrapCLOG();
  	BootStrapSUBTRANS();
  	BootStrapMultiXact();
+ 	free(buffer);
  }
  
  static char *
#18Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#13)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer (+

Yes, I assume that the patch to group the writes isn't something we want
right now, and the one for O_DIRECT is going to need an additional
fsync, and I have asked for testing on that.

I have posted a patch that I think fixes the memory leak reported and am
waiting for feedback on that.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Tue, 2005-03-01 at 13:53 -0800, Mark Wong wrote:

On Thu, Feb 03, 2005 at 07:25:55PM +0900, ITAGAKI Takahiro wrote:

Hello everyone.

I fixed two bugs in the patch that I sent before.
Check and test new one, please.

Ok, finally got back into the office and was able to run 1 set of
tests.

So the new baseline result with 8.0.1:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/
Throughput: 3639.97

Results with the patch but open_direct not set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/308/
Throughput: 3494.72

Results with the patch and open_direct set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/312/
Throughput: 3489.69

You can verify that the wall_sync_method is set to open_direct under
the "database parameters" link, but I'm wondering if I missed
something. It looks a little odd the the performance dropped.

Is there anything more to say on this?
Is it case-closed, or is there further work underway - I can't see any
further chat on this thread.

These results show it doesn't work better on larger systems. The
original testing showed it worked better on smaller systems - is there
still scope to include this for smaller configs?

If not, thanks for taking the time to write the patch and investigate
whether changes in this area would help. Not every performance patch
improves things, but that doesn't mean we shouldn't try...

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#17)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory

Bruce Momjian <pgman@candle.pha.pa.us> writes:

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

Does the following patch fix the memory leak you described?

You realize this is a waste of code, no? It's not like the bootstrap
subprocess frees every single bit of memory it ever allocates, and even
less like it'd be profitable to try to make it do so ... the memory
will go away anyway when the subprocess exits.

regards, tom lane

#20ITAGAKI Takahiro
itagaki.takahiro@lab.ntt.co.jp
In reply to: Bruce Momjian (#17)
Memory leak in BootStrapXLOG()

Bruce Momjian <pgman@candle.pha.pa.us> wrote:

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

Does the following patch fix the memory leak you described?

Yes, the revised patch has no leak by using stack instead of malloc().
This leak is trivial, but anyway direct io needs an aligned buffer.
IMHO any of the following is ok.

[A] 1st patch
char *buffer;
void* buffer0;
buffer0 = malloc(BLCKSZ + XLOG_EXTRA_BUFFERS);
buffer = (char *) XLOG_BUFFERS_ALIGN(buffer0);
free(buffer0);

[B] 2nd patch
char *buffer;
char buffer0[BLCKSZ + XLOG_EXTRA_BUFFERS + MAXIMUM_ALIGNOF];
buffer = XLOG_BUFFERS_ALIGN(buffer0);

[C] following code is simple if we don't care the memory leak.
char *buffer;
buffer = XLOG_BUFFERS_ALIGN( malloc(BLCKSZ + XLOG_EXTRA_BUFFERS) );

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#19)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

BTW, I found memory leak in BootStrapXLOG(). The buffer allocated by malloc()
is not free()ed. ISSUE_BOOTSTRAP_MEMORYLEAK in this patch points out it.
(But this leak is not serious, because this function is called only once.)

Does the following patch fix the memory leak you described?

You realize this is a waste of code, no? It's not like the bootstrap
subprocess frees every single bit of memory it ever allocates, and even
less like it'd be profitable to try to make it do so ... the memory
will go away anyway when the subprocess exits.

I guess, but the person reported a leak so I figured I would fix it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
Re: [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

You realize this is a waste of code, no?

I guess, but the person reported a leak so I figured I would fix it.

I don't really care whether there's a free() there or not --- what
bothers me is that calling it a leak shows a fundamental
misunderstanding of the environment in which that code runs, which
is likely to lead to proposed changes that have less trivial
consequences.

regards, tom lane