COPY speedup

Started by Pierre Frédéric Caillaudover 16 years ago13 messages

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

1 attachment(s)

Backups always take too long...
COPY TO is CPU bound...

A few days of coding later, I think I'm on to something.

First the results.
All tables are cached in RAM (not in shared_buffers though).
Timings are best of 4 tries.

- test_one_int is a table with 1 INT column and 10.000.000 rows (from
generate_series)

SELECT count(*) FROM test_one_int :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|---------|---------|-----------------------------
2.040 | --- | 150.22 | 4903.03 | 4.90 | 8.4.0 / compiled from source

* count(*) gives a reference timing for scanning a table

* reduced per-row overhead
-> COPY BINARY faster than count(*) for a 1-column table
-> COPY BIANRY faster than "SELECT * WHERE x=-1" for a 1-column table
which doesn't contain the value "-1"

* reduced per-tuple overhead
-> COPY BINARY 3.15 times faster

COPY test_one_int TO '/dev/null' :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|-------|---------|---------|-----------------------------
4.879 | 1.48 x | 62.80 | 2049.78 | 2.05 | 8.4.0 / copy to patch 4
7.198 | --- | 42.56 | 1389.25 | 1.39 | 8.4.0 / compiled from source

* reduced per-row and per-tuple overheads
-> COPY 1.48x faster

* Patched Binary mode is 3.4x faster than un-patched text mode

*******************************************************************

- test_many_ints is a table with 26 INT column and 1.000.000 rows

SELECT count(*) FROM test_many_ints :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|---------|---------|-----------------------------
0.275 | --- | 465.88 | 3637.45 | 94.57 | 8.4.0 / copy to patch 4

COPY test_many_ints TO '/dev/null' BINARY :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|-------|--------|---------|-----------------------------
1.706 | 5.19 x | 75.08 | 586.23 | 15.24 | 8.4.0 / copy to patch 4
8.861 | --- | 14.45 | 112.85 | 2.93 | 8.4.0 / compiled from source

COPY test_many_ints TO '/dev/null' :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|--------|---------|-----------------------------
8.941 | 1.36 x | 14.32 | 111.84 | 2.91 | 8.4.0 / copy to patch 4
12.149 | --- | 10.54 | 82.31 | 2.14 | 8.4.0 / compiled from source

* Patched Binary mode is 7.1x faster than un-patched text mode

*******************************************************************

- annonces is a 340MB table with a mix of ints, smallints, bools, date,
timestamp, etc, and a text field averaging 230 bytes

SELECT count(*) FROM annonces :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|---------|---------|-----------------------------
0.349 | --- | 933.45 | 1184.91 | 46.21 | 8.4.0 / copy to patch 4

* Patched Binary mode is 4.7x faster than un-patched text mode

COPY annonces TO '/dev/null' :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|-------|---------|-----------------------------
9.600 | 1.06 x | 33.93 | 43.08 | 1.68 | 8.4.0 / copy to patch 4
10.147 | --- | 32.10 | 40.75 | 1.59 | 8.4.0 / compiled from source

* Here, COPY isn't much faster : most of the time is actually spent
converting the DATE and TIMESTAMP columns to strings.
* In binary mode, such conversions are not needed.

*******************************************************************

- archive is 416MB, the same as annonces, without the text field, and many
more rows

SELECT count(*) FROM archive_data :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|---------|---------|-----------------------------
0.844 | --- | 470.60 | 3135.89 | 87.81 | 8.4.0 / copy to patch 4

* Patched Binary mode is 6.4x faster than un-patched text mode

COPY archive_data TO '/dev/null' :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|-------|---------|-----------------------------
28.471 | 1.21 x | 13.95 | 92.99 | 2.60 | 8.4.0 / copy to patch 4
34.344 | --- | 11.57 | 77.09 | 2.16 | 8.4.0 / compiled from source

* Most of the time is again spent converting the DATE and TIMESTAMP
columns to strings.

*******************************************************************

* Why ?

COPY in text mode should be "fast enough" but will never be really fast
because many types need complicated conversions.
COPY BINARY has drawbacks (not very portable...) so, to justify its
existence, it should compensate with a massive speed increase over text
mode, which is not the case in 8.4.

* How ?

- Created a new "WBuf" auto-flushing buffer type. It looks like a
StringInfo, but :
- it has a flush callback
- you add data to it in little pieces
- when it is full, it sends the buffer contents to the flush callback
- it never makes any palloc calls except on creation

- fmgr.c
- new way of calling SendFuncs and OutFuncs which uses the existing
"context" field
- copy.c passes WBuf through this context
- SendFuncs check if they are called with a context
- if yes, write directly to the buffer
- if no, previous behaviour remains, return a BYTEA

- copy.c
- creates a WBuf
- sets the flush callback to do the right thing (write file, send to
frontend, etc)
- writes data like headers and delimiters to it
- pass it to the SendFuncs
- if a SendFunc returns a BYTEA (because it has not been updated to write
directly to the buffer), use the BYTEA
- if not, do nothing, the data is already sent

- pqcomm.c
- removed memcpy on large blocks

- others
- removed some memcpy
- inlines added at strategic points (total size of executable is
actually smaller with the patch)

* Results

See performance numbers above.
It does generate the same output as old COPY (ie there doesn't seem to be
any bugs)
Should be 100% backward compatible, no syscatalogs change.

* Side effects

Uses less memory for big TEXT or BYTEA fields, since less copies are made.
This could be extended to make serialisation of query results sent to the
frontend faster.
Breaks COPY CSV, I need to fix it (it's simple, I just didn't have time).
I have ideas for COPY FROM too.

* Thoughts

Not for commitfest ;) it's too late.
Patch needs refactoring.
Maybe fuse StringInfo and WBuf together, with bits of pq_send*, maybe not,
have to think about this.
Some types have extremely slow outfuncs (for instance, box).

COPY BINARY should include in its header (in the variable-length field
specified for this), a sample of all types used in the table, that are
non-portable.
For instance, put a TIMESTAMP of a known value. On reading, check this :
if it's wrong, perhaps the dump was generated by a postgres with float
timestamps ?...
This would have another advantage : with the column types stored in the
header, you'd no longer ask yourself "hmmmm..... how many alter tables did
I do since I made this dump that doesn't seem to want to load ?..."
Also, currently you can load a binary dump of INTs in a DATE column and it
will work perfectly OK (except the dates will be rubbish of course).

With this patch COPY BINARY gets fast enough to sometimes saturate a
gigabit ethernet link...

Patch is for 8.4.0, if someone wants to try it.

Attachments:

pg_8.4.0_copyto_patch_4.txttext/plain; name=pg_8.4.0_copyto_patch_4.txtDownload

diff -rupN postgresql-8.4.0-orig/src/backend/commands/copy.c postgresql-8.4.0-copy/src/backend/commands/copy.c
--- postgresql-8.4.0-orig/src/backend/commands/copy.c	2009-06-11 16:48:55.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/commands/copy.c	2009-08-11 11:20:49.000000000 +0200
@@ -42,6 +42,10 @@
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
+#define COPY_BUFFER_SIZE 131072
 
 #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7'))
 #define OCTVALUE(c) ((c) - '0')
@@ -91,6 +95,9 @@ typedef struct CopyStateData
 	FILE	   *copy_file;		/* used if copy_dest == COPY_FILE */
 	StringInfo	fe_msgbuf;		/* used for all dests during COPY TO, only for
 								 * dest == COPY_NEW_FE in COPY FROM */
+	WBuf		wbuf;
+	OutputFunctionContext	*out_context;
+	
 	bool		fe_copy;		/* true for all FE copy dests */
 	bool		fe_eof;			/* true if detected end of copy data */
 	EolType		eol_type;		/* EOL type of input */
@@ -267,15 +274,10 @@ static char *limit_printout_length(const
 static void SendCopyBegin(CopyState cstate);
 static void ReceiveCopyBegin(CopyState cstate);
 static void SendCopyEnd(CopyState cstate);
-static void CopySendData(CopyState cstate, void *databuf, int datasize);
-static void CopySendString(CopyState cstate, const char *str);
-static void CopySendChar(CopyState cstate, char c);
 static void CopySendEndOfRow(CopyState cstate);
 static int CopyGetData(CopyState cstate, void *databuf,
 			int minread, int maxread);
-static void CopySendInt32(CopyState cstate, int32 val);
 static bool CopyGetInt32(CopyState cstate, int32 *val);
-static void CopySendInt16(CopyState cstate, int16 val);
 static bool CopyGetInt16(CopyState cstate, int16 *val);
 
 
@@ -347,6 +349,7 @@ ReceiveCopyBegin(CopyState cstate)
 		pq_endmessage(&buf);
 		cstate->copy_dest = COPY_NEW_FE;
 		cstate->fe_msgbuf = makeStringInfo();
+		cstate->wbuf = NULL;
 	}
 	else if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 2)
 	{
@@ -379,51 +382,92 @@ SendCopyEnd(CopyState cstate)
 	{
 		/* Shouldn't have any unsent data */
 		Assert(cstate->fe_msgbuf->len == 0);
+		Assert( wbuf_is_empty( cstate->wbuf ));
+		
 		/* Send Copy Done message */
 		pq_putemptymessage('c');
 	}
 	else
 	{
-		CopySendData(cstate, "\\.", 2);
 		/* Need to flush out the trailer (this also appends a newline) */
+		wbuf_put( cstate->wbuf, "\\.", 2 );
 		CopySendEndOfRow(cstate);
+		wbuf_flush( cstate->wbuf );
 		pq_endcopyout(false);
 	}
 }
 
-/*----------
- * CopySendData sends output data to the destination (file or frontend)
- * CopySendString does the same for null-terminated strings
- * CopySendChar does the same for single characters
- * CopySendEndOfRow does the appropriate thing at end of each data row
- *	(data is not actually flushed except by CopySendEndOfRow)
- *
- * NB: no data conversion is applied by these functions
- *----------
- */
-static void
-CopySendData(CopyState cstate, void *databuf, int datasize)
-{
-	appendBinaryStringInfo(cstate->fe_msgbuf, (char *) databuf, datasize);
-}
+/******** data sinks for copy output buffer ***********/
 
-static void
-CopySendString(CopyState cstate, const char *str)
+static void wbufCopySink_COPY_FILE( WBuf buf, const char* data, int length )
 {
-	appendBinaryStringInfo(cstate->fe_msgbuf, str, strlen(str));
+	CopyState cstate = (CopyState) buf->sink_arg;
+	
+	/* First, send any data in the WBuf, since we were called because it needs flushing. */
+	(void) fwrite( buf->data, wbuf_get_used_space( buf ), 1, cstate->copy_file );
+	if( ferror(cstate->copy_file) )
+		goto error;
+	
+	/* It is now empty. */
+	wbuf_reset( buf );
+
+	/* send any extra data */
+	if( data && length )
+	{
+		(void) fwrite(data, length, 1, cstate->copy_file );
+		if( ferror(cstate->copy_file) )
+			goto error;
+	}
+		
+	return;
+
+error:
+	if (ferror(cstate->copy_file))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to COPY file: %m")));
+}
+
+static void wbufCopySink_COPY_OLD_FE( WBuf buf, const char* data, int length )
+{
+	/* First, send any data in the WBuf, since we were called because it needs flushing. */
+	if (pq_putbytes(buf->data, wbuf_get_used_space( buf )))
+		goto error;
+	
+	/* It is now empty. */
+	wbuf_reset( buf );
+
+	/* send any extra data */
+	if( data && length )
+		if (pq_putbytes( data, length ))
+			goto error;
+
+	return;
+	
+error:
+	/* no hope of recovering connection sync, so FATAL */
+	ereport(FATAL,
+			(errcode(ERRCODE_CONNECTION_FAILURE),
+			 errmsg("connection lost during COPY to stdout")));
+}
+
+static void wbufCopySink_COPY_COPY_NEW_FE( WBuf buf, const char* data, int length )
+{
+	/* First, send any data in the WBuf, since we were called because it needs flushing. */
+	(void) pq_putmessage('d', buf->data, wbuf_get_used_space( buf ));	
+	
+	/* It is now empty. */
+	wbuf_reset( buf );
+	
+	/* send any extra data */
+	if( data && length )
+		(void) pq_putmessage('d',data, length);	
 }
 
-static void
-CopySendChar(CopyState cstate, char c)
-{
-	appendStringInfoCharMacro(cstate->fe_msgbuf, c);
-}
 
 static void
 CopySendEndOfRow(CopyState cstate)
 {
-	StringInfo	fe_msgbuf = cstate->fe_msgbuf;
-
 	switch (cstate->copy_dest)
 	{
 		case COPY_FILE:
@@ -431,43 +475,23 @@ CopySendEndOfRow(CopyState cstate)
 			{
 				/* Default line termination depends on platform */
 #ifndef WIN32
-				CopySendChar(cstate, '\n');
+				wbuf_put_char( cstate->wbuf, '\n');
 #else
-				CopySendString(cstate, "\r\n");
+				wbuf_put_string( cstate->wbuf, "\r\n");
 #endif
 			}
-
-			(void) fwrite(fe_msgbuf->data, fe_msgbuf->len,
-						  1, cstate->copy_file);
-			if (ferror(cstate->copy_file))
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not write to COPY file: %m")));
 			break;
 		case COPY_OLD_FE:
 			/* The FE/BE protocol uses \n as newline for all platforms */
 			if (!cstate->binary)
-				CopySendChar(cstate, '\n');
-
-			if (pq_putbytes(fe_msgbuf->data, fe_msgbuf->len))
-			{
-				/* no hope of recovering connection sync, so FATAL */
-				ereport(FATAL,
-						(errcode(ERRCODE_CONNECTION_FAILURE),
-						 errmsg("connection lost during COPY to stdout")));
-			}
+				wbuf_put_char( cstate->wbuf, '\n');
 			break;
 		case COPY_NEW_FE:
 			/* The FE/BE protocol uses \n as newline for all platforms */
 			if (!cstate->binary)
-				CopySendChar(cstate, '\n');
-
-			/* Dump the accumulated row as one CopyData message */
-			(void) pq_putmessage('d', fe_msgbuf->data, fe_msgbuf->len);
+				wbuf_put_char( cstate->wbuf, '\n');
 			break;
 	}
-
-	resetStringInfo(fe_msgbuf);
 }
 
 /*
@@ -581,23 +605,6 @@ CopyGetData(CopyState cstate, void *data
 	return bytesread;
 }
 
-
-/*
- * These functions do apply some data conversion
- */
-
-/*
- * CopySendInt32 sends an int32 in network byte order
- */
-static void
-CopySendInt32(CopyState cstate, int32 val)
-{
-	uint32		buf;
-
-	buf = htonl((uint32) val);
-	CopySendData(cstate, &buf, sizeof(buf));
-}
-
 /*
  * CopyGetInt32 reads an int32 that appears in network byte order
  *
@@ -618,18 +625,6 @@ CopyGetInt32(CopyState cstate, int32 *va
 }
 
 /*
- * CopySendInt16 sends an int16 in network byte order
- */
-static void
-CopySendInt16(CopyState cstate, int16 val)
-{
-	uint16		buf;
-
-	buf = htons((uint16) val);
-	CopySendData(cstate, &buf, sizeof(buf));
-}
-
-/*
  * CopyGetInt16 reads an int16 that appears in network byte order
  */
 static bool
@@ -883,6 +878,11 @@ DoCopy(const CopyStmt *stmt, const char 
 	 * more than strictly necessary, but seems best for consistency and
 	 * future-proofing.  Likewise we disallow all digits though only octal
 	 * digits are actually dangerous.
+	 *
+	 * note : do not change this behavior, since the OutputFunctions that go directly
+	 * through the output wbuf bypass any encoding verification on COPY TO. This is fine
+	 * for things like numbers, bool( 't', 'f' ) etc, unless someone uses a character like '1'
+	 * as a delimiter...
 	 */
 	if (!cstate->csv_mode &&
 		strchr("\\.abcdefghijklmnopqrstuvwxyz0123456789",
@@ -1311,8 +1311,23 @@ CopyTo(CopyState cstate)
 	num_phys_attrs = tupDesc->natts;
 	cstate->null_print_client = cstate->null_print;		/* default */
 
-	/* We use fe_msgbuf as a per-row buffer regardless of copy_dest */
-	cstate->fe_msgbuf = makeStringInfo();
+	/* We no longer use fe_msgbuf as a per-row buffer regardless of copy_dest */
+	cstate->fe_msgbuf = NULL; 
+	
+	/* Create a WBuf as a multirow buffer to store data to be forwarded to the COPY destination */
+	cstate->wbuf = wbuf_make( COPY_BUFFER_SIZE );
+	
+	/* Setup forwarding in wbuf : when it needs flushing, it will call the flush function */
+	switch( cstate->copy_dest )
+	{
+		case COPY_OLD_FE:	wbuf_set_sink( cstate->wbuf, wbufCopySink_COPY_OLD_FE, (void *)cstate );			break;
+		case COPY_FILE:		wbuf_set_sink( cstate->wbuf, wbufCopySink_COPY_FILE, (void *)cstate );			break;
+		case COPY_NEW_FE:	wbuf_set_sink( cstate->wbuf, wbufCopySink_COPY_COPY_NEW_FE, (void *)cstate );		break;
+	}
+	
+	/* Setup output context to help output functions */
+	cstate->out_context = makeNode( OutputFunctionContext );
+	cstate->out_context->out_buf = cstate->wbuf;	/* allows output functions to pass data "almost" directly to the client */
 
 	/* Get info about the columns we need to process. */
 	cstate->out_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo));
@@ -1351,21 +1366,23 @@ CopyTo(CopyState cstate)
 		int32		tmp;
 
 		/* Signature */
-		CopySendData(cstate, (char *) BinarySignature, 11);
+		wbuf_put( cstate->wbuf, (char *) BinarySignature, 11 );
+		
 		/* Flags field */
 		tmp = 0;
 		if (cstate->oids)
 			tmp |= (1 << 16);
-		CopySendInt32(cstate, tmp);
+		wbuf_put_int32_net( cstate->wbuf, tmp);
+		
 		/* No header extension */
 		tmp = 0;
-		CopySendInt32(cstate, tmp);
+		wbuf_put_int32_net( cstate->wbuf, tmp);
 	}
 	else
 	{
 		/*
 		 * For non-binary copy, we need to convert null_print to client
-		 * encoding, because it will be sent directly with CopySendString.
+		 * encoding, because it will be sent directly with wbuf_put_string.
 		 */
 		if (cstate->need_transcoding)
 			cstate->null_print_client = pg_server_to_client(cstate->null_print,
@@ -1382,7 +1399,7 @@ CopyTo(CopyState cstate)
 				char	   *colname;
 
 				if (hdr_delim)
-					CopySendChar(cstate, cstate->delim[0]);
+					wbuf_put_char( cstate->wbuf, cstate->delim[0]);
 				hdr_delim = true;
 
 				colname = NameStr(attr[attnum - 1]->attname);
@@ -1429,10 +1446,15 @@ CopyTo(CopyState cstate)
 	if (cstate->binary)
 	{
 		/* Generate trailer for a binary copy */
-		CopySendInt16(cstate, -1);
-		/* Need to flush out the trailer */
+		wbuf_put_int16_net(cstate->wbuf, -1);
+		
+		/* Send trailer */
 		CopySendEndOfRow(cstate);
 	}
+	
+	/* Flush buffer */
+	
+	wbuf_flush( cstate->wbuf );
 
 	MemoryContextDelete(cstate->rowcontext);
 }
@@ -1455,13 +1477,13 @@ CopyOneRowTo(CopyState cstate, Oid tuple
 	if (cstate->binary)
 	{
 		/* Binary per-tuple header */
-		CopySendInt16(cstate, list_length(cstate->attnumlist));
+		wbuf_put_int16_net( cstate->wbuf, list_length(cstate->attnumlist));
 		/* Send OID if wanted --- note attnumlist doesn't include it */
 		if (cstate->oids)
 		{
 			/* Hack --- assume Oid is same size as int32 */
-			CopySendInt32(cstate, sizeof(int32));
-			CopySendInt32(cstate, tupleOid);
+			wbuf_put_int32_net( cstate->wbuf, sizeof(int32));
+			wbuf_put_int32_net( cstate->wbuf, tupleOid);
 		}
 	}
 	else
@@ -1472,7 +1494,7 @@ CopyOneRowTo(CopyState cstate, Oid tuple
 		{
 			string = DatumGetCString(DirectFunctionCall1(oidout,
 												ObjectIdGetDatum(tupleOid)));
-			CopySendString(cstate, string);
+			wbuf_put_string( cstate->wbuf, string);
 			need_delim = true;
 		}
 	}
@@ -1486,39 +1508,45 @@ CopyOneRowTo(CopyState cstate, Oid tuple
 		if (!cstate->binary)
 		{
 			if (need_delim)
-				CopySendChar(cstate, cstate->delim[0]);
+				wbuf_put_char( cstate->wbuf, cstate->delim[0]);
 			need_delim = true;
-		}
-
-		if (isnull)
-		{
-			if (!cstate->binary)
-				CopySendString(cstate, cstate->null_print_client);
+			
+			if (isnull)
+				wbuf_put_string( cstate->wbuf, cstate->null_print_client);
 			else
-				CopySendInt32(cstate, -1);
+			{
+				string = OutputFunctionCallContext(&out_functions[attnum - 1],
+											value,
+											(fmNodePtr)cstate->out_context);
+				
+				if( string )	/* if NULL, SendFunction has written directly to our output buffer, we have nothing to do. */
+				{
+					if (cstate->csv_mode)
+						CopyAttributeOutCSV(cstate, string,
+											cstate->force_quote_flags[attnum - 1],
+											list_length(cstate->attnumlist) == 1);
+					else
+						CopyAttributeOutText(cstate, string);
+				}
+			}
+			
 		}
 		else
 		{
-			if (!cstate->binary)
-			{
-				string = OutputFunctionCall(&out_functions[attnum - 1],
-											value);
-				if (cstate->csv_mode)
-					CopyAttributeOutCSV(cstate, string,
-										cstate->force_quote_flags[attnum - 1],
-										list_length(cstate->attnumlist) == 1);
-				else
-					CopyAttributeOutText(cstate, string);
-			}
+			if (isnull)
+				wbuf_put_int32_net( cstate->wbuf, -1);
 			else
 			{
 				bytea	   *outputbytes;
 
-				outputbytes = SendFunctionCall(&out_functions[attnum - 1],
-											   value);
-				CopySendInt32(cstate, VARSIZE(outputbytes) - VARHDRSZ);
-				CopySendData(cstate, VARDATA(outputbytes),
-							 VARSIZE(outputbytes) - VARHDRSZ);
+				outputbytes = SendFunctionCallContext(&out_functions[attnum - 1],
+											    value,
+												(fmNodePtr)cstate->out_context);
+				if( outputbytes )	/* if NULL, SendFunction has written directly to our output buffer, we have nothing to do. */
+				{
+					wbuf_put_int32_net( cstate->wbuf, VARSIZE(outputbytes) - VARHDRSZ);
+					wbuf_put(cstate->wbuf, VARDATA(outputbytes), VARSIZE(outputbytes) - VARHDRSZ);
+				}
 			}
 		}
 	}
@@ -3107,7 +3135,7 @@ CopyReadBinaryAttribute(CopyState cstate
 #define DUMPSOFAR() \
 	do { \
 		if (ptr > start) \
-			CopySendData(cstate, start, ptr - start); \
+			wbuf_put(wbuf, start, ptr - start); \
 	} while (0)
 
 static void
@@ -3117,7 +3145,8 @@ CopyAttributeOutText(CopyState cstate, c
 	char	   *start;
 	char		c;
 	char		delimc = cstate->delim[0];
-
+	WBuf		wbuf = cstate->wbuf;
+	
 	if (cstate->need_transcoding)
 		ptr = pg_server_to_client(string, strlen(string));
 	else
@@ -3181,14 +3210,14 @@ CopyAttributeOutText(CopyState cstate, c
 				}
 				/* if we get here, we need to convert the control char */
 				DUMPSOFAR();
-				CopySendChar(cstate, '\\');
-				CopySendChar(cstate, c);
+				wbuf_put_char( cstate->wbuf, '\\');
+				wbuf_put_char( cstate->wbuf, c);
 				start = ++ptr;	/* do not include char in next run */
 			}
 			else if (c == '\\' || c == delimc)
 			{
 				DUMPSOFAR();
-				CopySendChar(cstate, '\\');
+				wbuf_put_char( cstate->wbuf, '\\');
 				start = ptr++;	/* we include char in next run */
 			}
 			else if (IS_HIGHBIT_SET(c))
@@ -3241,14 +3270,14 @@ CopyAttributeOutText(CopyState cstate, c
 				}
 				/* if we get here, we need to convert the control char */
 				DUMPSOFAR();
-				CopySendChar(cstate, '\\');
-				CopySendChar(cstate, c);
+				wbuf_put_char( cstate->wbuf, '\\');
+				wbuf_put_char( cstate->wbuf, c);
 				start = ++ptr;	/* do not include char in next run */
 			}
 			else if (c == '\\' || c == delimc)
 			{
 				DUMPSOFAR();
-				CopySendChar(cstate, '\\');
+				wbuf_put_char( cstate->wbuf, '\\');
 				start = ptr++;	/* we include char in next run */
 			}
 			else
@@ -3273,6 +3302,7 @@ CopyAttributeOutCSV(CopyState cstate, ch
 	char		delimc = cstate->delim[0];
 	char		quotec = cstate->quote[0];
 	char		escapec = cstate->escape[0];
+	WBuf		wbuf = cstate->wbuf;
 
 	/* force quoting if it matches null_print (before conversion!) */
 	if (!use_quote && strcmp(string, cstate->null_print) == 0)
@@ -3315,7 +3345,7 @@ CopyAttributeOutCSV(CopyState cstate, ch
 
 	if (use_quote)
 	{
-		CopySendChar(cstate, quotec);
+		wbuf_put_char( cstate->wbuf, quotec);
 
 		/*
 		 * We adopt the same optimization strategy as in CopyAttributeOutText
@@ -3326,7 +3356,7 @@ CopyAttributeOutCSV(CopyState cstate, ch
 			if (c == quotec || c == escapec)
 			{
 				DUMPSOFAR();
-				CopySendChar(cstate, escapec);
+				wbuf_put_char( cstate->wbuf, escapec);
 				start = ptr;	/* we include char in next run */
 			}
 			if (IS_HIGHBIT_SET(c) && cstate->encoding_embeds_ascii)
@@ -3336,12 +3366,12 @@ CopyAttributeOutCSV(CopyState cstate, ch
 		}
 		DUMPSOFAR();
 
-		CopySendChar(cstate, quotec);
+		wbuf_put_char( cstate->wbuf, quotec);
 	}
 	else
 	{
 		/* If it doesn't need quoting, we can just dump it as-is */
-		CopySendString(cstate, ptr);
+		wbuf_put_string( cstate->wbuf, ptr);
 	}
 }
 
diff -rupN postgresql-8.4.0-orig/src/backend/lib/Makefile postgresql-8.4.0-copy/src/backend/lib/Makefile
--- postgresql-8.4.0-orig/src/backend/lib/Makefile	2008-02-19 11:30:07.000000000 +0100
+++ postgresql-8.4.0-copy/src/backend/lib/Makefile	2009-08-10 15:33:50.000000000 +0200
@@ -12,6 +12,6 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dllist.o stringinfo.o
+OBJS = dllist.o stringinfo.o wbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff -rupN postgresql-8.4.0-orig/src/backend/lib/wbuf.c postgresql-8.4.0-copy/src/backend/lib/wbuf.c
--- postgresql-8.4.0-orig/src/backend/lib/wbuf.c	1970-01-01 01:00:00.000000000 +0100
+++ postgresql-8.4.0-copy/src/backend/lib/wbuf.c	2009-08-11 02:02:33.000000000 +0200
@@ -0,0 +1,290 @@
+/*-------------------------------------------------------------------------
+ *
+ * stringinfo.c
+ *
+ * StringInfo provides an indefinitely-extensible string data type.
+ * It can be used to buffer either ordinary C strings (null-terminated text)
+ * or arbitrary binary data.  All storage is allocated with palloc().
+ *
+ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  $PostgreSQL: pgsql/src/backend/lib/stringinfo.c,v 1.50 2009/01/01 17:23:42 momjian Exp $
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "lib/rwbuf.h"
+#include "utils/memutils.h"
+
+/*
+ * wbuf_flush
+ *
+ * Calls the sink() if there is data to flush, which empties the WBuf.
+ * 
+ */
+void wbuf_flush( WBuf buf )
+{
+	Assert( buf->sink );
+	if( ! wbuf_is_empty( buf ))				/* flushing an empty buffer should not call the sink function */
+		buf->sink( buf, NULL, 0 );	/* Call sink function */
+}
+
+/*
+ * wbuf_flush_forced
+ *
+ * Calls the sink(), which empties the WBuf. Does not check presence of data.
+ * 
+ */
+void wbuf_flush_forced( WBuf buf )
+{
+	Assert( buf->sink );
+	buf->sink( buf, NULL, 0 );	/* Call sink function */
+}
+
+/*
+ *	Ensures there is "needed" bytes in the buffer.
+ *	This is only called by the inlined wbuf_reserve if the buffer
+ *	needs to be flushed.
+ */
+void wbuf_reserve2_( WBuf buf, int needed )
+{
+	Assert( buf->sink );
+	buf->sink( buf, NULL, needed );	/* Call sink function, it will throw an error if needed */
+	Assert( wbuf_has_space( wbuf, needed ));	/* check if sink() did its job */
+}
+
+void wbuf_put_big( WBuf buf, const void *data, int datalen )
+{
+	Assert( data );
+	
+	/* empty buffer (it's probably full since we were called by wbuf_put) */
+	Assert( buf->sink );
+	if( ! wbuf_is_empty( buf ))				/* flushing an empty buffer should not call the sink function */
+		buf->sink( buf, NULL, datalen );	/* Call sink function */
+	
+	/* pipe rest of data only if it's needed, or worth it */
+	if( wbuf_has_space( buf, datalen ) && datalen < 4096 ) 
+	{
+		memcpy( buf->wp, data, datalen );
+		buf->wp += datalen;
+	}
+	else
+		buf->sink( buf, data, datalen );
+}
+
+void wbuf_put_string_big( WBuf buf, const char *str )
+{
+	int n;
+	
+	Assert( str );
+
+	wbuf_flush( buf );	/* empty buffer (it's probably full since we were called by wbuf_putString) */
+
+	/* try to put the rest of the string in the buffer in case it's short enough */
+	n = wbuf_get_free_space( buf );
+	if( n > 4096 ) 
+		n = 4096;
+	
+	if( strncpypartial( &buf->wp, &str, n ))
+	{
+		buf->wp--;	/* forget the '\0' */
+		return;
+	}
+	
+	/* insuficient space, send the rest (str and buf->wp were updated by strncpypartial() */
+	buf->sink( buf, str, strlen( str ));	/* pipe rest of data */
+}
+
+void wbuf_put_string_0_big( WBuf buf, const char *str )
+{
+	int n;
+	
+	Assert( str );
+
+	wbuf_flush( buf );	/* empty buffer (it's probably full since we were called by wbuf_putString) */
+
+	/* try to put the rest of the string in the buffer in case it's short enough */
+	n = wbuf_get_free_space( buf );
+	if( n > 4096 ) 
+		n = 4096;
+	
+	if( strncpypartial( &buf->wp, &str, n ))
+		return;
+	
+	/* insuficient space, send the rest (str and buf->wp were updated by strncpypartial() */
+	buf->sink( buf, str, strlen( str ) + 1 );	/* pipe rest of data including terminator */
+}
+
+
+/* --------------------------------
+ *		pq_sendint64	- append a binary 8-byte int to a StringInfo buffer
+ *
+ * It is tempting to merge this with pq_sendint, but we'd have to make the
+ * argument int64 for all data widths --- that could be a big performance
+ * hit on machines where int64 isn't efficient.
+ * --------------------------------
+ */
+void wbuf_put_int64_net(WBuf buf, int64 i)
+{
+	uint32		n32;
+
+	/* High order half first, since we're doing MSB-first */
+#ifdef INT64_IS_BUSTED
+	/* don't try a right shift of 32 on a 32-bit word */
+	n32 = (i < 0) ? -1 : 0;
+#else
+	n32 = (uint32) (i >> 32);
+#endif
+	n32 = htonl(n32);
+	wbuf_put(buf, &n32, 4);
+
+	/* Now the low order half */
+	n32 = (uint32) i;
+	n32 = htonl(n32);
+	wbuf_put(buf, &n32, 4);
+}
+
+/* --------------------------------
+ *		pq_sendfloat4	- append a float4 to a StringInfo buffer
+ *
+ * The point of this routine is to localize knowledge of the external binary
+ * representation of float4, which is a component of several datatypes.
+ *
+ * We currently assume that float4 should be byte-swapped in the same way
+ * as int4.  This rule is not perfect but it gives us portability across
+ * most IEEE-float-using architectures.
+ * --------------------------------
+ */
+void wbuf_put_float4_net( WBuf buf, float4 f )
+{
+	union
+	{
+		float4		f;
+		uint32		i;
+	}			swap;
+
+	swap.f = f;
+	swap.i = htonl(swap.i);
+
+	wbuf_put(buf, &swap.i, 4);
+}
+
+/* --------------------------------
+ *		pq_sendfloat8	- append a float8 to a StringInfo buffer
+ *
+ * The point of this routine is to localize knowledge of the external binary
+ * representation of float8, which is a component of several datatypes.
+ *
+ * We currently assume that float8 should be byte-swapped in the same way
+ * as int8.  This rule is not perfect but it gives us portability across
+ * most IEEE-float-using architectures.
+ * --------------------------------
+ */
+void wbuf_put_float8_net( WBuf buf, float8 f )
+{
+#ifdef INT64_IS_BUSTED
+	union
+	{
+		float8		f;
+		uint32		h[2];
+	}			swap;
+
+	swap.f = f;
+	swap.h[0] = htonl(swap.h[0]);
+	swap.h[1] = htonl(swap.h[1]);
+
+#ifdef WORDS_BIGENDIAN
+	/* machine seems to be big-endian, send h[0] first */
+	wbuf_put(buf, (char *) &swap.h[0], 4);
+	wbuf_put(buf, (char *) &swap.h[1], 4);
+#else
+	/* machine seems to be little-endian, send h[1] first */
+	wbuf_put(buf, (char *) &swap.h[1], 4);
+	wbuf_put(buf, (char *) &swap.h[0], 4);
+#endif
+#else							/* INT64 works */
+	union
+	{
+		float8		f;
+		int64		i;
+	}			swap;
+
+	swap.f = f;
+	wbuf_put_int64_net(buf, swap.i);
+#endif
+}
+
+
+
+/*
+ *	strncpypartial
+ *
+ *	Combines strncpy and a kind of strlen in one call
+ *
+ *	source is copied into dest.
+ *
+ *	pdest	is a char** since this function modifies it
+ *	psource is a char** since this function modifies it
+ *	dest_size is the size of buffer dest. At most, dest_size bytes are copied, including the '\0' terminator.
+ *
+ *	If the '\0' terminator could be copied, all is well.
+ *		The return value is true (success).
+ *		Unlike strncpy(), the rest of dest buffer is not zeroed.
+ *		Caller finds *pdest and *psource modified : 
+ * 			they point after the last byte copied ('\0' in this case.)
+ * 
+ *	If there was not enough space to copy everything, the copy stops after writing the last byte in "dest".
+ * 		The return value is false (failure).
+ *		Caller finds *pdest and *psource modified : 
+ * 			they point after the last byte copied (which is NOT '\0' in this case.)
+ * 			Caller can copy the rest of the string in another buffer simply by reusing the same pointers.
+ */
+bool strncpypartial( char ** pdest, const char ** psource, int dest_size )
+{
+	register char c;
+	char 		*dest	= *pdest;
+	const char	*source	= *psource;
+	int n4;
+
+	/* unroll loop for speed */
+	for( n4 = dest_size >> 2; n4; n4-- )
+	{
+		*dest++ = c = *source++;	
+		if( c != '\0')
+		{
+			*dest++ = c = *source++;	
+			if( c != '\0')
+			{
+				*dest++ = c = *source++;	
+				if( c != '\0')
+				{
+					*dest++ = c = *source++;	
+					if( c != '\0')
+						continue;
+				}
+			}
+		}
+		/* \0 terminator was found and copied */
+		goto success;
+	}
+
+	/* process remaining chars */
+	for( dest_size&=3; dest_size; dest_size-- )
+	{
+		*dest++ = c = *source++;	
+		if (c == '\0')
+			goto success;	/* \0 terminator was found and copied */
+	}
+	
+	/* \0 terminator was not copied. */
+	*pdest		= dest;
+	*psource	= source;
+	return false;
+	
+success:
+	*pdest		= dest;
+	*psource	= source;
+	return true;
+}
diff -rupN postgresql-8.4.0-orig/src/backend/libpq/pqcomm.c postgresql-8.4.0-copy/src/backend/libpq/pqcomm.c
--- postgresql-8.4.0-orig/src/backend/libpq/pqcomm.c	2009-01-01 18:23:42.000000000 +0100
+++ postgresql-8.4.0-copy/src/backend/libpq/pqcomm.c	2009-08-11 02:30:36.000000000 +0200
@@ -123,7 +123,7 @@ static bool DoingCopyOut;
 /* Internal functions */
 static void pq_close(int code, Datum arg);
 static int	internal_putbytes(const char *s, size_t len);
-static int	internal_flush(void);
+static int	internal_flush(const char *send_instead, size_t len);
 
 #ifdef HAVE_UNIX_SOCKETS
 static int	Lock_AF_UNIX(unsigned short portNumber, char *unixSocketName);
@@ -1035,12 +1035,25 @@ internal_putbytes(const char *s, size_t 
 {
 	size_t		amount;
 
+	/* if there is a large amount of data to send, 
+	 * rather than memcpy(), we could send it directly... */
+			
+	amount = PQ_BUFFER_SIZE - PqSendPointer;
+	if( len > amount && len > (PQ_BUFFER_SIZE/2) )
+	{
+		if (internal_flush( s, len ))	/* flushes internal buffer and sends data at (s,len) */
+			return EOF;
+		return 0;
+	}
+	
 	while (len > 0)
 	{
 		/* If buffer is full, then flush it out */
 		if (PqSendPointer >= PQ_BUFFER_SIZE)
-			if (internal_flush())
+		{
+			if (internal_flush( NULL, 0 ))
 				return EOF;
+		}
 		amount = PQ_BUFFER_SIZE - PqSendPointer;
 		if (amount > len)
 			amount = len;
@@ -1067,19 +1080,32 @@ pq_flush(void)
 	if (PqCommBusy)
 		return 0;
 	PqCommBusy = true;
-	res = internal_flush();
+	res = internal_flush( NULL, 0 );
 	PqCommBusy = false;
 	return res;
 }
 
 static int
-internal_flush(void)
+internal_flush(const char *send_instead, size_t len)
 {
 	static int	last_reported_send_errno = 0;
 
 	char	   *bufptr = PqSendBuffer;
 	char	   *bufend = PqSendBuffer + PqSendPointer;
 
+	/* when sending lots of data, avoid memcpy(), but first flush this buffer */
+	if( send_instead )
+	{
+		int r;
+		
+		if( bufptr != bufend )
+			if(( r = internal_flush( NULL, 0 )))
+				return r;
+			
+		bufptr = (char*)send_instead;
+		bufend = bufptr + len;
+	}
+	
 	while (bufptr < bufend)
 	{
 		int			r;
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/bool.c postgresql-8.4.0-copy/src/backend/utils/adt/bool.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/bool.c	2009-06-11 16:49:03.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/bool.c	2009-08-11 00:22:41.000000000 +0200
@@ -20,6 +20,9 @@
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 /*
  * Try to interpret value as boolean value.  Valid values are: true,
  * false, yes, no, on, off, 1, 0; as well as unique prefixes thereof.
@@ -163,11 +166,23 @@ Datum
 boolout(PG_FUNCTION_ARGS)
 {
 	bool		b = PG_GETARG_BOOL(0);
-	char	   *result = (char *) palloc(2);
 
-	result[0] = (b) ? 't' : 'f';
-	result[1] = '\0';
-	PG_RETURN_CSTRING(result);
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	if( ctx )
+	{
+		Assert( IsA( ctx->out_buf, OutputFunctionContext ));
+		
+		wbuf_put_char( ctx->out_buf, (b) ? 't' : 'f' );
+
+		PG_RETURN_CSTRING( NULL );	/* return a null pointer */
+	}
+	else
+	{
+		char	   *result = (char *) palloc(2);
+		result[0] = (b) ? 't' : 'f';	/*fixme: why palloc a constant string ? */
+		result[1] = '\0';
+		PG_RETURN_CSTRING( result );
+	}
 }
 
 /*
@@ -193,11 +208,27 @@ Datum
 boolsend(PG_FUNCTION_ARGS)
 {
 	bool		arg1 = PG_GETARG_BOOL(0);
-	StringInfoData buf;
 
-	pq_begintypsend(&buf);
-	pq_sendbyte(&buf, arg1 ? 1 : 0);
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, 1 );			/* length */
+		wbuf_put_char( out_buf, arg1 ? 1 : 0 );		/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
+
+		pq_begintypsend(&buf);
+		pq_sendbyte(&buf, arg1 ? 1 : 0);
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 /*
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/date.c postgresql-8.4.0-copy/src/backend/utils/adt/date.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/date.c	2009-06-11 16:49:03.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/date.c	2009-08-11 02:10:32.000000000 +0200
@@ -29,6 +29,9 @@
 #include "utils/date.h"
 #include "utils/nabstime.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 /*
  * gcc's -ffast-math switch breaks routines that expect exact results from
  * expressions like timeval / SECS_PER_HOUR, where timeval is double.
@@ -214,11 +217,27 @@ Datum
 date_send(PG_FUNCTION_ARGS)
 {
 	DateADT		date = PG_GETARG_DATEADT(0);
-	StringInfoData buf;
-
-	pq_begintypsend(&buf);
-	pq_sendint(&buf, date, sizeof(date));
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		Assert( sizeof( date ) == 4 );
+		
+		wbuf_put_int32_net( out_buf, 4 );		/* length */
+		wbuf_put_int32_net( out_buf, date );				/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
+
+		pq_begintypsend(&buf);
+		pq_sendint(&buf, date, sizeof(date));
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 /*
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/float.c postgresql-8.4.0-copy/src/backend/utils/adt/float.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/float.c	2009-06-11 16:49:03.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/float.c	2009-08-11 02:00:44.000000000 +0200
@@ -24,6 +24,9 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 
 #ifndef M_PI
 /* from my RH5.2 gcc math.h file - thomas 2000-04-03 */
@@ -359,11 +362,26 @@ Datum
 float4send(PG_FUNCTION_ARGS)
 {
 	float4		num = PG_GETARG_FLOAT4(0);
-	StringInfoData buf;
-
-	pq_begintypsend(&buf);
-	pq_sendfloat4(&buf, num);
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, 4 );		/* length */
+		wbuf_put_float4_net( out_buf, num );				/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
+
+		pq_begintypsend(&buf);
+		pq_sendfloat4(&buf, num);
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 /*
@@ -548,11 +566,26 @@ Datum
 float8send(PG_FUNCTION_ARGS)
 {
 	float8		num = PG_GETARG_FLOAT8(0);
-	StringInfoData buf;
-
-	pq_begintypsend(&buf);
-	pq_sendfloat8(&buf, num);
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, 8 );		/* length */
+		wbuf_put_float8_net( out_buf, num );				/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
+
+		pq_begintypsend(&buf);
+		pq_sendfloat8(&buf, num);
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/geo_ops.c postgresql-8.4.0-copy/src/backend/utils/adt/geo_ops.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/geo_ops.c	2009-06-23 18:25:02.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/geo_ops.c	2009-08-11 02:34:54.000000000 +0200
@@ -23,6 +23,9 @@
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 #ifndef M_PI
 #define M_PI 3.14159265358979323846
 #endif
@@ -458,14 +461,32 @@ Datum
 box_send(PG_FUNCTION_ARGS)
 {
 	BOX		   *box = PG_GETARG_BOX_P(0);
-	StringInfoData buf;
-
-	pq_begintypsend(&buf);
-	pq_sendfloat8(&buf, box->high.x);
-	pq_sendfloat8(&buf, box->high.y);
-	pq_sendfloat8(&buf, box->low.x);
-	pq_sendfloat8(&buf, box->low.y);
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, 32 );		/* length */
+		wbuf_put_float8_net( out_buf, box->high.x );		/* data */
+		wbuf_put_float8_net( out_buf, box->high.y );		/* data */
+		wbuf_put_float8_net( out_buf, box->low.x );			/* data */
+		wbuf_put_float8_net( out_buf, box->low.y );			/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
+
+		pq_begintypsend(&buf);
+		pq_sendfloat8(&buf, box->high.x);
+		pq_sendfloat8(&buf, box->high.y);
+		pq_sendfloat8(&buf, box->low.x);
+		pq_sendfloat8(&buf, box->low.y);
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/int.c postgresql-8.4.0-copy/src/backend/utils/adt/int.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/int.c	2009-01-01 18:23:49.000000000 +0100
+++ postgresql-8.4.0-copy/src/backend/utils/adt/int.c	2009-08-11 00:57:17.000000000 +0200
@@ -37,6 +37,8 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
 
 #define SAMESIGN(a,b)	(((a) < 0) == ((b) < 0))
 
@@ -72,10 +74,24 @@ Datum
 int2out(PG_FUNCTION_ARGS)
 {
 	int16		arg1 = PG_GETARG_INT16(0);
-	char	   *result = (char *) palloc(7);	/* sign, 5 digits, '\0' */
-
-	pg_itoa(arg1, result);
-	PG_RETURN_CSTRING(result);
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	if( ctx )
+	{
+		char * result;
+		Assert( IsA( ctx->out_buf, OutputFunctionContext ));
+		
+		result = wbuf_ensure( ctx->out_buf, 7 );
+
+		pg_itoa(arg1, result);
+		wbuf_increment( ctx->out_buf, strlen( result ));
+		PG_RETURN_CSTRING( NULL );	/* return a null pointer (not a null datum !)*/
+	}
+	else
+	{
+		char	   *result = (char *) palloc(7);	/* sign, 5 digits, '\0' */
+		pg_itoa(arg1, result);
+		PG_RETURN_CSTRING(result);
+	}
 }
 
 /*
@@ -96,11 +112,26 @@ Datum
 int2send(PG_FUNCTION_ARGS)
 {
 	int16		arg1 = PG_GETARG_INT16(0);
-	StringInfoData buf;
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, sizeof( arg1 ) );		/* length */
+		wbuf_put_int16_net( out_buf, arg1 );				/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
 
-	pq_begintypsend(&buf);
-	pq_sendint(&buf, arg1, sizeof(int16));
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+		pq_begintypsend(&buf);
+		pq_sendint(&buf, arg1, sizeof(int16));
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 /*
@@ -282,10 +313,24 @@ Datum
 int4out(PG_FUNCTION_ARGS)
 {
 	int32		arg1 = PG_GETARG_INT32(0);
-	char	   *result = (char *) palloc(12);	/* sign, 10 digits, '\0' */
-
-	pg_ltoa(arg1, result);
-	PG_RETURN_CSTRING(result);
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	if( ctx )
+	{
+		char * result;
+		Assert( IsA( ctx->out_buf, OutputFunctionContext ));
+		
+		result = wbuf_ensure( ctx->out_buf, 12 );	/* sign, 10 digits, '\0' */
+
+		pg_ltoa(arg1, result);
+		wbuf_increment( ctx->out_buf, strlen( result ));
+		PG_RETURN_CSTRING(0);	/* return a null pointer */
+	}
+	else
+	{
+		char	   *result = (char *) palloc(12);	/* sign, 10 digits, '\0' */
+		pg_ltoa(arg1, result);
+		PG_RETURN_CSTRING(result);
+	}
 }
 
 /*
@@ -306,11 +351,26 @@ Datum
 int4send(PG_FUNCTION_ARGS)
 {
 	int32		arg1 = PG_GETARG_INT32(0);
-	StringInfoData buf;
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, sizeof( arg1 ) );		/* length */
+		wbuf_put_int32_net( out_buf, arg1 );				/* data */
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
 
-	pq_begintypsend(&buf);
-	pq_sendint(&buf, arg1, sizeof(int32));
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+		pq_begintypsend(&buf);
+		pq_sendint(&buf, arg1, sizeof(int32));
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}		
 }
 
 
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/timestamp.c postgresql-8.4.0-copy/src/backend/utils/adt/timestamp.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/timestamp.c	2009-06-11 16:49:04.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/timestamp.c	2009-08-11 02:05:46.000000000 +0200
@@ -32,6 +32,9 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 /*
  * gcc's -ffast-math switch breaks routines that expect exact results from
  * expressions like timeval / SECS_PER_HOUR, where timeval is double.
@@ -275,15 +278,34 @@ Datum
 timestamp_send(PG_FUNCTION_ARGS)
 {
 	Timestamp	timestamp = PG_GETARG_TIMESTAMP(0);
-	StringInfoData buf;
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	
+	if( ctx )	/* fast buffer write */
+	{
+		/* get output buffer */
+		WBuf out_buf = ctx->out_buf;
+		Assert( IsA( fcinfo->context, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( out_buf, 8 );		/* length */
+#ifdef HAVE_INT64_TIMESTAMP
+		wbuf_put_int64_net(out_buf, timestamp);
+#else
+		wbuf_put_float8_net(out_buf, timestamp);
+#endif
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else	/* palloc a bytea */
+	{
+		StringInfoData buf;
 
-	pq_begintypsend(&buf);
+		pq_begintypsend(&buf);
 #ifdef HAVE_INT64_TIMESTAMP
-	pq_sendint64(&buf, timestamp);
+		pq_sendint64(&buf, timestamp);
 #else
-	pq_sendfloat8(&buf, timestamp);
+		pq_sendfloat8(&buf, timestamp);
 #endif
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 Datum
diff -rupN postgresql-8.4.0-orig/src/backend/utils/adt/varlena.c postgresql-8.4.0-copy/src/backend/utils/adt/varlena.c
--- postgresql-8.4.0-orig/src/backend/utils/adt/varlena.c	2009-06-11 16:49:04.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/adt/varlena.c	2009-08-11 01:35:13.000000000 +0200
@@ -27,6 +27,9 @@
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
 
+#include "lib/rwbuf.h"
+#include "utils/io_context.h"
+
 
 typedef struct varlena unknown;
 
@@ -391,11 +394,23 @@ Datum
 textsend(PG_FUNCTION_ARGS)
 {
 	text	   *t = PG_GETARG_TEXT_PP(0);
-	StringInfoData buf;
+	OutputFunctionContext *ctx = (OutputFunctionContext*) fcinfo->context;
+	if( ctx )
+	{
+		Assert( IsA( ctx->out_buf, OutputFunctionContext ));
+		
+		wbuf_put_int32_net( ctx->out_buf,  VARSIZE_ANY_EXHDR(t) );		/* length */		
+		wbuf_put( ctx->out_buf, VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));
+		PG_RETURN_BYTEA_P( NULL );
+	}
+	else
+	{
+		StringInfoData buf;
 
-	pq_begintypsend(&buf);
-	pq_sendtext(&buf, VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));
-	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+		pq_begintypsend(&buf);
+		pq_sendtext(&buf, VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));
+		PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+	}
 }
 
 
diff -rupN postgresql-8.4.0-orig/src/backend/utils/fmgr/fmgr.c postgresql-8.4.0-copy/src/backend/utils/fmgr/fmgr.c
--- postgresql-8.4.0-orig/src/backend/utils/fmgr/fmgr.c	2009-06-11 16:49:05.000000000 +0200
+++ postgresql-8.4.0-copy/src/backend/utils/fmgr/fmgr.c	2009-08-10 23:41:05.000000000 +0200
@@ -1920,6 +1920,42 @@ OutputFunctionCall(FmgrInfo *flinfo, Dat
 }
 
 /*
+ * Call a previously-looked-up datatype output, using a context.
+ * The output function should write its data to the context's wbuf
+ * with perhaps encoding conversion, (fixme ;)
+ * If it does not support this, simply return a cstring as usual.
+ *
+ * Do not call this on NULL datums.
+ *
+ * This is almost just window dressing for FunctionCall1, but it includes
+ * SPI context pushing for the same reasons as InputFunctionCall.
+ */
+char *
+OutputFunctionCallContext( FmgrInfo *flinfo, Datum val, fmNodePtr context )
+{
+	FunctionCallInfoData	fcinfo;
+	char	   *result;
+	bool		pushed;
+
+	pushed = SPI_push_conditional();
+
+	InitFunctionCallInfoData(fcinfo, flinfo, 1, context, NULL);
+
+	fcinfo.arg[0] = val;
+	fcinfo.argnull[0] = false;
+
+	result = DatumGetCString(FunctionCallInvoke(&fcinfo));
+
+	/* Check for null result, since caller is clearly not expecting one */
+	if (fcinfo.isnull)
+		elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);
+
+	SPI_pop_conditional(pushed);
+
+	return result;
+}
+
+/*
  * Call a previously-looked-up datatype binary-input function.
  *
  * "buf" may be NULL to indicate we are reading a NULL.  In this case
@@ -1997,6 +2033,51 @@ SendFunctionCall(FmgrInfo *flinfo, Datum
 }
 
 /*
+ * Call a previously-looked-up datatype binary-output function, using a context.
+ * The output function should write its data to the context's wbuf, and return
+ * a NULL POINTER.
+ *
+ * If it does not support this, simply return a bytea as usual.
+ *
+ * Do not call this on NULL datums.
+ *
+ */
+bytea *
+SendFunctionCallContext(FmgrInfo *flinfo, Datum val, fmNodePtr context )
+{
+	FunctionCallInfoData	fcinfo;
+	Datum r ;
+	bytea	   *result;
+	bool		pushed;
+
+	pushed = SPI_push_conditional();
+	
+	InitFunctionCallInfoData(fcinfo, flinfo, 1, context, NULL);
+
+	fcinfo.arg[0] = val;
+	fcinfo.argnull[0] = false;
+
+	r = FunctionCallInvoke(&fcinfo);
+	
+	/* If the SendFunction has written its data directly in the context's 
+	output buffer, it returns a NULL pointer (this is normal : there is nothing
+	to return !). DatumGetByteaP would try to detoast it and crash. So : */
+	
+	if( r )		result = DatumGetByteaP( r );
+	else		result = NULL;
+
+	/* Check for null result, since caller is clearly not expecting one */
+	if (fcinfo.isnull)
+		elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);
+
+	
+	SPI_pop_conditional(pushed);
+
+	return result;
+}
+
+
+/*
  * As above, for I/O functions identified by OID.  These are only to be used
  * in seldom-executed code paths.  They are not only slow but leak memory.
  */
diff -rupN postgresql-8.4.0-orig/src/include/fmgr.h postgresql-8.4.0-copy/src/include/fmgr.h
--- postgresql-8.4.0-orig/src/include/fmgr.h	2009-01-01 18:23:55.000000000 +0100
+++ postgresql-8.4.0-copy/src/include/fmgr.h	2009-08-10 22:17:21.000000000 +0200
@@ -506,6 +506,9 @@ extern Datum OidReceiveFunctionCall(Oid 
 extern bytea *SendFunctionCall(FmgrInfo *flinfo, Datum val);
 extern bytea *OidSendFunctionCall(Oid functionId, Datum val);
 
+extern bytea *SendFunctionCallContext(FmgrInfo *flinfo, Datum val, fmNodePtr context );
+extern char *OutputFunctionCallContext( FmgrInfo *flinfo, Datum val, fmNodePtr context );
+
 
 /*
  * Routines in fmgr.c
diff -rupN postgresql-8.4.0-orig/src/include/lib/rwbuf.h postgresql-8.4.0-copy/src/include/lib/rwbuf.h
--- postgresql-8.4.0-orig/src/include/lib/rwbuf.h	1970-01-01 01:00:00.000000000 +0100
+++ postgresql-8.4.0-copy/src/include/lib/rwbuf.h	2009-08-11 02:02:09.000000000 +0200
@@ -0,0 +1,523 @@
+/*-------------------------------------------------------------------------
+ *
+ * rwbuf.h
+ *	  Declarations/definitions for "RBuf / WBuf" functions.
+ *
+ * StringInfo provides an indefinitely-extensible string data type.
+ * It can be used to buffer either ordinary C strings (null-terminated text)
+ * or arbitrary binary data.  All storage is allocated with palloc().
+ *
+ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL: pgsql/src/include/lib/rwbuf.h,v 1.36 2009/01/01 17:23:59 momjian Exp $
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RWBUF_H
+#define RWBUF_H
+
+/* Writes to a WBuf of short datums (like int, etc) cannot be split to allow
+ * for a buffer flush, so this should be the size of the biggest object we
+ * expect to write atomically to a WBuf */
+#define	WBUF_MIN_SIZE 64
+
+struct WBufData;
+typedef struct WBufData *WBuf;
+
+/* WBuf sink function.
+ * It is called by the WBuf functions when a WBuf is full, or when we have finished
+ * to fill it and want to flush the contents.
+ * 
+ * 1. It must check if there is data in the WBuf, then process it entirely, or not at all.
+ * Partial writes cannot be handled. If an error occurs, it must ereport() it,
+ * because the return code of sink() is void, so there is no way to pass an error to the
+ * caller.
+ * 
+ * 2. It must look at the data and length parameters.
+ * If data is not null, it must process "length" bytes of data starting at address "data"
+ * If data is null, it must make sure there will be at least "length" free space in the buffer.
+ * 
+ *	** If sink() is something like writing to a file...
+ *
+ * We do it like this, so you can write a large amount of data to a WBuf with a small buffer
+ * without reallocating it and eating up memory. In this case, any data in the buffer is
+ * flushed in step 1., and the rest of the data is simply piped to the data sink in step 2.
+ * After the call to sink(), the buffer should be empty.
+ *
+ *	** If sink() is a reallocator that enlarges the buffer
+ *
+ * sink() can be a function that grows the buffer, and stores the extra
+ * data in it. In this case, it behaves just like StringInfo. sink() does not touch
+ * the data already in the buffer, it merely enlarges it.
+ * After the call to sink(), the buffer will contain exactly as much data as before,
+ * and a lot more free space.
+ * 
+ * If a reallocator is used, it must extend the buffer by a substantial amount
+ * because functions that send small data (like an int) to the buffer don't recheck
+ * if there is space after calling sink(), they assume sink() did remove enough data.
+ * 
+ * buf		the WBuf
+ * data		extra data to write
+ * length	length of extra data to write
+ *
+ * sink_arg is not passed as a parameter, since it is stored in WBufData.
+ *          sink() needs sink_arg to know where to put the data.
+ *
+ */
+typedef void (*WBufDataSink) ( WBuf buf, const char* data, int length );
+ 
+typedef struct WBufData
+{
+	char	   *data;			/* pointer to data buffer */
+	char       *wp;				/* position in buffer of next byte to be written */
+	char       *endp;			/* end of data buffer */
+	WBufDataSink	sink;		/* function to call when buffer is full and we try to write to it */
+	void *			sink_arg;	/* opaque data for sink() where we may, . */
+} WBufData;
+
+
+/*************************************************
+
+				Setup Functions
+
+**************************************************/
+
+/*
+ * wbuf_reset
+ *
+ * Reset the WBuf: the data buffer remains valid, but its
+ * previous content, if any, is cleared.
+ */
+static inline void wbuf_reset(WBuf buf)
+{
+	buf->wp = buf->data;
+}
+
+/*
+ * wbuf_init
+ *
+ * Initialize a WBufData struct (with previously undefined contents)
+ * to describe an empty buffer, of size "size"
+ */
+static inline void wbuf_init( WBuf buf, int size )
+{
+	Assert(size >= WBUF_MIN_SIZE);
+	buf->data = (char *) palloc(size);
+	buf->endp = buf->data + size;
+	wbuf_reset(buf);
+}
+
+/*
+ * wbuf_make
+ *
+ * Create a WBufData, with an empty buffer of size "size", 
+ * & return a pointer to it.
+ */
+static inline WBuf wbuf_make( int size )
+{
+	WBuf buf = (WBuf) palloc(sizeof(WBufData));
+	wbuf_init( buf, size );
+	return buf;
+}
+
+/*
+ * wbuf_set_sink
+ *
+ * Sets the sink function and its parameter.
+ */
+static inline void wbuf_set_sink( WBuf buf, WBufDataSink sink, void* opaque )
+{
+	buf->sink		= sink;
+	buf->sink_arg	= opaque;
+}
+
+/*************************************************
+
+				Simple Getters
+
+**************************************************/
+
+
+/*
+ * wbuf_get_used_space
+ *
+ * Returns amount of used bytes in buffer.
+ * 
+ */
+static inline int wbuf_get_used_space( WBuf buf )
+{
+	return buf->wp - buf->data;
+}
+
+/*
+ * wbuf_get_free_space
+ *
+ * Returns amount of unused bytes in buffer.
+ * 
+ */
+static inline int wbuf_get_free_space( WBuf buf )
+{
+	return buf->endp - buf->wp;
+}
+
+/*
+ * wbuf_get_free_space
+ *
+ * Returns size of buffer.
+ * 
+ */
+static inline int wbuf_get_size( WBuf buf )
+{
+	return buf->endp - buf->data;
+}
+
+/*
+ * wbuf_is_empty
+ *
+ * Returns true if WBuf is empty
+ * 
+ */
+static inline int wbuf_is_empty( WBuf buf )
+{
+	return buf->wp == buf->data;
+}
+
+/*
+ * wbuf_is_full
+ *
+ * Returns true if WBuf is full
+ * 
+ */
+static inline int wbuf_is_full( WBuf buf )
+{
+	return buf->wp == buf->endp;
+}
+
+/*
+ * wbuf_has_space
+ *
+ * Returns true if "needed" bytes will fit in the buffer.
+ * 
+ */
+static inline bool wbuf_has_space( WBuf buf, int needed )
+{
+	return ( buf->wp + needed ) <= buf->endp;
+}
+
+/*************************************************
+
+				Flushing
+
+**************************************************/
+
+/*
+ * wbuf_flush
+ *
+ * Calls the sink() if there is data to flush, which empties the WBuf (or not).
+ * 
+ */
+extern void wbuf_flush( WBuf buf );
+
+/*
+ * wbuf_flush_forced
+ *
+ * Calls the sink(), which empties the WBuf. Does not check presence of data.
+ * Used when we know the buffer is not empty.
+ * 
+ */
+extern void wbuf_flush_forced( WBuf buf );
+
+
+/*************************************************
+
+				Pre-Allocation
+
+**************************************************/
+
+/* private */
+extern void wbuf_reserve2_( WBuf buf, int needed );
+
+/*
+ * wbuf_ensure
+ *
+ * Ensures than "needed" bytes will fit in the buffer.
+ * Returns a pointer to the allocated space,
+ * It is up to the caller to write the data and increment the pointer.
+ * 
+ * This saves lots of checks if all you want is to write a few ints...
+ * 
+ * "needed" MUST be <= bus->maxlen (this is verified in reserveWBuf2).
+ * 
+ * To write big amounts of data, use wbuf_put_big directly.
+ *
+ */
+static inline char* wbuf_ensure( WBuf buf, int needed )
+{
+	/* check if it will fit */
+	if( !wbuf_has_space( buf, needed ))
+		wbuf_reserve2_( buf, needed );	/* no need to inline this code */
+	
+	/* return a pointer to the reserved space */
+	return buf->wp;
+}
+
+/*
+ * After calling p = wbuf_ensure( buf, N ) to be sure there are 
+ * N free bytes starting at pointer p, the caller can insert up to N
+ * bytes of data (perhaps less), then call wbuf_increment( buf, added )
+ * to tell the wbuf how many bytes were actually added.
+ */
+static inline void wbuf_increment( WBuf buf, int added )
+{
+	buf->wp += added;
+	Assert( buf->wp <= buf->endp );
+}
+
+/*
+ * wbuf_reserve
+ *
+ * Exactly like wbuf_ensure(), but it increments the pointer.
+ */
+static inline char* wbuf_reserve( WBuf buf, int needed )
+{
+	char *p = wbuf_ensure( buf, needed );
+	p = buf->wp;
+	buf->wp += needed;
+	return p;
+}
+
+
+
+/*************************************************
+
+			Scalar Type Send Functions
+
+**************************************************/
+
+/*
+ * wbuf_put_char
+ */
+static inline void wbuf_put_char( WBuf buf, char x )
+{
+	/* check if it will fit */
+	if( wbuf_is_full( buf ))
+		wbuf_flush_forced( buf );
+	
+	*(buf->wp++) = x;
+}
+
+/*************************************************
+
+	Native Byte Order Integer Type Send Functions
+
+**************************************************/
+
+static inline void wbuf_put_int16_native( WBuf buf, int16 x )
+{
+	if( !wbuf_has_space( buf, sizeof(x) ))		
+		wbuf_flush_forced( buf );
+	
+	memcpy( buf->wp, &x, sizeof( x ));
+	buf->wp += sizeof( x );
+}
+
+static inline void wbuf_put_int32_native( WBuf buf, int32 x )
+{
+	if( !wbuf_has_space( buf, sizeof(x) ))		
+		wbuf_flush_forced( buf );
+	
+	memcpy( buf->wp, &x, sizeof( x ));
+	buf->wp += sizeof( x );
+}
+
+/*************************************************
+
+	Network-Order Integer Type Send Functions
+
+**************************************************/
+
+static inline void wbuf_put_int16_net( WBuf buf, int16 x )
+{
+	if( !wbuf_has_space( buf, sizeof(x) ))
+		wbuf_flush_forced( buf );
+	
+	x = (int16)htons( x );
+	
+	memcpy( buf->wp, &x, sizeof( x ));
+	buf->wp += sizeof( x );
+}
+
+static inline void wbuf_put_int32_net( WBuf buf, int32 x )
+{
+	/* check if it will fit */
+	if( !wbuf_has_space( buf, sizeof(x) ))
+		wbuf_flush_forced( buf );
+	
+	x = htonl( x );
+
+	memcpy( buf->wp, &x, sizeof( x ));
+	buf->wp += sizeof( x );
+}
+
+extern void wbuf_put_int64_net( WBuf buf, int64 i);
+extern void wbuf_put_float4_net( WBuf buf, float4 f );
+extern void wbuf_put_float8_net( WBuf buf, float8 f );
+
+/*************************************************
+
+			Binary Send Functions
+
+**************************************************/
+
+/*
+ * wbuf_put_big
+ *
+ * Same as wbuf_put (see below) but we know the data will not
+ * fit in the buffer, so we will flush, then pipe the data directly to sink()
+ */
+extern void wbuf_put_big( WBuf buf, const void *data, int datalen );
+
+/*
+ * wbuf_put
+ *
+ * Sends binary data (data,datalen) to the WBuf.
+ * If the data is too large to fit, at most 2 flushes will happen.
+ * First, the WBuf itself (if not empty) will be flushed. 
+ * Then, depending on the size of the data, it may be written directly,
+ * or piped straight to the sink() without going through the buffer.
+ */
+static inline void wbuf_put( WBuf buf, const void *data, int datalen )
+{
+	Assert(data != NULL);
+	
+	/* check if it will fit */
+	if( wbuf_has_space( buf, datalen ))
+	{
+		/* append data */
+		memcpy( buf->wp, data, datalen );	/* inling wbuf_put makes gcc do smart things with memcpy */
+		buf->wp += datalen;
+		return;
+	}
+
+	/* no need to inline this code */
+	wbuf_put_big( buf, data, datalen );	
+}
+
+/*************************************************
+
+			String Send Functions
+			(not sending the NULL)
+
+**************************************************/
+
+/*
+ * wbuf_put_text
+ *
+ * Sends a non-null terminated string to the WBuf.
+ * Does not append a \0 in wbuf.
+ * Behaves the same as wbuf_put.
+ */
+#define wbuf_put_text wbuf_put
+
+/*
+ * wbuf_put_string_big
+ *
+ * Same as wbuf_put_string (see below) but we know the data will not
+ * fit in the buffer, so we will flush, then pipe the data directly to sink()
+ */
+extern void wbuf_put_string_big( WBuf buf, const char *str );
+
+/*	private-ish function (see wbuf.c)
+*/
+extern bool strncpypartial( char ** pdest, const char ** psource, int dest_size );
+
+/*
+ * wbuf_put_string
+ *
+ * Sends a null terminated string to the WBuf, including the '\0' terminator.
+ * Behaves the same as wbuf_put.
+ */
+static inline void wbuf_put_string( WBuf buf, const char *str )
+{
+	Assert(str != NULL);
+	
+	/* try to put it in the buffer  */
+	if( strncpypartial( &buf->wp, &str, wbuf_get_free_space( buf )) )
+	{
+		buf->wp--;	/* forget the '\0' */
+		return;
+	}
+	
+	/* insuficient space, send the rest (str and buf->wp were updated by strncpypartial() */
+	wbuf_put_string_big( buf, str ); /* no need to inline this code */
+}
+
+/*************************************************
+
+			String Send Functions
+			(sending the NULL)
+
+**************************************************/
+
+/*
+ * wbuf_put_text_0
+ *
+ * Sends a non-null terminated string to the WBuf, plus the '\0' terminator.
+ * Behaves the same as wbuf_put.
+ */
+static inline void wbuf_put_text_0( WBuf buf, const char *str, int len )
+{
+	Assert(str != NULL);
+	
+	/* check if it will fit, including the '\0' */
+	if( wbuf_has_space( buf, len+1 ))
+	{
+		/* append data */
+		memcpy( buf->wp, str, len );
+		buf->wp += len;
+		*buf->wp++ = '\0'; /* append terminator */
+		return;
+	}
+
+	wbuf_put_big( buf, str, len );	/* no need to inline this code */
+	wbuf_put_char( buf, '\0' );		/* append terminator */
+}
+
+/*
+ * wbuf_put_string_big
+ *
+ * Same as wbuf_put_string (see below) but we know the data will not
+ * fit in the buffer, so we will flush, then pipe the data directly to sink()
+ */
+extern void wbuf_put_string_0_big( WBuf buf, const char *str );
+
+/*	private-ish function (see wbuf.c)
+*/
+extern bool strncpypartial( char ** pdest, const char ** psource, int dest_size );
+
+/*
+ * wbuf_put_string
+ *
+ * Sends a null terminated string to the WBuf, including the '\0' terminator.
+ * Behaves the same as wbuf_put.
+ */
+static inline void wbuf_put_string_0( WBuf buf, const char *str )
+{
+	Assert(str != NULL);
+	
+	/* try to put it in the buffer  */
+	if( strncpypartial( &buf->wp, &str, wbuf_get_free_space( buf )) )
+		return;
+	
+	/* insuficient space, send the rest (str and buf->wp were updated by strncpypartial() */
+	wbuf_put_string_big( buf, str ); /* no need to inline this code */
+}
+
+/*
+	Fake data sink function which actually reallocates the buffer.
+*/
+/* void wbufReallocatorSink( WBuf buf, const char* data, int length ); */
+
+
+#endif   /* RWBUF_H */
diff -rupN postgresql-8.4.0-orig/src/include/nodes/nodes.h postgresql-8.4.0-copy/src/include/nodes/nodes.h
--- postgresql-8.4.0-orig/src/include/nodes/nodes.h	2009-06-11 16:49:11.000000000 +0200
+++ postgresql-8.4.0-copy/src/include/nodes/nodes.h	2009-08-10 22:52:54.000000000 +0200
@@ -388,7 +388,9 @@ typedef enum NodeTag
 	T_TriggerData = 950,		/* in commands/trigger.h */
 	T_ReturnSetInfo,			/* in nodes/execnodes.h */
 	T_WindowObjectData,			/* private in nodeWindowAgg.c */
-	T_TIDBitmap					/* in nodes/tidbitmap.h */
+	T_TIDBitmap,				/* in nodes/tidbitmap.h */
+	T_OutputFunctionContext		/*	COPY passes this context to the OutFunctions 
+									and SendFunctions, see utils/io_context, copy.c and util/adt/ *.c */
 } NodeTag;
 
 /*
diff -rupN postgresql-8.4.0-orig/src/include/utils/io_context.h postgresql-8.4.0-copy/src/include/utils/io_context.h
--- postgresql-8.4.0-orig/src/include/utils/io_context.h	1970-01-01 01:00:00.000000000 +0100
+++ postgresql-8.4.0-copy/src/include/utils/io_context.h	2009-08-10 22:50:51.000000000 +0200
@@ -0,0 +1,8 @@
+
+typedef struct OutputFunctionContext
+{
+	NodeTag		type;
+	
+	WBuf		out_buf;
+	
+} OutputFunctionContext;
diff -rupN postgresql-8.4.0-orig/src/Makefile.global postgresql-8.4.0-copy/src/Makefile.global
--- postgresql-8.4.0-orig/src/Makefile.global	2009-08-07 14:33:32.000000000 +0200
+++ postgresql-8.4.0-copy/src/Makefile.global	2009-08-10 21:13:19.000000000 +0200
@@ -34,7 +34,7 @@ MAJORVERSION = 8.4
 
 # Support for VPATH builds
 vpath_build = no
-abs_top_srcdir = /home/peufeu/dev/postgres/postgresql-8.4.0
+abs_top_srcdir = /home/peufeu/dev/postgres/postgresql-8.4.0-copy
 
 ifneq ($(vpath_build),yes)
 top_srcdir = $(top_builddir)
@@ -46,7 +46,7 @@ srcdir = $(top_srcdir)/$(subdir)
 endif
 
 # Saved arguments from configure
-configure_args =  '--prefix=/home/peufeu/dev/postgres/run/'
+configure_args =  '--prefix=/home/peufeu/dev/postgres/run/' '--enable-debug' '--enable-depend'
 
 
 ##########################################################################
@@ -162,7 +162,7 @@ with_zlib	= yes
 enable_shared	= yes
 enable_rpath	= yes
 enable_nls	= no
-enable_debug	= no
+enable_debug	= yes
 enable_dtrace	= no
 enable_coverage	= no
 enable_thread_safety	= no
@@ -214,7 +214,7 @@ endif # not PGXS
 CC = gcc
 GCC = yes
 SUN_STUDIO_CC = no
-CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv
+CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g
 
 # Kind-of compilers
 
@@ -539,7 +539,7 @@ install-strip:
 # whether this file needs to be updated. The dependency files are kept
 # in the .deps subdirectory of each directory.
 
-autodepend = 
+autodepend = yes
 
 ifeq ($(autodepend), yes)

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#1)

Re: COPY speedup

Replying to myself...

I've been examining the code path for COPY FROM too, and I think it is
possible to get the same kind of speedups on COPY FROM that the patch in
the previous message did for COPY TO, that is to say perhaps 2-3x faster
in BINARY mode and 10-20% faster in TEXT mode (these figures are
ballparks, only based on very quick checks however).

The idea is to avoid most (actually, all) palloc()'ing and memcpy()'ing
for types that are pass-by-value like INT.

Is there interest in such a patch (for 8.6) ?

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#2)

Re: COPY speedup

=?utf-8?Q?Pierre_Fr=C3=A9d=C3=A9ric_Caillau?= =?utf-8?Q?d?= <lists@peufeu.com> writes:

I've been examining the code path for COPY FROM too, and I think it is
possible to get the same kind of speedups on COPY FROM that the patch in
the previous message did for COPY TO, that is to say perhaps 2-3x faster
in BINARY mode and 10-20% faster in TEXT mode (these figures are
ballparks, only based on very quick checks however).

The idea is to avoid most (actually, all) palloc()'ing and memcpy()'ing
for types that are pass-by-value like INT.

Is there interest in such a patch (for 8.6) ?

If you do as much damage to the I/O function API as the other patch
did, it will probably be rejected. We don't touch datatype APIs
lightly, because it affects too much code.

regards, tom lane

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Tom Lane (#3)

Re: COPY speedup

If you do as much damage to the I/O function API as the other patch
did, it will probably be rejected.

You mean, as the COPY patch in my previous message, or as another patch ?
(I just search the archives and found one about CopyReadLine, but that's
probably not what you are talking about)

We don't touch datatype APIs
lightly, because it affects too much code.

regards, tom lane

I definitely agree with that.

Merlin Moncure

mmoncure@gmail.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#4)

Re: COPY speedup

2009/8/12 Pierre Frédéric Caillaud <lists@peufeu.com>:

If you do as much damage to the I/O function API as the other patch
did, it will probably be rejected.

You mean, as the COPY patch in my previous message, or as another
patch ?
(I just search the archives and found one about CopyReadLine, but
that's probably not what you are talking about)

We don't touch datatype APIs
lightly, because it affects too much code.

regards, tom lane

I definitely agree with that.

Is there any way to do this that is not as invasive?

merlin

Alvaro Herrera

alvherre@commandprompt.com

over 16 years ago

In reply to: Merlin Moncure (#5)

Re: COPY speedup

Merlin Moncure escribiï¿½:

2009/8/12 Pierre Frï¿½dï¿½ric Caillaud <lists@peufeu.com>:

If you do as much damage to the I/O function API as the other patch
did, it will probably be rejected.

ï¿½ ï¿½ ï¿½ ï¿½You mean, as the COPY patch in my previous message, or as another
patch ?
ï¿½ ï¿½ ï¿½ ï¿½(I just search the archives and found one about CopyReadLine, but
that's probably not what you are talking about)

We don't touch datatype APIs
lightly, because it affects too much code.

ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½regards, tom lane

ï¿½ ï¿½ ï¿½ ï¿½I definitely agree with that.

Is there any way to do this that is not as invasive?

Maybe add new methods, fastrecv/fastsend etc. Types that don't
implement them would simply use the slow methods, maintaining
backwards compatibility.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Alvaro Herrera (#6)

Re: COPY speedup

We don't touch datatype APIs
lightly, because it affects too much code.

regards, tom lane

I definitely agree with that.

Actually, let me clarify:

When I modified the datatype API, I was feeling uneasy, like "I shouldn't
really touch this".
But when I see a big red button, I just press it to see what happens.
Ugly hacks are useful to know how fast the thing can go ; then the
interesting part is to reimplement it cleanly, trying to reach the same
performance...

Is there any way to do this that is not as invasive?

Maybe add new methods, fastrecv/fastsend etc. Types that don't
implement them would simply use the slow methods, maintaining
backwards compatibility.

Well, this would certainly work, and it would be even faster.

I considered doing it like this, but it is a lot more work : adding
entries to the system catalogs, creating all the new functions, deciding
what to do with getTypeBinaryOutputInfo (since there would be 2 variants),
etc. Types that don't support the new functions would need some form of
indirection to call the old functions instead, etc. In a word, doable, but
kludgy, and I would need help from a system catalog expert. Also, on
upgrade, information about the new functions must be inserted in the
system catalogs ? (I don't know how this process works). If you want to
help...

The way I see COPY BINARY is that its speed should be really something
massive.
COPY foo FROM ... BINARY should be as fast as CREATE TABLE foo AS SELECT *
FROM bar (which is extremely fast).
COPY foo TO ... BINARY should be as fast as the disk allows.

Why else would anyone use a binary format if it's slower than portable
text ?

So, there are two other ways (besides fastsend/fastrecv) that I can see :

1- The way I implemented

I'm not saying it's the definitive solution : just a simple way to see how
much overhead is introduced by the current API, returning BYTEAs and
palloc()'ing every tuple of every row. I think this approach gave two
interesting answers :

- once COPY's output buffer has been made more efficient, with things like
removing fwrite() for every row etc (see patch), all that remains is the
API overhead, which is very significant for binary mode, since I could get
massive speedups (3-4x !) by bypassing it. The table scan itself is
super-fast.

- however, for text mode, it is not so significant, as the speedups
bypassing the API were roughly 0-20%, since most of the time is spent in
datum to text conversions.

Now, I don't think the hack is so ugly. It does make me feel a bit uneasy,
but :

- The context field in the fcinfo struct is there for a reason, so I used
it.
- I checked every place in the code where SendFunctionCall() appears
(which are quite few actually).
- The context field is never used for SendFuncs or ReceiveFuncs (it is
always set to NULL)

2- Another way

- palloc() could be made faster for short blocks
- a generous sprinkling of inline's
- a few modifications to pq_send*
- a few modifications to StringInfo
- bits of my previous patch in copy.c (like not fwriting every row)

This would be less fast, but you'd still get a substantial speedup.

As a conclusion, I think :

- Alvaro's fastsend/fastrecv provides the cleanest solutin
- Method 2 is the easiest, but slower
- Method 1 is an intermediate, but the use of the context field is a
touchy subject.

Also, I will work on COPY FROM ... BINARY. I should be able to make it
also much faster. This will be useful for big imports.

Regards,
Pierre

Alvaro Herrera

alvherre@commandprompt.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#7)

Re: COPY speedup

Pierre Frï¿½dï¿½ric Caillaud escribiï¿½:

But when I see a big red button, I just press it to see what happens.
Ugly hacks are useful to know how fast the thing can go ; then the
interesting part is to reimplement it cleanly, trying to reach the
same performance...

Right -- now that you've shown a 6x speedup increase, it is clear that
it makes sense to attempt a reimplementation. It also means it makes
sense to have an additional pair or two of input/output functions.

Maybe add new methods, fastrecv/fastsend etc. Types that don't
implement them would simply use the slow methods, maintaining
backwards compatibility.

I considered doing it like this, but it is a lot more work : adding
entries to the system catalogs, creating all the new functions,
deciding what to do with getTypeBinaryOutputInfo (since there would
be 2 variants), etc. Types that don't support the new functions
would need some form of indirection to call the old functions
instead, etc. In a word, doable, but kludgy, and I would need help
from a system catalog expert.

Right.

Also, on upgrade, information about the new functions must be inserted
in the system catalogs ? (I don't know how this process works).

No, that's not how pg_migrator works. Catalog upgrades are handled by
pg_dump/pg_restore, so you don't need to worry about it at all.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Alvaro Herrera (#8)

Re: COPY speedup

But when I see a big red button, I just press it to see what happens.
Ugly hacks are useful to know how fast the thing can go ; then the
interesting part is to reimplement it cleanly, trying to reach the
same performance...

Right -- now that you've shown a 6x speedup increase, it is clear that
it makes sense to attempt a reimplementation. It also means it makes
sense to have an additional pair or two of input/output functions.

Okay.

Here are some numbers. The tables are the same as in the previous email,
and it also contains the same results as "copy patch 4", aka "API hack"
for reference.

I benchmarked these :

* p5 = no api changes, COPY TO optimized :
- Optimizations in COPY (fast buffer, much less fwrite() calls, etc)
remain.
- SendFunction API reverted to original state (actually, the API changes
are still there, but deactivated, fcinfo->context = NULL).

=> small performance gain ; of course the lower per-row overhead is more
visible on "test_one_int", because that table has 1 column.
=> the (still huge) distance between p5 and "API hack" is split between
overhead in pq_send*+stringInfo (that we will tackle below) and palloc()
overhead (that was removed by the "API hack" by passing the destination
buffer directly).

* p6 = p5 + optimization of pq_send*
- inlining strategic functions
- probably benefits many other code paths

=> small incremental performance gain

* p7 = p6 + optimization of StringInfo
- inlining strategic functions
- probably benefits many other code paths

=> small incremental performance gain (they start to add up nicely)

* p8 = p7 + optimization of palloc()
- actually this is extremely dumb :
- int4send and int2send simply palloc() 16 bytes instead of 1024......
- the initial size of the allocset is 64K instead of 8K

=> still it has interesting results...

The three patches above are quite simple (especially the inlines) and yet,
speedup is already nice.

* p9 = p8 + monstrously ugly hack
copy looks at the sendfunc, notices it's int{2,4}send , and replaces it
with int{2,4}fastsend which is called directly from C, bypassing the fmgr
(urrrgghhhhhh)
of course it only works for ints.
This gives information about fmgr overhead : fmgr is pretty damn fast.

* p10 no copy
does everything except calling the SendFuncs, it writes dummy data instead.
This gives the time used in everything except the SendFuncs : table scan,
deform_tuple, file writes, etc, which is an interesting thing to know.

RESULTS :

COPY annonces TO '/dev/null' BINARY :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|--------|---------|---------------------------------------------
2.149 | 2.60 x | 151.57 | 192.40 | 7.50 | copy to patch 4
3.055 | 1.83 x | 106.64 | 135.37 | 5.28 | p8 = p7 + optimization of
palloc()
3.202 | 1.74 x | 101.74 | 129.15 | 5.04 | p7 = p6 + optimization of
StringInfo
3.754 | 1.49 x | 86.78 | 110.15 | 4.30 | p6 = p5 + optimization of
pq_send*
4.434 | 1.26 x | 73.47 | 93.26 | 3.64 | p5 no api changes, COPY TO
optimized
5.579 | --- | 58.39 | 74.12 | 2.89 | compiled from source

COPY archive_data TO '/dev/null' BINARY :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|--------|---------|---------------------------------------------
5.372 | 3.75 x | 73.96 | 492.88 | 13.80 | copy to patch 4
8.545 | 2.36 x | 46.49 | 309.83 | 8.68 | p8 = p7 + optimization of
palloc()
10.229 | 1.97 x | 38.84 | 258.82 | 7.25 | p7 = p6 + optimization of
StringInfo
12.869 | 1.57 x | 30.87 | 205.73 | 5.76 | p6 = p5 + optimization of
pq_send*
15.559 | 1.30 x | 25.54 | 170.16 | 4.76 | p5 no api changes, COPY TO
optimized
20.165 | --- | 19.70 | 131.29 | 3.68 | 8.4.0 / compiled from source

COPY test_one_int TO '/dev/null' BINARY :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|---------|---------|---------------------------------------------
1.493 | 4.23 x | 205.25 | 6699.22 | 6.70 | p10 no copy
1.660 | 3.80 x | 184.51 | 6022.33 | 6.02 | p9 monstrously ugly hack
2.003 | 3.15 x | 152.94 | 4991.87 | 4.99 | copy to patch 4
2.803 | 2.25 x | 109.32 | 3568.03 | 3.57 | p8 = p7 + optimization of
palloc()
2.976 | 2.12 x | 102.94 | 3360.05 | 3.36 | p7 = p6 + optimization of
StringInfo
3.165 | 2.00 x | 96.82 | 3160.05 | 3.16 | p6 = p5 + optimization of
pq_send*
3.698 | 1.71 x | 82.86 | 2704.43 | 2.70 | p5 no api changes, COPY TO
optimized
6.318 | --- | 48.49 | 1582.85 | 1.58 | 8.4.0 / compiled from source

COPY test_many_ints TO '/dev/null' BINARY :
Time | Speedup | Table | KRows | MTuples | Name
(s) | | MB/s | /s | /s |
------|---------|--------|--------|---------|---------------------------------------------
1.007 | 8.80 x | 127.23 | 993.34 | 25.83 | p10 no copy
1.114 | 7.95 x | 114.95 | 897.52 | 23.34 | p9 monstrously ugly hack
1.706 | 5.19 x | 75.08 | 586.23 | 15.24 | copy to patch 4
3.396 | 2.61 x | 37.72 | 294.49 | 7.66 | p8 = p7 + optimization of
palloc()
4.588 | 1.93 x | 27.92 | 217.98 | 5.67 | p7 = p6 + optimization of
StringInfo
5.821 | 1.52 x | 22.00 | 171.80 | 4.47 | p6 = p5 + optimization of
pq_send*
6.890 | 1.29 x | 18.59 | 145.14 | 3.77 | p5 no api changes, COPY TO
optimized
8.861 | --- | 14.45 | 112.85 | 2.93 | 8.4.0 / compiled from source

#10

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Alvaro Herrera (#8)

Re: COPY speedup

In the previous mails I made a mistake, writing "MTuples/s" instead of
"MDatums/s", sorry about that. It is the number of rows x columns. The
title was wrong, but the data was right.

I've been doing some tests on COPY FROM ... BINARY.

- inlines in various pg_get* etc
- a faster buffer handling for copy
- that's about it...

In the below tables, you have "p17" (ie test patch 17, the last one) and
straight postgres compared.

COPY annonces_2 FROM 'annonces.bin' BINARY :
Time | Speedup | Table | KRows | MDatums | Name
(s) | | MB/s | /s | /s |
-------|---------|--------|--------|---------|----------------------------------------------------
8.417 | 1.40 x | 38.70 | 49.13 | 1.92 | 8.4.0 / p17
11.821 | --- | 27.56 | 34.98 | 1.36 | 8.4.0 / compiled from source

COPY archive_data_2 FROM 'archive_data.bin' BINARY :
Time | Speedup | Table | KRows | MDatums | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|--------|---------|----------------------------------------------------
15.314 | 1.93 x | 25.94 | 172.88 | 4.84 | 8.4.0 / p17 COPY FROM BINARY
all
29.520 | --- | 13.46 | 89.69 | 2.51 | 8.4.0 / compiled from source

COPY test_one_int_2 FROM 'test_one_int.bin' BINARY :
Time | Speedup | Table | KRows | MDatums | Name
(s) | | MB/s | /s | /s |
-------|---------|--------|---------|---------|----------------------------------------------------
10.003 | 1.39 x | 30.63 | 999.73 | 1.00 | 8.4.0 / p17 COPY FROM
BINARY all
13.879 | --- | 22.08 | 720.53 | 0.72 | 8.4.0 / compiled from
source

COPY test_many_ints_2 FROM 'test_many_ints.bin' BINARY :
Time | Speedup | Table | KRows | MDatums | Name
(s) | | MB/s | /s | /s |
-------|---------|-------|--------|---------|----------------------------------------------------
6.009 | 2.08 x | 21.31 | 166.42 | 4.33 | 8.4.0 / p17 COPY FROM BINARY
all
12.516 | --- | 10.23 | 79.90 | 2.08 | 8.4.0 / compiled from source

I thought it might be interesting to get split timings of the various
steps in COPY FROM, so I simply commented out bits of code and ran tests.

The "delta" columns are differences between two lines, that is the time
taken in the step mentioned on the right.

reading data only = reading all the data and parsing it into chunks, doing
everything until the RecvFunc is called.
RecvFuncs = same, + RecvFunc is called
heap_form_tuple = same + heap_form_tuple is called
triggers = same + triggers are applied
insert = actual tuple insertion
p17 = total time (post insert triggers, constraint check, etc)

Time | Delta | Row delta | Datum delta | Name
(s) | (s) | (us) | (us) |
-------|-------|-----------|-------------|----------------------
1.311 | --- | --- | --- | reading data only
4.516 | 3.205 | 7.750 | 0.199 | RecvFuncs
4.534 | 0.018 | 0.043 | 0.001 | heap_form_tuple
5.323 | 0.789 | 1.908 | 0.049 | triggers
8.182 | 2.858 | 6.912 | 0.177 | insert
8.417 | 0.236 | 0.570 | 0.015 | p17

COPY archive_data_2 FROM 'archive_data.bin' BINARY :
Time | Delta | Row delta | Datum delta | Name
(s) | (s) | (us) | (us) |
-------|--------|-----------|-------------|---------------------
4.729 | --- | --- | --- | reading data only
8.508 | 3.778 | 1.427 | 0.051 | RecvFuncs
8.567 | 0.059 | 0.022 | 0.001 | heap_form_tuple
10.804 | 2.237 | 0.845 | 0.030 | triggers
14.475 | 3.671 | 1.386 | 0.050 | insert
15.314 | 0.839 | 0.317 | 0.011 | p17

COPY test_one_int_2 FROM 'test_one_int.bin' BINARY :
Time | Delta | Row delta | Datum delta | Name
(s) | (s) | (us) | (us) |
-------|-------|-----------|-------------|----------------------
1.247 | --- | --- | --- | reading data only
1.745 | 0.498 | 0.050 | 0.050 | RecvFuncs
1.750 | 0.004 | 0.000 | 0.000 | heap_form_tuple
3.114 | 1.364 | 0.136 | 0.136 | triggers
9.984 | 6.870 | 0.687 | 0.687 | insert
10.003 | 0.019 | 0.002 | 0.002 | p17

COPY test_many_ints_2 FROM 'test_many_ints.bin' BINARY :
Time | Delta | Row delta | Datum delta | Name
(s) | (s) | (us) | (us) |
-------|-------|-----------|-------------|----------------------
1.701 | --- | --- | --- | reading data only
3.122 | 1.421 | 1.421 | 0.055 | RecvFuncs
3.129 | 0.008 | 0.008 | 0.000 | heap_form_tuple
3.754 | 0.624 | 0.624 | 0.024 | triggers
5.639 | 1.885 | 1.885 | 0.073 | insert
6.009 | 0.370 | 0.370 | 0.014 | p17

We can see that :

- reading and parsing the data is still slow (actually, everything is
copied something like 3-4 times)
- RecvFuncs take quite long, too
- triggers use some time, although the table has no triggers ....? This is
suspicious...
- the actual insertion (which is really what we are interested in when
loading a table) is actually very fast

Ideally in COPY the insertion time in the table should take most of the
time used in the operation...

#11

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#10)

Re: COPY speedup

I'm doing some more exploration with oprofile...

I've got the glibc-debug package installed (on kubuntu), but oprofile
doesn't seem to know about it. I wonder what part of glibc gets 60% of the
run time... do I have to set a magic option in the postgres config ?

samples % image name app name symbol
name
155312 61.7420 libc-2.7.so libc-2.7.so
/lib/tls/i686/cmov/libc-2.7.so
35921 14.2799 postgres postgres
CopyOneRowTo
7485 2.9756 postgres postgres
CopySendData
5626 2.2365 postgres postgres
MemoryContextAlloc
5174 2.0568 postgres postgres
FunctionCall1
5167 2.0541 no-vmlinux no-vmlinux
/no-vmlinux
5087 2.0223 postgres postgres
AllocSetAlloc
4340 1.7253 postgres postgres int4out
3896 1.5488 postgres postgres
heap_deform_tuple

#12

Marko Kreen

markokr@gmail.com

over 16 years ago

In reply to: Pierre Frédéric Caillaud (#11)

Re: COPY speedup

On 8/18/09, Pierre Frédéric Caillaud <lists@peufeu.com> wrote:

I'm doing some more exploration with oprofile...

I've got the glibc-debug package installed (on kubuntu), but oprofile
doesn't seem to know about it. I wonder what part of glibc gets 60% of the
run time... do I have to set a magic option in the postgres config ?

AFAIK you need to run app with LD_LIBRARY_PATH=/usr/lib/debug,
otherwise the debug packages won't be used.

--
marko

#13

Pierre Frédéric Caillaud

lists@peufeu.com

over 16 years ago

In reply to: Marko Kreen (#12)

Re: COPY speedup

AFAIK you need to run app with LD_LIBRARY_PATH=/usr/lib/debug,
otherwise the debug packages won't be used.

I had stupidly put the LD_LIBRARY_PATH on make rather than on postgres,
ahem.
OK, it works, thanks.

I'm very carefully benchmarking everything everytime I make a
modification : sometimes just a simple change creates an unexpected
performance loss.
So the process is slow, but performance patch should not make it slower ;)