MD5 aggregate

Started by Dean Rasheedover 12 years ago32 messages

dean.a.rasheed@gmail.com

over 12 years ago

1 attachment(s)

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2
variants of md5(), so pure binary data can also be checksummed.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

Regards,
Dean

Attachments:

md5_agg.patchapplication/octet-stream; name=md5_agg.patchDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 4c5af4b..fe531c5
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT NULLIF(value, '(none)') ...
*** 11588,11593 ****
--- 11588,11607 ----
       <row>
        <entry>
         <indexterm>
+         <primary>md5_agg</primary>
+        </indexterm>
+        <function>md5_agg(<replaceable class="parameter">expression</replaceable>)</function>
+       </entry>
+       <entry><type>text</type> or <type>bytea</type></entry>
+       <entry><type>text</type></entry>
+       <entry>
+        MD5 hash of the concatenated input values
+       </entry>
+      </row>
+ 
+      <row>
+       <entry>
+        <indexterm>
          <primary>min</primary>
         </indexterm>
         <function>min(<replaceable class="parameter">expression</replaceable>)</function>
*************** SELECT count(*) FROM sometable;
*** 11714,11720 ****
  
    <para>
     The aggregate functions <function>array_agg</function>,
!    <function>json_agg</function>,
     <function>string_agg</function>,
     and <function>xmlagg</function>, as well as similar user-defined
     aggregate functions, produce meaningfully different result values
--- 11728,11734 ----
  
    <para>
     The aggregate functions <function>array_agg</function>,
!    <function>json_agg</function>, <function>md5_agg</function>,
     <function>string_agg</function>,
     and <function>xmlagg</function>, as well as similar user-defined
     aggregate functions, produce meaningfully different result values
diff --git a/src/backend/libpq/md5.c b/src/backend/libpq/md5.c
new file mode 100644
index 4fc8318..852c335
*** a/src/backend/libpq/md5.c
--- b/src/backend/libpq/md5.c
***************
*** 1,14 ****
  /*
   *	md5.c
   *
!  *	Implements	the  MD5 Message-Digest Algorithm as specified in
!  *	RFC  1321.	This  implementation  is a simple one, in that it
!  *	needs  every  input  byte  to  be  buffered  before doing any
!  *	calculations.  I  do  not  expect  this  file  to be used for
!  *	general  purpose  MD5'ing  of large amounts of data, only for
!  *	generating hashed passwords from limited input.
   *
!  *	Sverre H. Huseby <sverrehu@online.no>
   *
   *	Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
   *	Portions Copyright (c) 1994, Regents of the University of California
--- 1,9 ----
  /*
   *	md5.c
   *
!  *	Implements the MD5 Message-Digest Algorithm as specified in RFC 1321.
   *
!  *	Original coding by Sverre H. Huseby <sverrehu@online.no>
   *
   *	Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
   *	Portions Copyright (c) 1994, Regents of the University of California
***************
*** 27,89 ****
   *	PRIVATE FUNCTIONS
   */
  
- 
- /*
-  *	The returned array is allocated using malloc.  the caller should free it
-  *	when it is no longer needed.
-  */
- static uint8 *
- createPaddedCopyWithLength(const uint8 *b, uint32 *l)
- {
- 	uint8	   *ret;
- 	uint32		q;
- 	uint32		len,
- 				newLen448;
- 	uint32		len_high,
- 				len_low;		/* 64-bit value split into 32-bit sections */
- 
- 	len = ((b == NULL) ? 0 : *l);
- 	newLen448 = len + 64 - (len % 64) - 8;
- 	if (newLen448 <= len)
- 		newLen448 += 64;
- 
- 	*l = newLen448 + 8;
- 	if ((ret = (uint8 *) malloc(sizeof(uint8) * *l)) == NULL)
- 		return NULL;
- 
- 	if (b != NULL)
- 		memcpy(ret, b, sizeof(uint8) * len);
- 
- 	/* pad */
- 	ret[len] = 0x80;
- 	for (q = len + 1; q < newLen448; q++)
- 		ret[q] = 0x00;
- 
- 	/* append length as a 64 bit bitcount */
- 	len_low = len;
- 	/* split into two 32-bit values */
- 	/* we only look at the bottom 32-bits */
- 	len_high = len >> 29;
- 	len_low <<= 3;
- 	q = newLen448;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q] = (len_high & 0xff);
- 
- 	return ret;
- }
- 
  #define F(x, y, z) (((x) & (y)) | (~(x) & (z)))
  #define G(x, y, z) (((x) & (z)) | ((y) & ~(z)))
  #define H(x, y, z) ((x) ^ (y) ^ (z))
--- 22,27 ----
*************** doTheRounds(uint32 X[16], uint32 state[4
*** 181,233 ****
  	state[3] += d;
  }
  
! static int
! calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])
  {
- 	register uint32 i,
- 				j,
- 				k,
- 				newI;
- 	uint32		l;
- 	uint8	   *input;
- 	register uint32 *wbp;
- 	uint32		workBuff[16],
- 				state[4];
- 
- 	l = len;
- 
  	state[0] = 0x67452301;
  	state[1] = 0xEFCDAB89;
  	state[2] = 0x98BADCFE;
  	state[3] = 0x10325476;
  
! 	if ((input = createPaddedCopyWithLength(b, &l)) == NULL)
! 		return 0;
  
! 	for (i = 0;;)
  	{
! 		if ((newI = i + 16 * 4) > l)
! 			break;
! 		k = i + 3;
! 		for (j = 0; j < 16; j++)
! 		{
! 			wbp = (workBuff + j);
! 			*wbp = input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k];
! 			k += 7;
! 		}
! 		doTheRounds(workBuff, state);
! 		i = newI;
  	}
! 	free(input);
  
! 	j = 0;
! 	for (i = 0; i < 4; i++)
  	{
  		k = state[i];
  		sum[j++] = (k & 0xff);
--- 119,201 ----
  	state[3] += d;
  }
  
! static inline void
! initializeState(uint32 state[4])
  {
  	state[0] = 0x67452301;
  	state[1] = 0xEFCDAB89;
  	state[2] = 0x98BADCFE;
  	state[3] = 0x10325476;
+ }
  
! static inline void
! processBlock(const uint8 input[64], uint32 state[4])
! {
! 	uint32		buff[16];
! 	uint32		i;
! 	uint32		j;
! 	uint32	   *bp;
  
! 	for (i = 0, j = 3; i < 16; i++)
  	{
! 		bp = (buff + i);
! 		*bp = input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j];
! 		j += 7;
  	}
! 	doTheRounds(buff, state);
! }
  
! static void
! calculateFinalSum(const uint8 input[64], uint32 state[4],
! 				  uint32 len_low, uint32 len_high, uint8 sum[16])
! {
! 	uint8		buff[128];
! 	uint32		offset;
! 	uint32		size;
! 	uint32		i;
! 	uint32		j;
! 	uint32		k;
! 
! 	/* Last partial data block */
! 	offset = (len_low >> 3) & 63;
! 	if (offset > 0)
! 		memcpy(buff, input, offset);
! 
! 	/* Pad to a complete block minus 8 bytes for the bit count */
! 	size = offset < 56 ? 56 : 120;
! 	buff[offset++] = 0x80;
! 	while (offset < size)
! 	   buff[offset++] = 0x00;
! 
! 	/* Append the 64-bit bit count */
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset] = (len_high & 0xff);
! 
! 	/* Process these last 1 or 2 blocks */
! 	processBlock(buff, state);
! 	if (size > 56)
! 	   processBlock(buff+64, state);
! 
! 	/* Produce the final MD5 sum */
! 	for (i = j = 0; i < 4; i++)
  	{
  		k = state[i];
  		sum[j++] = (k & 0xff);
*************** calculateDigestFromBuffer(const uint8 *b
*** 238,244 ****
  		k >>= 8;
  		sum[j++] = (k & 0xff);
  	}
! 	return 1;
  }
  
  static void
--- 206,245 ----
  		k >>= 8;
  		sum[j++] = (k & 0xff);
  	}
! }
! 
! static void
! calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])
! {
! 	uint32		state[4];
! 	uint32		len_low;
! 	uint32		len_high;
! 
! 	initializeState(state);
! 
! 	if (b == NULL)
! 	{
! 		/* Treated the same as empty input */
! 		len_low = 0;
! 		len_high = 0;
! 	}
! 	else
! 	{
! 		/* Calculate the 64-bit bit count */
! 		len_low = len << 3;
! 		len_high = len >> 29;
! 
! 		/* Process any complete blocks of data */
! 		while (len >= 64)
! 		{
! 			processBlock(b, state);
! 			b += 64;
! 			len -= 64;
! 		}
! 	}
! 
! 	/* Produce the final MD5 sum from any remaining data */
! 	calculateFinalSum(b, state, len_low, len_high, sum);
  }
  
  static void
*************** pg_md5_hash(const void *buff, size_t len
*** 291,299 ****
  {
  	uint8		sum[16];
  
! 	if (!calculateDigestFromBuffer(buff, len, sum))
! 		return false;
! 
  	bytesToHex(sum, hexsum);
  	return true;
  }
--- 292,298 ----
  {
  	uint8		sum[16];
  
! 	calculateDigestFromBuffer(buff, len, sum);
  	bytesToHex(sum, hexsum);
  	return true;
  }
*************** pg_md5_hash(const void *buff, size_t len
*** 301,308 ****
  bool
  pg_md5_binary(const void *buff, size_t len, void *outbuf)
  {
! 	if (!calculateDigestFromBuffer(buff, len, outbuf))
! 		return false;
  	return true;
  }
  
--- 300,306 ----
  bool
  pg_md5_binary(const void *buff, size_t len, void *outbuf)
  {
! 	calculateDigestFromBuffer(buff, len, outbuf);
  	return true;
  }
  
*************** pg_md5_encrypt(const char *passwd, const
*** 343,345 ****
--- 341,444 ----
  
  	return ret;
  }
+ 
+ /*
+  * pg_md5_init
+  *
+  *		Initialize or reset the specified MD5State for performing a new
+  *		cumulative MD5 computation.
+  */
+ bool
+ pg_md5_init(MD5State *state)
+ {
+ 	if (state == NULL)
+ 		return false;
+ 
+ 	state->len_low = 0;
+ 	state->len_high = 0;
+ 
+ 	initializeState(state->state);
+ 
+ 	return true;
+ }
+ 
+ /*
+  * pg_md5_update
+  *
+  *		Update the specified MD5 message digest by adding the specified data.
+  *		This may be called multiple times after the MD5State has been
+  *		initialized with pg_md5_init, and before calling pg_md5_final_hash.
+  */
+ bool
+ pg_md5_update(MD5State *state, const void *buff, size_t len)
+ {
+ 	uint32		offset;
+ 	const uint8 *input;
+ 	uint32		nbits;
+ 
+ 	if (state == NULL)
+ 		return false;
+ 
+ 	/* Size of previous partial data block */
+ 	offset = (state->len_low >> 3) & 63;
+ 	input = (uint8 *) buff;
+ 
+ 	/* Update the bit count */
+ 	nbits = (uint32) (len << 3);
+ 	state->len_low += nbits;
+ 	state->len_high += len >> 29;
+ 	if (state->len_low < nbits)
+ 		state->len_high++; /* overflow */
+ 
+ 	/* Process the previous partial block of data */
+ 	if (offset > 0)
+ 	{
+ 		int			copy = offset + len > 64 ? 64 - offset : len;
+ 
+ 		memcpy(state->buffer + offset, input, copy);
+ 		if (offset + copy < 64)
+ 			return true; /* still don't have a complete block of data */
+ 
+ 		processBlock(state->buffer, state->state);
+ 		input += copy;
+ 		len -= copy;
+     }
+ 
+ 	/* Process any complete blocks of data */
+ 	while (len >= 64)
+ 	{
+ 		processBlock(input, state->state);
+ 		input += 64;
+ 		len -= 64;
+ 	}
+ 
+ 	/* Save any remaining partial data block for next time */
+ 	if (len > 0)
+ 		memcpy(state->buffer, input, len);
+ 
+ 	return true;
+ }
+ 
+ /*
+  * pg_md5_final_hash
+  *
+  *		Compute the final MD5 sum of the data added to the specified MD5State.
+  *
+  *		Once this has been called, it should not be called again, and no
+  *		further data should be added to the MD5State without first resetting
+  *		it using pg_md5_init.
+  */
+ bool
+ pg_md5_final_hash(MD5State *state, char *hexsum)
+ {
+ 	uint8		sum[16];
+ 
+ 	if (state == NULL)
+ 		return false;
+ 
+ 	calculateFinalSum(state->buffer, state->state,
+ 					  state->len_low, state->len_high, sum);
+ 	bytesToHex(sum, hexsum);
+ 
+ 	return 1;
+ }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
new file mode 100644
index 56349e7..0da2377
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*************** md5_text(PG_FUNCTION_ARGS)
*** 3643,3652 ****
  	len = VARSIZE_ANY_EXHDR(in_text);
  
  	/* get the hash result */
! 	if (pg_md5_hash(VARDATA_ANY(in_text), len, hexsum) == false)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OUT_OF_MEMORY),
! 				 errmsg("out of memory")));
  
  	/* convert to text and return it */
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
--- 3643,3649 ----
  	len = VARSIZE_ANY_EXHDR(in_text);
  
  	/* get the hash result */
! 	pg_md5_hash(VARDATA_ANY(in_text), len, hexsum);
  
  	/* convert to text and return it */
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
*************** md5_bytea(PG_FUNCTION_ARGS)
*** 3664,3678 ****
  	char		hexsum[MD5_HASH_LEN + 1];
  
  	len = VARSIZE_ANY_EXHDR(in);
! 	if (pg_md5_hash(VARDATA_ANY(in), len, hexsum) == false)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OUT_OF_MEMORY),
! 				 errmsg("out of memory")));
  
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
  }
  
  /*
   * Return the size of a datum, possibly compressed
   *
   * Works on any data type
--- 3661,3778 ----
  	char		hexsum[MD5_HASH_LEN + 1];
  
  	len = VARSIZE_ANY_EXHDR(in);
! 	pg_md5_hash(VARDATA_ANY(in), len, hexsum);
  
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
  }
  
  /*
+  * md5_agg - Create an md5 hash of the input values.
+  *
+  * This is equivalent to using string_agg with an empty delimiter to
+  * concatenate all the input values, and then taking the MD5 sum of the
+  * result.  However, doing that would use an excessive amount of memory,
+  * whereas md5_agg uses only a small fixed-size state structure.
+  */
+ 
+ static MD5State *
+ makeMD5State(FunctionCallInfo fcinfo)
+ {
+ 	MD5State   *state;
+ 	MemoryContext aggcontext;
+ 	MemoryContext oldcontext;
+ 
+ 	if (!AggCheckCallContext(fcinfo, &aggcontext))
+ 	{
+ 		/* cannot be called directly because of internal-type argument */
+ 		elog(ERROR, "md5_agg_transfn called in non-aggregate context");
+ 	}
+ 
+ 	oldcontext = MemoryContextSwitchTo(aggcontext);
+ 	state = (MD5State *) palloc(sizeof(MD5State));
+ 	pg_md5_init(state);
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
+ Datum
+ md5_text_agg_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value to the MD5 sum, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		text	   *in_text = PG_GETARG_TEXT_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5State(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in_text);
+ 		pg_md5_update(state, VARDATA_ANY(in_text), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_agg() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_bytea_agg_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value to the MD5 sum, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		bytea	   *in = PG_GETARG_BYTEA_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5State(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in);
+ 		pg_md5_update(state, VARDATA_ANY(in), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_agg() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_agg_finalfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	/* cannot be called directly because of internal-type argument */
+ 	Assert(AggCheckCallContext(fcinfo, NULL));
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	if (state != NULL)
+ 	{
+ 		char		hexsum[MD5_HASH_LEN + 1];
+ 
+ 		pg_md5_final_hash(state, hexsum);
+ 
+ 		PG_RETURN_TEXT_P(cstring_to_text(hexsum));
+ 	}
+ 	else
+ 		PG_RETURN_NULL();
+ }
+ 
+ /*
   * Return the size of a datum, possibly compressed
   *
   * Works on any data type
diff --git a/src/include/catalog/pg_aggregate.h b/src/include/catalog/pg_aggregate.h
new file mode 100644
index 6fb10a9..29977ba
*** a/src/include/catalog/pg_aggregate.h
--- b/src/include/catalog/pg_aggregate.h
*************** DATA(insert ( 3545	bytea_string_agg_tran
*** 235,240 ****
--- 235,244 ----
  /* json */
  DATA(insert ( 3175	json_agg_transfn	json_agg_finalfn		0	2281	_null_ ));
  
+ /* md5 */
+ DATA(insert ( 3179	md5_text_agg_transfn	md5_agg_finalfn		0	2281	_null_ ));
+ DATA(insert ( 3181	md5_bytea_agg_transfn	md5_agg_finalfn		0	2281	_null_ ));
+ 
  /*
   * prototypes for functions in pg_aggregate.c
   */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index b5be075..1544605
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DESCR("MD5 hash");
*** 3519,3524 ****
--- 3519,3535 ----
  DATA(insert OID =  2321 (  md5	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "17" _null_ _null_ _null_ _null_ md5_bytea _null_ _null_ _null_ ));
  DESCR("MD5 hash");
  
+ DATA(insert OID = 3177 (  md5_text_agg_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 25" _null_ _null_ _null_ _null_ md5_text_agg_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3178 (  md5_agg_finalfn		PGNSP PGUID 12 1 0 0 0 f f f f f f i 1 0 25 "2281" _null_ _null_ _null_ _null_ md5_agg_finalfn _null_ _null_ _null_ ));
+ DESCR("aggregate final function");
+ DATA(insert OID = 3179 (  md5_agg				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "25" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 text hashing aggregate");
+ DATA(insert OID = 3180 (  md5_bytea_agg_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 17" _null_ _null_ _null_ _null_ md5_bytea_agg_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3181 (  md5_agg				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "17" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 bytea hashing aggregate");
+ 
  /* crosstype operations for date vs. timestamp and timestamptz */
  DATA(insert OID = 2338 (  date_lt_timestamp		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "1082 1114" _null_ _null_ _null_ _null_ date_lt_timestamp _null_ _null_ _null_ ));
  DATA(insert OID = 2339 (  date_le_timestamp		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "1082 1114" _null_ _null_ _null_ _null_ date_le_timestamp _null_ _null_ _null_ ));
diff --git a/src/include/libpq/md5.h b/src/include/libpq/md5.h
new file mode 100644
index cfa8df5..c0b55d7
*** a/src/include/libpq/md5.h
--- b/src/include/libpq/md5.h
***************
*** 21,30 ****
--- 21,43 ----
  #define isMD5(passwd)	(strncmp(passwd, "md5", 3) == 0 && \
  						 strlen(passwd) == MD5_PASSWD_LEN)
  
+ typedef struct MD5State
+ {
+ 	uint32		len_low;	/* low part of 64-bit bit count */
+ 	uint32		len_high;	/* high part of 64-bit bit count */
+ 	uint32		state[4];	/* MD5 digest state buffer */
+ 	uint8		buffer[64];	/* input data buffer */
+ } MD5State;
  
+ /* functions to compute the MD5 sum of a single input value */
  extern bool pg_md5_hash(const void *buff, size_t len, char *hexsum);
  extern bool pg_md5_binary(const void *buff, size_t len, void *outbuf);
  extern bool pg_md5_encrypt(const char *passwd, const char *salt,
  			   size_t salt_len, char *buf);
  
+ /* cumulative MD5 computation functions */
+ extern bool pg_md5_init(MD5State *state);
+ extern bool pg_md5_update(MD5State *state, const void *buff, size_t len);
+ extern bool pg_md5_final_hash(MD5State *state, char *hexsum);
+ 
  #endif
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index 667c58b..e1cbcf7
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum to_hex32(PG_FUNCTION_ARGS);
*** 779,784 ****
--- 779,787 ----
  extern Datum to_hex64(PG_FUNCTION_ARGS);
  extern Datum md5_text(PG_FUNCTION_ARGS);
  extern Datum md5_bytea(PG_FUNCTION_ARGS);
+ extern Datum md5_text_agg_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_bytea_agg_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_agg_finalfn(PG_FUNCTION_ARGS);
  
  extern Datum unknownin(PG_FUNCTION_ARGS);
  extern Datum unknownout(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index d379c0d..c069b22
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** select string_agg(v, decode('ee', 'hex')
*** 1153,1156 ****
--- 1153,1195 ----
   \xffeeaa
  (1 row)
  
+ -- md5_agg tests
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),('bbbb'),('cccc')) g(a);
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  ccb3bf4d77b887690b3b89663823d13d | ccb3bf4d77b887690b3b89663823d13d
+ (1 row)
+ 
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),(null),('cccc')) g(a);
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  39797f2e1f4fc388384fccfd75dcd5b6 | 39797f2e1f4fc388384fccfd75dcd5b6
+ (1 row)
+ 
+ select md5_agg(a), md5(string_agg(a,'')) from (values(null),(null)) g(a);
+  md5_agg | md5 
+ ---------+-----
+          | 
+ (1 row)
+ 
+ select md5_agg(distinct f1 order by f1),
+        md5(string_agg(distinct f1, '' order by f1)) from varchar_tbl;
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  f271345903df8fde90583255d5c5dcd4 | f271345903df8fde90583255d5c5dcd4
+ (1 row)
+ 
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  6c33e22e546f0c70faa7e2f674760ac4 | 6c33e22e546f0c70faa7e2f674760ac4
+ (1 row)
+ 
+ truncate bytea_test_table;
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+  md5_agg | md5 
+ ---------+-----
+          | 
+ (1 row)
+ 
  drop table bytea_test_table;
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
new file mode 100644
index 281c695..1babf96
*** a/src/test/regress/expected/strings.out
--- b/src/test/regress/expected/strings.out
*************** select md5('') = 'd41d8cd98f00b204e98009
*** 1278,1359 ****
--- 1278,1443 ----
   t
  (1 row)
  
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values('')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values(''::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
   TRUE 
  ------
   t
+ (1 row)
+ 
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea)) t(v);
+  TRUE 
+ ------
+  t
  (1 row)
  
  --
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
new file mode 100644
index 38d4757..a89c0ba
*** a/src/test/regress/sql/aggregates.sql
--- b/src/test/regress/sql/aggregates.sql
*************** select string_agg(v, '') from bytea_test
*** 441,444 ****
--- 441,456 ----
  select string_agg(v, NULL) from bytea_test_table;
  select string_agg(v, decode('ee', 'hex')) from bytea_test_table;
  
+ -- md5_agg tests
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),('bbbb'),('cccc')) g(a);
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),(null),('cccc')) g(a);
+ select md5_agg(a), md5(string_agg(a,'')) from (values(null),(null)) g(a);
+ 
+ select md5_agg(distinct f1 order by f1),
+        md5(string_agg(distinct f1, '' order by f1)) from varchar_tbl;
+ 
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+ truncate bytea_test_table;
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+ 
  drop table bytea_test_table;
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
new file mode 100644
index e7841aa..8385d57
*** a/src/test/regress/sql/strings.sql
--- b/src/test/regress/sql/strings.sql
*************** select to_hex(256::bigint*256::bigint*25
*** 455,486 ****
--- 455,500 ----
  -- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
  --
  select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values('')) t(v);
  
  select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a')) t(v);
  
  select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc')) t(v);
  
  select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest')) t(v);
  
  select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz')) t(v);
  
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')) t(v);
  
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890')) t(v);
  
  select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values(''::bytea)) t(v);
  
  select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a'::bytea)) t(v);
  
  select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc'::bytea)) t(v);
  
  select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest'::bytea)) t(v);
  
  select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz'::bytea)) t(v);
  
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea)) t(v);
  
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea)) t(v);
  
  --
  -- test behavior of escape_string_warning and standard_conforming_strings options

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Dean Rasheed (#1)

Re: MD5 aggregate

On 6/13/13 5:35 AM, Dean Rasheed wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

That seems somewhat useful.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

I think it would be better if you split this into two separate patches.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Marko Kreen

markokr@gmail.com

over 12 years ago

In reply to: Dean Rasheed (#1)

Re: MD5 aggregate

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Marko Kreen (#3)

Re: MD5 aggregate

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Benedikt Grundmann

bgrundmann@janestreet.com

over 12 years ago

In reply to: Tom Lane (#4)

Re: MD5 aggregate

On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com>

wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane

xor works but only if each row is different (e.g. at the very least all
columns together make a unique key).

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Tom Lane (#4)

Re: MD5 aggregate

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
we'd need a way to get back "what would this row look like if it was
sent through COPY with these parameters", but I've long wanted that
also.

No, no clue about how to put all that together. Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

Thanks,

Stephen

Andrew Dunstan

andrew@dunslane.net

over 12 years ago

In reply to: Stephen Frost (#6)

Re: MD5 aggregate

On 06/14/2013 09:40 AM, Stephen Frost wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
we'd need a way to get back "what would this row look like if it was
sent through COPY with these parameters", but I've long wanted that
also.

No, no clue about how to put all that together. Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

I'd rather go the other way, processing the records without having to
process them otherwise at all. Turning things into text must slow things
down, surely.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Andrew Dunstan (#7)

Re: MD5 aggregate

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

Thanks,

Stephen

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Tom Lane (#4)

Re: MD5 aggregate

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Well, if you aggregated foo.*::text as in my original example, then
the textual representation of the row would protect you from that. But
yes, if you were just doing it with a single text column that might be
a risk.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Dean Rasheed (#9)

Re: MD5 aggregate

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature. If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result. Sounds a bit slow though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Stephen Frost (#8)

Re: MD5 aggregate

On 14 June 2013 15:19, Stephen Frost <sfrost@snowman.net> wrote:

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

md5_agg(record) ?

Yes, I like it.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Dean Rasheed (#11)

Re: MD5 aggregate

On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote:

On 14 June 2013 15:19, Stephen Frost <sfrost@snowman.net> wrote:

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

md5_agg(record) ?

Yes, I like it.

It's more complex than just memcmp()ing HeapTupleData though. At least
if the Datum contains varlena columns there's so many different
representations (short, long, compressed, external, external compressed)
of the same data that a md5 without normalizing that wouldn't be very
interesting.
So you would at least need a normalizing version of
toast_flatten_tuple() that also deals with short/long varlenas. But even
after that, you would still need to deal with Datums that can have
different representation (like short numerics, old style hstore, ...).

It might be more realistic to use the binary output functions, but I am
not sure whether all of those are sufficiently reproduceable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Hannu Krosing

hannu@2ndQuadrant.com

over 12 years ago

In reply to: Tom Lane (#10)

Re: MD5 aggregate

On 06/14/2013 04:47 PM, Tom Lane wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature. If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result. Sounds a bit slow though.

What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Hannu Krosing (#13)

Re: MD5 aggregate

On 14 June 2013 16:09, Hannu Krosing <hannu@2ndquadrant.com> wrote:

What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

That sounds like a pretty good approach. We could do that if we had a
version of md5() that returned numeric. My impression is that numeric
computations are pretty fast compared to the sorting overhead.

On the other hand, if there is a usable index, select md5_agg(..) from
(sub-query) will do and index scan rather than a sort, making it much
faster than using an ORDER BY in the aggregate.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Craig Ringer

craig@2ndquadrant.com

over 12 years ago

In reply to: Stephen Frost (#6)

Re: MD5 aggregate

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/14/2013 09:40 PM, Stephen Frost wrote:

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.

Until I started looking at the follow-up discussion I didn't realise it
wasn't supposed to.

If it isn't the md5sum of the ordered rows, it shouldn't be called
'md5'. It might still be useful, but please don't call it md5.

The proposals to make it produce the same result with different row
orderings sound useful in the context of SQL; if it produced a different
result with different orderings I'd want a way to force an explicit
ORDER BY clause on the aggregate and error out if one wasn't present.
Making it ignore order means that it's no longer md5, though.

row_sums(...) ?

- --
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRu+XMAAoJELBXNkqjr+S2mkkH/j8gi8d07dI6+G742f0U+v0J
u8DGhtDQuuWHalqlaUDOssmi4fRDg99OzLlR+Mid0yGL/UfFMoL47H+kNRoMkuzV
stUz3vf5rp8TbqEnikT3EwEKIuzaWrae0Fn3TKIYXVSRVvWjGzRSZsvJZsdfcS7T
7lZ9sf6QGekT9bAi6BIFsG7Z1bFLb6Q6AeTsX04++dLBCrjm96CSyisBswY5J2qg
zD0WrK6IOsSn9ljlIZRGSTtP+tdM5mOi/DHdeEd+glGx5YKQ9t9yq++oayoqb9mp
hPtsBo6UwcMaylPA2vQnhVi0q2bl9FMa+QGpaWBe6YfXLPF4PWhET2OixkRat1w=
=ncT/
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Craig Ringer

craig@2ndquadrant.com

over 12 years ago

In reply to: Dean Rasheed (#1)

Re: MD5 aggregate

On 06/13/2013 05:35 PM, Dean Rasheed wrote:

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

That's a very useful thing to be able to do, but I'm hesitant to make
the fact that it uses md5 too prominent in the name if it doesn't
produce a result that an external user could reasonably expect from
md5'ing the same data.

I imagine having an md5_agg(text) and md5(bytea) that was the more
efficient, streaming equivalent of:

md5(string_agg(the_col,''))

would be rather handy.

It'd be less useful for other types (floats, integers, etc) unless we
had a way to get the binary representations of those in a well defined
form, like int8le(1) . Casting to 'text' would be sufficient for most of
the purposes I can imagine, though, and for those that it wouldn't
things would quickly get so complicated that you'd want to be using a
PL/something function anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Dean Rasheed (#1)

Re: MD5 aggregate

On 13 June 2013 10:35, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

2). Produce an ordered MD5 sum compatible with COPY, whose result
would match that of running unix md5sum on the COPY output. Given all
the possible COPY options that would affect the result (format,
delimiters, headers, quoting, escaping, ...), I think that such a
thing would only reasonably be possible as an extension to the COPY
command itself.

I guess in its simplest form this would just be a new option "MD5" to
COPY that would cause it to pipe its output to the md5 aggregator and
then send the final sum to the COPY destination at the end (e.g.,
"COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data
in foo).

I still think my original md5_agg() has its uses, since what it
produces is comparable with external md5 sums, and is directly
available to SQL, but (1) is probably the most useful for quickly
comparing 2 tables. I'm much less convinced about the value of (2),
but on the face of it, it doesn't seem like it would be hard to
implement.

Thoughts?

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Dean Rasheed (#17)

2 attachment(s)

Re: MD5 aggregate

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Attachments:

md5_agg_v2.patchapplication/octet-stream; name=md5_agg_v2.patchDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 4c5af4b..b31d824
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT NULLIF(value, '(none)') ...
*** 11588,11593 ****
--- 11588,11622 ----
       <row>
        <entry>
         <indexterm>
+         <primary>md5_agg</primary>
+        </indexterm>
+        <function>md5_agg(<replaceable class="parameter">expression</replaceable>)</function>
+       </entry>
+       <entry><type>text</type> or <type>bytea</type></entry>
+       <entry><type>text</type></entry>
+       <entry>
+        MD5 hash of the concatenated input values
+       </entry>
+      </row>
+ 
+      <row>
+       <entry>
+        <indexterm>
+         <primary>md5_total</primary>
+        </indexterm>
+        <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
+       </entry>
+       <entry><type>text</type> or <type>bytea</type></entry>
+       <entry><type>text</type></entry>
+       <entry>
+        sum of the MD5 hash values of the input values, independent of their
+        order
+       </entry>
+      </row>
+ 
+      <row>
+       <entry>
+        <indexterm>
          <primary>min</primary>
         </indexterm>
         <function>min(<replaceable class="parameter">expression</replaceable>)</function>
*************** SELECT count(*) FROM sometable;
*** 11714,11720 ****
  
    <para>
     The aggregate functions <function>array_agg</function>,
!    <function>json_agg</function>,
     <function>string_agg</function>,
     and <function>xmlagg</function>, as well as similar user-defined
     aggregate functions, produce meaningfully different result values
--- 11743,11749 ----
  
    <para>
     The aggregate functions <function>array_agg</function>,
!    <function>json_agg</function>, <function>md5_agg</function>,
     <function>string_agg</function>,
     and <function>xmlagg</function>, as well as similar user-defined
     aggregate functions, produce meaningfully different result values
diff --git a/src/backend/libpq/md5.c b/src/backend/libpq/md5.c
new file mode 100644
index 4fc8318..b677066
*** a/src/backend/libpq/md5.c
--- b/src/backend/libpq/md5.c
***************
*** 1,14 ****
  /*
   *	md5.c
   *
!  *	Implements	the  MD5 Message-Digest Algorithm as specified in
!  *	RFC  1321.	This  implementation  is a simple one, in that it
!  *	needs  every  input  byte  to  be  buffered  before doing any
!  *	calculations.  I  do  not  expect  this  file  to be used for
!  *	general  purpose  MD5'ing  of large amounts of data, only for
!  *	generating hashed passwords from limited input.
   *
!  *	Sverre H. Huseby <sverrehu@online.no>
   *
   *	Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
   *	Portions Copyright (c) 1994, Regents of the University of California
--- 1,9 ----
  /*
   *	md5.c
   *
!  *	Implements the MD5 Message-Digest Algorithm as specified in RFC 1321.
   *
!  *	Original coding by Sverre H. Huseby <sverrehu@online.no>
   *
   *	Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
   *	Portions Copyright (c) 1994, Regents of the University of California
***************
*** 27,89 ****
   *	PRIVATE FUNCTIONS
   */
  
- 
- /*
-  *	The returned array is allocated using malloc.  the caller should free it
-  *	when it is no longer needed.
-  */
- static uint8 *
- createPaddedCopyWithLength(const uint8 *b, uint32 *l)
- {
- 	uint8	   *ret;
- 	uint32		q;
- 	uint32		len,
- 				newLen448;
- 	uint32		len_high,
- 				len_low;		/* 64-bit value split into 32-bit sections */
- 
- 	len = ((b == NULL) ? 0 : *l);
- 	newLen448 = len + 64 - (len % 64) - 8;
- 	if (newLen448 <= len)
- 		newLen448 += 64;
- 
- 	*l = newLen448 + 8;
- 	if ((ret = (uint8 *) malloc(sizeof(uint8) * *l)) == NULL)
- 		return NULL;
- 
- 	if (b != NULL)
- 		memcpy(ret, b, sizeof(uint8) * len);
- 
- 	/* pad */
- 	ret[len] = 0x80;
- 	for (q = len + 1; q < newLen448; q++)
- 		ret[q] = 0x00;
- 
- 	/* append length as a 64 bit bitcount */
- 	len_low = len;
- 	/* split into two 32-bit values */
- 	/* we only look at the bottom 32-bits */
- 	len_high = len >> 29;
- 	len_low <<= 3;
- 	q = newLen448;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	len_low >>= 8;
- 	ret[q++] = (len_low & 0xff);
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q++] = (len_high & 0xff);
- 	len_high >>= 8;
- 	ret[q] = (len_high & 0xff);
- 
- 	return ret;
- }
- 
  #define F(x, y, z) (((x) & (y)) | (~(x) & (z)))
  #define G(x, y, z) (((x) & (z)) | ((y) & ~(z)))
  #define H(x, y, z) ((x) ^ (y) ^ (z))
--- 22,27 ----
*************** doTheRounds(uint32 X[16], uint32 state[4
*** 181,233 ****
  	state[3] += d;
  }
  
! static int
! calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])
  {
- 	register uint32 i,
- 				j,
- 				k,
- 				newI;
- 	uint32		l;
- 	uint8	   *input;
- 	register uint32 *wbp;
- 	uint32		workBuff[16],
- 				state[4];
- 
- 	l = len;
- 
  	state[0] = 0x67452301;
  	state[1] = 0xEFCDAB89;
  	state[2] = 0x98BADCFE;
  	state[3] = 0x10325476;
  
! 	if ((input = createPaddedCopyWithLength(b, &l)) == NULL)
! 		return 0;
  
! 	for (i = 0;;)
  	{
! 		if ((newI = i + 16 * 4) > l)
! 			break;
! 		k = i + 3;
! 		for (j = 0; j < 16; j++)
! 		{
! 			wbp = (workBuff + j);
! 			*wbp = input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k--];
! 			*wbp <<= 8;
! 			*wbp |= input[k];
! 			k += 7;
! 		}
! 		doTheRounds(workBuff, state);
! 		i = newI;
  	}
! 	free(input);
  
! 	j = 0;
! 	for (i = 0; i < 4; i++)
  	{
  		k = state[i];
  		sum[j++] = (k & 0xff);
--- 119,201 ----
  	state[3] += d;
  }
  
! static inline void
! initializeState(uint32 state[4])
  {
  	state[0] = 0x67452301;
  	state[1] = 0xEFCDAB89;
  	state[2] = 0x98BADCFE;
  	state[3] = 0x10325476;
+ }
  
! static inline void
! processBlock(const uint8 input[64], uint32 state[4])
! {
! 	uint32		buff[16];
! 	uint32		i;
! 	uint32		j;
! 	uint32	   *bp;
  
! 	for (i = 0, j = 3; i < 16; i++)
  	{
! 		bp = (buff + i);
! 		*bp = input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j--];
! 		*bp <<= 8;
! 		*bp |= input[j];
! 		j += 7;
  	}
! 	doTheRounds(buff, state);
! }
  
! static void
! calculateFinalSum(const uint8 input[64], uint32 state[4],
! 				  uint32 len_low, uint32 len_high, uint8 sum[16])
! {
! 	uint8		buff[128];
! 	uint32		offset;
! 	uint32		size;
! 	uint32		i;
! 	uint32		j;
! 	uint32		k;
! 
! 	/* Last partial data block */
! 	offset = (len_low >> 3) & 63;
! 	if (offset > 0)
! 		memcpy(buff, input, offset);
! 
! 	/* Pad to a complete block minus 8 bytes for the bit count */
! 	size = offset < 56 ? 56 : 120;
! 	buff[offset++] = 0x80;
! 	while (offset < size)
! 	   buff[offset++] = 0x00;
! 
! 	/* Append the 64-bit bit count */
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	len_low >>= 8;
! 	buff[offset++] = (len_low & 0xff);
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset++] = (len_high & 0xff);
! 	len_high >>= 8;
! 	buff[offset] = (len_high & 0xff);
! 
! 	/* Process these last 1 or 2 blocks */
! 	processBlock(buff, state);
! 	if (size > 56)
! 	   processBlock(buff+64, state);
! 
! 	/* Produce the final MD5 sum */
! 	for (i = j = 0; i < 4; i++)
  	{
  		k = state[i];
  		sum[j++] = (k & 0xff);
*************** calculateDigestFromBuffer(const uint8 *b
*** 238,259 ****
  		k >>= 8;
  		sum[j++] = (k & 0xff);
  	}
- 	return 1;
  }
  
  static void
! bytesToHex(uint8 b[16], char *s)
  {
  	static const char *hex = "0123456789abcdef";
  	int			q,
  				w;
  
! 	for (q = 0, w = 0; q < 16; q++)
  	{
! 		s[w++] = hex[(b[q] >> 4) & 0x0F];
! 		s[w++] = hex[b[q] & 0x0F];
  	}
! 	s[w] = '\0';
  }
  
  /*
--- 206,260 ----
  		k >>= 8;
  		sum[j++] = (k & 0xff);
  	}
  }
  
  static void
! calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])
! {
!     uint32      state[4];
! 	uint32		len_low;
! 	uint32		len_high;
! 
! 	initializeState(state);
! 
! 	if (b == NULL)
! 	{
! 		/* Treated the same as empty input */
! 		len_low = 0;
! 		len_high = 0;
! 	}
! 	else
! 	{
! 		/* Calculate the 64-bit bit count */
! 		len_low = len << 3;
! 		len_high = len >> 29;
! 
! 		/* Process any complete blocks of data */
! 		while (len >= 64)
! 		{
! 			processBlock(b, state);
! 			b += 64;
! 			len -= 64;
! 		}
! 	}
! 
! 	/* Produce the final MD5 sum from any remaining data */
! 	calculateFinalSum(b, state, len_low, len_high, sum);
! }
! 
! static void
! bytesToHex(uint8 *input, size_t input_len, char *output)
  {
  	static const char *hex = "0123456789abcdef";
  	int			q,
  				w;
  
! 	for (q = 0, w = 0; q < input_len; q++)
  	{
! 		output[w++] = hex[(input[q] >> 4) & 0x0F];
! 		output[w++] = hex[input[q] & 0x0F];
  	}
! 	output[w] = '\0';
  }
  
  /*
*************** pg_md5_hash(const void *buff, size_t len
*** 291,308 ****
  {
  	uint8		sum[16];
  
! 	if (!calculateDigestFromBuffer(buff, len, sum))
! 		return false;
! 
! 	bytesToHex(sum, hexsum);
  	return true;
  }
  
  bool
  pg_md5_binary(const void *buff, size_t len, void *outbuf)
  {
! 	if (!calculateDigestFromBuffer(buff, len, outbuf))
! 		return false;
  	return true;
  }
  
--- 292,306 ----
  {
  	uint8		sum[16];
  
! 	calculateDigestFromBuffer(buff, len, sum);
! 	bytesToHex(sum, 16, hexsum);
  	return true;
  }
  
  bool
  pg_md5_binary(const void *buff, size_t len, void *outbuf)
  {
! 	calculateDigestFromBuffer(buff, len, outbuf);
  	return true;
  }
  
*************** pg_md5_encrypt(const char *passwd, const
*** 343,345 ****
--- 341,525 ----
  
  	return ret;
  }
+ 
+ /*
+  * pg_md5_init
+  *
+  *		Initialize or reset the specified MD5State for performing a new
+  *		cumulative MD5 computation.
+  */
+ void
+ pg_md5_init(MD5State *state)
+ {
+ 	state->len_low = 0;
+ 	state->len_high = 0;
+ 	initializeState(state->state);
+ }
+ 
+ /*
+  * pg_md5_update
+  *
+  *		Update the specified MD5 message digest by adding the specified data.
+  *		This may be called multiple times after the MD5State has been
+  *		initialized with pg_md5_init, and before calling pg_md5_final_hash.
+  */
+ void
+ pg_md5_update(MD5State *state, const void *buff, size_t len)
+ {
+ 	uint32		offset;
+ 	const uint8 *input;
+ 	uint32		nbits;
+ 
+ 	/* Size of previous partial data block */
+ 	offset = (state->len_low >> 3) & 63;
+ 	input = (uint8 *) buff;
+ 
+ 	/* Update the bit count */
+ 	nbits = (uint32) (len << 3);
+ 	state->len_low += nbits;
+ 	state->len_high += len >> 29;
+ 	if (state->len_low < nbits)
+ 		state->len_high++; /* overflow */
+ 
+ 	/* Process the previous partial block of data */
+ 	if (offset > 0)
+ 	{
+ 		int			copy = offset + len > 64 ? 64 - offset : len;
+ 
+ 		memcpy(state->buffer + offset, input, copy);
+ 		if (offset + copy < 64)
+ 			return; /* still don't have a complete block of data */
+ 
+ 		processBlock(state->buffer, state->state);
+ 		input += copy;
+ 		len -= copy;
+     }
+ 
+ 	/* Process any complete blocks of data */
+ 	while (len >= 64)
+ 	{
+ 		processBlock(input, state->state);
+ 		input += 64;
+ 		len -= 64;
+ 	}
+ 
+ 	/* Save any remaining partial data block for next time */
+ 	if (len > 0)
+ 		memcpy(state->buffer, input, len);
+ }
+ 
+ /*
+  * pg_md5_final_hash
+  *
+  *		Compute the final MD5 sum of the data added to the specified MD5State.
+  *
+  *		Once this has been called, it should not be called again, and no
+  *		further data should be added to the MD5State without first resetting
+  *		it using pg_md5_init.
+  */
+ void
+ pg_md5_final_hash(MD5State *state, char *hexsum)
+ {
+ 	uint8		sum[16];
+ 
+ 	calculateFinalSum(state->buffer, state->state,
+ 					  state->len_low, state->len_high, sum);
+ 	bytesToHex(sum, 16, hexsum);
+ }
+ 
+ /*
+  * pg_md5s_init
+  *
+  *		Initialize the specified MD5SumState for calculating a sum of MD5
+  *		sums.  sum_len is the number of bytes to use for the sum of MD5 sums.
+  *		This must be at least 16 (the size of a single MD5 sum).  If the
+  *		overall sum exceeds this size it will wrap around.
+  */
+ void
+ pg_md5s_init(MD5SumState *state, size_t sum_len)
+ {
+ 	int			i;
+ 
+ 	/* The sum array in an MD5SumState is at least 16 bytes in size */
+ 	state->sum_len = sum_len;
+ 	if (state->sum_len < 16)
+ 		state->sum_len = 16;
+ 
+ 	/* Initialize the array to 0 */
+ 	for (i = 0; i < state->sum_len; i++)
+ 		state->sum[i] = 0;
+ }
+ 
+ /*
+  * pg_md5s_update
+  *
+  *		Update the specified sum of MD5 sums by adding the MD5 sum of the
+  *		specified data to the total.  This may be called multiple times after
+  *		the MD5SumState has been initialized with pg_md5s_init, and before
+  *		calling pg_md5s_final_hash.
+  */
+ void
+ pg_md5s_update(MD5SumState *state, const void *buff, size_t len)
+ {
+ 	uint8		md5[16];
+ 	int			i;
+ 	int			j;
+ 	uint32		sum;
+ 	uint32		carry;
+ 
+ 	/* Calculate the MD5 sum of the input */
+ 	calculateDigestFromBuffer(buff, len, md5);
+ 
+ 	/* Add it to our sum of MD5 sums */
+ 	i = state->sum_len - 1;
+ 	j = 15;
+ 	carry = 0;
+ 
+ 	while (j >= 0)
+ 	{
+ 		sum = (uint32) state->sum[i] + md5[j] + carry;
+ 		state->sum[i] = (uint8) (sum & 0xFF);
+ 		carry = sum >> 8;
+ 		i--;
+ 		j--;
+ 	}
+ 	while (carry > 0 && i >= 0)
+ 	{
+ 		sum = (uint32) state->sum[i] + carry;
+ 		state->sum[i] = (uint8) (sum & 0xFF);
+ 		carry = sum >> 8;
+ 		i--;
+ 	}
+ }
+ 
+ /*
+  * pg_md5s_final_hash
+  *
+  *		Return the final sum of MD5 sums of the data added to the specified
+  *		MD5SumState.
+  */
+ void
+ pg_md5s_final_hash(MD5SumState *state, char *hexsum)
+ {
+ 	size_t		len;
+ 	uint8	   *sum;
+ 
+ 	/* Skip leading zeros */
+ 	len = state->sum_len;
+ 	sum = state->sum;
+ 
+ 	while (len > 0 && *sum == 0)
+ 	{
+ 		sum++;
+ 		len--;
+ 	}
+ 
+ 	/* Output the sum as hexadecimal */
+ 	if (len > 0)
+ 		bytesToHex(sum, len, hexsum);
+ 	else
+ 	{
+ 		hexsum[0] = '0';
+ 		hexsum[1] = '\0';
+ 	}
+ }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
new file mode 100644
index 56349e7..722e425
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*************** md5_text(PG_FUNCTION_ARGS)
*** 3643,3652 ****
  	len = VARSIZE_ANY_EXHDR(in_text);
  
  	/* get the hash result */
! 	if (pg_md5_hash(VARDATA_ANY(in_text), len, hexsum) == false)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OUT_OF_MEMORY),
! 				 errmsg("out of memory")));
  
  	/* convert to text and return it */
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
--- 3643,3649 ----
  	len = VARSIZE_ANY_EXHDR(in_text);
  
  	/* get the hash result */
! 	pg_md5_hash(VARDATA_ANY(in_text), len, hexsum);
  
  	/* convert to text and return it */
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
*************** md5_bytea(PG_FUNCTION_ARGS)
*** 3664,3678 ****
  	char		hexsum[MD5_HASH_LEN + 1];
  
  	len = VARSIZE_ANY_EXHDR(in);
! 	if (pg_md5_hash(VARDATA_ANY(in), len, hexsum) == false)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OUT_OF_MEMORY),
! 				 errmsg("out of memory")));
  
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
  }
  
  /*
   * Return the size of a datum, possibly compressed
   *
   * Works on any data type
--- 3661,3889 ----
  	char		hexsum[MD5_HASH_LEN + 1];
  
  	len = VARSIZE_ANY_EXHDR(in);
! 	pg_md5_hash(VARDATA_ANY(in), len, hexsum);
  
  	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
  }
  
  /*
+  * md5_agg - Create an md5 hash of the input values.
+  *
+  * This is equivalent to using string_agg with an empty delimiter to
+  * concatenate all the input values, and then taking the MD5 sum of the
+  * result.  However, doing that would use an excessive amount of memory,
+  * whereas md5_agg uses only a small fixed-size state structure.
+  */
+ 
+ static MD5State *
+ makeMD5State(FunctionCallInfo fcinfo)
+ {
+ 	MD5State   *state;
+ 	MemoryContext aggcontext;
+ 	MemoryContext oldcontext;
+ 
+ 	if (!AggCheckCallContext(fcinfo, &aggcontext))
+ 	{
+ 		/* cannot be called directly because of internal-type argument */
+ 		elog(ERROR, "md5_agg_transfn called in non-aggregate context");
+ 	}
+ 
+ 	oldcontext = MemoryContextSwitchTo(aggcontext);
+ 	state = (MD5State *) palloc(sizeof(MD5State));
+ 	pg_md5_init(state);
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
+ Datum
+ md5_text_agg_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value to the MD5 sum, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		text	   *in_text = PG_GETARG_TEXT_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5State(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in_text);
+ 		pg_md5_update(state, VARDATA_ANY(in_text), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_agg() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_bytea_agg_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value to the MD5 sum, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		bytea	   *in = PG_GETARG_BYTEA_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5State(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in);
+ 		pg_md5_update(state, VARDATA_ANY(in), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_agg() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_agg_finalfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5State   *state;
+ 
+ 	/* cannot be called directly because of internal-type argument */
+ 	Assert(AggCheckCallContext(fcinfo, NULL));
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5State *) PG_GETARG_POINTER(0);
+ 
+ 	if (state != NULL)
+ 	{
+ 		char		hexsum[MD5_HASH_LEN + 1];
+ 
+ 		pg_md5_final_hash(state, hexsum);
+ 
+ 		PG_RETURN_TEXT_P(cstring_to_text(hexsum));
+ 	}
+ 	else
+ 		PG_RETURN_NULL();
+ }
+ 
+ /*
+  * md5_total - Create a sum of md5 hashes of the input values.
+  *
+  * This computes the md5 hash of each non-null input value and adds it to the
+  * total, treating it like a 128-bit integer.  The result is then the final
+  * sum of md5 sums, as a hexadecimal number, with at least 32 digits.
+  *
+  * An md5 hash is 16 bytes;  use an addition 8 bytes for the sum of md5 hashes
+  * so that it won't wrap round until at least 2^64 values have been added.
+  */
+ 
+ #define MD5_SUM_LEN		24
+ 
+ static MD5SumState *
+ makeMD5SumState(FunctionCallInfo fcinfo)
+ {
+ 	MD5SumState *state;
+ 	MemoryContext aggcontext;
+ 	MemoryContext oldcontext;
+ 
+ 	if (!AggCheckCallContext(fcinfo, &aggcontext))
+ 	{
+ 		/* cannot be called directly because of internal-type argument */
+ 		elog(ERROR, "md5_total_transfn called in non-aggregate context");
+ 	}
+ 
+ 	oldcontext = MemoryContextSwitchTo(aggcontext);
+ 	state = (MD5SumState *) palloc(sizeof(MD5SumState) +
+ 									MD5_SUM_LEN - MD5_HASH_LEN/2);
+ 	pg_md5s_init(state, MD5_SUM_LEN);
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
+ Datum
+ md5_text_total_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5SumState *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5SumState *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value's MD5 sum to the sum of MD5 sums, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		text	   *in_text = PG_GETARG_TEXT_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5SumState(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in_text);
+ 		pg_md5s_update(state, VARDATA_ANY(in_text), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_total() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_bytea_total_transfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5SumState *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5SumState *) PG_GETARG_POINTER(0);
+ 
+ 	/* Add the value's MD5 sum to the sum of MD5 sums, unless it is null */
+ 	if (!PG_ARGISNULL(1))
+ 	{
+ 		bytea	   *in = PG_GETARG_BYTEA_PP(1);
+ 		size_t		len;
+ 
+ 		if (state == NULL)
+ 			state = makeMD5SumState(fcinfo);
+ 
+ 		len = VARSIZE_ANY_EXHDR(in);
+ 		pg_md5s_update(state, VARDATA_ANY(in), len);
+ 	}
+ 
+ 	/*
+ 	 * The transition type for md5_total() is declared to be "internal",
+ 	 * which is a pass-by-value type the same size as a pointer.
+ 	 */
+ 	PG_RETURN_POINTER(state);
+ }
+ 
+ Datum
+ md5_total_finalfn(PG_FUNCTION_ARGS)
+ {
+ 	MD5SumState *state;
+ 
+ 	/* cannot be called directly because of internal-type argument */
+ 	Assert(AggCheckCallContext(fcinfo, NULL));
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (MD5SumState *) PG_GETARG_POINTER(0);
+ 
+ 	if (state != NULL)
+ 	{
+ 		char		hexsum[sizeof(state->sum) * 2 + 1];
+ 
+ 		pg_md5s_final_hash(state, hexsum);
+ 
+ 		PG_RETURN_TEXT_P(cstring_to_text(hexsum));
+ 	}
+ 	else
+ 		PG_RETURN_NULL();
+ }
+ 
+ /*
   * Return the size of a datum, possibly compressed
   *
   * Works on any data type
diff --git a/src/include/catalog/pg_aggregate.h b/src/include/catalog/pg_aggregate.h
new file mode 100644
index 6fb10a9..5182f12
*** a/src/include/catalog/pg_aggregate.h
--- b/src/include/catalog/pg_aggregate.h
*************** DATA(insert ( 3545	bytea_string_agg_tran
*** 235,240 ****
--- 235,246 ----
  /* json */
  DATA(insert ( 3175	json_agg_transfn	json_agg_finalfn		0	2281	_null_ ));
  
+ /* md5 */
+ DATA(insert ( 3179	md5_text_agg_transfn	md5_agg_finalfn		0	2281	_null_ ));
+ DATA(insert ( 3181	md5_bytea_agg_transfn	md5_agg_finalfn		0	2281	_null_ ));
+ DATA(insert ( 3184	md5_text_total_transfn	md5_total_finalfn		0	2281	_null_ ));
+ DATA(insert ( 3186	md5_bytea_total_transfn	md5_total_finalfn		0	2281	_null_ ));
+ 
  /*
   * prototypes for functions in pg_aggregate.c
   */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index b5be075..d7cf3b4
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DESCR("MD5 hash");
*** 3519,3524 ****
--- 3519,3546 ----
  DATA(insert OID =  2321 (  md5	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "17" _null_ _null_ _null_ _null_ md5_bytea _null_ _null_ _null_ ));
  DESCR("MD5 hash");
  
+ DATA(insert OID = 3177 (  md5_text_agg_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 25" _null_ _null_ _null_ _null_ md5_text_agg_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3178 (  md5_agg_finalfn		PGNSP PGUID 12 1 0 0 0 f f f f f f i 1 0 25 "2281" _null_ _null_ _null_ _null_ md5_agg_finalfn _null_ _null_ _null_ ));
+ DESCR("aggregate final function");
+ DATA(insert OID = 3179 (  md5_agg				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "25" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 text hashing aggregate");
+ DATA(insert OID = 3180 (  md5_bytea_agg_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 17" _null_ _null_ _null_ _null_ md5_bytea_agg_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3181 (  md5_agg				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "17" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 bytea hashing aggregate");
+ 
+ DATA(insert OID = 3182 (  md5_text_total_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 25" _null_ _null_ _null_ _null_ md5_text_total_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3183 (  md5_total_finalfn		PGNSP PGUID 12 1 0 0 0 f f f f f f i 1 0 25 "2281" _null_ _null_ _null_ _null_ md5_total_finalfn _null_ _null_ _null_ ));
+ DESCR("aggregate final function");
+ DATA(insert OID = 3184 (  md5_total				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "25" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 text hash sum aggregate");
+ DATA(insert OID = 3185 (  md5_bytea_total_transfn	PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2281 "2281 17" _null_ _null_ _null_ _null_ md5_bytea_total_transfn _null_ _null_ _null_ ));
+ DESCR("aggregate transition function");
+ DATA(insert OID = 3186 (  md5_total				PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 25 "17" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
+ DESCR("MD5 bytea hash sum aggregate");
+ 
  /* crosstype operations for date vs. timestamp and timestamptz */
  DATA(insert OID = 2338 (  date_lt_timestamp		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "1082 1114" _null_ _null_ _null_ _null_ date_lt_timestamp _null_ _null_ _null_ ));
  DATA(insert OID = 2339 (  date_le_timestamp		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "1082 1114" _null_ _null_ _null_ _null_ date_le_timestamp _null_ _null_ _null_ ));
diff --git a/src/include/libpq/md5.h b/src/include/libpq/md5.h
new file mode 100644
index cfa8df5..bd49f92
*** a/src/include/libpq/md5.h
--- b/src/include/libpq/md5.h
***************
*** 21,30 ****
--- 21,56 ----
  #define isMD5(passwd)	(strncmp(passwd, "md5", 3) == 0 && \
  						 strlen(passwd) == MD5_PASSWD_LEN)
  
+ /* State information needed in a cumulative MD5 computation. */
+ typedef struct MD5State
+ {
+ 	uint32		len_low;	/* low part of 64-bit bit count */
+ 	uint32		len_high;	/* high part of 64-bit bit count */
+ 	uint32		state[4];	/* MD5 digest state buffer */
+ 	uint8		buffer[64];	/* input data buffer */
+ } MD5State;
+ 
+ /* State information for computing a sum of MD5 sums. */
+ typedef struct MD5SumState
+ {
+ 	size_t		sum_len;	/* length of sum (>= 16) */
+ 	uint8		sum[16];	/* sum of MD5 sums --- VARIABLE LENGTH ARRAY */
+ } MD5SumState;
  
+ /* functions to compute the MD5 sum of a single input value */
  extern bool pg_md5_hash(const void *buff, size_t len, char *hexsum);
  extern bool pg_md5_binary(const void *buff, size_t len, void *outbuf);
  extern bool pg_md5_encrypt(const char *passwd, const char *salt,
  			   size_t salt_len, char *buf);
  
+ /* cumulative MD5 computation functions */
+ extern void pg_md5_init(MD5State *state);
+ extern void pg_md5_update(MD5State *state, const void *buff, size_t len);
+ extern void pg_md5_final_hash(MD5State *state, char *hexsum);
+ 
+ /* functions to compute a sum of MD5 sums */
+ extern void pg_md5s_init(MD5SumState *state, size_t sum_len);
+ extern void pg_md5s_update(MD5SumState *state, const void *buff, size_t len);
+ extern void pg_md5s_final_hash(MD5SumState *state, char *hexsum);
+ 
  #endif
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index 667c58b..0ed6e33
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum to_hex32(PG_FUNCTION_ARGS);
*** 779,784 ****
--- 779,790 ----
  extern Datum to_hex64(PG_FUNCTION_ARGS);
  extern Datum md5_text(PG_FUNCTION_ARGS);
  extern Datum md5_bytea(PG_FUNCTION_ARGS);
+ extern Datum md5_text_agg_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_bytea_agg_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_agg_finalfn(PG_FUNCTION_ARGS);
+ extern Datum md5_text_total_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_bytea_total_transfn(PG_FUNCTION_ARGS);
+ extern Datum md5_total_finalfn(PG_FUNCTION_ARGS);
  
  extern Datum unknownin(PG_FUNCTION_ARGS);
  extern Datum unknownout(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index d379c0d..c069b22
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** select string_agg(v, decode('ee', 'hex')
*** 1153,1156 ****
--- 1153,1195 ----
   \xffeeaa
  (1 row)
  
+ -- md5_agg tests
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),('bbbb'),('cccc')) g(a);
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  ccb3bf4d77b887690b3b89663823d13d | ccb3bf4d77b887690b3b89663823d13d
+ (1 row)
+ 
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),(null),('cccc')) g(a);
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  39797f2e1f4fc388384fccfd75dcd5b6 | 39797f2e1f4fc388384fccfd75dcd5b6
+ (1 row)
+ 
+ select md5_agg(a), md5(string_agg(a,'')) from (values(null),(null)) g(a);
+  md5_agg | md5 
+ ---------+-----
+          | 
+ (1 row)
+ 
+ select md5_agg(distinct f1 order by f1),
+        md5(string_agg(distinct f1, '' order by f1)) from varchar_tbl;
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  f271345903df8fde90583255d5c5dcd4 | f271345903df8fde90583255d5c5dcd4
+ (1 row)
+ 
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+              md5_agg              |               md5                
+ ----------------------------------+----------------------------------
+  6c33e22e546f0c70faa7e2f674760ac4 | 6c33e22e546f0c70faa7e2f674760ac4
+ (1 row)
+ 
+ truncate bytea_test_table;
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+  md5_agg | md5 
+ ---------+-----
+          | 
+ (1 row)
+ 
  drop table bytea_test_table;
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
new file mode 100644
index 281c695..1babf96
*** a/src/test/regress/expected/strings.out
--- b/src/test/regress/expected/strings.out
*************** select md5('') = 'd41d8cd98f00b204e98009
*** 1278,1359 ****
--- 1278,1443 ----
   t
  (1 row)
  
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values('')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890')) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values(''::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
   TRUE 
  ------
   t
  (1 row)
  
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea)) t(v);
+  TRUE 
+ ------
+  t
+ (1 row)
+ 
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
   TRUE 
  ------
   t
+ (1 row)
+ 
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea)) t(v);
+  TRUE 
+ ------
+  t
  (1 row)
  
  --
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
new file mode 100644
index 38d4757..a89c0ba
*** a/src/test/regress/sql/aggregates.sql
--- b/src/test/regress/sql/aggregates.sql
*************** select string_agg(v, '') from bytea_test
*** 441,444 ****
--- 441,456 ----
  select string_agg(v, NULL) from bytea_test_table;
  select string_agg(v, decode('ee', 'hex')) from bytea_test_table;
  
+ -- md5_agg tests
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),('bbbb'),('cccc')) g(a);
+ select md5_agg(a), md5(string_agg(a,'')) from (values('aaaa'),(null),('cccc')) g(a);
+ select md5_agg(a), md5(string_agg(a,'')) from (values(null),(null)) g(a);
+ 
+ select md5_agg(distinct f1 order by f1),
+        md5(string_agg(distinct f1, '' order by f1)) from varchar_tbl;
+ 
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+ truncate bytea_test_table;
+ select md5_agg(v), md5(string_agg(v, '')) from bytea_test_table;
+ 
  drop table bytea_test_table;
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
new file mode 100644
index e7841aa..8385d57
*** a/src/test/regress/sql/strings.sql
--- b/src/test/regress/sql/strings.sql
*************** select to_hex(256::bigint*256::bigint*25
*** 455,486 ****
--- 455,500 ----
  -- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
  --
  select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values('')) t(v);
  
  select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a')) t(v);
  
  select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc')) t(v);
  
  select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest')) t(v);
  
  select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz')) t(v);
  
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')) t(v);
  
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890')) t(v);
  
  select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ select md5_agg(v) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE" from (values(''::bytea)) t(v);
  
  select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ select md5_agg(v) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE" from (values('a'::bytea)) t(v);
  
  select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ select md5_agg(v) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE" from (values('abc'::bytea)) t(v);
  
  select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ select md5_agg(v) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE" from (values('message digest'::bytea)) t(v);
  
  select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ select md5_agg(v) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE" from (values('abcdefghijklmnopqrstuvwxyz'::bytea)) t(v);
  
  select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ select md5_agg(v) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE" from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea)) t(v);
  
  select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ select md5_agg(v) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE" from (values('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea)) t(v);
  
  --
  -- test behavior of escape_string_warning and standard_conforming_strings options

md5-100m-row-test.sqlapplication/octet-stream; name=md5-100m-row-test.sqlDownload

#19

Marko Kreen

markokr@gmail.com

over 12 years ago

In reply to: Dean Rasheed (#18)

Re: MD5 aggregate

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
bigger than available RAM). If you want to promote such use you should
also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints. Just let it
happen. Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
or SpookyHash. You can get C implementation from pghashlib.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

David Fetter

david@fetter.org

over 12 years ago

In reply to: Dean Rasheed (#18)

Review [was Re: MD5 aggregate]

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Submission review:

Is the patch in a patch format which has context? (eg: context diff format)

Yes.

Does it apply cleanly to the current git master?

Yes.

Does it include reasonable tests, necessary doc patches, etc?

Yes.

Usability review:

Does the patch actually implement that?

Yes.

Do we want that?

Enough do.

Do we already have it?

No.

Does it follow SQL spec, or the community-agreed behavior?

No, and yes, respectively.

Does it include pg_dump support (if applicable)?

N/A

Are there dangers?

Patch changes the MD5 implementation, which could
theoretically result in backward incompatibility if the
changes are not 100% backward-compatible.

Have all the bases been covered?

Yes.

Feature test:

Does the feature work as advertised?

Yes.

Are there corner cases the author has failed to consider?

Not that I've found so far.

Are there any assertion failures or crashes?

No.

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes. For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
shackle@postgres:5493=# \g
Time: 948.623 ms
shackle@postgres:5493=# \g
Time: 939.819 ms

and here from master + the patch:

WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 1129.178 ms
shackle@postgres:5494=# \g
Time: 1134.013 ms
shackle@postgres:5494=# \g
Time: 1124.387 ms
shackle@postgres:5494=# \g
Time: 1119.733 ms
shackle@postgres:5494=# \g
Time: 1104.906 ms
shackle@postgres:5494=# \g
Time: 1137.055 ms
shackle@postgres:5494=# \g
Time: 1128.977 ms
shackle@postgres:5494=# \g
Time: 1143.619 ms

Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

If it claims to improve performance, does it?

Possibly for the aggregate.

Does it slow down other things?

Not that I've found.

Coding review:

Does it follow the project coding guidelines?

Yes.

Are there portability issues?

Not that I can see.

Will it work on Windows/BSD etc?

Not yet tested.

Are the comments sufficient and accurate?

Yes.

Does it do what it says, correctly?

As far as I can tell.

Does it produce compiler warnings?

No.

Can you make it crash?

My efforts so far have failed.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

David Fetter

david@fetter.org

over 12 years ago

In reply to: David Fetter (#20)

Re: Review [was Re: MD5 aggregate]

On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes. For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
shackle@postgres:5493=# \g
Time: 948.623 ms
shackle@postgres:5493=# \g
Time: 939.819 ms

and here from master + the patch:

WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 1129.178 ms
shackle@postgres:5494=# \g
Time: 1134.013 ms
shackle@postgres:5494=# \g
Time: 1124.387 ms
shackle@postgres:5494=# \g
Time: 1119.733 ms
shackle@postgres:5494=# \g
Time: 1104.906 ms
shackle@postgres:5494=# \g
Time: 1137.055 ms
shackle@postgres:5494=# \g
Time: 1128.977 ms
shackle@postgres:5494=# \g
Time: 1143.619 ms

Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

The above was with -O0. Please find below with vanilla ./configure (-O2):

Master:
shackle@postgres:5493=# WITH t1 AS (SELECT
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 469.101 ms
shackle@postgres:5493=# \g
Time: 464.122 ms
shackle@postgres:5493=# \g
Time: 461.411 ms
shackle@postgres:5493=# \g
Time: 458.222 ms
shackle@postgres:5493=# \g
Time: 463.616 ms
shackle@postgres:5493=# \g
Time: 463.983 ms
shackle@postgres:5493=# \g
Time: 453.770 ms
shackle@postgres:5493=# \g
Time: 456.729 ms

MD5:
WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 436.862 ms
shackle@postgres:5494=# \t
Showing only tuples.
shackle@postgres:5494=# \t
Tuples only is off.
shackle@postgres:5494=# \g
Time: 438.686 ms
shackle@postgres:5494=# \g
Time: 443.789 ms
shackle@postgres:5494=# \g
Time: 452.522 ms
shackle@postgres:5494=# \g
Time: 447.824 ms
shackle@postgres:5494=# \g
Time: 448.547 ms
shackle@postgres:5494=# \g
Time: 446.901 ms
shackle@postgres:5494=# \g
Time: 448.537 ms

Average, stddev population for master: 461.36925, 4.5883631545
Average, stddev population for md5: 445.4585, 4.9892286729

so slightly faster on MD5, as expected.

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: David Fetter (#21)

Re: Review [was Re: MD5 aggregate]

On 21 June 2013 21:04, David Fetter <david@fetter.org> wrote:

On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes. For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
shackle@postgres:5493=# \g
Time: 948.623 ms
shackle@postgres:5493=# \g
Time: 939.819 ms

and here from master + the patch:

WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 1129.178 ms
shackle@postgres:5494=# \g
Time: 1134.013 ms
shackle@postgres:5494=# \g
Time: 1124.387 ms
shackle@postgres:5494=# \g
Time: 1119.733 ms
shackle@postgres:5494=# \g
Time: 1104.906 ms
shackle@postgres:5494=# \g
Time: 1137.055 ms
shackle@postgres:5494=# \g
Time: 1128.977 ms
shackle@postgres:5494=# \g
Time: 1143.619 ms

Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

The above was with -O0. Please find below with vanilla ./configure (-O2):

Master:
shackle@postgres:5493=# WITH t1 AS (SELECT
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 469.101 ms
shackle@postgres:5493=# \g
Time: 464.122 ms
shackle@postgres:5493=# \g
Time: 461.411 ms
shackle@postgres:5493=# \g
Time: 458.222 ms
shackle@postgres:5493=# \g
Time: 463.616 ms
shackle@postgres:5493=# \g
Time: 463.983 ms
shackle@postgres:5493=# \g
Time: 453.770 ms
shackle@postgres:5493=# \g
Time: 456.729 ms

MD5:
WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 436.862 ms
shackle@postgres:5494=# \t
Showing only tuples.
shackle@postgres:5494=# \t
Tuples only is off.
shackle@postgres:5494=# \g
Time: 438.686 ms
shackle@postgres:5494=# \g
Time: 443.789 ms
shackle@postgres:5494=# \g
Time: 452.522 ms
shackle@postgres:5494=# \g
Time: 447.824 ms
shackle@postgres:5494=# \g
Time: 448.547 ms
shackle@postgres:5494=# \g
Time: 446.901 ms
shackle@postgres:5494=# \g
Time: 448.537 ms

Average, stddev population for master: 461.36925, 4.5883631545
Average, stddev population for md5: 445.4585, 4.9892286729

so slightly faster on MD5, as expected.

Cheers,
David.

Thanks for the review and testing!

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Noah Misch

noah@leadboat.com

over 12 years ago

In reply to: Dean Rasheed (#18)

Re: MD5 aggregate

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

I took a look at this patch. First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

+         <primary>md5_total</primary>
+        </indexterm>
+        <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
+       </entry>
+       <entry><type>text</type> or <type>bytea</type></entry>
+       <entry><type>text</type></entry>
+       <entry>
+        sum of the MD5 hash values of the input values, independent of their
+        order

This is not specific enough. We would need to provide either an algorithm
specification or a literature reference. However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL. I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

*** a/src/backend/libpq/md5.c
--- b/src/backend/libpq/md5.c
***************
*** 1,14 ****
/*
*	md5.c
*
!  *	Implements	the  MD5 Message-Digest Algorithm as specified in
!  *	RFC  1321.	This  implementation  is a simple one, in that it
!  *	needs  every  input  byte  to  be  buffered  before doing any
!  *	calculations.  I  do  not  expect  this  file  to be used for
!  *	general  purpose  MD5'ing  of large amounts of data, only for
!  *	generating hashed passwords from limited input.

In other words, performance wasn't a design criterion. I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy. What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Noah Misch (#23)

Re: MD5 aggregate

On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

I took a look at this patch. First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

I didn't really set out with the aim of optimising the existing md5()
functions at all, but merely to restructure the code in a way that
would expose the necessary API for md5_agg(). As the timings up-thread
show, the performance gain is really very modest, although the memory
saving might be more of a factor in some setups. In fact, for the
sorts of queries shown, the vast majority of the time is spent
elsewhere, as can be seen by replacing md5() with a more trivial
function such as length().

+         <primary>md5_total</primary>
+        </indexterm>
+        <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
+       </entry>
+       <entry><type>text</type> or <type>bytea</type></entry>
+       <entry><type>text</type></entry>
+       <entry>
+        sum of the MD5 hash values of the input values, independent of their
+        order
This is not specific enough. We would need to provide either an algorithm
specification or a literature reference. However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL. I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

Yes, I can totally understand that point-of-view. I wasn't really
intending to write md5_total() at the outset - it was more of an
intellectual exercise following some of the previous points raised. So
on reflection, I think I can also agree that that probably doesn't
belong in core.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

*** a/src/backend/libpq/md5.c
--- b/src/backend/libpq/md5.c
***************
*** 1,14 ****
/*
*  md5.c
*
!  *  Implements      the  MD5 Message-Digest Algorithm as specified in
!  *  RFC  1321.      This  implementation  is a simple one, in that it
!  *  needs  every  input  byte  to  be  buffered  before doing any
!  *  calculations.  I  do  not  expect  this  file  to be used for
!  *  general  purpose  MD5'ing  of large amounts of data, only for
!  *  generating hashed passwords from limited input.
In other words, performance wasn't a design criterion. I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy. What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Thanks for the review.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Dean Rasheed (#24)

Re: MD5 aggregate

On 6/26/13 4:04 PM, Dean Rasheed wrote:

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

Well, in general, I'd rather see the sophisticated stuff in core and the
hacks in extensions.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Noah Misch

noah@leadboat.com

over 12 years ago

In reply to: Dean Rasheed (#24)

Re: MD5 aggregate

On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:

On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

All true, but I don't see those facts justifying core inclusion.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
several mirrors of that answer, and a few posts on this thread.

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Understood; feel free to back off from any performance improvements in which
you didn't wish to involve yourself.

If we do end up moving forward with md5_agg(), I note that the pgcrypto
version already has an initialize/accumulate/finalize API division. A patch
importing that code largely-unchanged would be easier to verify than a patch
restructuring the src/backend/libpq/md5.c implementation. The two patches
would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Noah Misch (#26)

Re: MD5 aggregate

On 26 June 2013 22:48, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:

On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

All true, but I don't see those facts justifying core inclusion.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
several mirrors of that answer, and a few posts on this thread.

I found more than that using Google. It's not uncommon for people to
use Google and then pick the first "accepted" answer on Stack
Overflow, in which case they might well pick one of the answers here
[1]: http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql
lists they might find this [4]/messages/by-id/4E5F469C.5070104@compulab.co.il, or deduce from this [5]/messages/by-id/e94d85500801301153u6b976e31m89e311c7134a0160@mail.gmail.com that it isn't
possible.

[1]: http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2]: http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3]: https://ralphholz.de/?q=node/298

[4]: /messages/by-id/4E5F469C.5070104@compulab.co.il

[5]: /messages/by-id/e94d85500801301153u6b976e31m89e311c7134a0160@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Understood; feel free to back off from any performance improvements in which
you didn't wish to involve yourself.

If we do end up moving forward with md5_agg(), I note that the pgcrypto
version already has an initialize/accumulate/finalize API division. A patch
importing that code largely-unchanged would be easier to verify than a patch
restructuring the src/backend/libpq/md5.c implementation. The two patches
would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.

I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM (
SELECT i, md5_agg(chr(32+(i+j*31)%224))
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

which all produce the same overall sum.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Peter Eisentraut (#25)

Re: MD5 aggregate

On 26 June 2013 21:46, Peter Eisentraut <peter_e@gmx.net> wrote:

On 6/26/13 4:04 PM, Dean Rasheed wrote:

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Marko Kreen

markokr@gmail.com

over 12 years ago

In reply to: Dean Rasheed (#28)

Re: MD5 aggregate

On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On 26 June 2013 21:46, Peter Eisentraut <peter_e@gmx.net> wrote:

On 6/26/13 4:04 PM, Dean Rasheed wrote:

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

The problem with md5_agg() is that it's only useful in toy scenarios.

It's more useful give people script that does same sum(hash(row))
on dump file than try to run MD5 on ordered rows.

Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
people want way to check if 2 tables or table and dump are same.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Marko Kreen (#29)

Re: MD5 aggregate

On Thu, Jun 27, 2013 at 7:29 AM, Marko Kreen <markokr@gmail.com> wrote:

On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On 26 June 2013 21:46, Peter Eisentraut <peter_e@gmx.net> wrote:

On 6/26/13 4:04 PM, Dean Rasheed wrote:

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

The problem with md5_agg() is that it's only useful in toy scenarios.

It's more useful give people script that does same sum(hash(row))
on dump file than try to run MD5 on ordered rows.

Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
people want way to check if 2 tables or table and dump are same.

I think you're trying to tell Dean to write the patch that you want
instead of the patch that he wants. There are certainly other things
that could be done that some people might sometimes prefer, but that
doesn't mean what he did isn't useful.

That having been said, I basically agree with Noah: I think this would
be a useful extension (perhaps even in contrib?) but I don't think we
need to install it by default. It's useful, but it's also narrow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Dean Rasheed (#27)

Re: MD5 aggregate

On 6/27/13 4:19 AM, Dean Rasheed wrote:

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.

It's not clear what these people wanted this functionality for. They
all wanted to analyze a table to compare with another table (or the same
table later). Either, they wanted this to detect data changes, in which
case the right tool is a checksum, not a cryptographic hash. We already
have several checksum implementations in core, so we could expose on of
them. Or they wanted this to protect their data from tampering, in
which case the right tool is a cryptographic hash, but Noah argues that
a sum of MD5 hashes is not cryptographically sound. (And in any case,
we don't put cryptographic functionality into the core.)

The reason md5_agg is proposed here and in all those cited posts is
presumably because the md5() function was already there anyway. The the
md5() function is there because the md5 code was already there anyway
because of the authentication. Let's not add higher-order
already-there-anyway code. ;-)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Dean Rasheed

dean.a.rasheed@gmail.com

over 12 years ago

In reply to: Peter Eisentraut (#31)

Re: MD5 aggregate

On 27 June 2013 17:47, Peter Eisentraut <peter_e@gmx.net> wrote:

On 6/27/13 4:19 AM, Dean Rasheed wrote:

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.

It's not clear what these people wanted this functionality for. They
all wanted to analyze a table to compare with another table (or the same
table later). Either, they wanted this to detect data changes, in which
case the right tool is a checksum, not a cryptographic hash. We already
have several checksum implementations in core, so we could expose on of
them. Or they wanted this to protect their data from tampering, in
which case the right tool is a cryptographic hash, but Noah argues that
a sum of MD5 hashes is not cryptographically sound. (And in any case,
we don't put cryptographic functionality into the core.)

The reason md5_agg is proposed here and in all those cited posts is
presumably because the md5() function was already there anyway. The the
md5() function is there because the md5 code was already there anyway
because of the authentication. Let's not add higher-order
already-there-anyway code. ;-)

OK fair enough. It's looking like there are more people who don't want
md5_agg() in core, or want something different, than who do want it.
Also, if we're taking the view that the existing md5() function is
only for hashing passwords, then it's probably not worth trying to
optimise it.

At this stage it's probably best to mark this as returned with
feedback, and I'll consider the options for rewriting it, but not
during this commitfest.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers