MD5 aggregate

Started by Dean Rasheedalmost 13 years ago32 messageshackers
Jump to latest
#1Dean Rasheed
dean.a.rasheed@gmail.com

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2
variants of md5(), so pure binary data can also be checksummed.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

Regards,
Dean

Attachments:

md5_agg.patchapplication/octet-stream; name=md5_agg.patchDownload+556-223
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Dean Rasheed (#1)
Re: MD5 aggregate

On 6/13/13 5:35 AM, Dean Rasheed wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

That seems somewhat useful.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

I think it would be better if you split this into two separate patches.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Marko Kreen
markokr@gmail.com
In reply to: Dean Rasheed (#1)
Re: MD5 aggregate

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#3)
Re: MD5 aggregate

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Tom Lane (#4)
Re: MD5 aggregate

On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com>

wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane

xor works but only if each row is different (e.g. at the very least all
columns together make a unique key).

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#4)
Re: MD5 aggregate

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
we'd need a way to get back "what would this row look like if it was
sent through COPY with these parameters", but I've long wanted that
also.

No, no clue about how to put all that together. Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

Thanks,

Stephen

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Stephen Frost (#6)
Re: MD5 aggregate

On 06/14/2013 09:40 AM, Stephen Frost wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
we'd need a way to get back "what would this row look like if it was
sent through COPY with these parameters", but I've long wanted that
also.

No, no clue about how to put all that together. Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

I'd rather go the other way, processing the records without having to
process them otherwise at all. Turning things into text must slow things
down, surely.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#7)
Re: MD5 aggregate

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

Thanks,

Stephen

#9Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#4)
Re: MD5 aggregate

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Well, if you aggregated foo.*::text as in my original example, then
the textual representation of the row would protect you from that. But
yes, if you were just doing it with a single text column that might be
a risk.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#9)
Re: MD5 aggregate

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature. If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result. Sounds a bit slow though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Stephen Frost (#8)
Re: MD5 aggregate

On 14 June 2013 15:19, Stephen Frost <sfrost@snowman.net> wrote:

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

md5_agg(record) ?

Yes, I like it.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Dean Rasheed (#11)
Re: MD5 aggregate

On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote:

On 14 June 2013 15:19, Stephen Frost <sfrost@snowman.net> wrote:

* Andrew Dunstan (andrew@dunslane.net) wrote:

I'd rather go the other way, processing the records without having
to process them otherwise at all. Turning things into text must slow
things down, surely.

That's certainly an interesting idea also..

md5_agg(record) ?

Yes, I like it.

It's more complex than just memcmp()ing HeapTupleData though. At least
if the Datum contains varlena columns there's so many different
representations (short, long, compressed, external, external compressed)
of the same data that a md5 without normalizing that wouldn't be very
interesting.
So you would at least need a normalizing version of
toast_flatten_tuple() that also deals with short/long varlenas. But even
after that, you would still need to deal with Datums that can have
different representation (like short numerics, old style hstore, ...).

It might be more realistic to use the binary output functions, but I am
not sure whether all of those are sufficiently reproduceable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#10)
Re: MD5 aggregate

On 06/14/2013 04:47 PM, Tom Lane wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 14 June 2013 14:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature. If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result. Sounds a bit slow though.

What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Hannu Krosing (#13)
Re: MD5 aggregate

On 14 June 2013 16:09, Hannu Krosing <hannu@2ndquadrant.com> wrote:

What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

That sounds like a pretty good approach. We could do that if we had a
version of md5() that returned numeric. My impression is that numeric
computations are pretty fast compared to the sorting overhead.

On the other hand, if there is a usable index, select md5_agg(..) from
(sub-query) will do and index scan rather than a sort, making it much
faster than using an ORDER BY in the aggregate.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Craig Ringer
craig@2ndquadrant.com
In reply to: Stephen Frost (#6)
Re: MD5 aggregate

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/14/2013 09:40 PM, Stephen Frost wrote:

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.

Until I started looking at the follow-up discussion I didn't realise it
wasn't supposed to.

If it isn't the md5sum of the ordered rows, it shouldn't be called
'md5'. It might still be useful, but please don't call it md5.

The proposals to make it produce the same result with different row
orderings sound useful in the context of SQL; if it produced a different
result with different orderings I'd want a way to force an explicit
ORDER BY clause on the aggregate and error out if one wasn't present.
Making it ignore order means that it's no longer md5, though.

row_sums(...) ?

- --
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRu+XMAAoJELBXNkqjr+S2mkkH/j8gi8d07dI6+G742f0U+v0J
u8DGhtDQuuWHalqlaUDOssmi4fRDg99OzLlR+Mid0yGL/UfFMoL47H+kNRoMkuzV
stUz3vf5rp8TbqEnikT3EwEKIuzaWrae0Fn3TKIYXVSRVvWjGzRSZsvJZsdfcS7T
7lZ9sf6QGekT9bAi6BIFsG7Z1bFLb6Q6AeTsX04++dLBCrjm96CSyisBswY5J2qg
zD0WrK6IOsSn9ljlIZRGSTtP+tdM5mOi/DHdeEd+glGx5YKQ9t9yq++oayoqb9mp
hPtsBo6UwcMaylPA2vQnhVi0q2bl9FMa+QGpaWBe6YfXLPF4PWhET2OixkRat1w=
=ncT/
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Craig Ringer
craig@2ndquadrant.com
In reply to: Dean Rasheed (#1)
Re: MD5 aggregate

On 06/13/2013 05:35 PM, Dean Rasheed wrote:

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

That's a very useful thing to be able to do, but I'm hesitant to make
the fact that it uses md5 too prominent in the name if it doesn't
produce a result that an external user could reasonably expect from
md5'ing the same data.

I imagine having an md5_agg(text) and md5(bytea) that was the more
efficient, streaming equivalent of:

md5(string_agg(the_col,''))

would be rather handy.

It'd be less useful for other types (floats, integers, etc) unless we
had a way to get the binary representations of those in a well defined
form, like int8le(1) . Casting to 'text' would be sufficient for most of
the purposes I can imagine, though, and for those that it wouldn't
things would quickly get so complicated that you'd want to be using a
PL/something function anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#1)
Re: MD5 aggregate

On 13 June 2013 10:35, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

2). Produce an ordered MD5 sum compatible with COPY, whose result
would match that of running unix md5sum on the COPY output. Given all
the possible COPY options that would affect the result (format,
delimiters, headers, quoting, escaping, ...), I think that such a
thing would only reasonably be possible as an extension to the COPY
command itself.

I guess in its simplest form this would just be a new option "MD5" to
COPY that would cause it to pipe its output to the md5 aggregator and
then send the final sum to the COPY destination at the end (e.g.,
"COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data
in foo).

I still think my original md5_agg() has its uses, since what it
produces is comparable with external md5 sums, and is directly
available to SQL, but (1) is probably the most useful for quickly
comparing 2 tables. I'm much less convinced about the value of (2),
but on the face of it, it doesn't seem like it would be hard to
implement.

Thoughts?

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#17)
Re: MD5 aggregate

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Attachments:

md5_agg_v2.patchapplication/octet-stream; name=md5_agg_v2.patchDownload+803-235
md5-100m-row-test.sqlapplication/octet-stream; name=md5-100m-row-test.sqlDownload
#19Marko Kreen
markokr@gmail.com
In reply to: Dean Rasheed (#18)
Re: MD5 aggregate

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
bigger than available RAM). If you want to promote such use you should
also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints. Just let it
happen. Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
or SpookyHash. You can get C implementation from pghashlib.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20David Fetter
david@fetter.org
In reply to: Dean Rasheed (#18)
Review [was Re: MD5 aggregate]

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Submission review:

Is the patch in a patch format which has context? (eg: context diff format)

Yes.

Does it apply cleanly to the current git master?

Yes.

Does it include reasonable tests, necessary doc patches, etc?

Yes.

Usability review:

Does the patch actually implement that?

Yes.

Do we want that?

Enough do.

Do we already have it?

No.

Does it follow SQL spec, or the community-agreed behavior?

No, and yes, respectively.

Does it include pg_dump support (if applicable)?

N/A

Are there dangers?

Patch changes the MD5 implementation, which could
theoretically result in backward incompatibility if the
changes are not 100% backward-compatible.

Have all the bases been covered?

Yes.

Feature test:

Does the feature work as advertised?

Yes.

Are there corner cases the author has failed to consider?

Not that I've found so far.

Are there any assertion failures or crashes?

No.

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes. For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
shackle@postgres:5493=# \g
Time: 948.623 ms
shackle@postgres:5493=# \g
Time: 939.819 ms

and here from master + the patch:

WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 1129.178 ms
shackle@postgres:5494=# \g
Time: 1134.013 ms
shackle@postgres:5494=# \g
Time: 1124.387 ms
shackle@postgres:5494=# \g
Time: 1119.733 ms
shackle@postgres:5494=# \g
Time: 1104.906 ms
shackle@postgres:5494=# \g
Time: 1137.055 ms
shackle@postgres:5494=# \g
Time: 1128.977 ms
shackle@postgres:5494=# \g
Time: 1143.619 ms

Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

If it claims to improve performance, does it?

Possibly for the aggregate.

Does it slow down other things?

Not that I've found.

Coding review:

Does it follow the project coding guidelines?

Yes.

Are there portability issues?

Not that I can see.

Will it work on Windows/BSD etc?

Not yet tested.

Are the comments sufficient and accurate?

Yes.

Does it do what it says, correctly?

As far as I can tell.

Does it produce compiler warnings?

No.

Can you make it crash?

My efforts so far have failed.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21David Fetter
david@fetter.org
In reply to: David Fetter (#20)
#22Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: David Fetter (#21)
#23Noah Misch
noah@leadboat.com
In reply to: Dean Rasheed (#18)
#24Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Noah Misch (#23)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Dean Rasheed (#24)
#26Noah Misch
noah@leadboat.com
In reply to: Dean Rasheed (#24)
#27Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Noah Misch (#26)
#28Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Peter Eisentraut (#25)
#29Marko Kreen
markokr@gmail.com
In reply to: Dean Rasheed (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Marko Kreen (#29)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: Dean Rasheed (#27)
#32Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Peter Eisentraut (#31)