md5 of table

Started by Sim Zacksover 14 years ago16 messagesgeneral
Jump to latest
#1Sim Zacks
sim@compulab.co.il

Is there a way to get an md5 or other hash of an entire table?

I want to be able to easily compare 2 tables in different databases.

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the butt.

If I could return an md5 of the entire table, then I could check if the
tables have the same hash and be confident enough that the tables were
identical.

Thanks
Sim

#2Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Sim Zacks (#1)
Re: md5 of table

On Thu, Sep 01, 2011 at 11:47:24AM +0300, Sim Zacks wrote:

Is there a way to get an md5 or other hash of an entire table?

I want to be able to easily compare 2 tables in different databases.

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the
butt.

If I could return an md5 of the entire table, then I could check if
the tables have the same hash and be confident enough that the tables
were identical.

One option might be to pg_dump in an appropriate format and
md5-compare the output ?

Another option might be to

- cross-check columns/column types
- query from information_schema
- compare row counts
- may need a lock
- compare table sizes
- may need vaccum ?

If all three match that may be good enough ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#3Achilleas Mantzios
achill@matrix.gatewaynet.com
In reply to: Karsten Hilbert (#2)
Re: md5 of table

md5 has size limitations, the second approach seems more practical.

Στις Thursday 01 September 2011 12:30:45 ο/η Karsten Hilbert έγραψε:

On Thu, Sep 01, 2011 at 11:47:24AM +0300, Sim Zacks wrote:

Is there a way to get an md5 or other hash of an entire table?

I want to be able to easily compare 2 tables in different databases.

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the
butt.

If I could return an md5 of the entire table, then I could check if
the tables have the same hash and be confident enough that the tables
were identical.

One option might be to pg_dump in an appropriate format and
md5-compare the output ?

Another option might be to

- cross-check columns/column types
- query from information_schema
- compare row counts
- may need a lock
- compare table sizes
- may need vaccum ?

If all three match that may be good enough ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Achilleas Mantzios

#4Sim Zacks
sim@compulab.co.il
In reply to: Sim Zacks (#1)
Re: md5 of table

On 09/01/2011 12:26 PM, Pavel Stehule wrote:

Hello

postgres=# create table tt(a int, b varchar);
CREATE TABLE
postgres=# insert into tt values(10,'hello');
INSERT 0 1

postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from tt;
md5
----------------------------------
20a92a676f52699e613da1bb114bd6f0
(1 row)

Regards

Pavel Stehule

Would be perfect, but 8.2 can't cast a UDT to text.
ERROR: cannot cast type tt to text
LINE 1: select tt::text from tt

Thanks
Sim

#5Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Sim Zacks (#4)
Re: md5 of table

On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:

On 09/01/2011 12:26 PM, Pavel Stehule wrote:

Hello

postgres=# create table tt(a int, b varchar);
CREATE TABLE
postgres=# insert into tt values(10,'hello');
INSERT 0 1

postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from

I do that as well, but it might have questionable performance when
your table has 16M rows, and is 50GB +

--
GJ

#6Sim Zacks
sim@compulab.co.il
In reply to: Sim Zacks (#1)
Re: md5 of table

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html style="direction: ltr;">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<style>body p { margin-bottom: 0cm; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="UTF-8" bgcolor="#ffffff"
text="#000000">
<br>
<blockquote
cite="mid:CAFj8pRCf0tsUQs307Xoj1SiPjbdm8yjhKQkn=X=bQGUAUUMpGA@mail.gmail.com"
type="cite">
<pre wrap="">
I am not sure if this will work, but you can try it

<a class="moz-txt-link-freetext" href="http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Cast_to_varchar&quot;&gt;http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Cast_to_varchar&lt;/a&gt;

Pavel
</pre>
</blockquote>
<br>
I appreciate your help, but UDTs don't have input/ouput functions
unless you define them manually and I need this for all of my
tables. <br>
<br>
Thanks<br>
Sim<br>
</body>
</html>

#7Sim Zacks
sim@compulab.co.il
In reply to: Grzegorz Jaśkiewicz (#5)
Re: md5 of table

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html style="direction: ltr;">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<style>body p { margin-bottom: 0cm; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="UTF-8" bgcolor="#ffffff"
text="#000000">
On 09/01/2011 01:35 PM, Grzegorz Jaśkiewicz wrote:
<blockquote
cite="mid:CAJY59_h6TqWfh3NKxfDCq1Pb9hBcYGBqxb5h+czWcgLj4ETeGQ@mail.gmail.com"
type="cite">
<pre wrap="">On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <a class="moz-txt-link-rfc2396E" href="mailto:sim@compulab.co.il">&lt;sim@compulab.co.il&gt;</a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On 09/01/2011 12:26 PM, Pavel Stehule wrote:
</pre>
<blockquote type="cite">
<pre wrap="">
Hello

postgres=# create table tt(a int, b varchar);
CREATE TABLE
postgres=# insert into tt values(10,'hello');
INSERT 0 1

postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from
</pre>
</blockquote>
</blockquote>
<pre wrap="">I do that as well, but it might have questionable performance when
your table has 16M rows, and is 50GB +
</pre>
</blockquote>
<p>I don't need performance. This is for regression testing for a
new database version. I want to run my functions in the old db and
the new db and when it modifies a table, I want to be able to
check that the tables are the same.</p>
<p>Sim<br>
</p>
</body>
</html>

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: Achilleas Mantzios (#3)
Re: md5 of table

On Thu, Sep 1, 2011 at 3:48 AM, Achilleas Mantzios
<achill@matrix.gatewaynet.com> wrote:

md5 has size limitations, the second approach seems more practical.

Really? I was not aware of size limits of md5, what are they?

#9Achilleas Mantzios
achill@matrix.gatewaynet.com
In reply to: Scott Marlowe (#8)
Re: md5 of table

Στις Thursday 01 September 2011 15:50:21 ο/η Scott Marlowe έγραψε:

On Thu, Sep 1, 2011 at 3:48 AM, Achilleas Mantzios
<achill@matrix.gatewaynet.com> wrote:

md5 has size limitations, the second approach seems more practical.

Really? I was not aware of size limits of md5, what are they?

sorry, i was wrong. i dont know why i had this impression,
just checked with a 43GB table on a freebsd machine and went fine.

--
Achilleas Mantzios

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Grzegorz Jaśkiewicz (#5)
Re: md5 of table

2011/9/1 Grzegorz Jaśkiewicz <gryzman@gmail.com>:

On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:

On 09/01/2011 12:26 PM, Pavel Stehule wrote:

Hello

postgres=# create table tt(a int, b varchar);
CREATE TABLE
postgres=# insert into tt values(10,'hello');
INSERT 0 1

postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from

I do that as well, but it might have questionable performance when
your table has 16M rows, and is 50GB +

you need order by for that to work. I would do it like this:
select md5(array(select foo from foo order by foo_pkey)::text);

it's great quick'n'dirty, but not much scalable beyond millions.

OP:

I thought about using dblink and the EXCEPT query, but then I need to know the field list of each query result, which is a pain in the butt.

That is not correct. As long as the table definitions are precisely
the same, you can move records across dblink without specifying
fields. You do this by using record type for the composite which
dblink sends across as text.

merlin

#11Sim Zacks
sim@compulab.co.il
In reply to: Merlin Moncure (#10)
Re: md5 of table

OP:

I thought about using dblink and the EXCEPT query, but then I need to know the field list of each query result, which is a pain in the butt.

That is not correct. As long as the table definitions are precisely
the same, you can move records across dblink without specifying
fields. You do this by using record type for the composite which
dblink sends across as text.

merlin

Do you have a quick example? This is what I have tried:

select * from tbla
except
select * from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select * from tbla')

The error I get back is:
ERROR: a column definition list is required for functions returning
"record"

Sim

#12Robert Treat
xzilla@users.sourceforge.net
In reply to: Merlin Moncure (#10)
Re: md5 of table

2011/9/1 Merlin Moncure <mmoncure@gmail.com>:

2011/9/1 Grzegorz Jaśkiewicz <gryzman@gmail.com>:

On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:

On 09/01/2011 12:26 PM, Pavel Stehule wrote:

Hello

postgres=# create table tt(a int, b varchar);
CREATE TABLE
postgres=# insert into tt values(10,'hello');
INSERT 0 1

postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from

I do that as well, but it might have questionable performance when
your table has 16M rows, and is 50GB +

you need order by for that to work.  I would do it like this:
select md5(array(select foo from foo order by foo_pkey)::text);

it's great quick'n'dirty, but not much scalable beyond millions.

I've always liked doing this with my pager:

[robert@client-168] export PAGER=md5
-=[11:40:25 Thu Sep 01]=---=[ pagila-0.10.1 ]=-
[robert@client-168] psql -hlocalhost -dpagila
psql (9.0.4, server 9.1beta3)
WARNING: psql version 9.0, server version 9.1.
Some psql features might not work.
Type "help" for help.

pagila=# select * from actor order by actor_id;
f381ebdefe0aada9c0bc14e657962c1f

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

#13Merlin Moncure
mmoncure@gmail.com
In reply to: Sim Zacks (#11)
Re: md5 of table

2011/9/1 Sim Zacks <sim@compulab.co.il>:

OP:

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the butt.

That is not correct.  As long as the table definitions are precisely
the same, you can move records across dblink without specifying
fields.  You do this by using record type for the composite which
dblink sends across as text.

merlin

Do you have a quick example? This is what I have tried:

select * from tbla
except
select * from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select * from tbla')

The error I get back is:
ERROR:  a column definition list is required for functions returning
"record"

sure:
select tbla from tbla
except
select t::tbla from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

there's a bunch of ways to do that -- you can also do the md5 on the
remote side so you can just send the digests.
select * from tbla
except
select (t::tbla).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

should also work.

This *might* work -- I didn't try. It's been a while since I've used
stock dblink.
select * from tbla
except
select (t).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla from tbla') R(t tbla);

merlin

#14Scott Marlowe
scott.marlowe@gmail.com
In reply to: Achilleas Mantzios (#9)
Re: md5 of table

On Thu, Sep 1, 2011 at 7:56 AM, Achilleas Mantzios
<achill@matrix.gatewaynet.com> wrote:

Στις Thursday 01 September 2011 15:50:21 ο/η Scott Marlowe έγραψε:

Really?  I was not aware of size limits of md5, what are they?

sorry, i was wrong. i dont know why i had this impression,
just checked with a 43GB table on a freebsd machine and went fine.

Well, it might have been an older version or a 32 bit version or
something you were thinking of. Sometimes a long memory helps us get
the wrong answer. :)

#15Vincent de Phily
vincent.dephily@mobile-devices.fr
In reply to: Sim Zacks (#1)
Re: md5 of table

On Thursday 01 September 2011 11:47:24 Sim Zacks wrote:

Is there a way to get an md5 or other hash of an entire table?

I want to be able to easily compare 2 tables in different databases.

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the butt.

If I could return an md5 of the entire table, then I could check if the
tables have the same hash and be confident enough that the tables were
identical.

Thanks
Sim

You might also want to take a look at

http://pgfoundry.org/projects/pg-comparator/

which can give a more nuanced view of db differences and tries to be smart
about performance. It looks a bit stale; I haven't used it in ages, but it
used to be a trusty part of our test suite.
--
Vincent de Phily

#16Sim Zacks
sim@compulab.co.il
In reply to: Merlin Moncure (#13)
Re: md5 of table

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the butt.

That is not correct. As long as the table definitions are precisely
the same, you can move records across dblink without specifying
fields. You do this by using record type for the composite which
dblink sends across as text.

merlin

Do you have a quick example? This is what I have tried:

select * from tbla
except
select * from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select * from tbla')

The error I get back is:
ERROR: a column definition list is required for functions returning
"record"

sure:
select tbla from tbla
except
select t::tbla from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

We tried something like that. Unfortunately, in 8.2 you can't cast a row
type as text.

there's a bunch of ways to do that -- you can also do the md5 on the
remote side so you can just send the digests.
select * from tbla
except
select (t::tbla).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

should also work.

This *might* work -- I didn't try. It's been a while since I've used
stock dblink.
select * from tbla
except
select (t).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla from tbla') R(t tbla);

merlin

This looks like it might work for us. At least I would only need the
table name for the field list instead of the entire column list.