Using bytea field...

Started by Andre Lopesabout 15 years ago10 messagesgeneral
Jump to latest
#1Andre Lopes
lopes80andre@gmail.com

Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example if
the file already exists in the database, this is possible with bytea?

Best Regads,

#2Andy Colson
andy@squeakycode.net
In reply to: Andre Lopes (#1)
Re: Using bytea field...

On 3/8/2011 12:28 PM, Andre Lopes wrote:

Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example if
the file already exists in the database, this is possible with bytea?

Best Regads,

You dont need to use both base64 and bytea. You can store base64 in
text field... or just store the photo as-is into bytea.

To answer your question: it would be faster if you computed an md5 (or
sha or whatever) and stored it in the db, then you could check to see if
an image exists by searching for the md5, which would be way faster, an
send a lot less data over the wire.

-Andy

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Andre Lopes (#1)
Re: Using bytea field...

Not sure if it is possible directly but have you considered (or you might have to) generating an MD5 hash of the data (possibly after encoding) and then comparing the hashes? For a small image it may not matter but if you plan on making the check with any frequency (and multiple times against the same record) doing a one-time hash generation is going to be quite a bit more efficient.

Just make sure you know how you are going to keep the hash and the binary contents in-sync.

You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to opine on that particular option.

David J.

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andre Lopes
Sent: Tuesday, March 08, 2011 1:29 PM
To: postgresql Forums
Subject: [GENERAL] Using bytea field...

Hi,

I'm using a bytea field to store small images in base64. I need to know if I can compare bytea rows for equality. To know for example if the file already exists in the database, this is possible with bytea?

Best Regads,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Josh Kupershmidt
schmiddy@gmail.com
In reply to: David G. Johnston (#3)
Re: Using bytea field...

On Tue, Mar 8, 2011 at 8:00 PM, David Johnston <polobo@yahoo.com> wrote:

You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to opine on that particular option.

Yes, a functional index on MD5(your_bytea_column) will work and is the
way to go for equality comparisons on bytea values. You can use the
built-in MD5() function or one of the various hash functions in
pgcrypto. You can also save some index size by only storing the
bytea-encoded md5 result in the index, something like:

CREATE INDEX bigcol_idx ON foo (decode(MD5(bigcol), 'hex'));

And then run formulate your queries similarly so they use the index:

test=# EXPLAIN ANALYZE SELECT * FROM foo
WHERE (decode(md5(bigcol), 'hex')) = (decode(md5('4'), 'hex'));

QUERY PLAN

-------------------------------------------------------------------------------
---------------------------------
Index Scan using bigcol_idx on foo (cost=0.00..8.28 rows=1 width=4) (actual t
ime=0.032..0.034 rows=1 loops=1)
Index Cond: (decode(md5(bigcol), 'hex'::text) = '\xa87ff679a2f3e71d9181a67b7
542122c'::bytea)
Total runtime: 0.095 ms
(3 rows)

Josh

#5Sim Zacks
sim@compulab.co.il
In reply to: Andy Colson (#2)
Re: Using bytea field...

MD5 is not collision resistant (using the immortal words of wikipedia
http://en.wikipedia.org/wiki/MD5).

This means that it is possible that multiple images will return the same
md5 hash.

The question is, if it screws up and says that an image already exists
and then returns a different image when querying for it, how bad would
that be.

I've seen a lot of discussions in the past few years about how
problematic the md5 approach is.

Sim

On 03/08/2011 09:06 PM, Andy Colson wrote:

Show quoted text

On 3/8/2011 12:28 PM, Andre Lopes wrote:

Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example if
the file already exists in the database, this is possible with bytea?

Best Regads,

You dont need to use both base64 and bytea. You can store base64 in
text field... or just store the photo as-is into bytea.

To answer your question: it would be faster if you computed an md5 (or
sha or whatever) and stored it in the db, then you could check to see
if an image exists by searching for the md5, which would be way
faster, an send a lot less data over the wire.

-Andy

#6Sim Zacks
sim@compulab.co.il
In reply to: Sim Zacks (#5)
Re: Using bytea field...

On 03/09/2011 01:27 PM, Sim Zacks wrote:

This means that it is possible that multiple images will return the
same md5 hash.

OTOH, if you had an indexed md5 hash and compared the image only to the
matches, that would be a fast and accurate querying method

#7Radosław Smogura
rsmogura@softperience.eu
In reply to: Sim Zacks (#5)
Re: Using bytea field...

On Wed, 09 Mar 2011 13:27:16 +0200, Sim Zacks wrote:

MD5 is not collision resistant (using the immortal words of wikipedia
http://en.wikipedia.org/wiki/MD5).

This means that it is possible that multiple images will return the
same md5 hash.

The question is, if it screws up and says that an image already
exists and then returns a different image when querying for it, how
bad would that be.

I've seen a lot of discussions in the past few years about how
problematic the md5 approach is.

Sim

On 03/08/2011 09:06 PM, Andy Colson wrote:

On 3/8/2011 12:28 PM, Andre Lopes wrote:

Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example
if
the file already exists in the database, this is possible with
bytea?

Best Regads,

You dont need to use both base64 and bytea. You can store base64 in
text field... or just store the photo as-is into bytea.

To answer your question: it would be faster if you computed an md5
(or sha or whatever) and stored it in the db, then you could check to
see if an image exists by searching for the md5, which would be way
faster, an send a lot less data over the wire.

-Andy

Every hash algorithm is not collision resistant, and in theory You need
to check equality of images. Hash function says that if hashes are
different, then images are different. It not says if hashes are same
then images are same.

Regards,
Radek

#8Andy Colson
andy@squeakycode.net
In reply to: Sim Zacks (#5)
Re: Using bytea field...

On 03/08/2011 09:06 PM, Andy Colson wrote:

On 3/8/2011 12:28 PM, Andre Lopes wrote:

Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example if
the file already exists in the database, this is possible with bytea?

Best Regads,

You dont need to use both base64 and bytea. You can store base64 in
text field... or just store the photo as-is into bytea.

To answer your question: it would be faster if you computed an md5 (or
sha or whatever) and stored it in the db, then you could check to see
if an image exists by searching for the md5, which would be way
faster, an send a lot less data over the wire.

-Andy

On 3/9/2011 5:27 AM, Sim Zacks wrote:

MD5 is not collision resistant (using the immortal words of wikipedia
http://en.wikipedia.org/wiki/MD5).

This means that it is possible that multiple images will return the same
md5 hash.

The question is, if it screws up and says that an image already exists
and then returns a different image when querying for it, how bad would
that be.

I've seen a lot of discussions in the past few years about how
problematic the md5 approach is.

Sim

It'll never happen:

http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability

Sure you CAN go out of your way to generate collisions, but I'd bet
money you never see one from your setup.

The probability is extremely slim. And if thats too much of a chance,
use sha2, its mind numbingly slim.

If you were doing cryptography it would be a problem, yes, but not
checking file equality.

-Andy

#9Sim Zacks
sim@compulab.co.il
In reply to: Andy Colson (#8)
Re: Using bytea field...

The question is, if it screws up and says that an image already exists
and then returns a different image when querying for it, how bad would
that be.

It'll never happen:

http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability

Sure you CAN go out of your way to generate collisions, but I'd bet
money you never see one from your setup.

The probability is extremely slim. And if thats too much of a chance,
use sha2, its mind numbingly slim.

If you were doing cryptography it would be a problem, yes, but not
checking file equality.

-Andy

Never is a long time. The question that I asked is precisely: how much
money you would bet that you'll never hit a collision. It depends on the
use case. If you are talking about privacy issues, which can include
lawsuits, loss of reputation and/or damages, then I wouldn't take that
risk, even on sha2. Especially not with all the publicly available
documentation explaining why not to do it. If you are talking about a
minor inconvenience or professional pride because the wrong image showed
up, or the right image was never stored, then it may be worth the risk.

Sim

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Sim Zacks (#9)
Re: Using bytea field...

On Thu, Mar 10, 2011 at 2:09 AM, Sim Zacks <sim@compulab.co.il> wrote:

The question is, if it screws up and says that an image already exists
and then returns a different image when querying for it, how bad would
that be.

It'll never happen:

http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability

Sure you CAN go out of your way to generate collisions, but I'd bet
money you never see one from your setup.

The probability is extremely slim.  And if thats too much of a chance,
use sha2, its mind numbingly slim.

If you were doing cryptography it would be a problem, yes, but not
checking file equality.

-Andy

Never is a long time. The question that I asked is precisely: how much money
you would bet that you'll never hit a collision. It depends on the use case.
If you are talking about privacy issues, which can include lawsuits, loss of
reputation and/or damages, then I wouldn't take that risk, even on sha2.
Especially not with all the publicly available documentation explaining why
not to do it.  If you are talking about a minor inconvenience or
professional pride because the wrong image showed up, or the right image was
never stored, then it may be worth the risk.

Regardless of the intended use, I would bet every dollar I've ever
made, will make, could borrow, beg steal, etc vs 1 of your dollars and
happily collect it when I won the bet. See here:
(http://en.wikipedia.org/wiki/Birthday_attack) and look at the table
of odds vs population size...your statement is not in line with
mathematical reality, and from a risk standpoint there is a large
number of things to be looking at before sha2 collision such as drive
bit error rates, spontaneous combustion, etc.

AFAIK, even sha1 collisions have never been found in the wild, and the
zfs deduplication system uses sha1 to deduplicate disk blocks, as does
bit torrent. In fact many computing systems you rely on make hash
safety assumptions weaker than sha2.

Schneier speculates that we may see a collision soon here:
http://blog.valerieaurora.org/2009/06/25/sha-1-collision-expected-within-a-year/.
A small number of duplicate accidental md5 hashes have been found in
the wild.

merlin