Drawbacks of using BYTEA for PK?
Are there any drawbacks of using BYTEA for PK compared to using a
primitive/atomic data types like INT/SERIAL? (like significant
performance hit, peculiar FK behaviour, etc).
I plan to use BYTEA for GUID (of course, temporarily I hope, until
PostgreSQL officially supports GUID data type), since it seems to be the
most convenient+compact compared to other data types currently
available. I use GUIDs for most PK columns.
--
dave
David Garamond wrote:
Are there any drawbacks of using BYTEA for PK compared to using a
primitive/atomic data types like INT/SERIAL? (like significant
performance hit, peculiar FK behaviour, etc).I plan to use BYTEA for GUID (of course, temporarily I hope, until
PostgreSQL officially supports GUID data type), since it seems to be
the most convenient+compact compared to other data types currently
available. I use GUIDs for most PK columns.
GUID? Isn't that really nothing more than an MD5 on a sequence?
SELECT (MD5(NEXTVAL('my_table_seq'))) AS my_guid;
Since 7.4 has the md5 function built-in, there's your support ;-)
Now just add that to your table's trigger and your good to go.
I think in MS products, they format the guid with dashes in the
style 8-4-4-4-12 but it still looks to me like a 32 character hex
string or a 16 byte (128 bit) value. You can choose to store the
value however you like, I'm not sure what would be optimal, but
bits are bits, right?
Dante
On Sunday 11 January 2004 22:05, D. Dante Lorenso wrote:
David Garamond wrote:
Are there any drawbacks of using BYTEA for PK compared to using a
primitive/atomic data types like INT/SERIAL? (like significant
performance hit, peculiar FK behaviour, etc).I plan to use BYTEA for GUID (of course, temporarily I hope, until
PostgreSQL officially supports GUID data type), since it seems to be
the most convenient+compact compared to other data types currently
available. I use GUIDs for most PK columns.GUID? Isn't that really nothing more than an MD5 on a sequence?
SELECT (MD5(NEXTVAL('my_table_seq'))) AS my_guid;
I think the point of a GUID is it's supposed to be unique across any number of
machines without requiring those machines to coordinate their use of GUID
values.
I think the typical approach is to use something like:
hash_fn( network_mac_address || other_hopefully_unique_constant ||
sequence_val )
and make sure that the probability of getting collisions is acceptably low.
ISTR a long discussion a year or two back on one of the lists, for those that
are interested.
--
Richard Huxton
Archonet Ltd
D. Dante Lorenso wrote:
GUID? Isn't that really nothing more than an MD5 on a sequence?
SELECT (MD5(NEXTVAL('my_table_seq'))) AS my_guid;
I know there are several algorithms to generate GUID, but this is
certainly inadequate :-) You need to make sure that the generated GUID
will be unique throughout cyberspace (or to be more precise, the GUID
should have a very very small chance of colliding with other people's
GUID). Even OID is not a good seed at all.
Perhaps I can make a GUID by MD5( two random numbers || a timestamp || a
unique seed like MD5 of '/sbin/ifconfig' output)...
Since 7.4 has the md5 function built-in, there's your support ;-)
Well, until there's a GUID or INT128 or BIGBIGINT builtin type I doubt
many people will regard PostgreSQL as fully supporting GUID. I believe
there's the pguuid project in GBorg site that does something like this.
--
dave
David Garamond <lists@zara.6.isreserved.com> writes:
Perhaps I can make a GUID by MD5( two random numbers || a timestamp || a
unique seed like MD5 of '/sbin/ifconfig' output)...
Adding an MD5 hash contributes *absolutely zero*, except waste of space,
to any attempt to make a GUID. The hash will add no uniqueness that was
not there before.
regards, tom lane
Tom Lane wrote:
David Garamond <lists@zara.6.isreserved.com> writes:
Perhaps I can make a GUID by MD5( two random numbers || a timestamp || a
unique seed like MD5 of '/sbin/ifconfig' output)...Adding an MD5 hash contributes *absolutely zero*, except waste of space,
to any attempt to make a GUID. The hash will add no uniqueness that was
not there before.
Of course, in the above case, MD5 is used to compress the "uniqueness"
(which should be more than 128-bit, comprised of: a) [good] random
number; b) timestamp; c) a "node ID" element, either from /sbin/config
output which contain MAC address, or from the hash of harddisk content,
etc) into a 128-bit space.
--
dave
Tom Lane wrote:
Adding an MD5 hash contributes *absolutely zero*, except waste of space,
to any attempt to make a GUID. The hash will add no uniqueness that was
not there before.
The cool thing about a 'GUID' (or in my example a hashed sequence number
[sure
toss in some entropy if you want it]) is that if you happen to reference
that
value as a primary key on a table, the URL that passes the argument can not
be guessed at easily. For example using a sequence:
http://domain.com/application/load_record.html?customer_id=12345
Then, users of the web will assume that you have at most 12345
customers. And
they can try to look up information on other customers by doing:
http://domain.com/application/load_record.html?customer_id=12346
http://domain.com/application/load_record.html?customer_id=12344
...basically walking the sequence. Sure, you will protect against this with
access rights, BUT...seeing the sequence is a risk and not something you
want
to happen. NOW, if you use a GUID:
http://domain.com/application/load_record.html?customer_id=f46d6296-5362-2526-42e3-1b8ce9dcccc1
Right, so now try to guess the next value in this sequence. It's a little
more protective and obfuscated (an advantage in using GUIDs).
Dante
On Mon, 12 Jan 2004, D. Dante Lorenso wrote:
Tom Lane wrote:
Adding an MD5 hash contributes *absolutely zero*, except waste of space,
to any attempt to make a GUID. The hash will add no uniqueness that was
not there before.The cool thing about a 'GUID' (or in my example a hashed sequence number
[sure
toss in some entropy if you want it]) is that if you happen to reference
that
value as a primary key on a table, the URL that passes the argument can not
be guessed at easily. For example using a sequence:http://domain.com/application/load_record.html?customer_id=12345
Then, users of the web will assume that you have at most 12345
customers. And
they can try to look up information on other customers by doing:http://domain.com/application/load_record.html?customer_id=12346
http://domain.com/application/load_record.html?customer_id=12344...basically walking the sequence. Sure, you will protect against this with
access rights, BUT...seeing the sequence is a risk and not something you
want
to happen. NOW, if you use a GUID:
Security != obscurity.
While using GUIDs may make it harder to get hacked, it in no way actually
increases security. Real security comes from secure code, period.
"scott.marlowe" <scott.marlowe@ihs.com> writes:
they can try to look up information on other customers by doing:
http://domain.com/application/load_record.html?customer_id=12346
http://domain.com/application/load_record.html?customer_id=12344...basically walking the sequence. Sure, you will protect against this with
access rights, BUT...seeing the sequence is a risk and not something you
want
to happen. NOW, if you use a GUID:Security != obscurity.
While using GUIDs may make it harder to get hacked, it in no way actually
increases security. Real security comes from secure code, period.
Well, uh, you're both wrong.
On the one hand if your GUIDs are just an MD5 of a sequence then they're just
as guessable as the sequence. The attacker can try MD5 of various numbers
until he finds the one he is (it's probably on the web site somewhere anyways)
and then run MD5 himself on whatever number he feels.
On the other hand it is possible to do this right. Include a secret of some
kind in the MD5 hash, something that's not publically available. That secret
is in essence the password to the scheme. Now it's not really "obscurity" any
more than any password based scheme is "security through obscurity".
However even that isn't ideal, since you have to be able to change the
password periodically in case it's leaked. I believe there are techniques to
solve this though I can' think of any off the top of my head.
But if your only threat model is people attacking based on the publicly
visible information then an MD5 of the combination of a sequence and a secret
is a perfectly reasonable approach.
In the past I happily exposed the sequence but used an MD5 of the sequence and
a secret as a protection against spoofing. I find exposing the sequence is
very convenient for programming and debugging problems. Spoofing is a serious
security hazard, but worrying about leaking information like the size of the
customer database is usually a sign of people hoping for security through
obscurity.
--
greg
Greg Stark wrote:
On the other hand it is possible to do this right. Include a secret of some
kind in the MD5 hash, something that's not publically available. That secret
is in essence the password to the scheme. Now it's not really "obscurity" any
more than any password based scheme is "security through obscurity".However even that isn't ideal, since you have to be able to change the
password periodically in case it's leaked. I believe there are techniques to
solve this though I can' think of any off the top of my head.But if your only threat model is people attacking based on the publicly
visible information then an MD5 of the combination of a sequence and a secret
is a perfectly reasonable approach.
We're originally talking about using MD5 as a means to generate unique
ID right (and not to store password hash to be checked against later)?
Then this "secret key" is unnecessary. Just get some truly random bits
(if the number of bits is 128, then you can use it as it is. If the
number of bits is > 128, you can hash it using MD5 to get 128 bit. If
the number of bits is < 128, you're "screwed" anyway :-)
--
dave
Quoting Greg Stark <gsstark@mit.edu>:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
they can try to look up information on other customers by doing:
http://domain.com/application/load_record.html?customer_id=12346
http://domain.com/application/load_record.html?customer_id=12344...basically walking the sequence. Sure, you will protect against this
with
access rights, BUT...seeing the sequence is a risk and not something you
want
to happen. NOW, if you use a GUID:Security != obscurity.
While using GUIDs may make it harder to get hacked, it in no way actually
increases security. Real security comes from secure code, period.Well, uh, you're both wrong.
On the one hand if your GUIDs are just an MD5 of a sequence then they're
just
as guessable as the sequence. The attacker can try MD5 of various numbers
until he finds the one he is (it's probably on the web site somewhere
anyways)
and then run MD5 himself on whatever number he feels.On the other hand it is possible to do this right. Include a secret of some
kind in the MD5 hash, something that's not publically available. That secret
is in essence the password to the scheme. Now it's not really "obscurity"
any
more than any password based scheme is "security through obscurity".However even that isn't ideal, since you have to be able to change the
password periodically in case it's leaked. I believe there are techniques to
solve this though I can' think of any off the top of my head.But if your only threat model is people attacking based on the publicly
visible information then an MD5 of the combination of a sequence and a
secret
is a perfectly reasonable approach.In the past I happily exposed the sequence but used an MD5 of the sequence
and
a secret as a protection against spoofing. I find exposing the sequence is
very convenient for programming and debugging problems. Spoofing is a
serious
security hazard, but worrying about leaking information like the size of the
customer database is usually a sign of people hoping for security through
obscurity.--
greg---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Its not a question of right or wrong. Its the method. One thing I see here is
a failing to use several security methods at different layers. That really is
necessary for a production environment. If you want customer id's kept private,
then you need a private connection or to not expose them. Using an MD5 hash to
"hide" them will slow your app down by some delta and not protect your
connection. Granted garbling that id with a password is somewhat more secure
but your connection could still be attacked or even hijacked.
In the URL's you gave above, why are you not using HTTPS (i.e. authentication)?
What about using a crytographic cookies to identify your session and link that
to you userid (after authorization)?
'Just seems like you're not using the right tool (method) for the job here.
$-0.02
--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com
____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com
David Garamond wrote:
Perhaps I can make a GUID by MD5( two random numbers || a timestamp || a
unique seed like MD5 of '/sbin/ifconfig' output)...
As long as you don't use RFC1918 addresses, the IPv4 address(es) of the
host should be unique for the Internet. Append/prepend a 32 bit
timestamp and you have a 64bit unique identifier that is "universally"
unique (to one second).
Greg Stark wrote:
... worrying about leaking information like the size of the
customer database is usually a sign of people hoping for security through
obscurity.
To prevent the size of your database being guessed at from the serial
numbers of your customers' accounts, don't issue the numbers sequentially.
One simplistic method of non-sequential assignment is: generate a random
number between "00...00" and "99...99"*, check if it's already in use -
if not, issue it, if so, regenerate. When presenting the number, always
format it as an N-digit number with leading zeroes - for Perl
programmers, this would be achieved along the lines of printf("%010d",
$account_number)
Thus you will end up with customer numbers evenly spread over the number
space. This will prevent people inferring the size of your database (or
company) through the account numbers they observe.
To protect the customer's account from being accessed by unauthorised
persons, use form-based password access (not HTTP basic**) and/or X.509
certificates over a secure connection.
As Scotty says, "use the right tool for the right job!"
HTH
Alex Satrapa
*make the number space much larger than your expected number of
accounts. This reduces collisions in random number generation. Another
option is to increment through the number space in the event of a
collision, rather than generating another random number.
**using form-based access, the user can log out when leaving the
terminal. Using HTTP basic, the browser is likely to remember their
login for the entire session, and sometimes even between sessions.
On Tue, Jan 13, 2004 at 10:15:47AM +1100, Alex Satrapa wrote:
**using form-based access, the user can log out when leaving the
terminal. Using HTTP basic, the browser is likely to remember their
login for the entire session, and sometimes even between sessions.
You can persuade the browser to forget the password just by sending it
a 401. Unfortunately, the user then has to know to hit 'cancel' on the
resulting dialog box.
they can try to look up information on other customers by doing:
http://domain.com/application/load_record.html?customer_id=12346
http://domain.com/application/load_record.html?customer_id=12344...basically walking the sequence. Sure, you will protect against this
to happen. NOW, if you use a GUID:
Security != obscurity.
While using GUIDs may make it harder to get hacked, it in no way actually
increases security. Real security comes from secure code, period.Well, uh, you're both wrong.
On the one hand if your GUIDs are just an MD5 of a sequence then they're
just as guessable as the sequence.Its not a question of right or wrong. Its the method. One thing I see here is
a failing to use several security methods at different layers....why are you not using HTTPS (i.e. authentication)?
What about using a crytographic cookies to identify your session and link that
to you userid (after authorization)?
Ok, my point is not one of security as much as the obscurity. I have the
security aspect already covered whereby I only select the customer
record from
the database where the logged in account has access to the record. So, if
you are not the admin or the actual customer, the select will return a code
indicating that you do not have permission to view the given record.
Maybe a better example of my problem is with records throughout the system
like invoices, customer data, etc... If any of these items use a sequence
and that sequence is global to the table in the database and the number is
exposed externally, then it is possible to infer the success of the company
underneath, is it not?
For instance, if I generate sequential numbers for invoice ids and the
customer
sees #123 as an invoice number one month and sees #128 the next month,
it might
imply that there are only 4 customers getting invoiced each month.
Another example ... let's say customers can create 'Widgets' in their
account.
There might be a page that lists all their 'widgets'. If you click on the
widget, you can edit it. A link to do this might look as follows:
http://.../account/widget_list.html
http://.../account/widget_edit.html?widget_id=12345
Well, if the widget_id is a sequence (global to the widget table), then
by creating
one widget, customer would get widget id (WIDG_1) and another widget
(WIDG_2),
the customer could see that the widget_id increased by only an amount of
N = WIDG_2 - WIDG_1
and would therefore provide the assumption that the number of customers
creating
widgets in total does not exceed N. I don't see this as much of a
problem about
'security' in the respect of who can access the data as much as who can make
conclusions about the company beind the data.
See what I mean? What do you propose as the best solution for this?
Not expose
the sequences to the user and use user-enumerated ids? Then a trigger
on the
table would assign ids like:
SELECT (MAX(widget_id)+1) INTO NEW.widget_id
WHERE cust_id = NEW.cust_id;
But I think after several hundred customer records, this trigger would start
getting slow. I don't know really, any ideas?
Dante
"D. Dante Lorenso" <dante@lorenso.com> writes:
Maybe a better example of my problem is with records throughout the system
like invoices, customer data, etc... If any of these items use a sequence
and that sequence is global to the table in the database and the number is
exposed externally, then it is possible to infer the success of the company
underneath, is it not?
Except that's exactly the way business has always been done. Though people
usually start new accounts with check# 50000 or something like that for
precisely that reason. But it's still pretty transparent, and they don't
really worry about it too much.
What you're saying is fundamentally valid, but I tend to think these kinds of
concerns are just generically overblown.
My only comment was that just taking an MD5 of the sequence gives you no
security. At the very least you have to include a secret. Even then I suspect
there are further subtle cryptographic issues. There always are.
--
greg
----- Original Message -----
From: "Alex Satrapa" <alex@lintelsys.com.au>
As long as you don't use RFC1918 addresses, the IPv4 address(es) of the
host should be unique for the Internet. Append/prepend a 32 bit
timestamp and you have a 64bit unique identifier that is "universally"
unique (to one second).
Aarrgh... So if you have 2 inserts in the same second, you have key
collision? Why not append a sequence to that so you have: Unique address
|| timestamp || sequence value. In a case such as this I can see why you
might want to use md5() to hash that value.
Best Wishes,
Chris Travers
Answers inline.
----- Original Message -----
From: "Greg Stark" <gsstark@mit.edu>
On the one hand if your GUIDs are just an MD5 of a sequence then they're
just
as guessable as the sequence. The attacker can try MD5 of various numbers
until he finds the one he is (it's probably on the web site somewhere
anyways)
and then run MD5 himself on whatever number he feels.
On the other hand it is possible to do this right. Include a secret of
some
kind in the MD5 hash, something that's not publically available. That
secret
is in essence the password to the scheme. Now it's not really "obscurity"
any
more than any password based scheme is "security through obscurity".
You still have the following problem: the PK is not really used for very
much in this case except referencing data. This is done internally
(invoices, etc), so the application is presumed to know the ID when looking
up a customer. Nothing you do will prevent any attack based on searching
the database, i.e. select customer_id from customers; if such an attack is
possible in an application. I actually think that developers should enforce
security as far back (towards the database) as possible, so if this needs to
be prevented, using a view which only provides access to the customers
required is the preferred solution. You could also use triggers.
If, however, you want a global unique id which will never collide with any
other records (f. ex. for distributed server solutions), then you have
another problem-- MD5 is NOT guaranteed to be unique. Think about it-- if
the return digest is of a set length, then there must be many different
values which will create that same digest. Instead MD5 is designed to
prevent deliberate duplication, which is not what we are talking about here
(accidental duplication) and so you may want to be cautious about hashing
your keys. In this case, a more open, transparent key would be better. For
example:
machine identifier || sequence.
You *could* hash these, but it is unnecessary and may actually create
collisions if the machine identifier is sufficiently large. However,
mac_address || ipv4 address should be sufficient, I would think. It would
still be attackable in your view, so you could add a timestamp :-) but
again, I see limited utility of guids as a security feature.
Best Wishes,
Chris Travers
From: "Keith C. Perry" <netadmin@vcsn.com>
Using an MD5 hash to
"hide" them will slow your app down by some delta and not protect your
connection. Granted garbling that id with a password is somewhat more
secure
but your connection could still be attacked or even hijacked.
In the URL's you gave above, why are you not using HTTPS (i.e.
authentication)?
What about using a crytographic cookies to identify your session and link
that
to you userid (after authorization)?
Https I can see. I am having difficulty understanding how you could use
cryptographic cookies to prevent session hijacking though given the current
setup. Also you could use ssl between the web server and PostgreSQL to
secure that connection.
As a side question: Does PostgreSQL support using Kerberos for encrypted
connections (beyond authentication), or do you need to use SSL for that?
Best Wishes,
Chris Travers
Sounds to me you have concerns more along the lines of counterintelligence.
Maybe a better example of my problem is with records throughout the system
like invoices, customer data, etc... If any of these items use a sequence
and that sequence is global to the table in the database and the number is
exposed externally, then it is possible to infer the success of the
company
underneath, is it not?
IMO, the solution here is to start your sequences at an arbitrary value
(preferably not round) such as 1543691. Therefore the first customer
doesn't know that you don't have 1.5M other customers :-) This could be
calculated for each sequence with a formula such as
SELECT (random() * 1000000 + 1000000)::bigint;
For instance, if I generate sequential numbers for invoice ids and the
customer
sees #123 as an invoice number one month and sees #128 the next month,
it might
imply that there are only 4 customers getting invoiced each month.
Another solution I have seen is to use a formula for your invoices based on:
Letter key for invoice type followed by YYYYMMDD followed by a numeric
sequence. This also helps to obscure things since the customer may not know
how often you reset the sequence (could be every month, or every day). The
letter key can uniquely identify your server on your network thereby
creating a GUID. In other words your sequence need only be unique to a
given time frame. You could even add a timestamp and a sequence that wraps
around after 9 :-) That way as long as you don't create 10 invoices in the
same second you are OK.
http://.../account/widget_list.html
http://.../account/widget_edit.html?widget_id=12345
Provided that each customer is only creating one widget at a time, you could
then take the customer_id and append to it a value of a customer-specific
sequence. You could even have this as a compound primary key. That way,
each customer can only determine how many widgets they have created :-)
See what I mean? What do you propose as the best solution for this?
Create GUIDS which contain only the information you want. No need to hash.
See above for examples.
Best Wishes,
Chris Travers