Using varchar primary keys.

Started by Tim Uckunabout 13 years ago23 messagesgeneral
Jump to latest
#1Tim Uckun
timuckun@gmail.com

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join
the taggings table to the tags table so I am wondering if there is any
advantage at all of having that id field in the tags table. Why shouldn't I
just put the tag itself as the primary key? The tags are quite short so if
pg is smart about it I would guess they would take up less space than an
integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all
these tables with just two fields in them. id field and then some varchar
field. Then every time I want to display that record I join with five
other tables so I can look up all the associated lookup tables and display
the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

Is this false economy?

#2Misa Simic
misa.simic@gmail.com
In reply to: Tim Uckun (#1)
Re: Using varchar primary keys.

Hi Tim,

Debate "natural" vs "surrogate" key last for a too long time - with no
official winner...

I think it is more "religional" then real issue...

Advice will be simple: pick approach what best fit your needs, taking into
account now and tomorrow (probability of change)...

SQL standard and normal forms are fine with both religions... And it really
has very small impact on: speed and storage size (real things) - that is
not worth mentioning

I think, I personally could be put to "surrogate" religion - but in
described scenario, having tables with just two columns: pk and unique - I
would never suggest that, because of simply from practical, common sense,
view - it doesn't make sense...

So probably in your scenario you should pick "natural" approach - for
described scenarios...

Kind Regards,

Misa
------------------------------
From: Tim Uckun
Sent: 01/04/2013 01:36
To: pgsql-general
Subject: [GENERAL] Using varchar primary keys.

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join
the taggings table to the tags table so I am wondering if there is any
advantage at all of having that id field in the tags table. Why shouldn't I
just put the tag itself as the primary key? The tags are quite short so if
pg is smart about it I would guess they would take up less space than an
integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all
these tables with just two fields in them. id field and then some varchar
field. Then every time I want to display that record I join with five
other tables so I can look up all the associated lookup tables and display
the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

Is this false economy?

#3Modulok
modulok@gmail.com
In reply to: Tim Uckun (#1)
Re: Using varchar primary keys.

On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote:

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join
the taggings table to the tags table so I am wondering if there is any
advantage at all of having that id field in the tags table. Why shouldn't I
just put the tag itself as the primary key? The tags are quite short so if
pg is smart about it I would guess they would take up less space than an
integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all
these tables with just two fields in them. id field and then some varchar
field. Then every time I want to display that record I join with five
other tables so I can look up all the associated lookup tables and display
the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

You can. Arguably, if the natural key is relatively short and consists of only
one column E.F. Codd would probably have encourage you to use it. I'd leave the
serial column for convenience of identifying individual records on the command
line, especially if the varchar can store characters that are not easily typed
in a console.

As for the primary key being a 'natural key' vs. a 'synthetic' one,
that's a huge debate akin to emacs vs. vi. Depending on your situation, there's
nothing wrong with using natural keys. It is discussed at length and
in some cases encouraged here::

http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html

-Modulok-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Gavan Schneider
pg-gts@snkmail.com
In reply to: Tim Uckun (#1)
Re: Using varchar primary keys.

On 1/4/13 at 10:35 AM, Tim Uckun wrote:

Consider the following scenario.

Since I don't consider myself an expert I am talking in basic
terms below. Please don't misunderstand this plodding as
anything other than me attempting to be as clear as possible.

These comments reflect my understanding of Joe Celko's view as
expressed in his books. I fully understand that many respectable
SQL masters (respectfully) disagree with Celko in this. My take
is to use his approach as a starting point and only deviate when
a different approach is obviously much easier for me. (I doubt
it's better. :). YMMV.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

Celko would suggest the table name is the plural of its primary
key, so a table named tags has tag as the name of its primary
key. And this is a big hint: the sequential index is not needed.
And he further (very strongly) suggests that primary keys assume
their natural values and not be referenced by arbitrary
(sequential) values, i.e., avoid pseudo-keys, as neither SQL nor
modern data bases need them.

The taggings has a tag_id field along with the context, tagger_id etc.

So this would instead have two columns named "tag" and "tagger"
(each referencing the column of the same name in tables "tags"
and "taggers") and whatever is implied by the "etc". if the pair
(tag,tagger) is meant to be unique then this is a good candidate
for primary key, otherwise a further term(s) needs to be dragged
into the composite index (e.g., a timestamp). Once again there
is no logical need for a sequential integer to be stored as well.

....

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

Is this false economy?

I think so.

If the tags table has both sequential integer and the unique tag
value then there is likely storage and I/O associated with two
indexes, along with storage of both tag and associated integer.
If the list of tags is short enough it might be suitable to use
an ENUM to enforce both integrity and brevity.

If the list of tags is longer and/or might need updating then a
single column table will allow for this. In general, I think
lists of things (e.g., tag values) should be kept within the dB
so they can be used for integrity checking, etc. I don't like
the idea of externalising this job to the interface application.
Also, as you have mentioned, you can elegantly handle table
renovation, eg., if a given tag needs its name changed, it can
be done via a foreign key constraint with cascade. (Elegant here
refers to the syntax, the actual I/O implications may be very
ugly depending on the scale of data that needs rewriting. :)

In another situation the tags table might well contain columns
such as: tag, tag_full_name, tag_origin,
tag_authority_certificate/expiry/whatever. In such an instance
the tag is better if it's human readable/meaningful so reports
can be useful without always back joining the full name.
Sequential integers rarely fulfil this role as implied by the
original question.

Regards
Gavan Schneider

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Julian
tempura@internode.on.net
In reply to: Modulok (#3)
Re: Using varchar primary keys.

On 01/04/13 12:19, Modulok wrote:

On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote:

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join
the taggings table to the tags table so I am wondering if there is any
advantage at all of having that id field in the tags table. Why shouldn't I
just put the tag itself as the primary key? The tags are quite short so if
pg is smart about it I would guess they would take up less space than an
integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all
these tables with just two fields in them. id field and then some varchar
field. Then every time I want to display that record I join with five
other tables so I can look up all the associated lookup tables and display
the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

You can. Arguably, if the natural key is relatively short and consists of only
one column E.F. Codd would probably have encourage you to use it. I'd leave the
serial column for convenience of identifying individual records on the command
line, especially if the varchar can store characters that are not easily typed
in a console.

As for the primary key being a 'natural key' vs. a 'synthetic' one,
that's a huge debate akin to emacs vs. vi. Depending on your situation, there's
nothing wrong with using natural keys. It is discussed at length and
in some cases encouraged here::

http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html

-Modulok-

I disagree with almost everything in that blog post:

Shouldn't the first question be to figure out why even a bigserial
(int8) would not suffice as a unique identifier for row data? There are
reasons where it wont, what are they?
(theres a great blog post by a twitter developer on the development of
their unique identifier (PK) - couldn't find it)

Natural Keys have a purpose but when do they exist in the database?

For example a Surrogate Key that is exposed to the world may get wrapped
in a checkbit algorithm or perhaps even a hmac encoded key to validate
the genuineness of incoming requests before an expensive database lookup
is initiated.

So is there a need to store the Natural Keys generated in the
application? It might be useful to external parties though they may even
call it "foos_pks".

What about the Natural Keys of an external source? Should that be stored
in a lookup table along with the integer based Surrogate Key?
Maybe I'll call it "bars_pks".

As to the original question, my response may seem off topic, but
exposing PK's to the world has some pitfalls (and not all to do with
security) and character based keys have pitfalls as compared to
Surrogate Keys that stay out of the displayed data.

CREATE TABLE tags (tag text PRIMARY KEY);
CREATE TABLE child_tag (parent_tag text REFERENCES tags);

INSERT INTO tags VALUES ('fu');
INSERT INTO child_tag VALUES ('fu');

--Spelling error.

UPDATE tags SET tag = 'foo' WHERE tag = 'fu';

This will fail unless you ON UPDATE CASCADE.

Some things to think about.

Regards,
Jules.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tim Uckun
timuckun@gmail.com
In reply to: Julian (#5)
Re: Using varchar primary keys.

Natural Keys have a purpose but when do they exist in the database?

In my case it's the tags. Table tags has only two columns id and tag. Each
one has a unique index. I have many other similar lookup tables. For
example things like order_status, item_type etc.

What about the Natural Keys of an external source? Should that be stored
in a lookup table along with the integer based Surrogate Key?
Maybe I'll call it "bars_pks".

I always keep those. The API's I expose to clients and partners allow them
to modify the records they send and I let them refer to items by their own
primary keys. This has always worked out well for me.

--Spelling error.

UPDATE tags SET tag = 'foo' WHERE tag = 'fu';

This will fail unless you ON UPDATE CASCADE.

Yes of course you'd need the on update cascade. Then again maybe I don't
even need that tags table. Tags could just be a view (select distinct tag
from child_tags). I am not saying that's efficient or desirable but it's
possible.

#7Jasen Betts
jasen@xnet.co.nz
In reply to: Tim Uckun (#1)
Re: Using varchar primary keys.

On 2013-04-01, Tim Uckun <timuckun@gmail.com> wrote:

--14dae93996072d9ff304d945bd3c
Content-Type: text/plain; charset=UTF-8

Natural Keys have a purpose but when do they exist in the database?

In my case it's the tags. Table tags has only two columns id and tag. Each
one has a unique index. I have many other similar lookup tables. For
example things like order_status, item_type etc.

how about using an enum instead of this table?

--
⚂⚃ 100% natural

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Tim Uckun
timuckun@gmail.com
In reply to: Jasen Betts (#7)
Re: Using varchar primary keys.

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

#9Jasen Betts
jasen@xnet.co.nz
In reply to: Tim Uckun (#8)
Re: Using varchar primary keys.

On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

since 9.1 you can add values.

http://www.postgresql.org/docs/9.1/static/sql-altertype.html

--
⚂⚃ 100% natural

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Jasen Betts (#9)
Re: Using varchar primary keys.

On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:

On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

since 9.1 you can add values.

http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11jesusthefrog
jesusthefrog@gmail.com
In reply to: Merlin Moncure (#10)
Re: Using varchar primary keys.

On the topic of 'natural' versus 'synthetic' primary keys, I am generally
in the camp that an extra ID field won't cost you too much, and while one
may not need it for a simple table (i.e. id, name) one might add any number
of columns later, and you'll be glad to have it.

I am, however, against using sequences (or serial integers in Postgres) for
reasons of scaling and replication across multiple copies of a database
running on different servers.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as an
integer, not an identifier (as is all too common with serial integers).

On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:

On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

since 9.1 you can add values.

http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------

#12Tim Uckun
timuckun@gmail.com
In reply to: jesusthefrog (#11)
Re: Using varchar primary keys.

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally
in the camp that an extra ID field won't cost you too much, and while one
may not need it for a simple table (i.e. id, name) one might add any number
of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary
keys.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as an
integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know
Postgres does not have a true multi master replication system so all the
writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column
because when I am communicating with people it makes it so much easier to
say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
talk to people about the data and UUIDs make it very difficult to
communicate with humans.

#13Julian
tempura@internode.on.net
In reply to: jesusthefrog (#11)
Re: Using varchar primary keys.

On 02/04/13 06:35, jesusthefrog wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am
generally in the camp that an extra ID field won't cost you too much,
and while one may not need it for a simple table (i.e. id, name) one
might add any number of columns later, and you'll be glad to have it.

Definitely, a lookup table can be extended to hold all kinds of
statistics and translations on a "tag". I didn't look too deeply into
the actual usage (not sure if any real examples where given). But ON
UPDATE CASCADE if not a feature I would recommend in this context
(referencing a PK) if at all.

I am, however, against using sequences (or serial integers in Postgres)
for reasons of scaling and replication across multiple copies of a
database running on different servers.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as
an integer, not an identifier (as is all too common with serial integers).

It was instagrams own implementation, but mentions twitters "Snowflake".
Interesting read still only requires a bigint.
http://goo.gl/gRlXC

IMO uuid is convenient but large and for all that data it doesn't hold
anything useful. I have been tempted though.

Jules.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: jesusthefrog (#11)
Re: Using varchar primary keys.

On 02/04/13 08:35, jesusthefrog wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am
generally in the camp that an extra ID field won't cost you too much,
and while one may not need it for a simple table (i.e. id, name) one
might add any number of columns later, and you'll be glad to have it.

I am, however, against using sequences (or serial integers in
Postgres) for reasons of scaling and replication across multiple
copies of a database running on different servers.

My preferred method is to give every table an ID column of UUID type
and generate a UUID using the uuid-ossp contrib module. This also
prevents someone not familiar with the database design from using an
ID somewhere they should not (as is possible with natural PKs) or
treating the ID as an integer, not an identifier (as is all too common
with serial integers).

On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com
<mailto:mmoncure@gmail.com>> wrote:

On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz
<mailto:jasen@xnet.co.nz>> wrote:

On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

since 9.1 you can add values.

http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------

I use synthetic primary keys, I want to minimise changes to the database
because the user, directly or due to of some law change, changes a
'natural' value. Using synthetic primary keys, minimises changes to a
database when a 'natural' value changes - if people's names are part of
many natural keys, then when people change their name (like when a woman
gets married), only one table needs to change. Likewise the customer
nunber the manager swore would never change, now they want to change for
a numeric key to an alphanumeric one.

Using 'natural' values for a primarykey, seems to be deliberately adding
potential time bombs. Almost as bad as the misguided idea that using
'sudo' is safer than the alternative when executing commands a root!

Cheers,
Gavin

#15Joe Van Dyk
joe@tanga.com
In reply to: Tim Uckun (#12)
Re: Using varchar primary keys.

On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally
in the camp that an extra ID field won't cost you too much, and while one
may not need it for a simple table (i.e. id, name) one might add any number
of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary
keys.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as an
integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know
Postgres does not have a true multi master replication system so all the
writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column
because when I am communicating with people it makes it so much easier to
say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
talk to people about the data and UUIDs make it very difficult to
communicate with humans.

I've been wishing for a smaller uuid type for a while. Say you need to
assign a Order #. Customers might be referencing the number, so you don't
want it to be too long. But you don't want Order #'s to be guessable or
have the characters/digits be transposed accidently.

I've been using a unique text column with a default of random_characters(12)

CREATE OR REPLACE FUNCTION public.random_characters(length integer)

RETURNS text

LANGUAGE sql

STABLE

AS $function$

SELECT array_to_string(array((

SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'

FROM mod((random()*31)::int, 31)+1 FOR 1)

FROM generate_series(1, $1))),'');

$function$;

This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
easily be mistyped or misread.

#16Merlin Moncure
mmoncure@gmail.com
In reply to: Joe Van Dyk (#15)
Re: Using varchar primary keys.

On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally
in the camp that an extra ID field won't cost you too much, and while one
may not need it for a simple table (i.e. id, name) one might add any number
of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary
keys.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as an
integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know
Postgres does not have a true multi master replication system so all the
writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column
because when I am communicating with people it makes it so much easier to
say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
talk to people about the data and UUIDs make it very difficult to
communicate with humans.

I've been wishing for a smaller uuid type for a while. Say you need to
assign a Order #. Customers might be referencing the number, so you don't
want it to be too long. But you don't want Order #'s to be guessable or have
the characters/digits be transposed accidently.

I've been using a unique text column with a default of random_characters(12)

CREATE OR REPLACE FUNCTION public.random_characters(length integer)
RETURNS text
LANGUAGE sql
STABLE
AS $function$
SELECT array_to_string(array((
SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
FROM mod((random()*31)::int, 31)+1 FOR 1)
FROM generate_series(1, $1))),'');
$function$;

This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
easily be mistyped or misread.

This is pseudo random and can be guessed, which is maybe dangerous
depending on circumstance. For stronger random stream go to
pgcrypto.gen_random_bytes(). Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.

My historical comments in this debate are noted. To summarize, I
strongly believe that natural keys are often (but not always) better.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Samantha Atkins
sjatkins@me.com
In reply to: Merlin Moncure (#16)
Re: Using varchar primary keys.

Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistent identity.

Also natural keys have the considerable defect of being of different type and arity per logical entity type. This means that very nice things like dynamic relationships and persistent collections of references to multiple types of things cannot be built. It also increases ORM and data cache complexity.

These are considerable weaknesses.

You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof.

- samantha

On Apr 2, 2013, at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally
in the camp that an extra ID field won't cost you too much, and while one
may not need it for a simple table (i.e. id, name) one might add any number
of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary
keys.

My preferred method is to give every table an ID column of UUID type and
generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID somewhere
they should not (as is possible with natural PKs) or treating the ID as an
integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know
Postgres does not have a true multi master replication system so all the
writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column
because when I am communicating with people it makes it so much easier to
say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
talk to people about the data and UUIDs make it very difficult to
communicate with humans.

I've been wishing for a smaller uuid type for a while. Say you need to
assign a Order #. Customers might be referencing the number, so you don't
want it to be too long. But you don't want Order #'s to be guessable or have
the characters/digits be transposed accidently.

I've been using a unique text column with a default of random_characters(12)

CREATE OR REPLACE FUNCTION public.random_characters(length integer)
RETURNS text
LANGUAGE sql
STABLE
AS $function$
SELECT array_to_string(array((
SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
FROM mod((random()*31)::int, 31)+1 FOR 1)
FROM generate_series(1, $1))),'');
$function$;

This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
easily be mistyped or misread.

This is pseudo random and can be guessed, which is maybe dangerous
depending on circumstance. For stronger random stream go to
pgcrypto.gen_random_bytes(). Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.

My historical comments in this debate are noted. To summarize, I
strongly believe that natural keys are often (but not always) better.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18Merlin Moncure
mmoncure@gmail.com
In reply to: Samantha Atkins (#17)
Re: Using varchar primary keys.

On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote:

Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistent identity.

That is true, but irrelevant in most real world cases. Also, nothing
is keeping you from using an extra marker if/when you need to provide
an invariant lookup.

Also natural keys have the considerable defect of being of different type and arity per logical entity type. This means that very nice things like dynamic relationships and persistent collections of references to multiple types of things cannot be built. It also increases ORM and data cache complexity.

OO evangelism.

These are considerable weaknesses.

You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof.

I never claimed that. I said that postgresql random() can be guessed,
which it can, since it's based on lrand48.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Joe Van Dyk
joe@tanga.com
In reply to: Merlin Moncure (#16)
Re: Using varchar primary keys.

On Tue, Apr 2, 2013 at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:

On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am

generally

in the camp that an extra ID field won't cost you too much, and while

one

may not need it for a simple table (i.e. id, name) one might add any

number

of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary
keys.

My preferred method is to give every table an ID column of UUID type

and

generate a UUID using the uuid-ossp contrib module. This also prevents
someone not familiar with the database design from using an ID

somewhere

they should not (as is possible with natural PKs) or treating the ID

as an

integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I

know

Postgres does not have a true multi master replication system so all the
writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial

column

because when I am communicating with people it makes it so much easier

to

say "dealer number X" than "dealer number SOME_HUGE_UUID". I often

have to

talk to people about the data and UUIDs make it very difficult to
communicate with humans.

I've been wishing for a smaller uuid type for a while. Say you need to
assign a Order #. Customers might be referencing the number, so you don't
want it to be too long. But you don't want Order #'s to be guessable or

have

the characters/digits be transposed accidently.

I've been using a unique text column with a default of

random_characters(12)

CREATE OR REPLACE FUNCTION public.random_characters(length integer)
RETURNS text
LANGUAGE sql
STABLE
AS $function$
SELECT array_to_string(array((
SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
FROM mod((random()*31)::int, 31)+1 FOR 1)
FROM generate_series(1, $1))),'');
$function$;

This seems to work ok. I don't allow 1's or l's or 0's or O's, as they

can

easily be mistyped or misread.

This is pseudo random and can be guessed, which is maybe dangerous
depending on circumstance. For stronger random stream go to
pgcrypto.gen_random_bytes(). Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.

Right, but it's better than using serial's as far as being guessable.

The probability for collisions are fairly low, if you are using 12 or more
characters (with 30 possible characters). Not sure what the math is on the
probability of collisions (birthday problem) though.. and you could have a
trigger that checked for the existence of a matching key before
inserts/updates.

And using UUIDs would be too long for lots of purposes where people are
working with the numbers, and where there might be external constraints on
how long the IDs can be.

An example use case:
https://www.tanga.com/deals/cd8f90c81a/oral-b-sensitive-clean-6-extra-soft-replacement-brush-heads

where "cd8f90c81a" is the 'uuid' for that product. It's non-guessable
enough, and I don't want to put a full UUID in the URL.

Show quoted text

My historical comments in this debate are noted. To summarize, I
strongly believe that natural keys are often (but not always) better.

#20Julian
tempura@internode.on.net
In reply to: Merlin Moncure (#18)
Re: Using varchar primary keys.

On 03/04/13 06:37, Merlin Moncure wrote:

On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote:

Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistent identity.

Can't find Samantha's original post. I agree but done right they can be
(persistent and unique)

That is true, but irrelevant in most real world cases. Also, nothing
is keeping you from using an extra marker if/when you need to provide
an invariant lookup.

Also natural keys have the considerable defect of being of different type and arity per logical entity type. This means that very nice things like dynamic relationships and persistent collections of references to multiple types of things cannot be built. It also increases ORM and data cache complexity.

OO evangelism.

These are considerable weaknesses.

The Google cache is filled with "foo vs bar"
i.e. natural vs surrogate and its amazing what you get, even surrogate
key under wikipedia comes with a "its sources remain unclear because it
lacks inline citations" disclaimer.
I consider it pretty much a non debate.

You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof.

I never claimed that. I said that postgresql random() can be guessed,
which it can, since it's based on lrand48.

merlin

Trying to get back on topic with the original post.

I have the iso (?) country code table, I have no problem with the PK
being the char(2) country code.
These natural codes/keys are thought out, debated and eventually decided
by other people.
But I also don't have a problem with adding an integer (serial) column
and making that my PK, although that PK might be pretty useless to the
rest of the world.

So thats that, having to really think it out is probably a good sign
that you should stick to a surrogate unless you are really sure.
(again I don't advocate ON UPDATE CASCADE as a solution should you
change your mind)

As to the whole natural vs surrogate/synthetic key debate, as I
mentioned in an earlier post I use them both. The question is when is it
worthy of a PK.

1)cust_id=123 (surrogate: PK)
vs
2)cust_id=1231 (natural: checkbit such as barcode data etc)
vs
3)cust_id=<natural: uuencoded binary that spits out "123" after being
decoded by the app>

For me, 2) is more likely to be a PK than 3), but it is entirely
possible that neither would be a PK.

Global/Universal unique identifiers, useful with replication and
database partitioning (in my instance)

1)cust_id=<uuid>

vs

2)cust_id=<shard_id>-<something_extra>-<cust_id>

1) will work, but 128bits is alot of random data that could be useful to
the app.
2) cust_id is not as universally unique, but if that was ever a problem
I could also wrap that in a encoded binary with a small change to the
app and no change to the database now it resembles something truly random.

The difference is 2) is more useful and contains "routing" data.
These are all natural and exposed to the world. But the question still
remains are they worthy of being a PK within the database of origin?
So far the answer from me is "doesn't have to be" but everyone else
might think it is, they may even make it a PK.

Regards,
Jules.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Gavan Schneider
pg-gts@snkmail.com
In reply to: Julian (#20)
#22Merlin Moncure
mmoncure@gmail.com
In reply to: Gavan Schneider (#21)
#23Jasen Betts
jasen@xnet.co.nz
In reply to: Tim Uckun (#1)