pervasiveness of surrogate (also called synthetic) keys

Started by Jim Irreralmost 15 years ago52 messagesgeneral
Jump to latest
#1Jim Irrer
irrer@umich.edu

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that many
situations benefit from them, but are they really as pervasive
as he claims?

Thanks,

- Jim

#2Andy Colson
andy@squeakycode.net
In reply to: Jim Irrer (#1)
Re: pervasiveness of surrogate (also called synthetic) keys

On 4/28/2011 12:29 PM, Jim Irrer wrote:

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that many
situations benefit from them, but are they really as pervasive
as he claims?

Thanks,

- Jim

I dont see how you could know unless you went to all the "large
installations" and asked. But since its a good idea, and you "should"
do it that way, and because I'm pessimistic, I'd say only 5% of RDB
users do it that way.

Oh! Joke: Why do DB Admins make better lovers? They use surrogates!

Anyway, I'm not a large install, but I use em. That's gotta count for
something.

Really, how could you count? Was there a poll someplace? Ask for some
data. Otherwise seems like BS to me.

-Andy

#3Rob Sargent
robjsargent@gmail.com
In reply to: Andy Colson (#2)
Re: pervasiveness of surrogate (also called synthetic) keys

On 04/28/2011 11:44 AM, Andy Colson wrote:

On 4/28/2011 12:29 PM, Jim Irrer wrote:

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that
many
situations benefit from them, but are they really as pervasive
as he claims?

Thanks,

- Jim

I dont see how you could know unless you went to all the "large
installations" and asked. But since its a good idea, and you "should" do
it that way, and because I'm pessimistic, I'd say only 5% of RDB users
do it that way.

Oh! Joke: Why do DB Admins make better lovers? They use surrogates!

Anyway, I'm not a large install, but I use em. That's gotta count for
something.

Really, how could you count? Was there a poll someplace? Ask for some
data. Otherwise seems like BS to me.

-Andy

Hm, I get the feeling that only the good folks at Hibernate seem to
think using a "natural key" is the _only_ way to go.

#4Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Jim Irrer (#1)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 01:29:31PM -0400, Jim Irrer wrote:

common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations.

94.68536% of all the claims I ever hear are obviously pulled out of
thin air.

What conclusion does your colleague want to draw from this
overwhelming (if perhaps statistically dubious) penetration? Surely
the argument doesn't conclude, "Therefore we should do that too?" I
seem to recall my mother making some remark about others jumping off
cliffs.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

#5Joshua D. Drake
jd@commandprompt.com
In reply to: Jim Irrer (#1)
Re: pervasiveness of surrogate (also called synthetic) keys

On 04/28/2011 10:29 AM, Jim Irrer wrote:

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that
many
situations benefit from them, but are they really as pervasive
as he claims?

Well there is no fact to back that up but, I will say that most toolkits
require the use of a synthetic key, rails, django etc....

JD

Thanks,

- Jim

--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Developement
Organizers of the PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

#6Scott Ribe
scott_ribe@elevated-dev.com
In reply to: Rob Sargent (#3)
Re: pervasiveness of surrogate (also called synthetic) keys

On Apr 28, 2011, at 11:53 AM, Rob Sargent wrote:

Hm, I get the feeling that only the good folks at Hibernate seem to think using a "natural key" is the _only_ way to go.

Well, natural keys are quite obviously the way to go, when they exist. The problem is, they usually don't really exist. What's usually proposed as a natural key, will upon further investigation, either not be guaranteed unique, or not guaranteed to be unchanging, or both.

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

#7Daniele Varrazzo
daniele.varrazzo@gmail.com
In reply to: Joshua D. Drake (#5)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 7:26 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

Well there is no fact to back that up but, I will say that most toolkits
require the use of a synthetic key, rails, django etc....

Usually such tools are born with surrogate keys only, because it's
easier, and either grow up developing natural keys (e.g. the Django
ORM, SQLAlchemy) or fade into uselessness (e.g. SQLObject). But this
speaks more about tools than about the merits of the natural keys: if
the toolkit doesn't support them it's seriously getting in the way in
this and probably in other matters too.

-- Daniele

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Jim Irrer (#1)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote:

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations.  I agree that many
situations benefit from them, but are they really as pervasive
as he claims?

They are fairly pervasive, and increasingly so, which I find to be
really unfortunate. Personally I think rote use of surrogate keys is
terrible and leads to bad table designs, especially if you don't
identify the true natural key with, say, a unique constraint. This
pushes duplicate enforcement out of the database and into application
code, or (even worse) the user. What everyone misses in the database
is that natural keys force good database design...if you can't
discover one, you probably have a design problem. There are of
course exceptions, but they are limited.

The main/best case for surrogates is defense against cascading updates
-- obviously if a single record change can turn into 1000's of updates
statements you absolutely want to avoid that if you can. But in a
typical database this only applies to particular tables. 'One off'
updates are of course no problem. In a 'infinite performance' world,
this would still be an update, because of locks.

The performance case either way is complex. Generally, I find natural
keys to be faster all things considered but this is going to be
situationally dependent. Although integer indexes are tighter and
faster, natural keys give back by skipping entire joins, sorts, etc.
They also tend to have better tuple/page efficiency so that when
searching for a range of records you touch less pages. Essentially,
you get more traffic routed through a smaller number, albeit bigger,
indexes. Natural key tables also tend to be 'self documenting' which
is a nice property.

A lot of bad arguments made against natural keys are made, for example:
*) natural keys can change (so what? unless you are worried about cascades)
*) SSN are reused! (SSN is obviously not a complete key if you want to
identify a person)
*) most tables don't have unique natural keys (let's see em)
etc

merlin

#9David G. Johnston
david.g.johnston@gmail.com
In reply to: Andy Colson (#2)
Re: pervasiveness of surrogate (also called synthetic) keys

Any system that generates transactional data has to use some kind of
synthetic key. I guess you could rely upon some form of timestamp but from
a usability standpoint that is not really a good decision. Inventory also
requires synthetic keys - whether you decide what they are or someone else
does. A serial field is the common form of a synthetic key but any assigned
identifier can be defined in the same way. You can even argue that a field
that concatenates the parts of the natural key qualifies as synthetic.

To actually use the data in the database it is often helpful to be able to
uniquely identify a row using a single field.

My personal take is that normalization is important. Once you have
normalized you will find tables with compound natural primary keys. If
these tables are going to be queried in such a way as to return a single
record (WHERE clause) it is likely that a synthetic key would be beneficial.
If they are normally going to be used only with the context of other core
tables (basically JOIN ON clause) then it is less important to have a single
representative value.

Justifying anything simply based upon what other people do is superficial.

As for the colleague; the basic response is "what is your source". If they
are trying to convince you to include synthetic keys on specific tables ask
them to explain what specific benefit will be gained by doing so and let
them know that "you should always have a synthetic key" is not a helpful
response.

David J.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson
Sent: Thursday, April 28, 2011 1:44 PM
To: Jim Irrer
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] pervasiveness of surrogate (also called synthetic)
keys

On 4/28/2011 12:29 PM, Jim Irrer wrote:

A colleague of mine insists that using surrogate keys is the common
practice by an overwhelming margin in relational databases and that
they are used in 99 percent of large installations. I agree that many
situations benefit from them, but are they really as pervasive as he
claims?

Thanks,

- Jim

I dont see how you could know unless you went to all the "large
installations" and asked. But since its a good idea, and you "should"
do it that way, and because I'm pessimistic, I'd say only 5% of RDB users do
it that way.

Oh! Joke: Why do DB Admins make better lovers? They use surrogates!

Anyway, I'm not a large install, but I use em. That's gotta count for
something.

Really, how could you count? Was there a poll someplace? Ask for some
data. Otherwise seems like BS to me.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Merlin Moncure (#8)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote:

They are fairly pervasive, and increasingly so, which I find to be
really unfortunate. Personally I think rote use of surrogate keys is
terrible and leads to bad table designs, especially if you don't
identify the true natural key with, say, a unique constraint.

I was recently asked on this list why GNUmed uses all those
surrogate keys.

I should have added to my answer that we DO make *extensive*
use of all sorts of built-in constraints and custom triggers
to enforce "natural" keys. I must agree with a recent poster
that what appears to identify as a natural key often really
isn't or else becomes not so later on. It's vastly easier to
then deal with that by re-defining constraints without
having to touch primary keys.

This
pushes duplicate enforcement out of the database and into application
code, or (even worse) the user. What everyone misses in the database
is that natural keys force good database design...if you can't
discover one, you probably have a design problem.

Sure but that doesn't mean you need to actually *use*
natural keys as primary keys - enforce them with all sorts
of constraints, sure - but don't assume you properly figured
out the perfect schema the first time around.

I've so far found it *good* to have duplicate enforcement:

- the database enforces what we agree on the final data
*should* look like

- the UI tries to lure the user into entering "valid" data

Now, those ("duplicative") database constraints have saved
our butt more than once preventing faulty patient data to be
inserted into medical records.

A lot of bad arguments made against natural keys are made, for example:
*) natural keys can change (so what? unless you are worried about cascades)

I find it is not so much that they can change:

Sure, it doesn't matter whether a certain code reads
"C03EB21" or "C03EB22".

but rather that they tend to go non-unique whenever the
whims of meatspace decide it's now more convenient to allow
dupes:

Suddenly there must be two records with code "C03EB21".

*) SSN are reused! (SSN is obviously not a complete key if you want to
identify a person)
*) most tables don't have unique natural keys (let's see em)

Now, those two arguments are bogus, I agree.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Karsten Hilbert (#10)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 4:07 PM, Karsten Hilbert
<Karsten.Hilbert@gmx.net> wrote:

On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote:

They are fairly pervasive, and increasingly so, which I find to be
really unfortunate.  Personally I think rote use of surrogate keys is
terrible and leads to bad table designs, especially if you don't
identify the true natural key with, say, a unique constraint.

I was recently asked on this list why GNUmed uses all those
surrogate keys.

I should have added to my answer that we DO make *extensive*
use of all sorts of built-in constraints and custom triggers
to enforce "natural" keys. I must agree with a recent poster
that what appears to identify as a natural key often really
isn't or else becomes not so later on. It's vastly easier to
then deal with that by re-defining constraints without
having to touch primary keys.

This
pushes duplicate enforcement out of the database and into application
code, or (even worse) the user.  What everyone misses in the database
is that natural keys force good database design...if you can't
discover one, you probably have a design problem.

Sure but that doesn't mean you need to actually *use*
natural keys as primary keys - enforce them with all sorts
of constraints, sure - but don't assume you properly figured
out the perfect schema the first time around.

I've so far found it *good* to have duplicate enforcement:

- the database enforces what we agree on the final data
 *should* look like

- the UI tries to lure the user into entering "valid" data

Now, those ("duplicative") database constraints have saved
our butt more than once preventing faulty patient data to be
inserted into medical records.

A lot of bad arguments made against natural keys are made, for example:
*) natural keys can change (so what? unless you are worried about cascades)

I find it is not so much that they can change:

       Sure, it doesn't matter whether a certain code reads
       "C03EB21" or "C03EB22".

but rather that they tend to go non-unique whenever the
whims of meatspace decide it's now more convenient to allow
dupes:

       Suddenly there must be two records with code "C03EB21".

*) SSN are reused! (SSN is obviously not a complete key if you want to
identify a person)
*) most tables don't have unique natural keys (let's see em)

Now, those two arguments are bogus, I agree.

pretty much agree on all points. I don't really think primary keys
tend to change very much in terms of schema but when they do it can be
a real headache.

I took a quick look at the gnumed schema and found it to be generally
very thorough and excellent. If you're going to use surrogate keys,
that's they way to do it. That's a neat trick btw to use inheritance
for the auditing feature...how is it working out for you? Any general
comments on postgresql with regards to your product?

merlin

#12Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Merlin Moncure (#11)
Re: pervasiveness of surrogate (also called synthetic) keys

On Fri, Apr 29, 2011 at 10:14:07AM -0500, Merlin Moncure wrote:

I took a quick look at the gnumed schema and found it to be generally
very thorough and excellent. If you're going to use surrogate keys,
that's they way to do it.

Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

That's a neat trick btw to use inheritance
for the auditing feature...how is it working out for you?

That works very nicely for us. Same thing with aggregating
clinical narrative across diverse tables.

Any general comments on postgresql with regards to your product?

We have found it to be very dependable and professionally
maintained. We've never lost any patient data due to crashes
(for what that's worth). The breadth of constraints one can
define saved our behinds several times by preventing buggy
applications from storing faulty patient data.

One thing that'd be helpful to have would be ON CONNECT
triggers - that would make it much safer to support HIPAA
requirements (I'm aware of the apparent fallacy of a faulty
ON CONNECT trigger preventing superuser access - that can be
overcome by not running ON CONNECT triggers in single-user
rescue mode).

Another would be database wide asserts on data. Of course,
better support of inheritance in terms of definably
propagating constraints and triggers would be nice :-)

But that's a lot to ask.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#13Scott Marlowe
scott.marlowe@gmail.com
In reply to: Karsten Hilbert (#10)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thu, Apr 28, 2011 at 3:07 PM, Karsten Hilbert
<Karsten.Hilbert@gmx.net> wrote:

I must agree with a recent poster
that what appears to identify as a natural key often really
isn't or else becomes not so later on. It's vastly easier to
then deal with that by re-defining constraints without
having to touch primary keys.

I have run into this. Often what happens is that the data model
directly affects the assumptions coders make. In a system I worked
on, the username was unique / primary key. Code was written that
assumed username would always be unique across the app. Then the app
was deployed to multiple non-associated groups, where user names from
two different organizations might be the same and they couldn't be
changed. Talk about some ugly code changes needed to be made to fix
all the bugs. ugh. An initial assumption that a serial value would
be unique, but usernames might not be would have resulted in a much
cleaner design for that system. Or even that the PK was org||username
or something.

#14Andres Freund
andres@anarazel.de
In reply to: Scott Ribe (#6)
Re: pervasiveness of surrogate (also called synthetic) keys

On Thursday, April 28, 2011 08:31:09 PM Scott Ribe wrote:

Well, natural keys are quite obviously the way to go, when they exist. The
problem is, they usually don't really exist. What's usually proposed as a
natural key, will upon further investigation, either not be guaranteed
unique, or not guaranteed to be unchanging, or both.

There is no fundamental problem with changing primary keys. Sure, there are
challenges and situations where thats annoying and problematic, but it's not
as bad as often assumed.

Andres

#15Greg Smith
gsmith@gregsmith.com
In reply to: Karsten Hilbert (#12)
Re: pervasiveness of surrogate (also called synthetic) keys

On 05/01/2011 06:12 PM, Karsten Hilbert wrote:

Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

This debate exists at all levels of experience, and the only thing that
changes as you get more experienced people involved is an increase in
anecdotes on each side. The sole time I ever found myself arguing with
Joe Celko is over an article he wrote recommending natural keys, using
an example from the automotive industry. Problem was, the specific
example he gave was flat out wrong. I was working in automotive MIS at
the time, and the thing he was saying would never change did, in fact,
change every year--in only a fraction of a percent of cases, in an
extremely subtle way that snuck up on people and wreaked much
confusion. That's typical for an early natural key design: you get it
working fine in V1.0, only to discover months or years down the road
there's a case you never considered you don't model correctly, and it
may take some sort of conversion to fix.

The reason why there's a strong preference for surrogate keys is that
they always work and you can avoid ever needing to come up with a better
design. if you just use them and forget about it. The position Merlin
has advocated here, that there should always be a natural key available
if you know the data well enough, may be true. But few people are good
enough designers to be sure they've made the decision correctly, and the
downsides of being wrong can be a long, painful conversion process.
Easier for most people to just eliminate the possibility of making a
mistake by using auto-generated surrogate keys, where the primary
problem you'll run into is merely using more space/resources than you
might otherwise need to have. It minimizes the worst-case--mistake make
in the model, expensive re-design--by adding overhead that makes the
average case more expensive. Software design usually has enough risks
that any time you can eliminate one just by throwing some resources at
it, that's normally the right thing to do.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#16Jim Irrer
irrer@umich.edu
In reply to: Greg Smith (#15)
Re: pervasiveness of surrogate (also called synthetic) keys

I've been carefully reading all of the comments with great interest.

Thanks very much for the thoughtful responses - very enlightening.

- Jim (the topic originator)

Jim Irrer irrer@umich.edu (734) 647-4409
University of Michigan Hospital Radiation Oncology
519 W. William St. Ann Arbor, MI 48103

On Mon, May 2, 2011 at 11:10 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Show quoted text

On 05/01/2011 06:12 PM, Karsten Hilbert wrote:

Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

This debate exists at all levels of experience, and the only thing that
changes as you get more experienced people involved is an increase in
anecdotes on each side. The sole time I ever found myself arguing with Joe
Celko is over an article he wrote recommending natural keys, using an
example from the automotive industry. Problem was, the specific example he
gave was flat out wrong. I was working in automotive MIS at the time, and
the thing he was saying would never change did, in fact, change every
year--in only a fraction of a percent of cases, in an extremely subtle way
that snuck up on people and wreaked much confusion. That's typical for an
early natural key design: you get it working fine in V1.0, only to discover
months or years down the road there's a case you never considered you don't
model correctly, and it may take some sort of conversion to fix.

The reason why there's a strong preference for surrogate keys is that they
always work and you can avoid ever needing to come up with a better design.
if you just use them and forget about it. The position Merlin has advocated
here, that there should always be a natural key available if you know the
data well enough, may be true. But few people are good enough designers to
be sure they've made the decision correctly, and the downsides of being
wrong can be a long, painful conversion process. Easier for most people to
just eliminate the possibility of making a mistake by using auto-generated
surrogate keys, where the primary problem you'll run into is merely using
more space/resources than you might otherwise need to have. It minimizes
the worst-case--mistake make in the model, expensive re-design--by adding
overhead that makes the average case more expensive. Software design
usually has enough risks that any time you can eliminate one just by
throwing some resources at it, that's normally the right thing to do.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Merlin Moncure
mmoncure@gmail.com
In reply to: Greg Smith (#15)
Re: pervasiveness of surrogate (also called synthetic) keys

On Mon, May 2, 2011 at 10:10 AM, Greg Smith <greg@2ndquadrant.com> wrote:

On 05/01/2011 06:12 PM, Karsten Hilbert wrote:

Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

This debate exists at all levels of experience, and the only thing that
changes as you get more experienced people involved is an increase in
anecdotes on each side.  The sole time I ever found myself arguing with Joe
Celko is over an article he wrote recommending natural keys, using an
example from the automotive industry.  Problem was, the specific example he
gave was flat out wrong.  I was working in automotive MIS at the time, and
the thing he was saying would never change did, in fact, change every
year--in only a fraction of a percent of cases, in an extremely subtle way
that snuck up on people and wreaked much confusion.  That's typical for an
early natural key design:  you get it working fine in V1.0, only to discover
months or years down the road there's a case you never considered you don't
model correctly, and it may take some sort of conversion to fix.

The reason why there's a strong preference for surrogate keys is that they
always work and you can avoid ever needing to come up with a better design.
if you just use them and forget about it.  The position Merlin has advocated
here, that there should always be a natural key available if you know the
data well enough, may be true.  But few people are good enough designers to
be sure they've made the decision correctly, and the downsides of being
wrong can be a long, painful conversion process.  Easier for most people to
just eliminate the possibility of making a mistake by using auto-generated
surrogate keys, where the primary problem you'll run into is merely using
more space/resources than you might otherwise need to have.  It minimizes
the worst-case--mistake make in the model, expensive re-design--by adding
overhead that makes the average case more expensive.  Software design
usually has enough risks that any time you can eliminate one just by
throwing some resources at it, that's normally the right thing to do.

There are many practical arguments advocating the use of surrogate
keys. Faster updates, easier schema changes, performance, maintenance
costs down the line, better tools integration, etc. These arguments
basically involve trade-offs that can be justifiably be used to push
you one way or the other. That's all well and good.

However, I tend to disagree with arguments that you are better off not
identifying natural keys at all. To my mind, any database that has a
table with no discernible key but whose records are referred to via
another table's foreign key has a schema that is in a State of Error.
A surrogate key is just added information to the database -- why does
it play that record X out of N identical candidates should be the one
mapped? Is that logic repeatable? What are the ramifications for
joins that do not flow through the ID columns? Typically what it means
is that the rules that guard against duplicate information entry are
not, in fact in the database at all but in the application, and bad
data can now get into your database by a much broader array of causes.
The last and best defense against a nasty and common class of data
errors has been removed. The more complex and your database is, the
more it tends to be used a by a large array of clients, possibly even
spanning multiple computer languages -- thus the need for a root
system of constraint checking that is declarative and easily
understood.

Sure, requirements change, models change, but at any particular point
and time a model with as little as possible (read: none) outside
inputs should be able to provably demonstrate verifiable facts. With
a natural key database (or a surrogate key database with defined keys
that are not used for RI) you have inherent constraint checking that a
purely surrogate database simply doesn't have. Whatever the software
maintenance costs are, which is itself a complex and debatable topic,
I'll go with a strategy that gives a better defense against bad or
ambiguous data.

merlin

#18Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#15)
Re: pervasiveness of surrogate (also called synthetic) keys

On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote:

The position Merlin
has advocated here, that there should always be a natural key available
if you know the data well enough, may be true. But few people are good
enough designers to be sure they've made the decision correctly, and the
downsides of being wrong can be a long, painful conversion process.
Easier for most people to just eliminate the possibility of making a
mistake by using auto-generated surrogate keys, where the primary
problem you'll run into is merely using more space/resources than you
might otherwise need to have. It minimizes the worst-case--mistake make
in the model, expensive re-design--by adding overhead that makes the
average case more expensive.

Once you really try to define "natural" and "surrogate" keys, I think a
lot of the arguments disappear. I wrote about this a few years back:

http://thoughts.j-davis.com/2007/12/11/terminology-confusion/

In particular, I think you are falsely assuming that a natural key must
be generated from an outside source (or some source outside of your
control), and is therefore not reliably unique.

You can generate your own keys, and if you hand them out to customers
and include them on paperwork, they are now a part of the reality that
your database models -- and therefore become natural keys. Invoice
numbers, driver's license numbers, etc., are all natural keys, because
they are known about, and used, in reality. Usernames are, too, the only
difference is that you let the user choose it.

In contrast, a pointer or a UUID typically does not represent reality,
because no humans ever see it and no computer systems outside yours know
about it. So, it's merely an implementation detail and should not be a
part of the model.

Regards,
Jeff Davis

#19Craig Ringer
craig@2ndquadrant.com
In reply to: Jeff Davis (#18)
Re: pervasiveness of surrogate (also called synthetic) keys

On 03/05/11 08:25, Jeff Davis wrote:

You can generate your own keys, and if you hand them out to customers
and include them on paperwork, they are now a part of the reality that
your database models -- and therefore become natural keys. Invoice
numbers, driver's license numbers, etc., are all natural keys, because
they are known about, and used, in reality. Usernames are, too, the only
difference is that you let the user choose it.

I've repeatedly run into situations where I generate a key that seems
entirely sensible, making a generated primary key part of the business
processes ... then external constraints force me to change the format of
that key or start accepting keys from outside. "Oh, we need to move to
14-digit client IDs because <x-system> that we interact with requires
them". "We want invoice numbers to include a Luhn check digit, can you
add that?". Etc.

I'm now strongly in favour of keeping an internal key that users never
see, and having separate user-visible identifiers. The users can demand
that those identifiers change format or generation method and it's an
easy change in only one place. Want two different keys? I can do that
too. Record a key that matches some external system? That's easy. Want
to be able to edit/override/rename keys? Yep, that's fuss free too, and
it won't affect my audit history (which uses the real internal keys) or
have to cascade to foreign key relationships in append-only ledger tables.

I use a mix of surrogate and natural keys, depending on the situation. I
see little point in surrogate keys for simple lookup tables, but find
them invaluable in audited tables with lots of foreign key relationships
that interact with other business systems.

--
Craig Ringer

#20Scott Marlowe
scott.marlowe@gmail.com
In reply to: Craig Ringer (#19)
Re: pervasiveness of surrogate (also called synthetic) keys

On Mon, May 2, 2011 at 7:43 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:

I'm now strongly in favour of keeping an internal key that users never
see, and having separate user-visible identifiers. The users can demand
that those identifiers change format or generation method and it's an

It's far easier to change a unique constraint on the fly than a PK in
a busy production database.

#21Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Davis (#18)
#22Dann Corbit
DCorbit@connx.com
In reply to: Rob Sargent (#21)
#23Craig Ringer
craig@2ndquadrant.com
In reply to: Rob Sargent (#21)
#24Greg Smith
gsmith@gregsmith.com
In reply to: Rob Sargent (#21)
#25John R Pierce
pierce@hogranch.com
In reply to: Greg Smith (#24)
#26Craig Ringer
craig@2ndquadrant.com
In reply to: Greg Smith (#24)
#27Rob Sargent
robjsargent@gmail.com
In reply to: John R Pierce (#25)
#28Rob Sargent
robjsargent@gmail.com
In reply to: Craig Ringer (#26)
#29Craig Ringer
craig@2ndquadrant.com
In reply to: Rob Sargent (#28)
#30Scott Ribe
scott_ribe@elevated-dev.com
In reply to: Rob Sargent (#27)
#31Greg Smith
gsmith@gregsmith.com
In reply to: John R Pierce (#25)
#32Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Craig Ringer (#23)
#33Rick Genter
rick.genter@gmail.com
In reply to: Craig Ringer (#23)
#34Merlin Moncure
mmoncure@gmail.com
In reply to: Craig Ringer (#26)
#35Jeff Davis
pgsql@j-davis.com
In reply to: Rob Sargent (#21)
#36Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Davis (#35)
#37Jeff Davis
pgsql@j-davis.com
In reply to: Rob Sargent (#36)
#38Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#24)
#39Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Davis (#37)
#40Greg Smith
gsmith@gregsmith.com
In reply to: Merlin Moncure (#34)
#41Greg Smith
gsmith@gregsmith.com
In reply to: Jeff Davis (#38)
#42David G. Johnston
david.g.johnston@gmail.com
In reply to: Greg Smith (#40)
#43Greg Smith
gsmith@gregsmith.com
In reply to: David G. Johnston (#42)
#44Misa Simic
misa.simic@gmail.com
In reply to: Merlin Moncure (#8)
#45Merlin Moncure
mmoncure@gmail.com
In reply to: Misa Simic (#44)
#46Misa Simic
misa.simic@gmail.com
In reply to: Merlin Moncure (#45)
#47David G. Johnston
david.g.johnston@gmail.com
In reply to: Misa Simic (#46)
#48Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: David G. Johnston (#47)
#49Merlin Moncure
mmoncure@gmail.com
In reply to: Misa Simic (#46)
#50Merlin Moncure
mmoncure@gmail.com
In reply to: Greg Smith (#43)
#51Misa Simic
misa.simic@gmail.com
In reply to: Merlin Moncure (#50)
#52Scott Marlowe
scott.marlowe@gmail.com
In reply to: Greg Smith (#40)