FW: Surrogate keys (Was: enums)

Started by Jim Nasbyalmost 20 years ago34 messages

jnasby@pervasive.com

almost 20 years ago

Ooops, fat-finger'd -hackers...

-----Original Message-----
Adding -hackers back to the list.

From: Leandro Guimarães Faria Corcete Dutra
Em Seg, 2006-01-16 às 12:52 -0600, Jim C. Nasby escreveu:

On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:

For UPDATEs and INSERTs, the "proper" primary key also

needs to be

checked, but keys are used for more than just checking

uniqueness:

they're also often used in JOINs. Joining against a

single integer

I'd think it quite a different proposition (I'd think

faster in terms

of performance) than joining against, say, a text column or a
composite key.

How different is that?

Comparing two ints is much, much faster than comparing two text fields. For a small number of comparisons, it doesn't matter. When you're joining tables together, it's a different story.

a) the optimizer does a really poor job on multi-column

index statistics

Then it should eventually be fixed?

It's on the to-do, but it's not an easy nut to crack.

b) If each parent record will have many children, the space

savings from

using a surrogate key can be quite large

Only where the surrogate is significantly smaller than
the natural?

#define significant

Here's a real-life example: the primary table for stats.distributed.net has about 120M rows. One field in that table (participant_id) links back to the participant table; it's an int. If instead we used participant_name and that averaged 8 characters in length, that would grow the main table by 1GB (8 chars takes 8 bytes instead of 4, plus there's the varlena header of 4 bytes). The machine that stats runs on has 4G of memory, so cutting 1G of wasted space out of that table helps quite a bit.

(In actuality, there isn't participant_name... participants are identified by email address (not a great idea, but I wasn't around when that was chosen). As you can imagine, email addresses are substantially longer than 4 bytes. When we normalized email out of that main table things got substantially faster. That was a number of years ago, so the table was probably 15-25% of it's current size, but it still made a huge difference.)

c) depending on how you view things, putting actual keys

all over the

place is denormalized

How come?

See my other reply... :)

Generally, I just use surrogate keys for everything unless

performance

dictates something else.

Shouldn't it be the other way round, for the user's sake?

Why should it? It's trivial to create views that abstract surrogate keys out, and if you really want to you can even make the views updatable. But here's two other things to consider:

In many cases you can't define a single field as a unique key. So you end up with having to add many extra keys to all your join clauses. Not very friendly, and prone to error.

Not every language has equal support for text comparisons (and in my experience, almost all real keys are mostly text).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Jim C. Nasby

jnasby@pervasive.com

almost 20 years ago

In reply to: Jim Nasby (#1)

Re: FW: Surrogate keys (Was: enums)

Comparing two ints is much, much faster than comparing two text
fields. For a small number of comparisons, it doesn't matter. When
you're joining tables together, it's a different story.

That is where data independence would come handy... like a better enum,
with possreps and hidden implementation.

Forgive me my ignorance, but are ints inherently faster to compare than
strings, or is it just an implementation detail? Ideally, if this is so
a fully data-independent system would create a hash behind the back of
user in order to get performance.

The CPU can do an integer comparison with one instruction; it can't do
that with a text string. (Well, theoretically if the string was 3/4
bytes exactly (dependin on null termination) you could compare just as
fast, but I'm pretty certain that no compiler is that fancy.)

Here's a real-life example: the primary table for
stats.distributed.net has about 120M rows. One field in that table
(participant_id) links back to the participant table; it's an int. If
instead we used participant_name and that averaged 8 characters in
length, that would grow the main table by 1GB (8 chars takes 8 bytes
instead of 4, plus there's the varlena header of 4 bytes). The machine
that stats runs on has 4G of memory, so cutting 1G of wasted space out
of that table helps quite a bit.

OK, hardly a typical example. As I think I left clear, my problem is
not using surrogate keys, but using them by default, or even
exclusively.

No? It's certainly not uncommon to have tables with 100M+ rows. And keep
in mind that this applies to every row of every table that has foreign
keys. I'd bet it's actually common to save 1G or more with surrogate
keys in moderately sized databases.

Of course, you do have to be intelligent here, too. The only key defined
on the table in my example is participant_id, project_id, date; there is
no surrogate key because there's no real reason to have one.

(In actuality, there isn't participant_name... participants are
identified by email address (not a great idea, but I wasn't around
when that was chosen). As you can imagine, email addresses are
substantially longer than 4 bytes. When we normalized email out of
that main table things got substantially faster. That was a number of
years ago, so the table was probably 15-25% of it's current size, but
it still made a huge difference.)

This isn't normalisation at all, as far as I understand it. It is just

I don't have the rules of normalization memorized enough to know what
form this breaks, but I'm 99% certain it breaks at least one of them.

Look at it this way: if someone wants to change their email address,
best case scenario is that you have cascading RI setup and it updates
thousands of rows in that table. Worst case scenario, you just de-linked
a whole bunch of data. But with a surrogate key, all you have to do is
update one row in one table and you're done.

that we don't have data independence... so you had to expose an
implementation detail?

Expose to what? The application? First, this is a pretty minor thing to
expose; second, if it's that big a concern you can completely hide it by
using a view.

But the reality is, dealing with a numeric ID can be a heck of a lot
easier than an email address. Look at URLs that embbed one versus the
other for a good example.

Why should it? It's trivial to create views that abstract surrogate
keys out, and if you really want to you can even make the views
updatable. But here's two other things to consider:

These views, in heavy querying environments, can be prohibitive.

"Normalize 'til it hurts; denormalize 'til it works."

Yes, the added overhead of rules for updates/inserts/deletes could start
to add up in performance-critical code. But if performance is that
critical you're far more likely to run into other bottlenecks first. And
worst-case, you abstract behind a stored procedure that just has the
right queries hard-coded.

As for select-only views you'll have a hard time showing any meaningful
performance penalty.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Leandro Guimarães Faria Corcete Dutra

leandro@dutra.fastmail.fm

almost 20 years ago

In reply to: Jim C. Nasby (#2)

Re: FW: Surrogate keys (Was: enums)

Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:

Forgive me my ignorance, but are ints inherently faster to compare than
strings, or is it just an implementation detail? Ideally, if this is so
a fully data-independent system would create a hash behind the back of
user in order to get performance.

The CPU can do an integer comparison with one instruction; it can't do
that with a text string.

OK. Again, data independence should be the goal here.

OK, hardly a typical example. As I think I left clear, my problem is
not using surrogate keys, but using them by default, or even
exclusively.

No? It's certainly not uncommon to have tables with 100M+ rows.

No, but neither are they *that* common.

Certainly, lots of database have a few of them. But then, they have
dozens, hundreds, thousands of much smaller tables.

And keep
in mind that this applies to every row of every table that has foreign
keys. I'd bet it's actually common to save 1G or more with surrogate
keys in moderately sized databases.

Only if you have quite some children, because otherwise, in the main
tables, the surrogate keys add a field, an index and a sequence to an
otherwise smaller table and index.

Of course, you do have to be intelligent here, too. The only key defined
on the table in my example is participant_id, project_id, date; there is
no surrogate key because there's no real reason to have one.

Quite.

(In actuality, there isn't participant_name... participants are
identified by email address (not a great idea, but I wasn't around
when that was chosen). As you can imagine, email addresses are
substantially longer than 4 bytes. When we normalized email out of
that main table things got substantially faster. That was a number of
years ago, so the table was probably 15-25% of it's current size, but
it still made a huge difference.)

This isn't normalisation at all, as far as I understand it. It is just

I don't have the rules of normalization memorized enough to know what
form this breaks, but I'm 99% certain it breaks at least one of them.

No, never.

Normalisation is about eliminating redundancy and, therefore, update
anomalies. Making all the table dependent on only the keys and the
whole keys, by projecting relations to eliminate entity mixups.

What you mention is actually exposing an implementation detail, namely
an integer that serves as a hash of the key.

Look at it this way: if someone wants to change their email address,
best case scenario is that you have cascading RI setup and it updates
thousands of rows in that table. Worst case scenario, you just de-linked
a whole bunch of data. But with a surrogate key, all you have to do is
update one row in one table and you're done.

OK, if you have lots of linked data. But most tables are really dead
ends.

that we don't have data independence... so you had to expose an
implementation detail?

Expose to what? The application? First, this is a pretty minor thing to
expose; second, if it's that big a concern you can completely hide it by
using a view.

As someone said, you end up with ids everywhere, and no
user-understandable data at all...

But the reality is, dealing with a numeric ID can be a heck of a lot
easier than an email address. Look at URLs that embbed one versus the
other for a good example.

Again, implementation details... levels mixup.

Why should it? It's trivial to create views that abstract surrogate
keys out, and if you really want to you can even make the views
updatable. But here's two other things to consider:

These views, in heavy querying environments, can be prohibitive.

"Normalize 'til it hurts; denormalize 'til it works."

Lack of data implementation biting us again.

Yes, the added overhead of rules for updates/inserts/deletes could start
to add up in performance-critical code. But if performance is that
critical you're far more likely to run into other bottlenecks first. And
worst-case, you abstract behind a stored procedure that just has the
right queries hard-coded.

As for select-only views you'll have a hard time showing any meaningful
performance penalty.

Yet real user-defined data types could make it all much simpler.

-- 
+55 (11) 5685 2219               xmpp:leandrod@jabber.org
+55 (11) 9406 7191                       Yahoo!: lgcdutra    
+55 (11) 5686 9607         MSN: leandro@dutra.fastmail.fm
+55 (11) 4390 5383                      ICQ/AIM: 61287803

Michael Glaesemann

grzm@myrealbox.com

almost 20 years ago

In reply to: Leandro Guimarães Faria Corcete Dutra (#3)

Re: Surrogate keys (Was: enums)

On Jan 19, 2006, at 9:31 , Leandro Guimarães Faria Corcete Dutra wrote:

OK. Again, data independence should be the goal here.

Again, implementation details... levels mixup.

Lack of data implementation biting us again.

Yet real user-defined data types could make it all much simpler.

Again, again, and again, patches welcome! PostgreSQL is an open-
source project, and people contribute in a variety of ways, two of
which include submitting code and sponsoring others to develop code.
If you look at the todo list, there are *lots* of things people would
like to see improved in PostgreSQL, but the pace at which PostgreSQL
is improved and what is improved is driven in large part by what
people are willing to do themselves or sponsor. If these are things
you're interested in (and it certainly appears you are), why not
contribute?

Michael Glaesemann
grzm myrealbox com

Dann Corbit

DCorbit@connx.com

almost 20 years ago

In reply to: Leandro Guimarães Faria Corcete Dutra (#3)

Re: FW: Surrogate keys (Was: enums)

Maybe it goes better into Advocacy or something, but I have found a quote by database big-wigs that I strongly disagree with:

From:
http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

We have this.
"PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be assigned by the DBMS only if a user-defined primary key is not available.
Second generation systems support the notion of a primary key, which is a user-assigned unique identifier. If a primary key exists for a collection that is known never to change, for example social security number, student registration number, or employee number, then no additional system-assigned UID is required. An immutable primary key has an extra advantage over a system-assigned unique identifier because it has a natural, human readable meaning. Consequently, in data interchange or debugging this may be an advantage. If no primary key is available for a collection, then it is imperative that a system-assigned UID be provided. Because SQL supports update through a cursor, second generation systems must be able to update the last record retrieved, and this is only possible if it can be uniquely identified. If no primary key serves this purpose, the system must include an extra UID. Therefore, several second generation systems already obey this proposition. Moreover, as will be noted in Proposition 2.3, some collections, e.g. views, do not necessarily have system assigned UIDs, so building a system that requires them is likely to be proven undesirable. We close our discussion on Tenet 1 with a final proposition that deals with the notion of rules."

This is a bad idea.

Let's take the example of a Social Security Number.

Not everyone has one:
http://www.ssa.gov/pubs/10002.html#how2

If people do have one, they can definitely change it. If someone has stolen a SSN, then the wronged party is able to get their SSN changed:
http://101-identitytheft.com/ssn.htm
The odds of this happening are low, but if you cannot handle it, then the damage caused is considerable.
Now what happens if you want to have customers outside of the USA? {Don't worry, we'll never go global...} I hope that my objections are very plain and obvious.

The primary key should be immutable, meaning that its value should not be changed during the course of normal operations of the database. What natural key is immutable? The answer is that such an attribute does not exist. To use them for such a purpose is begging for trouble.

I saw the argument that there is a great volume of space wasted by adding a column that does not naturally occur in the data. That argument is simply absurd. Consider a database with 10 billion rows of data in it. Each of those tables gets an 8 byte primary key added for every row, resulting in 80 GB consumed. The cost of 80 GB is perhaps $200. With a database that large (where the extra space consumed by an artificial key column has a cost that can easily be measured) the odds of a problem arising due to a natural column changing its value are huge. The cost of such a tragedy is certainly more than the $200 pittance!

If there is an argument that we also have the parent key values propagated into the child tables as foreign keys, that argument has no merit. The other attribute that would have been chosen would also be propagated. And so (for instance) there is no savings to propagating a SSN field into child tables verses propagating an 8 byte integer.

I also saw an argument that the propagated ID values are confusing to end-users. That is the fault of the database designer who game them a stupid name. If they were things like InvoiceID and LineItemID then there will not be the same sort of confusion. The meaning and purpose of the column is immediately apparent. As an alternative, the ubiquitous OID name for a column on a table is also very transparent. Of course, when it is used in a foreign key, it must be given a role name to avoid confusion in that case.

At any rate, the use of natural keys is a mistake made by people who have never had to deal with very large database systems.

IMO-YMMV.

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Leandro Guimarães Faria Corcete Dutra
Sent: Wednesday, January 18, 2006 4:31 PM
To: Jim C. Nasby
Cc: pgsql-hackers@postgresql.org
Subject: Re: FW: [HACKERS] Surrogate keys (Was: enums)

Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:

Forgive me my ignorance, but are ints inherently faster to compare

than

strings, or is it just an implementation detail? Ideally, if this is

so

a fully data-independent system would create a hash behind the back of
user in order to get performance.

The CPU can do an integer comparison with one instruction; it can't do
that with a text string.

OK. Again, data independence should be the goal here.

OK, hardly a typical example. As I think I left clear, my problem

is

not using surrogate keys, but using them by default, or even
exclusively.

No? It's certainly not uncommon to have tables with 100M+ rows.

No, but neither are they *that* common.

Certainly, lots of database have a few of them. But then, they have
dozens, hundreds, thousands of much smaller tables.

And keep
in mind that this applies to every row of every table that has foreign
keys. I'd bet it's actually common to save 1G or more with surrogate
keys in moderately sized databases.

Only if you have quite some children, because otherwise, in the main
tables, the surrogate keys add a field, an index and a sequence to an
otherwise smaller table and index.

Of course, you do have to be intelligent here, too. The only key defined
on the table in my example is participant_id, project_id, date; there is
no surrogate key because there's no real reason to have one.

Quite.

(In actuality, there isn't participant_name... participants are
identified by email address (not a great idea, but I wasn't around
when that was chosen). As you can imagine, email addresses are
substantially longer than 4 bytes. When we normalized email out of
that main table things got substantially faster. That was a number of
years ago, so the table was probably 15-25% of it's current size, but
it still made a huge difference.)

This isn't normalisation at all, as far as I understand it. It is

just

I don't have the rules of normalization memorized enough to know what
form this breaks, but I'm 99% certain it breaks at least one of them.

No, never.

Normalisation is about eliminating redundancy and, therefore, update
anomalies. Making all the table dependent on only the keys and the
whole keys, by projecting relations to eliminate entity mixups.

What you mention is actually exposing an implementation detail,
namely
an integer that serves as a hash of the key.

Look at it this way: if someone wants to change their email address,
best case scenario is that you have cascading RI setup and it updates
thousands of rows in that table. Worst case scenario, you just de-linked
a whole bunch of data. But with a surrogate key, all you have to do is
update one row in one table and you're done.

OK, if you have lots of linked data. But most tables are really
dead
ends.

that we don't have data independence... so you had to expose an
implementation detail?

Expose to what? The application? First, this is a pretty minor thing to
expose; second, if it's that big a concern you can completely hide it by
using a view.

As someone said, you end up with ids everywhere, and no
user-understandable data at all...

But the reality is, dealing with a numeric ID can be a heck of a lot
easier than an email address. Look at URLs that embbed one versus the
other for a good example.

Again, implementation details... levels mixup.

Why should it? It's trivial to create views that abstract surrogate
keys out, and if you really want to you can even make the views
updatable. But here's two other things to consider:

These views, in heavy querying environments, can be prohibitive.

"Normalize 'til it hurts; denormalize 'til it works."

Lack of data implementation biting us again.

Yes, the added overhead of rules for updates/inserts/deletes could start
to add up in performance-critical code. But if performance is that
critical you're far more likely to run into other bottlenecks first. And
worst-case, you abstract behind a stored procedure that just has the
right queries hard-coded.

As for select-only views you'll have a hard time showing any meaningful
performance penalty.

Yet real user-defined data types could make it all much simpler.
--
+55 (11) 5685 2219               xmpp:leandrod@jabber.org
+55 (11) 9406 7191                       Yahoo!: lgcdutra
+55 (11) 5686 9607         MSN: leandro@dutra.fastmail.fm
+55 (11) 4390 5383                      ICQ/AIM: 61287803
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Import Notes

Resolved by subject fallback

Michael Glaesemann

grzm@myrealbox.com

almost 20 years ago

In reply to: Dann Corbit (#5)

Re: Surrogate keys (Was: enums)

On Jan 19, 2006, at 10:34 , Dann Corbit wrote:

http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

"PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be
assigned by the DBMS only if a user-defined primary key is not
available.

An immutable primary key has an extra advantage over a system-
assigned unique identifier because it has a natural, human readable
meaning. Consequently, in data interchange or debugging this may be
an advantage. If no primary key is available for a collection,
then it is imperative that a system-assigned UID be provided.

Dann Corbit:

The primary key should be immutable, meaning that its value should
not be changed during the course of normal operations of the
database. What natural key is immutable? The answer is that such
an attribute does not exist. To use them for such a purpose is
begging for trouble.

As far as I can tell, the only difference between your position,
Dann, and Date and Darwen's, is that you think no natural key is
immutable. If you *could* find an immutable natural key, would it be
an acceptable key for you? Date and Darwen say explicitly that if no
immutable (natural) (primary) key is available a system-assigned UID
is required. If you think there is no immutable natural key
available, Darwen and Date would agree that you should use a system-
generated key. Or do you think I'm misreading you or The Third
Manifesto?

Michael Glaesemann
grzm myrealbox com

Dann Corbit

DCorbit@connx.com

almost 20 years ago

In reply to: Michael Glaesemann (#6)

Re: Surrogate keys (Was: enums)

-----Original Message-----
From: Michael Glaesemann [mailto:grzm@myrealbox.com]
Sent: Wednesday, January 18, 2006 5:48 PM
To: Dann Corbit
Cc: Leandro Guimarães Faria Corcete Dutra; Jim C. Nasby; pgsql-
hackers@postgresql.org
Subject: Re: [HACKERS] Surrogate keys (Was: enums)

On Jan 19, 2006, at 10:34 , Dann Corbit wrote:

http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

"PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be
assigned by the DBMS only if a user-defined primary key is not
available.

<snip />

An immutable primary key has an extra advantage over a system-
assigned unique identifier because it has a natural, human readable
meaning. Consequently, in data interchange or debugging this may be
an advantage. If no primary key is available for a collection,
then it is imperative that a system-assigned UID be provided.

<snip />

Dann Corbit:

The primary key should be immutable, meaning that its value should
not be changed during the course of normal operations of the
database. What natural key is immutable? The answer is that such
an attribute does not exist. To use them for such a purpose is
begging for trouble.

As far as I can tell, the only difference between your position,
Dann, and Date and Darwen's, is that you think no natural key is
immutable. If you *could* find an immutable natural key, would it be
an acceptable key for you? Date and Darwen say explicitly that if no
immutable (natural) (primary) key is available a system-assigned UID
is required. If you think there is no immutable natural key
available, Darwen and Date would agree that you should use a system-
generated key. Or do you think I'm misreading you or The Third
Manifesto?

If you could find an immutable natural key, it would be the *BEST* thing to use. Unfortunately, I believe that immutable natural keys are rarer than horse feathers and pickle smoke. Furthermore, because of statements like the one that I collected and pasted from the above document, I believe that people will choose totally inappropriate things (I have seen it many times and had to deal with the repercussions) to use as natural keys (e.g. SSN) and cause enormous damage through those choices.

But I suppose on a sort of "mathematical" level the statement is fully true.

Import Notes

Resolved by subject fallback

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Dann Corbit (#5)

Re: FW: Surrogate keys (Was: enums)

Dann,

The primary key should be immutable, meaning that its value should not be
changed during the course of normal operations of the database.

Why? I don't find this statement to be self-evident. Why would we have ON
UPDATE CASCADE if keys didn't change sometimes?

At any rate, the use of natural keys is a mistake made by people who have
never had to deal with very large database systems.

Oh, I guess I'm dumb then. The biggest database system I ever had to deal
with was merely 5 TB ...

Anyway, my opinion on this, in detail, will be on the ITToolBox blog. You can
argue with me there.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Michael Glaesemann (#6)

Re: Surrogate keys (Was: enums)

Michael Glaesemann <grzm@myrealbox.com> writes:

As far as I can tell, the only difference between your position,
Dann, and Date and Darwen's, is that you think no natural key is
immutable.

D&D's examples of "natural" keys are worth a second look though:

If a primary key exists for a collection that is known never to change,
for example social security number, student registration number, or
employee number, then no additional system-assigned UID is required.

The problem with SSN is that somebody other than you controls it.
If you are the college registrar, then you control the student's
registration number, and you don't have to change it. In fact, guess
what: you probably generated it in the same way as a surrogate key.

I'd argue that all of these are in reality the exact same thing as
a surrogate key --- from the point of view of the issuing authority.
But from anyone else's point of view, they are external data and you
can't hang your own database design on the assumption that they won't
change.

regards, tom lane

#10

Greg Stark

gsstark@mit.edu

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Surrogate keys (Was: enums)

If a primary key exists for a collection that is known never to change,
for example social security number, student registration number, or
employee number, then no additional system-assigned UID is required.

In point of fact Social security numbers *can* change.

--
greg

#11

Leandro Guimarães Faria Corcete Dutra

leandro@dutra.fastmail.fm

almost 20 years ago

In reply to: Michael Glaesemann (#4)

Re: Surrogate keys (Was: enums)

Em Qui, 2006-01-19 às 09:54 +0900, Michael Glaesemann escreveu:

On Jan 19, 2006, at 9:31 , Leandro Guimarães Faria Corcete Dutra wrote:

If these are things
you're interested in (and it certainly appears you are), why not
contribute?

'Cause unfortunately am not a hacker...

-- 
+55 (11) 5685 2219               xmpp:leandrod@jabber.org
+55 (11) 9406 7191                       Yahoo!: lgcdutra    
+55 (11) 5686 9607         MSN: leandro@dutra.fastmail.fm
+55 (11) 4390 5383                      ICQ/AIM: 61287803

#12

David Fetter

david@fetter.org

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote:

Michael Glaesemann <grzm@myrealbox.com> writes:

As far as I can tell, the only difference between your position,
Dann, and Date and Darwen's, is that you think no natural key is
immutable.

D&D's examples of "natural" keys are worth a second look though:

If a primary key exists for a collection that is known never to
change, for example social security number, student registration
number, or employee number, then no additional system-assigned
UID is required.

The problem with SSN is that somebody other than you controls it.

No, that's not the big problem. The big problem is that it's very
likely illegal for you to use it for anything unless you happen to be
the Social Security Administration.

If you are the college registrar, then you control the student's
registration number, and you don't have to change it. In fact,
guess what: you probably generated it in the same way as a surrogate
key.

True.

I'd argue that all of these are in reality the exact same thing as a
surrogate key --- from the point of view of the issuing authority.
But from anyone else's point of view, they are external data and you
can't hang your own database design on the assumption that they
won't change.

Right :)

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

#13

Martijn van Oosterhout

kleptog@svana.org

almost 20 years ago

In reply to: David Fetter (#12)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 12:50:52AM -0800, David Fetter wrote:

On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote:
No, that's not the big problem. The big problem is that it's very
likely illegal for you to use it for anything unless you happen to be
the Social Security Administration.

Actually no. From reading here[1]http://www.cpsr.org/prevsite/cpsr/privacy/ssn/ssn.faq.html#IsItIllegalToAsk it appears anyone is allowed to ask
you your SSN and they can do what they like with it.

What you're describing is more like the TFN in Australia. Apart from
the fact you're not required to have one or provide it if asked, if
you're not a bank, or share registrary or some other such institution,
you're not allowed to ask for it, let alone store it. Medicare number
the same, if you're not a health service provider, you can't ask for
it.

Anyway, this doesn't mean an SSN is a good key, for all sorts of other
reasons people have already stated.

[1]: http://www.cpsr.org/prevsite/cpsr/privacy/ssn/ssn.faq.html#IsItIllegalToAsk

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#14

Pollard, Mike

mpollard@cincom.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#13)

Re: Surrogate keys (Was: enums)

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:

- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:

A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

A physical address - how about longitude/latitude/height from sea level?

The point here is two-fold. First, what we call 'natural' is frequently
itself a surrogate key (yes, even your name is really just a surrogate
key. As with all surrogate keys, it is a sequence of symbols that you
use to represent yourself). The second point is even when you find a
truly 'natural' key (something not arbitrarily made up by anyone, and
uniquely identifying the data in question), it may be completely and
utterly inappropriate to use in a database.

What is 'natural' anyway? If someone phones in an order, we usually
assign an order number to that request. This order number is not the
actual order, and the customer couldn't care a less what it is, but I've
never heard a DBA argue we should get rid of it (well, to be fair, I've
never discussed order numbers with a DBA at all). After all, would it
make sense for the key for that order to be the customer's name, the
date/time of the order, all the items ordered, and the address to ship
the order? That isn't a key, but it's the only 'natural' thing that
identifies that order that immediately comes to my mind.

On the other hand, would anyone argue that an order_item table should
have a surrogate key? Well, I wouldn't. The key for the order_item
table should be something like the order number and the inventory item
number together (IMHO).

The point? Surrogate keys and natural keys are two tools in the
database arsenal. Just as it is unwise to use a hammer to drive a screw
just because you don't believe in screwdrivers, it is unwise to just off
hand discard either method of specifying a key. Rather, use
intelligence and education (one of which is discussions such as this) in
deciding how best to represent your data to aide in performance, ease of
use, and adaptability.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.

Import Notes

Resolved by subject fallback

#15

Noname

mark@mark.mielke.cc

almost 20 years ago

In reply to: Pollard, Mike (#14)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 09:37:12AM -0500, Pollard, Mike wrote:

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:
- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:
A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)

A physical address - how about longitude/latitude/height from sea level?

Planet? Solar system? Galaxy? Universe? :-)

I agreed with what you had to say (the stuff I deleted). Just felt
like being funny. Not sure if I'm successful. Hehe...

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#16

Chris Browne

cbbrowne@acm.org

almost 20 years ago

In reply to: Pollard, Mike (#14)

Re: Surrogate keys

mark@mark.mielke.cc writes:

On Thu, Jan 19, 2006 at 09:37:12AM -0500, Pollard, Mike wrote:

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:
- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:
A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)

That is arguably about the best choice possible, for a human being, as
DNA isn't supposed to be able to change (much).

Mind you, there do exist odd cases where a person might have two sets
of DNA in different parts of their body. This commonly (well, it's
not really very common...) happens when non-identical twins share a
blood supply; that twins were involved may not be noticed if one does
not survive to birth...

A physical address - how about longitude/latitude/height from sea level?

Planet? Solar system? Galaxy? Universe? :-)

I agreed with what you had to say (the stuff I deleted). Just felt
like being funny. Not sure if I'm successful. Hehe...

Well, that's useful for representing a key for a piece of real estate.

It's fairly interestingly useless for representing a human attribute,
at least in terms of being a primary key...
--
output = ("cbbrowne" "@" "acm.org")
http://cbbrowne.com/info/linux.html
"It's a pretty rare beginner who isn't clueless. If beginners weren't
clueless, the infamous Unix learning cliff wouldn't be a problem."
-- david parsons

#17

Jim C. Nasby

jnasby@pervasive.com

almost 20 years ago

In reply to: Pollard, Mike (#14)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 09:37:12AM -0500, Pollard, Mike wrote:

The point? Surrogate keys and natural keys are two tools in the
database arsenal. Just as it is unwise to use a hammer to drive a screw
just because you don't believe in screwdrivers, it is unwise to just off
hand discard either method of specifying a key. Rather, use
intelligence and education (one of which is discussions such as this) in
deciding how best to represent your data to aide in performance, ease of
use, and adaptability.

There is one thing to consider: consistency. If you mix and match
'natural' keys and surrogate keys as PK, then how do you know which one
you're supposed to be joining on? How does everyone else on the team
know?

Sure, there's many examples where you don't really need a surrogate key.
But there's just as many (if not more) where you want a surrogate key so
that you don't have to deal with the pain of a multiple-field key. (Note
that I don't consider simply defining a multiple-field key to be unique
as painful). So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#18

Thomas Hallgren

thomas@tada.se

almost 20 years ago

In reply to: Josh Berkus (#8)

Re: FW: Surrogate keys (Was: enums)

Josh Berkus wrote:

Why? I don't find this statement to be self-evident. Why would we have ON
UPDATE CASCADE if keys didn't change sometimes?

Many times you will have references to a specific row from somewhere outside of your
database. Perhaps you have a federation of web services that collaborate or other arbitrary
URL's that contain the key.

It might be harder to create remote row sets, middle tier caches, and other similar
constructs if you cannot trust that the primary key is immutable. Such mechanisms often
trust that the primary key can be used to refetch the data and that it has been deleted if
it's no longer found.

Anyway, my opinion on this, in detail, will be on the ITToolBox blog. You can
argue with me there.

That will be interesting reading. I didn't find it (yet). Can you please post an URL here?

Regards,
Thomas Hallgren

#19

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Thomas Hallgren (#18)

Re: FW: Surrogate keys (Was: enums)

Thomas,

Many times you will have references to a specific row from somewhere
outside of your database. Perhaps you have a federation of web services
that collaborate or other arbitrary URL's that contain the key.

It might be harder to create remote row sets, middle tier caches, and
other similar constructs if you cannot trust that the primary key is
immutable. Such mechanisms often trust that the primary key can be used
to refetch the data and that it has been deleted if it's no longer found.

Sure. And that's a good reason to need an immutable surrogate key.
However, Dann was arguing that immutable surrogate keys are an "a
priori" good thing for all applications, in no need of justification,
which I don't buy.

That will be interesting reading. I didn't find it (yet). Can you please
post an URL here?

Sure, it was posted on -hackers earlier:
http://blogs.ittoolbox.com/database/soup/archives/007327.asp

--Josh

#20

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Jim C. Nasby (#17)

Re: Surrogate keys (Was: enums)

Jim,

So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.

"The lazy man's guide to SQL database design", but Jim Nasby.

;-)

--Josh

#21

Noname

mark@mark.mielke.cc

almost 20 years ago

In reply to: Josh Berkus (#20)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:

So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.

"The lazy man's guide to SQL database design", but Jim Nasby.
;-)

Hehe... I was thinking the same thing. I've definately seen cases
where the use of surrogate keys verges on ridiculous. It hasn't
harmed the application, except it terms of complexity. It still works.
It still performs fine. The SQL queries are awful looking. :-)

That's where I would tend to draw the line. For me, I find
implementation and maintenance to be the most expensive part of my
applications. My data hasn't yet become large enough to make disk
space, compute resources, or I/O bandwidth a serious concern.

If I think the use of surrogate keys may make my life harder, I'll try
not to use them. If I think they may make my life easier, I'll use
them without blinking an eye. Harder vs. easier = cost to implement.

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#22

Dann Corbit

DCorbit@connx.com

almost 20 years ago

In reply to: Noname (#21)

Re: Surrogate keys (Was: enums)

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Josh Berkus
Sent: Thursday, January 19, 2006 10:09 AM
To: Martijn van Oosterhout
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Surrogate keys (Was: enums)

Martjin,

In any of these either misspellings, changes of names, ownership or
even structure over time render the obvious useless as keys. There

are

techniques for detecting and reducing duplication but the point is

that

for any of these duplicates *can* be valid data.

Please point me out where, in the writings of E.F. Codd or in the SQL
Standard, it says that keys have to be immutable for the life of the

row.

Only do that for data that you care about. If you think that the data
has no value, there is no need to have a way to identify a row.

Duplicate *values* can be valid data. Duplicate *tuples* show some
serious flaws in your database design. If you have a personnel
directory on which you've not bothered to define any unique

constraints

other than the ID column, then you can't match your data to reality.

you have two rows with the same first and last name, you don't know if
they are two different people or the same person, duplicated. Which
will be a big problem come paycheck time.

Per E.F. Codd, each tuple is a *unique* predicate (or key) comprising

set of values definining a *unique* data entity. i.e. "The employeee
named "John" "Little" at extension "4531". There is nothing anywhere
said about keys never changing.

This is Databases 101 material. Really!

I give it an 'F.'

When the data changes, the problems generated are not just due to
repercussions related to the child and parent tables related through the
primary key.

Someone has an invoice, and they call in with a question. A combination
of their name and address was used as a primary key. They moved, and
sent in a forwarding address. The DBA was smart enough to design the
database to cascade results, so that there are no orphan records and we
have not compromised the structure of the database.
The customer calls in with a question about an old invoice.
"We have no record of that transaction."

I was a DBA for a database for a company with many millions of customers
worldwide (e.g. the product registration table was 24 GB).

Their design had natural keys in it. It caused dozens of problems,
every single day.

I content that most people are not smart enough to decide when a natural
key is a good idea. The engineers that designed the database were
probably pretty smart, since it sort of worked and had thousands of
tables and hundreds of millions of rows in it. But one bad decision on
a natural key will cause literally millions of dollars of damage.

The primary defense I have heard so far is that the Oids are hard to
understand. They are nothing in comparison with understanding what to
do when you have 25 changes to primary keys on various tables every
single day.

Once you get used to Oids, I find it hard to believe that any
intelligent person finds them confusing. Confusion resulting from
having primary keys that are a moving target? Now that's confusion for
you.

IMO-YMMV.

I think it is time for me to give it a rest, though. My experience may
be very atypical and I feel strangely passionate about it.

Import Notes

Resolved by subject fallback

#23

Jim C. Nasby

jnasby@pervasive.com

almost 20 years ago

In reply to: Noname (#21)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 02:01:14PM -0500, mark@mark.mielke.cc wrote:

On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:

So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.

"The lazy man's guide to SQL database design", but Jim Nasby.
;-)

Hehe... I was thinking the same thing. I've definately seen cases
where the use of surrogate keys verges on ridiculous. It hasn't
harmed the application, except it terms of complexity. It still works.
It still performs fine. The SQL queries are awful looking. :-)

Got an example?

That's where I would tend to draw the line. For me, I find
implementation and maintenance to be the most expensive part of my
applications. My data hasn't yet become large enough to make disk
space, compute resources, or I/O bandwidth a serious concern.

Which is exactly what my thought process is. If you mix surrogate and
non-surrogate keys, how do you know which table has which? Sure, while
you're actively writing the code it's not an issue, but what about 6
months later? What about if someone else picks up the code?

I know Josh was poking fun with his comment about me being lazy, but
lazy can make for better code. I can go back to code I wrote 3 years ago
and I know that 99% of tables will have something_id (where something is
almost certain to be the table name) as a surrogate key to join on;
there's no need for me to go and figure out what does and what doesn't
have a surrogate key. The 1% that don't fall into that generally aren't
an issue because they're normally very large tables that nothing joins
to.

There's actually an article floating around somewhere about how lazy
coders are good coders... :)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#24

Jim C. Nasby

jnasby@pervasive.com

almost 20 years ago

In reply to: Dann Corbit (#22)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 11:22:24AM -0800, Dann Corbit wrote:

In any of these either misspellings, changes of names, ownership or
even structure over time render the obvious useless as keys. There

are

techniques for detecting and reducing duplication but the point is

that

for any of these duplicates *can* be valid data.

Please point me out where, in the writings of E.F. Codd or in the SQL
Standard, it says that keys have to be immutable for the life of the

row.

Only do that for data that you care about. If you think that the data
has no value, there is no need to have a way to identify a row.

Erm... if you don't care, why are you storing it? :)

I was a DBA for a database for a company with many millions of customers
worldwide (e.g. the product registration table was 24 GB).

Their design had natural keys in it. It caused dozens of problems,
every single day.

I content that most people are not smart enough to decide when a natural
key is a good idea. The engineers that designed the database were
probably pretty smart, since it sort of worked and had thousands of
tables and hundreds of millions of rows in it. But one bad decision on
a natural key will cause literally millions of dollars of damage.

The primary defense I have heard so far is that the Oids are hard to
understand. They are nothing in comparison with understanding what to
do when you have 25 changes to primary keys on various tables every
single day.

Once you get used to Oids, I find it hard to believe that any
intelligent person finds them confusing. Confusion resulting from
having primary keys that are a moving target? Now that's confusion for
you.

Well, I wouldn't use OIDs as in the PostgreSQL OID, but I agree. If
nothing else an ID gives you a fallback... if you absolutely can't find
a customer (or whatever else) through natural keys, you ask them for
their customer ID/number, which has no reason to ever change.

BTW, if you want to see a mess*, take a look at the distributed.net stats
code, which unfortunately uses email as the means to identify
participants. It made perfect sense originally, anyone running the
client was bound to have an email address, and they all had to be
unique, right? Worked great until the first person contacted us
wondering how to change his email address in stats because he'd changed
ISPs. If you look at todays statscode (at least the database portion of
it) approximately 50% of it is there to deal with people retiring one
email address into another, and I'd say that 90%+ of the bugs are in
this code. Had we just required new users to register to get a nice
shiny unique numeric ID (or a unique username...), none of that code
would exist.

* note that I'm not trying to rag on any of the numerous people who've
been involved in the stats code over the years, but it is insightful to
look at some of the 'dumb mistakes' that have been made and the large
amount of pain that it's caused.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#25

Noname

mark@mark.mielke.cc

almost 20 years ago

In reply to: Jim C. Nasby (#23)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 01:48:18PM -0600, Jim C. Nasby wrote:

On Thu, Jan 19, 2006 at 02:01:14PM -0500, mark@mark.mielke.cc wrote:

On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:

So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.

"The lazy man's guide to SQL database design", but Jim Nasby.
;-)

Hehe... I was thinking the same thing. I've definately seen cases
where the use of surrogate keys verges on ridiculous. It hasn't
harmed the application, except it terms of complexity. It still works.
It still performs fine. The SQL queries are awful looking. :-)

Got an example?

Sure, but I have to be vague, because it's a company DB. :-)

The DB has a primary table, that maps the primary key to a surrogate
key. The surrogate key is used in several 1:1 and 1:N relationships.
Pretty straight forward. (I tend to put the primary key in the most
primary table that would have a 1:1 relationship, of which there is
one in this database - but whatever) The primary key is an identifier
used for all inputs and outputs to the application. It is used by
manual and automatic processes internal and external to the company.
Definately a primary key / surrogate key scenario.

The problem here, is that the primary key *is* a natural key. It
is generated to be unique, and it is immutable. There are no interfaces
provided to allow the rename of the key. It is a short character
string of 5 to 20 characters.

All queries to the table are joined with this primary key/surrogate
key table, to allow lookup by the primary key, for records only
identified by the surrogate key.

The database is only likely to have a few thousands records, with
the 1:N relationships not exceeding 5 or 10, and not recursive.
For performance, or disk space, it doesn't really matter which way
they went.

The confusion, though, of joining using a surrogate, that is
intended to be opaque (the value is never queried), ensures that
the program has no simple queries. All queries involve at least
one join.

I said almost ridiculous. It's not enough for me to complain, and
request a re-design. I don't really care what it does, as long as
it accepts my data, and allows me to query my data. But, it does
seem silly to me.

That's where I would tend to draw the line. For me, I find
implementation and maintenance to be the most expensive part of my
applications. My data hasn't yet become large enough to make disk
space, compute resources, or I/O bandwidth a serious concern.

Which is exactly what my thought process is. If you mix surrogate and
non-surrogate keys, how do you know which table has which? Sure, while
you're actively writing the code it's not an issue, but what about 6
months later? What about if someone else picks up the code?

It's usually pretty obvious, looking at a database diagram. You look
up the primary key, and see that it only shows up in one table. :-)

I know Josh was poking fun with his comment about me being lazy, but
lazy can make for better code. I can go back to code I wrote 3 years ago
and I know that 99% of tables will have something_id (where something is
almost certain to be the table name) as a surrogate key to join on;
there's no need for me to go and figure out what does and what doesn't
have a surrogate key. The 1% that don't fall into that generally aren't
an issue because they're normally very large tables that nothing joins
to.

I don't disagree with you. I just don't mind deciding to use a surrogate
key if I'm unsure, and not using a surrogate if it seems more effort than
gain.

There's actually an article floating around somewhere about how lazy
coders are good coders... :)

Dunno where it started, but that's one of the tenets of the developers
of Perl. Of course, with Perl 6, they admitted to having made quite a
few deisgn errors with Perl 5 and earlier... :-)

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#26

Christopher Kings-Lynne

chriskl@familyhealth.com.au

almost 20 years ago

In reply to: Noname (#15)

Re: Surrogate keys (Was: enums)

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)

No then you have problems with identical twins :)

Chris

#27

Michael Glaesemann

grzm@myrealbox.com

almost 20 years ago

In reply to: Christopher Kings-Lynne (#26)

Re: Surrogate keys (Was: enums)

On Jan 20, 2006, at 10:50 , Christopher Kings-Lynne wrote:

Yes. Representation of the DNA is probably best. But - that's a
lot of
data to use as a key in multiple tables. :-)

No then you have problems with identical twins :)

And, looking forward, clones.

Michael Glaesemann
grzm myrealbox com

#28

Richard Huxton

dev@archonet.com

almost 20 years ago

In reply to: Dann Corbit (#22)

Re: Surrogate keys (Was: enums)

Dann Corbit wrote:

When the data changes, the problems generated are not just due to
repercussions related to the child and parent tables related through the
primary key.

Someone has an invoice, and they call in with a question. A combination
of their name and address was used as a primary key. They moved, and
sent in a forwarding address. The DBA was smart enough to design the
database to cascade results, so that there are no orphan records and we
have not compromised the structure of the database.
The customer calls in with a question about an old invoice.
"We have no record of that transaction."

Aside:
Even if not using name+address as a primary key, a separate record
should be kept of these details *at the time of the invoice* otherwise
you'll never be able to match up a printed invoice with its digital
source. Usually of course this is by inv_name, inv_address columns in
the invoice header, but it could be by some fancy temporal versioning on
client details.

--
Richard Huxton
Archonet Ltd

#29

Martijn van Oosterhout

kleptog@svana.org

almost 20 years ago

In reply to: Richard Huxton (#28)

Re: Surrogate keys (Was: enums)

On Fri, Jan 20, 2006 at 01:58:16PM +0000, Richard Huxton wrote:

Aside:
Even if not using name+address as a primary key, a separate record
should be kept of these details *at the time of the invoice* otherwise
you'll never be able to match up a printed invoice with its digital
source. Usually of course this is by inv_name, inv_address columns in
the invoice header, but it could be by some fancy temporal versioning on
client details.

Aside: Using name+address is terrible way for customers to identify
themselves. After you have a couple of experiences of ex-spouses trying
to hijack accounts you get fairly strict about what they have to know.
Bill identifiers, account identifiers, other surrogate keys are fine
but information that's reasonably general knowledge is nice, but not
enough.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#30

Mike Rylander

mrylander@gmail.com

almost 20 years ago

In reply to: Pollard, Mike (#14)

Re: Surrogate keys (Was: enums)

On 1/19/06, Pollard, Mike <mpollard@cincom.com> wrote:

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:

- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:

A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

Unless the person in question happens to be a chimera (yes, they do exist).

;-)

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

#31

Christopher Browne

cbbrowne@acm.org

almost 20 years ago

In reply to: Pollard, Mike (#14)

Re: Surrogate keys

On 1/19/06, Pollard, Mike <mpollard@cincom.com> wrote:

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:

- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:

A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

Unless the person in question happens to be a chimera (yes, they do exist).

;-)

Yeah, I saw that episode of CSI! :-)
--
"cbbrowne","@","gmail.com"
http://linuxdatabases.info/info/lsf.html
"Cat's motto: No matter what you've done wrong, always try to make it
look like the dog did it." -- Unknown

#32

Mike Rylander

mrylander@gmail.com

almost 20 years ago

In reply to: Christopher Browne (#31)

Re: Surrogate keys

On 1/21/06, Christopher Browne <cbbrowne@acm.org> wrote:

On 1/19/06, Pollard, Mike <mpollard@cincom.com> wrote:

Martijn van Oosterhout wrote:

Please provides natural keys for any of the following:

- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

Ahh, a challenge. Hmm, not sure about all of them, but here goes:

A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan. Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

Unless the person in question happens to be a chimera (yes, they do exist).

;-)

Yeah, I saw that episode of CSI! :-)

Heh. I didn't realize they did that already. I was thinking of the
show "I Am My Own Twin" from the Discovery Health Channel.

Couldn't find a link from the official page, but:
http://www.globalspin.com/mt/archives/000547.html .

--
"cbbrowne","@","gmail.com"
http://linuxdatabases.info/info/lsf.html
"Cat's motto: No matter what you've done wrong, always try to make it
look like the dog did it." -- Unknown

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

#33

Bruno Wolff III

bruno@wolff.to

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 00:06:41 -0500,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

The problem with SSN is that somebody other than you controls it.
If you are the college registrar, then you control the student's
registration number, and you don't have to change it. In fact, guess
what: you probably generated it in the same way as a surrogate key.

I work for a University and even the numbers assigned by us get changed on a
regular basis as it is very easy for people to get entered into the
system multiple times. (And for a while campus ids were SSNs by default and we
are still in the process of making them different for everyone.) There are
several effectively surrogate keys (campus id and emplid), but they don't map
1 to 1 to real people. I believe we keep a history of campus ids, and delete
emplids for duplicates.

#34

Bruno Wolff III

bruno@wolff.to

almost 20 years ago

In reply to: Noname (#15)

Re: Surrogate keys (Was: enums)

On Thu, Jan 19, 2006 at 09:53:11 -0500,
mark@mark.mielke.cc wrote:

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)

On a simple level, this would be a problem for twins.
There are other complications as well. People are going to have slightly
different DNA in different cells due to mutations. Though you could probably
do some averaging over a number of cells to get a single value.
For people that have had transplants, you could probably define something
for doing the sample for original material.