indexes
I notice a lot of places where people use the approach of creating an index and
a unique key like:
CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)
instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)
If the name is NEVER going to change, is there any advantage to doing this?
If there are many-to-many reference tables (like name-to-friends) is this any
different?
I've seen this a lot, but I've always assumed that with the condition that
'name' would NEVER change, there was no advantage.
I asked this question here awhile ago. It's a fairly common question,
and it's known as the surrogate vs natural key debate.
Using a natural key has the advantage of performance. With a surrogate
key, most RDBMS systems will have to maintain two indexes. Natural keys
can also make your database more readable, and can eliminate the need to
do joins for foreign keys in some cases.
Surrogate keys are useful because you can very easily change your data
structure with a bit less SQL magick. A lot of Object Relational
Mappers always create surrogate keys, too, although I suspect that's
mostly a way to make the ORM more portable by guaranteeing that a
primary key will exist.
The only other time surrogate keys are very useful is when all your
candidate keys have values that change fairly often, since the primary
key ought to be as stable as possible.
Some developers also feel more comfortable with an id field. Having
that metadata feels like a safety net for some reason.
--
Brandon Aiken
CS/IT Systems Engineer
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Allison
Sent: Friday, November 24, 2006 9:54 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] indexes
I notice a lot of places where people use the approach of creating an
index and
a unique key like:
CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)
instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)
If the name is NEVER going to change, is there any advantage to doing
this?
If there are many-to-many reference tables (like name-to-friends) is
this any
different?
I've seen this a lot, but I've always assumed that with the condition
that
'name' would NEVER change, there was no advantage.
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
On Fri, 2006-11-24 at 09:54 -0500, Tom Allison wrote:
I notice a lot of places where people use the approach of creating an index and
a unique key like:CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)If the name is NEVER going to change, is there any advantage to doing this?
If there are many-to-many reference tables (like name-to-friends) is this any
different?
THe point of the first table is to have a artificial key that allows
easy access to the row.
It is easier to say: select * from foo where id = 5;
I've seen this a lot, but I've always assumed that with the condition that
'name' would NEVER change, there was no advantage.
Technically, it also violates normal form as your primary key should be
on data that is representative. Although this:
CREATE TABLE users ( id SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)
Would make more sense because id is representative of the users.id which
is representative from an application stand point.
Sincerely,
Joshua D. Drake
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
It depends how it's going to be used. If you are going to reference
this table in other tables a lot and/or rarely care about what the
name actually is, then the two-column approach is going to be more
efficient. Numbers are smaller and easier to compare than strings.
On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
Show quoted text
I notice a lot of places where people use the approach of creating
an index and a unique key like:CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)If the name is NEVER going to change, is there any advantage to
doing this?
If there are many-to-many reference tables (like name-to-friends)
is this any different?I've seen this a lot, but I've always assumed that with the
condition that 'name' would NEVER change, there was no advantage.---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
But that requires that you haul an artificial construct around.
On 11/24/06 12:38, Ben wrote:
It depends how it's going to be used. If you are going to reference this
table in other tables a lot and/or rarely care about what the name
actually is, then the two-column approach is going to be more efficient.
Numbers are smaller and easier to compare than strings.On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
I notice a lot of places where people use the approach of creating an
index and a unique key like:CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)If the name is NEVER going to change, is there any advantage to doing
this?
If there are many-to-many reference tables (like name-to-friends) is
this any different?I've seen this a lot, but I've always assumed that with the condition
that 'name' would NEVER change, there was no advantage.
- --
Ron Johnson, Jr.
Jefferson LA USA
Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p
nMmJ64MHVNfE91EZIsJNwts=
=piIg
-----END PGP SIGNATURE-----
Yes, it does. So of course it depends on how you use it to know
what's going to be more efficient. For instance, if the rows in this
table contain strings of more than a few bytes, and more than a
couple tables reference this table with a foreign key, then you will
quickly start to save space by using a numeric primary key, even if
it is an artificial construct.
For the kind of work I find myself doing, it's rare that it would be
more efficient to not have the artificial construct. But that doesn't
mean one is always better than the other.
On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote:
Show quoted text
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1But that requires that you haul an artificial construct around.
On 11/24/06 12:38, Ben wrote:
It depends how it's going to be used. If you are going to
reference this
table in other tables a lot and/or rarely care about what the name
actually is, then the two-column approach is going to be more
efficient.
Numbers are smaller and easier to compare than strings.On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
I notice a lot of places where people use the approach of
creating an
index and a unique key like:CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)If the name is NEVER going to change, is there any advantage to
doing
this?
If there are many-to-many reference tables (like name-to-friends) is
this any different?I've seen this a lot, but I've always assumed that with the
condition
that 'name' would NEVER change, there was no advantage.- --
Ron Johnson, Jr.
Jefferson LA USAIs "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p
nMmJ64MHVNfE91EZIsJNwts=
=piIg
-----END PGP SIGNATURE--------------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
True. That doesn't mean I like it...
Real-life example: We design and run customer service centers for
many toll roads in the US Northeast.
So, we have a set of tables like this:
T_AGENCY (
AGENCY_ID INTEGER PRIMARY KEY,
AGENCY_CODE CHAR(2),
AGENCY_NAME CHAR(40),
etc,
etc );
T_PLAZA (
PLAZA_ID INTEGER PRIMARY KEY,
EXTERNAL_PLAZA_ID CHAR(5),
PLAZA_NAME CHAR(40),
AGENCY_ID INTEGER FOREIGN KEY REFERENCES T_AGENCY.AGENCY_ID,
etc,
etc );
T_LANE (
LANE_ID INTEGER PRIMARY KEY,
EXTERNAL_LANE_ID CHAR(5),
LANE_NAME CHAR(40),
PLAZA_ID INTEGER FOREIGN KEY REFERENCES T_PLAZA.PLAZA_ID,
etc,
etc);
So, in our various transaction and summary tables, we have the
LANE_ID column instead of natural key AGENCY_CODE,
EXTERNAL_PLAZA_ID, EXTERNAL_LANE_ID.
Since we have multiple projects, each with their own "database", and
each database is federated (because they are so large), each actual
database must keep a copy of T_AGENCY, T_PLAZA & T_LANE. The
agencies pass transaction data around, since they all use the same
brand of toll tag, and motorists can use their tag at any other
agency's lanes and have it "work", so if the synthetic keys of these
tables were to ever get out of sync, Bad Things would happen.
On 11/24/06 18:42, Ben wrote:
Yes, it does. So of course it depends on how you use it to know what's
going to be more efficient. For instance, if the rows in this table
contain strings of more than a few bytes, and more than a couple tables
reference this table with a foreign key, then you will quickly start to
save space by using a numeric primary key, even if it is an artificial
construct.For the kind of work I find myself doing, it's rare that it would be
more efficient to not have the artificial construct. But that doesn't
mean one is always better than the other.On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote:
But that requires that you haul an artificial construct around.
On 11/24/06 12:38, Ben wrote:
It depends how it's going to be used. If you are going to reference this
table in other tables a lot and/or rarely care about what the name
actually is, then the two-column approach is going to be more efficient.
Numbers are smaller and easier to compare than strings.On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
I notice a lot of places where people use the approach of creating an
index and a unique key like:CREATE TABLE foo (
idx SERIAL PRIMARY KEY,
name varchar(32) UNIQUE NOT NULL
)instead of
CREATE TABLE foo (
name varchar(32) PRIMARY KEY
)If the name is NEVER going to change, is there any advantage to doing
this?
If there are many-to-many reference tables (like name-to-friends) is
this any different?I've seen this a lot, but I've always assumed that with the condition
that 'name' would NEVER change, there was no advantage.
- --
Ron Johnson, Jr.
Jefferson LA USA
Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
iD4DBQFFZ69TS9HxQb37XmcRAgAAAJUT1HnH/ocUKLxbow6GAg2Gz1wpAJwPQuJV
ZodGoV0BF2NgKn2YTrEoSQ==
=Y+x1
-----END PGP SIGNATURE-----
Ben wrote:
Yes, it does. So of course it depends on how you use it to know what's
going to be more efficient. For instance, if the rows in this table
contain strings of more than a few bytes, and more than a couple tables
reference this table with a foreign key, then you will quickly start to
save space by using a numeric primary key, even if it is an artificial
construct.For the kind of work I find myself doing, it's rare that it would be
more efficient to not have the artificial construct. But that doesn't
mean one is always better than the other.
So let me see if I understand this correctly.
If the real-world primary key is large (say up to 100 characters in length) then
the disadvantage is that you are duplicating this referenced key in several
other tables, each element taking up 100 characters. Space is wasted when
compared to int4 ID's. But not really sure if this is a performance problem for
SELECT except for the space required (varchar(128) vs. int4).
Having two keys, a primary_key of int4 and a unique key of varchar(128) would be
very ugly on INSERT/DELETE/UPDATE queries because of the index overhead.
One table may have rows on the order of 100's.
Another table will be 1,000,000.
The many-to-many join would be.. 100,000,000's
And maybe there I would have need for smaller physical index variable types...
Tom,
If the real-world primary key is large (say up to 100 characters in length)
then
the disadvantage is that you are duplicating this referenced key in
several
other tables, each element taking up 100 characters. Space is wasted when
compared to int4 ID's. But not really sure if this is a performance
problem for
SELECT except for the space required (varchar(128) vs. int4).
What is worth a try is to check an md5 or sha hash of those 100chars and use
that as a key.
with a proper hashing algorythm (proper= fitting to your data) collisions
should not happen; and you have a shorter key to connect the tables.
best wishes
harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.
On Sat, Nov 25, 2006 at 12:14:11PM -0500, Tom Allison wrote:
If the real-world primary key is large (say up to 100 characters in length)
then the disadvantage is that you are duplicating this referenced key in
several other tables, each element taking up 100 characters. Space is
wasted when compared to int4 ID's. But not really sure if this is a
performance problem for SELECT except for the space required (varchar(128)
vs. int4).
Well, it kinda sucks for joining because comparing strings may take
tens to hundreds of times as long as compairng integers. It's not just
byte-wise comparison but you have to be locale sensetive about it.
My main problem with using any kind of string as "natural key" is that
once you start passing it around you have worry about the encoding of
said string and when it goes a round-trip to a client, will what you
get back still be the same? If you can guarentee ASCII you might be ok,
but otherwise... Integers have none of these issues.
Having two keys, a primary_key of int4 and a unique key of varchar(128)
would be very ugly on INSERT/DELETE/UPDATE queries because of the index
overhead.
In general my tables are queried several orders of magnitude more often
than they are updated, so index update cost isn't all that relevent.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote:
Well, it kinda sucks for joining because comparing strings may take
tens to hundreds of times as long as compairng integers. It's not just
byte-wise comparison but you have to be locale sensetive about it.My main problem with using any kind of string as "natural key" is that
once you start passing it around you have worry about the encoding of
said string and when it goes a round-trip to a client, will what you
get back still be the same? If you can guarentee ASCII you might be ok,
but otherwise... Integers have none of these issues.
That sounds like there would be a purpose for a locale-less type of
string, as ugly as that is.
That's not a feature request, just something to digest ;)
Having two keys, a primary_key of int4 and a unique key of varchar(128)
would be very ugly on INSERT/DELETE/UPDATE queries because of the index
overhead.In general my tables are queried several orders of magnitude more often
than they are updated, so index update cost isn't all that relevent.
Indeed. Basically if expensive calculations can be pre-calculated I tend
to move them into triggers that store the pre-calculated result. Usually
there's only a handfull of people modifying data, while there are many
people requesting it.
Regards,
--
Alban Hertroys
alban@magproductions.nl
magproductions b.v.
T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede
// Integrate Your World //