Searching data across tables, some large

Started by Robert Fitzpatrickalmost 19 years ago29 messagesgeneral
Jump to latest
#1Robert Fitzpatrick
lists@webtent.net

Thanks to Richard for the help earlier this week with performance
questions, once I was able to get pgsql and mssql using the same
resources and doing some tuning, I was able to get comparable results.

The issue still though, I have this view that I designed with the
thought in mind to provide all fields the user will want to search in
the PHP web app. However, I have found when joining a couple of large
tables, the view count reaches the 40+ million records, this does not
seem to work in pgsql (or in mssql), especially if distinct is needed.
Should I even be attempting this approach?

I have never used TSearch and wonder if that is the solution to this
type of search? From a quick read of some TSearch info I see indexes are
setup on a column basis in a table? So, I could create an index column
for say the first name and last name fields in a contact table, but
cannot create a field with information from different tables? Also, not
all fields are text that I need to search, some boolean for instance,
can I search TSearch index fields and other fields at the same time. Not
sure how TSearch works, just wanted to get an opinion that that may be
what I need before delving into it too much.

--
Robert

#2Ericson Smith
esconsult1@gmail.com
In reply to: Robert Fitzpatrick (#1)
Re: Searching data across tables, some large

Yeah, we've used Tsearch with joins and searches on other fields on
the tsearch table no problem. Tsearches are just another part of a
WHERE clause.

Regards
- Ericson Smith
Developer
http://www.funadvice.com

Show quoted text

On 5/23/07, Robert Fitzpatrick <lists@webtent.net> wrote:

Thanks to Richard for the help earlier this week with performance
questions, once I was able to get pgsql and mssql using the same
resources and doing some tuning, I was able to get comparable results.

The issue still though, I have this view that I designed with the
thought in mind to provide all fields the user will want to search in
the PHP web app. However, I have found when joining a couple of large
tables, the view count reaches the 40+ million records, this does not
seem to work in pgsql (or in mssql), especially if distinct is needed.
Should I even be attempting this approach?

I have never used TSearch and wonder if that is the solution to this
type of search? From a quick read of some TSearch info I see indexes are
setup on a column basis in a table? So, I could create an index column
for say the first name and last name fields in a contact table, but
cannot create a field with information from different tables? Also, not
all fields are text that I need to search, some boolean for instance,
can I search TSearch index fields and other fields at the same time. Not
sure how TSearch works, just wanted to get an opinion that that may be
what I need before delving into it too much.

--
Robert

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#3Chuck D.
pgsql-list@nullmx.com
In reply to: Ericson Smith (#2)
Geographic data sources, queries and questions

Greetings all,

I have a couple issues regarding geographic names databases.

1) The first is this. I have 3 tables. Country, state and city. Country has
a country_id to identify a country, state has a state_id and country_id to
identify a state, and city has a city_id, state_id and country_id (for easy
reference) to identify it. I then have a table for users that stores their
city, state and country ID's along with other info about them.

My problem came recently when I questioned the integrity of the data and
needed to make some changes. I thought to myself that maybe storing the ID
wasn't as good as storing the ISO or FIPS 2 letter abbreviation. The only
problem the abbreviation could changed at some point by the regulating bodies
and all rows in all tables would need to be updated.

The question is, for the purposes of querying or searching is it better to
store and search a 2 byte integer that is indexed for country or state ID's,
or is it better to store and search a 2 byte CHAR abbreviation?

2) I've spent an accumulated total of around a month and a half trying to
consolidate geographic name data from several free sources on the net and
realize this isn't the best use of my time and errors will be had. Does
anyone know of a reliable source of geo data that isn't costly? Most want to
charge a server license, annual rate, etc. I'm not sure about the free
sources because one I used actually had mixed values in a column and drove me
nuts. I primarily need:

country
state
county if applicable
city
latitude
longitude

This is primarily input from an HTML form to calculate distances between
users.

Anyone who has any experience with geo name data I would appreciate hearing
your solution.

#4Robert Fitzpatrick
lists@webtent.net
In reply to: Ericson Smith (#2)
Re: Searching data across tables, some large

On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote:

Yeah, we've used Tsearch with joins and searches on other fields on
the tsearch table no problem. Tsearches are just another part of a
WHERE clause.

And can there be Tsearch fields built based on fields in different
tables? Where can I find the best docs for TSearch2? I'm looking for
information pertaining to how TSearch can help in my situation, instead
of building a view with a lot of joins causing absurd amount of rows
returned.

Or am I going to need to limit the tables a user can search at one time
regardless? For instance, seems if I join my table of clients with
contacts to return all the contacts with company info, no problem. I can
also join my clients with activity *or* comments table. But if I try to
left join all three of those tables in one view against clients, seems
to return way too many rows. Of course, I will not want all those rows,
but I can't get it to respond to queries. Meanwhile, I can create a view
with just the comments table joined to clients, while the count is over
1 million records, select statements execute quickly.

Is TSearch for me? Thanks for the advice!

--
Robert

#5George Pavlov
gpavlov@mynewplace.com
In reply to: Chuck D. (#3)
Re: Geographic data sources, queries and questions

seems hard to enforce integrity in your model. how are you going to
ensure that the user's city-state-country combo a valid one? (well, you
can, but it is a pain). ask yourself: can a city be in more than one
country? probably not (even if the name is the same it is not the same
city!). can a state be in more than one country? etc., etc.

seems much cleaner to have cities have a key to states, states to
countries. otherwise might as well just have a big denormalized table
and skip the whole relational thing...

numeric ids vs chars, when properly indexed, should perform about the
same (even if there is a small difference this is not something one
should really worry about; hey, there aren't even that many cities in
the world!)

i would go with a unique internal id (in fact that IS what i do) you can
store the FIPS/ISO code in a neighboring field, but i am not sure it is
good enough for a primary key.

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D.
Sent: Wednesday, May 23, 2007 4:22 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Geographic data sources, queries and questions

Greetings all,

I have a couple issues regarding geographic names databases.

1) The first is this. I have 3 tables. Country, state and
city. Country has
a country_id to identify a country, state has a state_id and
country_id to
identify a state, and city has a city_id, state_id and
country_id (for easy
reference) to identify it. I then have a table for users
that stores their
city, state and country ID's along with other info about them.

My problem came recently when I questioned the integrity of
the data and
needed to make some changes. I thought to myself that maybe
storing the ID
wasn't as good as storing the ISO or FIPS 2 letter
abbreviation. The only
problem the abbreviation could changed at some point by the
regulating bodies
and all rows in all tables would need to be updated.

The question is, for the purposes of querying or searching is
it better to
store and search a 2 byte integer that is indexed for country
or state ID's,
or is it better to store and search a 2 byte CHAR abbreviation?

2) I've spent an accumulated total of around a month and a
half trying to
consolidate geographic name data from several free sources on
the net and
realize this isn't the best use of my time and errors will be
had. Does
anyone know of a reliable source of geo data that isn't
costly? Most want to
charge a server license, annual rate, etc. I'm not sure
about the free
sources because one I used actually had mixed values in a
column and drove me
nuts. I primarily need:

country
state
county if applicable
city
latitude
longitude

This is primarily input from an HTML form to calculate
distances between
users.

Anyone who has any experience with geo name data I would
appreciate hearing
your solution.

#6Robert Fitzpatrick
lists@webtent.net
In reply to: Robert Fitzpatrick (#4)
Re: Searching data across tables, some large

On Wed, 2007-05-23 at 19:48 -0400, Robert Fitzpatrick wrote:

On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote:

Yeah, we've used Tsearch with joins and searches on other fields on
the tsearch table no problem. Tsearches are just another part of a
WHERE clause.

And can there be Tsearch fields built based on fields in different
tables? Where can I find the best docs for TSearch2? I'm looking for
information pertaining to how TSearch can help in my situation, instead
of building a view with a lot of joins causing absurd amount of rows
returned.

Of course, what was I thinking! After reading through the TSearch stuff
I see the objective is to index multiple columns. There would be no way
to reference a unique record from an index of columns over multiple
tables.

So, it is looking like a build of the query string is going to have to
take place on whichever tables the fields are in while limiting what
fields the user can search (depending whether fields are in different
tables), that versus a view with everything is what I should be doing?

--
Robert

#7Jorge Godoy
jgodoy@gmail.com
In reply to: Chuck D. (#3)
Re: Geographic data sources, queries and questions

"Chuck D." <pgsql-list@nullmx.com> writes:

1) The first is this. I have 3 tables. Country, state and city. Country has
a country_id to identify a country, state has a state_id and country_id to
identify a state, and city has a city_id, state_id and country_id (for easy
reference) to identify it. I then have a table for users that stores their
city, state and country ID's along with other info about them.

I don't believe this is good design. You'll have to have a trigger or
something to verify that the country_id+state_id on the city table are
exactly equal to the country_id+state_id on the state table. If you
don't, you might have something like (using US city names...) "country:
USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New
York".

It isn't a problem of "any country and any state" on the city table, but
a problem of "this state inside that particular country". I'd drop the
country column.

My problem came recently when I questioned the integrity of the data and
needed to make some changes. I thought to myself that maybe storing the ID
wasn't as good as storing the ISO or FIPS 2 letter abbreviation. The only
problem the abbreviation could changed at some point by the regulating bodies
and all rows in all tables would need to be updated.

You have integrity problems because you denormalized your model too much
and tried to attach the same information on two different places without
requiring those to be equal.

The question is, for the purposes of querying or searching is it better to
store and search a 2 byte integer that is indexed for country or state ID's,
or is it better to store and search a 2 byte CHAR abbreviation?

It all depends: surrogate primary keys or ... :-) (old flame starter)

--
Jorge Godoy <jgodoy@gmail.com>

#8Chuck D.
pgsql-list@nullmx.com
In reply to: George Pavlov (#5)
Re: Geographic data sources, queries and questions

On Wednesday 23 May 2007 17:56, you wrote:

seems hard to enforce integrity in your model. how are you going to
ensure that the user's city-state-country combo a valid one? (well, you
can, but it is a pain). ask yourself: can a city be in more than one
country? probably not (even if the name is the same it is not the same
city!). can a state be in more than one country? etc., etc.

No, a city state combination belongs to one country as far as I know. I
really just left the country id in the city table as an easy means of getting
around while I was trying to understand the different data sources. I was
doing a lot of typing at the console and it was easier than joins :) It
would be dropped for sure as it adds another 4 MEGS to the table not
including indexes, and it violates BC normal form.

numeric ids vs chars, when properly indexed, should perform about the
same (even if there is a small difference this is not something one
should really worry about; hey, there aren't even that many cities in
the world!)

Ya, about 2 million city entries total.

i would go with a unique internal id (in fact that IS what i do) you can
store the FIPS/ISO code in a neighboring field, but i am not sure it is
good enough for a primary key.

OK, cool. I *prefer* the id as that is what I've been used to, but when I
used oracle in school we would have been taught to use the CHAR. I just
don't like using the CHAR because some data sets use FIPS and some use ISO
and both are bound to change. After all those are just abbreviations. Seems
like something to be referenced. I think I'll stick with the ID's instead of
CHARs then as they seem more comfortable.

Speaking of which, if you are using a similar database what source did you use
for geographic data? I'm having troubles with a reliable set.

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D.
Sent: Wednesday, May 23, 2007 4:22 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Geographic data sources, queries and questions

Greetings all,

I have a couple issues regarding geographic names databases.

1) The first is this. I have 3 tables. Country, state and
city. Country has
a country_id to identify a country, state has a state_id and
country_id to
identify a state, and city has a city_id, state_id and
country_id (for easy
reference) to identify it. I then have a table for users
that stores their
city, state and country ID's along with other info about them.

My problem came recently when I questioned the integrity of
the data and
needed to make some changes. I thought to myself that maybe
storing the ID
wasn't as good as storing the ISO or FIPS 2 letter
abbreviation. The only
problem the abbreviation could changed at some point by the
regulating bodies
and all rows in all tables would need to be updated.

The question is, for the purposes of querying or searching is
it better to
store and search a 2 byte integer that is indexed for country
or state ID's,
or is it better to store and search a 2 byte CHAR abbreviation?

2) I've spent an accumulated total of around a month and a
half trying to
consolidate geographic name data from several free sources on
the net and
realize this isn't the best use of my time and errors will be
had. Does
anyone know of a reliable source of geo data that isn't
costly? Most want to
charge a server license, annual rate, etc. I'm not sure
about the free
sources because one I used actually had mixed values in a
column and drove me
nuts. I primarily need:

country
state
county if applicable
city
latitude
longitude

This is primarily input from an HTML form to calculate
distances between
users.

Anyone who has any experience with geo name data I would
appreciate hearing
your solution.

#9Chuck D.
pgsql-list@nullmx.com
In reply to: Jorge Godoy (#7)
Re: Geographic data sources, queries and questions

On Wednesday 23 May 2007 18:59, you wrote:

I don't believe this is good design. You'll have to have a trigger or
something to verify that the country_id+state_id on the city table are
exactly equal to the country_id+state_id on the state table. If you
don't, you might have something like (using US city names...) "country:
USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New
York".

It isn't a problem of "any country and any state" on the city table, but
a problem of "this state inside that particular country". I'd drop the
country column.

You are right, this is a bad design. The country_id on the city table has to
go.

#10Tilmann Singer
tils-pgsql@tils.net
In reply to: Chuck D. (#3)
Re: Geographic data sources, queries and questions

* Chuck D. <pgsql-list@nullmx.com> [20070524 01:26]:

2) I've spent an accumulated total of around a month and a half trying to
consolidate geographic name data from several free sources on the net and
realize this isn't the best use of my time and errors will be had. Does
anyone know of a reliable source of geo data that isn't costly? Most want to
charge a server license, annual rate, etc. I'm not sure about the free
sources because one I used actually had mixed values in a column and drove me
nuts. I primarily need:

country
state
county if applicable
city
latitude
longitude

This is primarily input from an HTML form to calculate distances between
users.

Anyone who has any experience with geo name data I would appreciate hearing
your solution.

We are using this data which seems to be fairly extensive and
accurate, and is free:

http://earth-info.nga.mil/gns/html/gis_countryfiles.htm

I haven't fully understood the meaning of all the fiels in there
however, we're using it only to compute alternative spellings for city
and country names and came up with these conditions that seem to
return the desired results:

For city alternatives:

select lower(full_name) as full_name from geo_names gn1 where gn1.ufi in
(select ufi from geo_names gn2 where lower(gn2.full_name)=:city
and gn2.fc='P')

For country alternatives:

select lower(full_name) as full_name from geo_names gn1 where
gn1.ufi in
(select ufi from geo_names gn2 where
lower(gn2.full_name)=:country and gn2.fc='A' and gn2.dsg='PCLI')

Til

#11Berend Tober
btober@ct.metrocast.net
In reply to: Tilmann Singer (#10)
Re: Geographic data sources, queries and questions

I don't believe this is good design. You'll have to

have a trigger or

something to verify that the country_id+state_id on the

city table are

exactly equal to the country_id+state_id on the state

table. If you

don't, you might have something like (using US city

names...) "country:

USA -> state: NY" -> "country: Zimbabwe -> state: NY ->

city: New

York".

It isn't a problem of "any country and any state" on the

city table, but

a problem of "this state inside that particular

country". I'd drop the

country column.

You are right, this is a bad design. The country_id on

the city table has to

go.

I'm not sure it is a bad design. Country has a country_id.
That's the primary key. State has a state_id, and exactly
one country, so really state has a compound primary key,
namely (country_id, state_id). And similarly to produce
relational integrity between state and city, city needs to
reference the state primary key, which means state has to
have all three (country_id, state_id, city_id) as it's
primary key.

This ties in with a subject dear to my heart and discussed
at great length starting here:

"http://archives.postgresql.org/pgsql-general/2006-08/msg00535.php&quot;

The tie-in is not that you necessarily need the infamous
"gap-less" sequence, but that you don't necessarily need
three sequences per se.

While it would be temptingly easy to simply declare all
three separately in each table as

country:
country_id SERIAL, ...

state:
country_id integer,
state_id SERIAL, ...

city:
country_id integer,
state_id integer,
city_id SERIAL, ...

with that naive approach, every row in state has a unique
state_id, and every row in city has a unique city_id. Then
you'll notice that values of country_id are repeated in
state, and state_id values are repeated in city.

And then you'll realize that really it is the combination of
(country_id, state_id) that defines a unique state, and
(country_id, state_id, city_id) that defines a unique city.
It would require the use of stored programs and triggers to
manage these compound keys. But that's what stored programs
and triggers are for.

If you allow the fact that two countries could lay claim to
the same geographic sub-region, then you need a separate
table for an n-m relation.

Then throw in postal codes. Not only can a city have
multiple postal codes, but a postal code can serve more than
one city. And the cities served might be in different
states! (I used to have an example of that, but I can't find
it right now.)

Same with telephone area codes. (You are going there,
eventually, right?)

#12John D. Burger
john@mitre.org
In reply to: Tilmann Singer (#10)
Re: Geographic data sources, queries and questions

Tilmann Singer wrote:

We are using this data which seems to be fairly extensive and
accurate, and is free:

http://earth-info.nga.mil/gns/html/gis_countryfiles.htm

We use that, but it is only non-US, so we combine it with this:

http://geonames.usgs.gov/domestic/download_data.htm

We also have a hodge-podge of other sources, but those are the main
ones. (By the way, we have found USGS to very amenable to dumping
their data in arbitrary ways. Those state files essentially try to
fit everything into a single CSV format, but they have given us other
custom dumps.)

Note that both of these main sources have multiple names for the same
location, so our schema is highly normalized - we have a separate
table for names (so the string "Springfield" occurs in only one
place :). Because we are interested in all sorts of geographic
entities, not just city/state/country, we have only a single table
for these, with fields for type, lat/long, primary name, and a few
other things. All other relationships are represented in separate
linking tables, using our internal IDs for locations and names, e.g.,
location_has_name, location_contained_in_location, etc. As far as
FIPS and ISO codes are concerned, we have a separate table mapping
(locationID, standards body) to codes.

We are interested in sharing this stuff, so I'd be happy to pass
along the schema and/or the data, although all of it is kind of beta.

- John D. Burger
MITRE

#13Chuck D.
pgsql-list@nullmx.com
In reply to: John D. Burger (#12)
Re: Geographic data sources, queries and questions

On Thursday 24 May 2007 13:02, John D. Burger wrote:

We also have a hodge-podge of other sources, but those are the main
ones. (By the way, we have found USGS to very amenable to dumping
their data in arbitrary ways. Those state files essentially try to
fit everything into a single CSV format, but they have given us other
custom dumps.)

Note that both of these main sources have multiple names for the same
location, so our schema is highly normalized - we have a separate
table for names (so the string "Springfield" occurs in only one
place :). Because we are interested in all sorts of geographic
entities, not just city/state/country, we have only a single table
for these, with fields for type, lat/long, primary name, and a few
other things. All other relationships are represented in separate
linking tables, using our internal IDs for locations and names, e.g.,
location_has_name, location_contained_in_location, etc. As far as
FIPS and ISO codes are concerned, we have a separate table mapping
(locationID, standards body) to codes.

We are interested in sharing this stuff, so I'd be happy to pass
along the schema and/or the data, although all of it is kind of beta.

John,

I'd be happy to take a look at the schema and data.

I decided to put together the USGS stuff, the maxmind free stuff and the
GeoNames project files and in the end I had countries with no states, states
with no cities and cities with no states. Some data sources said a country
had 40 states, another said it had 50. It was difficult to try and figure
out because I don't know geo stuff enough to verify it.

How can I go about getting a peek at the schema and the sources at least?

#14Noname
mkwong8888@hotmail.com
In reply to: Chuck D. (#13)
Re: Geographic data sources, queries and questions

You can try the free sample database from http://www.geodatasource.com

#15John D. Burger
john@mitre.org
In reply to: Chuck D. (#13)
Re: Geographic data sources, queries and questions

Chuck D. wrote:

I decided to put together the USGS stuff, the maxmind free stuff
and the
GeoNames project files and in the end I had countries with no
states, states
with no cities and cities with no states. Some data sources said a
country
had 40 states, another said it had 50. It was difficult to try and
figure
out because I don't know geo stuff enough to verify it.

Yeah, all of our sources data has various degrees of noise. There
were even locations mis-typed as =countries= in the official NGA
downloads - you'd think their validation would at least identify
spurious countries :). We developed a set of heuristics for deciding
when two locations (usually but not always from two different
sources) were in fact the same entity. This was an area that needed
more work, however, when the project ended. In addition, different
sources had made different ontological decisions about what was
what. For instance, does the US have 50 states - what about the US
Virgin Islands, etc?

This was a few years ago - if we were to start up again, I suspect we
would investigate working with whoever is behind geonames.org, as
they seem to have the same kind of goals we did. Anyway, I will send
our schema under separate cover, and I will investigate sending you
the data as well.

- John D. Burger
MITRE

#16brian
brian@zijn-digital.com
In reply to: John D. Burger (#15)
Re: Geographic data sources, queries and questions

John D. Burger wrote:

For instance, does the US have 50 states - what about the US Virgin Islands, etc?

Off-topic, but the US Virgin Islands are an "unincorporated United
States insular area" ("territory"--note the lowercase t). The Dept. Of
the Interrior addresses this in their FAQ:

http://www.doi.gov/oia/FAQ/FAQindex.htm#4

You'll be surprised and amazed at the number of US "terms of
sovereignty" that exist.

b

#17John D. Burger
john@mitre.org
In reply to: brian (#16)
Re: Geographic data sources, queries and questions

Off-topic, but the US Virgin Islands are an "unincorporated United
States insular area" ("territory"--note the lowercase t). The Dept.
Of the Interrior addresses this in their FAQ:

http://www.doi.gov/oia/FAQ/FAQindex.htm#4

You'll be surprised and amazed at the number of US "terms of
sovereignty" that exist.

Yah, that's my point - some data sources might lump all these
together as state/province level entities, and some might not.

- John D. Burger
MITRE

#18Michael Glaesemann
grzm@seespotcode.net
In reply to: Berend Tober (#11)
Re: Geographic data sources, queries and questions

On May 24, 2007, at 8:57 , btober wrote:

I'm not sure it is a bad design. Country has a country_id.
That's the primary key. State has a state_id, and exactly
one country, so really state has a compound primary key,
namely (country_id, state_id).

While each state may have a single state_id and a single country,
that does not imply a composite (or compound) primary key. There's
been a lot written on database normalization, so I won't go into
depth here, but a quick way to think about it is how each row is
uniquely identified. For example;

Is each state uniquely identified by state_id? If so, that's a
(primary) key for the states table. That's often what people are
trying to do when they make a table of the form:

-- Listing 1
CREATE TABLE states
(
state_id INTEGER PRIMARY KEY
, state_name TEXT NOT NULL
);

If you are only dealing with one country, each state is (hopefully)
uniquely identified by its name as well, so you could add a UNIQUE
constraint to the state_name column, e..g,

-- Listing 2
CREATE TABLE states
(
state_id INTEGER PRIMARY KEY
, state_name TEXT NOT NULL UNIQUE
);

From a logical point of view, PRIMARY KEY is equivalent to NOT NULL
UNIQUE, so there's no logical difference between Listing 2 and the
following:

-- Listing 3
CREATE TABLE states
(
state_id INTEGER NOT NULL UNIQUE
, state_name TEXT PRIMARY KEY
);

The state_id column is what is often referred to as a surrogate key:
it holds no information that really identifies the state in any real
sense. One integer is as good as another to identify the state. On
the other hand, the state_name column *is* associated with each state
in a real sense. Assigning arbitrary names to states would be less
than useful.

If you're handling more than one country, you'll most likely want to
associate the states with their respective countries.

-- Listing 4
CREATE TABLE countries
(
country_id INTEGER PRIMARY KEY
);

CREATE TABLE states
(
state_id INTEGER PRIMARY KEY
, state_name TEXT NOT NULL
, country_id INTEGER NOT NULL
REFERENCES countries (country_id)
);

Note that there's no UNIQUE constraint on state_name. You may have
more than one state with the same state_name around the world so you
may want to make sure that for each country, each state_name is unique:

-- Listing 5
CREATE TABLE states
(
state_id INTEGER PRIMARY KEY
, state_name TEXT NOT NULL
, country_id INTEGER NOT NULL
REFERENCES countries (country_id)
, UNIQUE (country_id, state_name)
);

And similarly to produce
relational integrity between state and city, city needs to
reference the state primary key, which means state has to
have all three (country_id, state_id, city_id) as it's
primary key.

While each city does belong to a given state, each state in turn is
associated with a given country. To find out what country a given
city belongs to you'd just join through the states table. For example:

-- Listing 6
CREATE TABLE cities
(
city_id INTEGER PRIMARY KEY
, city_name TEXT NOT NULL
, state_id INTEGER NOT NULL
REFERENCES states (state_id)
, UNIQUE (state_id, city_name)
);

I've also gone ahead and provided a UNIQUE constraint to prevent
city_name duplicates in the same state.

A PRIMARY KEY constraint of the form PRIMARY KEY (country_id,
state_id, city_id) would mean that the for each country_id and
state_id combination each city_id is unique. This means you could
potentially have the same city in multiple states in the same country
or in various countries and states. And there's nothing to prevent
something along the lines of (Mexico City, Nebraska, Canada). Note
that (Omaha, Nebraska, United States of America) would happily exist
in the same cities table!

To find the countries for each city:

-- Listing 7

SELECT city_name, state_name, country_name
FROM cities
NATURAL JOIN states
NATURAL JOIN countries;

While it would be temptingly easy to simply declare all
three separately in each table as

country:
country_id SERIAL, ...

state:
country_id integer,
state_id SERIAL, ...

city:
country_id integer,
state_id integer,
city_id SERIAL, ...

with that naive approach, every row in state has a unique
state_id, and every row in city has a unique city_id. Then
you'll notice that values of country_id are repeated in
state, and state_id values are repeated in city.

This "naive" approach (other than adding country_id to the city
table) actually looks like proper normalization. The repetition you
have here is just providing the country for each state and the state
for each city. That's not duplication of information, if you want to
associate states with countries and cities with states.

And then you'll realize that really it is the combination of
(country_id, state_id) that defines a unique state, and
(country_id, state_id, city_id) that defines a unique city.
It would require the use of stored programs and triggers to
manage these compound keys. But that's what stored programs
and triggers are for.

As above, if you've got your database schema designed properly, you
don't need any stored procedures or triggers (other than those
provided under the covers by the foreign keys) to maintain the proper
referential integrity. Each city has a unique state (the state_id
column) and a unique country (joined through the states table). But
enforcing (city_id, state_id, country_id) uniqueness allows all kinds
of city/state/country mismatches.

If you allow the fact that two countries could lay claim to
the same geographic sub-region, then you need a separate
table for an n-m relation.

This could be interesting :) Especially if the regions don't share
exactly the same borders! :)

Then throw in postal codes. Not only can a city have
multiple postal codes, but a postal code can serve more than
one city.

-- Listing 8

CREATE TABLE postal_codes
(
postal_code TEXT PRIMARY KEY
);

CREATE TABLE city_postal_codes
(
city_id INTEGER NOT NULL
REFERENCES cities (city_id)
, postal_code TEXT NOT NULL
REFERENCES postal_codes (postal_code)
, PRIMARY KEY (city_id, postal_code)
);

And the cities served might be in different
states! (I used to have an example of that, but I can't find
it right now.)

In Listing 8 there's nothing enforcing unique (state_id, postal_code)
associations, so you wouldn't run into trouble there. If you wanted
to find out which postal codes service which states you can use a join:

-- Listing 9

SELECT DISTINCT state_name, postal_code
FROM states
NATURAL JOIN city_postal_codes;

Same with telephone area codes. (You are going there,
eventually, right?)

Given the mobility of telephone numbers nowadays, you may not be
concerned with strict associations with telephone numbers,
subscribers, cities, and states.

Anyway, this has ended up much longer than I intended, but I didn't
want this to go unanswered. I've found Chris Date's books very
helpful, in particular "Introduction to Database Systems"[1](http://www.amazon.com/Introduction-Database-Systems-Eighth/dp/ 0321197844/) and
"Database in Depth: Relational Theory for Practice"[2](http://www.amazon.com/Database-Depth-Relational-Theory- Practitioners/dp/0596100124/).

Hope this helps.

Michael Glaesemann
grzm seespotcode net

[1]: (http://www.amazon.com/Introduction-Database-Systems-Eighth/dp/ 0321197844/)
0321197844/)
[2]: (http://www.amazon.com/Database-Depth-Relational-Theory- Practitioners/dp/0596100124/)
Practitioners/dp/0596100124/)

#19Oliver Elphick
olly@lfix.co.uk
In reply to: Michael Glaesemann (#18)
Re: Geographic data sources, queries and questions

On Tue, 2007-05-29 at 13:49 -0500, Michael Glaesemann wrote:

If you're handling more than one country, you'll most likely want to
associate the states with their respective countries.

-- Listing 4
CREATE TABLE countries
(
country_id INTEGER PRIMARY KEY
);

CREATE TABLE states
(
state_id INTEGER PRIMARY KEY
, state_name TEXT NOT NULL
, country_id INTEGER NOT NULL
REFERENCES countries (country_id)
);

Note that there's no UNIQUE constraint on state_name. You may have
more than one state with the same state_name around the world so you
may want to make sure that for each country, each state_name is
unique:

You have assumed that state codes are unique integers, but for a
worldwide database that is probably a bad design. The USA knows its
states by two-letter codes, as does India and one should surely not
invent a new set of codes for them. I would make this field a
VARCHAR(3) with an upper-case constraint. Furthermore, these codes are
not going to be unique. For instance MH is the US abbreviation for the
Marshall Islands [US Post Office] and also the Indian abbreviation for
Maharashtra [Wikipedia]. In such a case I would always make the country
code part of the primary key and not just an attribute. Again this
saves your having to invent a new set of codes when one exists already.

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
Do you want to know God? http://www.lfix.co.uk/knowing_god.html

#20John D. Burger
john@mitre.org
In reply to: Oliver Elphick (#19)
Re: Geographic data sources, queries and questions

Oliver Elphick wrote:

You have assumed that state codes are unique integers, but for a
worldwide database that is probably a bad design. The USA knows its
states by two-letter codes, as does India and one should surely not
invent a new set of codes for them. I would make this field a
VARCHAR(3) with an upper-case constraint.

In fact, the US postal codes are not what most govt. data sources use
- they are mandated to use FIPS codes, which are numeric and are not
guaranteed to be stable!!!

Furthermore, these codes are
not going to be unique. For instance MH is the US abbreviation for
the
Marshall Islands [US Post Office] and also the Indian abbreviation for
Maharashtra [Wikipedia]. In such a case I would always make the
country
code part of the primary key and not just an attribute. Again this
saves your having to invent a new set of codes when one exists
already.

Even ISO country codes are not guaranteed to be stable - I think
Yugoslavia is one example where a code has been recycled recently.
As I said, we found the simplest approach was to use our own internal
IDs for these things, and have a table mapping these to the codes
used in various standards.

- John D. Burger
MITRE

#21Michael Glaesemann
grzm@seespotcode.net
In reply to: Oliver Elphick (#19)
#22Michael Glaesemann
grzm@seespotcode.net
In reply to: John D. Burger (#20)
#23Ron Johnson
ron.l.johnson@cox.net
In reply to: Michael Glaesemann (#22)
#24John D. Burger
john@mitre.org
In reply to: Ron Johnson (#23)
#25Bruce Momjian
bruce@momjian.us
In reply to: John D. Burger (#24)
#26PFC
lists@peufeu.com
In reply to: Bruce Momjian (#25)
#27Ron Johnson
ron.l.johnson@cox.net
In reply to: John D. Burger (#24)
#28Michael Glaesemann
grzm@seespotcode.net
In reply to: Bruce Momjian (#25)
#29Chuck D.
pgsql-list@nullmx.com
In reply to: John D. Burger (#12)