Large table search question

Started by John Wellsalmost 22 years ago9 messagesgeneral
Jump to latest
#1John Wells
jb@sourceillustrated.com

Guys,

I have a general question about designing databases for large data sets.

I was speaking with a colleague about an application we're preparing to
build. One of the application's tables will potentially contain 2 million
or more names, containing (at least) the fields first_name, last_name,
middle_name and prefix.

A common lookup the application will require is the full name, so prefix +
first_name + middle_name + last_name.

My friend's suggestion was to create a "lookup field" in the table itself,
which would contain a concatenation of these fields created during insert.
So, for each record, we'd having each individual field and then a
full_name field that would contain the combination of the ind. fields.
His argument is that this will make lookups in this manner extremely fast
and efficient.

I agree with his assertion, but get the feeling that this is sort of an
ugly design. Would a compound index on these fields really be less
efficient?

Thanks for your help!

John

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Wells (#1)
Re: Large table search question

"John Wells" <jb@sourceillustrated.com> writes:

A common lookup the application will require is the full name, so prefix +
first_name + middle_name + last_name.

My friend's suggestion was to create a "lookup field" in the table itself,
which would contain a concatenation of these fields created during insert.
So, for each record, we'd having each individual field and then a
full_name field that would contain the combination of the ind. fields.
His argument is that this will make lookups in this manner extremely fast
and efficient.

Not unless you then add an index on that field, which would imply doubly
redundant storage of the data (primary fields, lookup field, lookup
field's index).

You don't actually need the lookup field in Postgres: you can create the
computed index directly. For instance

create index fooi on foo ((first_name || middle_name || last_name));

select * from foo
where (first_name || middle_name || last_name) = 'JohnQPublic';

This is still kinda grim on storage space, but at least it's 2x not 3x.

regards, tom lane

#3Richard Huxton
dev@archonet.com
In reply to: John Wells (#1)
Re: Large table search question

John Wells wrote:

Guys,

I have a general question about designing databases for large data sets.

I was speaking with a colleague about an application we're preparing to
build. One of the application's tables will potentially contain 2 million
or more names, containing (at least) the fields first_name, last_name,
middle_name and prefix.

A common lookup the application will require is the full name, so prefix +
first_name + middle_name + last_name.

My friend's suggestion was to create a "lookup field" in the table itself,
which would contain a concatenation of these fields created during insert.
So, for each record, we'd having each individual field and then a
full_name field that would contain the combination of the ind. fields.
His argument is that this will make lookups in this manner extremely fast
and efficient.

Might, might not. No figures to back up his argument. It'll certainly
make updates slower and less efficient. In fact, since each row will
store the data twice you'll get less rows per disk-page which means
(potentially) more disk reads when you need to get several rows.

I agree with his assertion, but get the feeling that this is sort of an
ugly design. Would a compound index on these fields really be less
efficient?

Doubtful, I'd certainly not try his solution until I'd tried the simple
way first.

If you really want to try your friend's approach on PG you can build a
functional index. As of 7.4, these can be expressions rather than just
indexes so you can do something like:

CREATE INDEX my_idx_1 ON table1 ( prefix || ' ' || first_name ...);

If you're using 7.3.x you'll need to wrap that expression in a function
and index the function instead.

In your case though, I'd just build a compound index and leave it at that.

--
Richard Huxton
Archonet Ltd

#4Stijn Vanroye
s.vanroye@farcourier.com
In reply to: Richard Huxton (#3)
Re: Large table search question

I don't want to but in, I just find this an interesting discussion and would like to add my 2 cents:

I have read this in the manual: (PostgreSQL 7.4beta4 documentation, Chapter 11.3 Multicolumn Indexes)
Qoute:
"Multicolumn indexes should be used sparingly. Most of the time, an index on a single column is sufficient and saves space and time. Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized."
This makes me think of the usefullness in means of performance off multi-column indices. Furthermore it states that mulicolumn indeces will only be used by the planner if the fields of the index are used with the AND operator in the where clause of your select. (Same chapter).

We had a table with 6million+ records and a few tests with explain reveiled that none of the multi-column indeces where actually used! This while regualar analyzes where done, and the data never changes (no mutations).

I don't seem to grasp the full meaning of the above. Am I better of using several single field indices, or do mulitcolumn indices offer an advantage? If so in which case? Just made me wander...

Regards,

Stijn Vanroye

-----Original Message-----
From: Richard Huxton [mailto:dev@archonet.com]
Sent: dinsdag 1 juni 2004 10:44
To: John Wells
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Large table search question[Scanned]

John Wells wrote:

Guys,

I have a general question about designing databases for large data sets.

I was speaking with a colleague about an application we're preparing to
build. One of the application's tables will potentially contain 2 million
or more names, containing (at least) the fields first_name, last_name,
middle_name and prefix.

A common lookup the application will require is the full name, so prefix +
first_name + middle_name + last_name.

My friend's suggestion was to create a "lookup field" in the table itself,
which would contain a concatenation of these fields created during insert.
So, for each record, we'd having each individual field and then a
full_name field that would contain the combination of the ind. fields.
His argument is that this will make lookups in this manner extremely fast
and efficient.

Might, might not. No figures to back up his argument. It'll certainly
make updates slower and less efficient. In fact, since each row will
store the data twice you'll get less rows per disk-page which means
(potentially) more disk reads when you need to get several rows.

I agree with his assertion, but get the feeling that this is sort of an
ugly design. Would a compound index on these fields really be less
efficient?

Doubtful, I'd certainly not try his solution until I'd tried the simple
way first.

If you really want to try your friend's approach on PG you can build a
functional index. As of 7.4, these can be expressions rather than just
indexes so you can do something like:

CREATE INDEX my_idx_1 ON table1 ( prefix || ' ' || first_name ...);

If you're using 7.3.x you'll need to wrap that expression in a function
and index the function instead.

In your case though, I'd just build a compound index and leave it at that.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

#5Richard Huxton
dev@archonet.com
In reply to: Stijn Vanroye (#4)
Re: Large table search question

Stijn Vanroye wrote:

I don't want to but in, I just find this an interesting discussion
and would like to add my 2 cents:

I have read this in the manual: (PostgreSQL 7.4beta4 documentation,
Chapter 11.3 Multicolumn Indexes) Qoute: "Multicolumn indexes should
be used sparingly. Most of the time, an index on a single column is
sufficient and saves space and time. Indexes with more than three
columns are unlikely to be helpful unless the usage of the table is
extremely stylized." This makes me think of the usefullness in means
of performance off multi-column indices. Furthermore it states that
mulicolumn indeces will only be used by the planner if the fields of
the index are used with the AND operator in the where clause of your
select. (Same chapter).

We had a table with 6million+ records and a few tests with explain
reveiled that none of the multi-column indeces where actually used!
This while regualar analyzes where done, and the data never changes
(no mutations).

Indeed - in many cases the additional costs of keeping a multi-column
index up to date, and of reading it outweigh the benefits on the few
queries that actually use them.

Looking at John's example, if he defined an index (first_name,
last_name) then ordering by last_name can't ever use that index. If you
execute a lot of queries for last_name="Smith" AND first_name="John"
then it might well help, there are a lot of "Smith"s to choose from. On
the other hand, my last_name="Huxton" and there aren't many of those in
the phone book, so if a lot of your data is "Huxton"s rather than
"Smith"s then you might just want an index on last_name.

I don't seem to grasp the full meaning of the above. Am I better of
using several single field indices, or do mulitcolumn indices offer
an advantage? If so in which case? Just made me wander...

There's really no alternative to testing. The statistics tables are very
useful here. Unless you have good reason not to, always turn the
statistics gathering on, and take snapshot regularly to keep an eye on
where PG is exerting the most effort.
--
Richard Huxton
Archonet Ltd

#6Stijn Vanroye
s.vanroye@farcourier.com
In reply to: Richard Huxton (#5)
Re: Large table search question[Scanned]

Thanks for the reply. I was afraid it would come down to testing each individual situation.µ
The table I mentioned (6 million+ records) actually is a phonebook. And searching and filtering is possible on almost any combination of fields. So there's an index on each individual field now and that's it. Works relatively fast now anyway.

Regards.

Richard Huxton wrote:

Show quoted text

Stijn Vanroye wrote:

I don't want to but in, I just find this an interesting discussion
and would like to add my 2 cents:

I have read this in the manual: (PostgreSQL 7.4beta4 documentation,
Chapter 11.3 Multicolumn Indexes) Qoute: "Multicolumn indexes should
be used sparingly. Most of the time, an index on a single column is
sufficient and saves space and time. Indexes with more than three
columns are unlikely to be helpful unless the usage of the table is
extremely stylized." This makes me think of the usefullness in means
of performance off multi-column indices. Furthermore it states that
mulicolumn indeces will only be used by the planner if the fields of
the index are used with the AND operator in the where clause of your
select. (Same chapter).

We had a table with 6million+ records and a few tests with explain
reveiled that none of the multi-column indeces where actually used!
This while regualar analyzes where done, and the data never changes
(no mutations).

Indeed - in many cases the additional costs of keeping a multi-column
index up to date, and of reading it outweigh the benefits on the few
queries that actually use them.

Looking at John's example, if he defined an index (first_name,
last_name) then ordering by last_name can't ever use that
index. If you
execute a lot of queries for last_name="Smith" AND first_name="John"
then it might well help, there are a lot of "Smith"s to
choose from. On
the other hand, my last_name="Huxton" and there aren't many
of those in
the phone book, so if a lot of your data is "Huxton"s rather than
"Smith"s then you might just want an index on last_name.

I don't seem to grasp the full meaning of the above. Am I better of
using several single field indices, or do mulitcolumn indices offer
an advantage? If so in which case? Just made me wander...

There's really no alternative to testing. The statistics
tables are very
useful here. Unless you have good reason not to, always turn the
statistics gathering on, and take snapshot regularly to keep
an eye on
where PG is exerting the most effort.
--
Richard Huxton
Archonet Ltd

#7Richard Huxton
dev@archonet.com
In reply to: Stijn Vanroye (#6)
Re: Large table search question[Scanned]

Stijn Vanroye wrote:

Thanks for the reply. I was afraid it would come down to testing each
individual situation.� The table I mentioned (6 million+ records)
actually is a phonebook. And searching and filtering is possible on
almost any combination of fields. So there's an index on each
individual field now and that's it. Works relatively fast now anyway.

Keep an eye on the stats tables, and introduce compound indexes one at a
time would be my recommendation

--
Richard Huxton
Archonet Ltd

#8Bruce Momjian
bruce@momjian.us
In reply to: Richard Huxton (#5)
Re: Large table search question

Richard Huxton <dev@archonet.com> writes:

If you execute a lot of queries for last_name="Smith" AND first_name="John"
then it might well help, there are a lot of "Smith"s to choose from.

I think this is the crux of the argument. Even for a common last name like
Smith, you're going to be talking about what, a few thousand records? Probably
selective enough that the extra column isn't really necessary and you pay the
cost for that extra column on every single update and even on other lookups.

On the other hand this logic applies best to DSS style databases where you're
looking for the fastest average throughput. For OLTP databases it may not
hold: if 'Smith' breaks your performance guarantee then the application could
break. For systems like that it may be worth paying a 1% penalty everywhere
that's within your time budget to avoid paying a 100% penalty on the rare
query that would cause failures, even if on average that means a performance
loss.

In practice I find two column indexes are not uncommon, especially on
many-to-many relationship tables. Three column indexes are rare but they do
happen. Offhand I don't ever recall defining any indexes with four or more
columns.

There's really no alternative to testing. The statistics tables are very useful
here. Unless you have good reason not to, always turn the statistics gathering
on, and take snapshot regularly to keep an eye on where PG is exerting the most
effort.

IMHO many people rely too heavily on empirical data rather than having a good
sense of what they want to be happening. Statistics can obscure situations
like what I described above.

I do have statistics on though, and have the same thinking about always
leaving it on, but I'm unclear how to make use of these data. What tools
should I be looking at to use them?

--
greg

#9John Wells
jb@sourceillustrated.com
In reply to: Bruce Momjian (#8)
Re: Large table search question

Gents,

Thanks for the detailed discussion on this. I appreciate your insight and
time...definitely has given me a few things to consider and test (and
learn, I might add!).

John