Performance issues of one vs. two split tables.

Started by Bill Moseleyalmost 19 years ago23 messagesgeneral

moseley@hank.org

almost 19 years ago

Is there any benefit of splitting up a table into two tables that will
always have a one-to-one relationship?

Say I have a "user" table that has first, last, email, password, and
last_accessed columns. This user table will be accessed often. (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user. Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?

Another related question:

Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.). I want the user to be able to select zero or
more favorite colors. I would typically use a link table:

create table favorite_colors (
color int references color(id),
user int references user(id)
);

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

table user_prefs (
...
favorite_colors bit varying,
...
);

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be overriding
concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?

--
Bill Moseley
moseley@hank.org

PFC

lists@peufeu.com

almost 19 years ago

In reply to: Bill Moseley (#1)

Re: Performance issues of one vs. two split tables.

Say I have a "user" table that has first, last, email, password, and
last_accessed columns. This user table will be accessed often. (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user. Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?

I did something like that on MySQL some time ago.
In the Users table there was stuff that other users need to see (like his
login name, etc), and stuff that only this user needs to see (like his
preferences).
So, when displaying posts in the forum, for instance, only a small part
of the fields in the Users table was needed, the rest was just dead
weight, that made the table unable to fit in RAM.
So I split the table, and it was faster.

However, you can also buy more RAM...

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: PFC (#2)

Re: Performance issues of one vs. two split tables.

On Mon, May 14, 2007 at 10:52:13PM +0200, PFC wrote:

Say I have a "user" table that has first, last, email, password, and
last_accessed columns. This user table will be accessed often. (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user. Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?

I did something like that on MySQL some time ago.
In the Users table there was stuff that other users need to see
(like his login name, etc), and stuff that only this user needs to see
(like his preferences).

So, when displaying posts in the forum, for instance, only a small
part of the fields in the Users table was needed, the rest was just dead
weight, that made the table unable to fit in RAM.

Well, that's part of my question. If not selecting those columns in
the common selects how much "dead weight" is brought along due to the
extra columns defined in the table, if any?

--
Bill Moseley
moseley@hank.org

Rich Shepard

rshepard@appl-ecosys.com

almost 19 years ago

In reply to: PFC (#2)

Re: Performance issues of one vs. two split tables.

On Mon, 14 May 2007, PFC wrote:

I did something like that on MySQL some time ago.
In the Users table there was stuff that other users need to see (like
his login name, etc), and stuff that only this user needs to see (like his
preferences).
So, when displaying posts in the forum, for instance, only a small
part of the fields in the Users table was needed, the rest was just dead
weight, that made the table unable to fit in RAM.
So I split the table, and it was faster.

However, you can also buy more RAM...

Or, use Views without paying more.

Rich

--
Richard B. Shepard, Ph.D. | The Environmental Permitting
Applied Ecosystem Services, Inc. | Accelerator(TM)
<http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863

Dawid Kuroczko

qnex42@gmail.com

almost 19 years ago

In reply to: Bill Moseley (#1)

On 5/14/07, Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Mon, 14 May 2007, PFC wrote:

I did something like that on MySQL some time ago.
In the Users table there was stuff that other users need to see (like
his login name, etc), and stuff that only this user needs to see (like his
preferences).
So, when displaying posts in the forum, for instance, only a small
part of the fields in the Users table was needed, the rest was just dead
weight, that made the table unable to fit in RAM.
So I split the table, and it was faster.

However, you can also buy more RAM...

Or, use Views without paying more.

Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents. So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless. View, as the name states, is converted into
a select on a real table.

As for actual colors table -- you might consider keeping colors
as int[] (integer array) column, though I'll never admit I suggeted
you that. :D

Regards,
Dawid

Import Notes

Reply to msg id not found: 758d5e7f0705142132w51b32c99r88524c577cf2b126@mail.gmail.com

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: Dawid Kuroczko (#5)

Re: Performance issues of one vs. two split tables.

On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:

Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents. So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless. View, as the name states, is converted into
a select on a real table.

Are you saying that in Postgresql:

select first_name, last_name from user_table;

uses the same memory as this?

select first_name, last_name,
passowrd, email,
[10 other columns]
from user_table;

--
Bill Moseley
moseley@hank.org

Dawid Kuroczko

qnex42@gmail.com

almost 19 years ago

In reply to: Bill Moseley (#6)

Re: Performance issues of one vs. two split tables.

On 5/15/07, Bill Moseley <moseley@hank.org> wrote:

On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:

Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents. So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless. View, as the name states, is converted into
a select on a real table.

Are you saying that in Postgresql:

select first_name, last_name from user_table;

uses the same memory as this?

select first_name, last_name,
passowrd, email,
[10 other columns]
from user_table;

Yes. You read whole page (8KB) into buffer_cache,
then extract these columns from these buffer. From the
buffer cache point of view, whole tuple is contained in the
cache.

Say, if you first SELECT fname, lname FROM user_table;
and then you issue SELECT * FROM user_table; -- the
second select will be returned from buffer cache -- since
all rows are already in the cache.

Having seperate caches for possible SELECT [column list]
would be well, not quite efficient.

Now, select fname,lname will take less private memory,
but this memory will be freed as soon as the query finishes,
but this won't help our cache much.

Regards,
Dawid

PFC

lists@peufeu.com

almost 19 years ago

In reply to: Dawid Kuroczko (#7)

Re: Performance issues of one vs. two split tables.

Say, if you first SELECT fname, lname FROM user_table;
and then you issue SELECT * FROM user_table; -- the
second select will be returned from buffer cache -- since
all rows are already in the cache.

...Unless your table contains some large TEXT columns that have been
stored out of line (TOASTed) by postgres.

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: Bill Moseley (#1)

Re: Performance issues of one vs. two split tables.

Can anyone provide input on this question? I'm curious how to look at
this from a disk and memory usage perspective. Would using a bit
column type help much?

I'm not thrilled by the loss of referential integrity.

On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:

Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.). I want the user to be able to select zero or
more favorite colors. I would typically use a link table:

create table favorite_colors (
color int references color(id),
user int references user(id)
);

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

table user_prefs (
...
favorite_colors bit varying,
...
);

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be overriding
concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?

--
Bill Moseley
moseley@hank.org

#10

Ben

bench@silentmedia.com

almost 19 years ago

In reply to: Bill Moseley (#9)

Re: Performance issues of one vs. two split tables.

I'm sure there's a point where you'd be saving a "substantial" amount
of disk space using a non-normalized scheme, but, like you say, you'd
be missing out on other things. In general, disks are cheap while the
man hours used to try to fix data corruption is not.

On May 15, 2007, at 7:35 AM, Bill Moseley wrote:

Show quoted text

Can anyone provide input on this question? I'm curious how to look at
this from a disk and memory usage perspective. Would using a bit
column type help much?

I'm not thrilled by the loss of referential integrity.

On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:

Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.). I want the user to be able to select zero or
more favorite colors. I would typically use a link table:

create table favorite_colors (
color int references color(id),
user int references user(id)
);

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

table user_prefs (
...
favorite_colors bit varying,
...
);

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be
overriding
concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?

--
Bill Moseley
moseley@hank.org

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#11

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: Dawid Kuroczko (#7)

Re: Performance issues of one vs. two split tables.

On Tue, May 15, 2007 at 07:51:44AM +0200, Dawid Kuroczko wrote:

On 5/15/07, Bill Moseley <moseley@hank.org> wrote:

On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:

Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents. So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless. View, as the name states, is converted into
a select on a real table.

Are you saying that in Postgresql:

select first_name, last_name from user_table;

uses the same memory as this?

select first_name, last_name,
passowrd, email,
[10 other columns]
from user_table;

Yes. You read whole page (8KB) into buffer_cache,
then extract these columns from these buffer. From the
buffer cache point of view, whole tuple is contained in the
cache.

Sorry, I don't mean to drag this thread out much longer. But, I have
one more question regarding joins.

Say I have a customer table and an order table. I want a list of all
order id's for a given customer.

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

See, I've seen this splitting of one-to-one tables a number of time
(such as the user and user_preferences example) and I'm not sure if
that's just poor schema design, premature optimization, or someone
making smart use of their knowledge of the internal workings of
Postgresql....

--
Bill Moseley
moseley@hank.org

#12

PFC

lists@peufeu.com

almost 19 years ago

In reply to: Bill Moseley (#11)

Re: Performance issues of one vs. two split tables.

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

No, it just pulls the columns you ask from the table, nothing less,
nothing more.

Splitting tables (vertical partitioning) is used to reduce the size of
the working set that has to fit in RAM... this is a different reason than
what you're thinking about.

#13

Chris Browne

cbbrowne@acm.org

almost 19 years ago

In reply to: Bill Moseley (#1)

Re: Performance issues of one vs. two split tables.

lists@peufeu.com (PFC) writes:

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

No, it just pulls the columns you ask from the table, nothing
less, nothing more.

That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

They may not be drawn into the return set, but they will still be
drawn into memory.

If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://linuxdatabases.info/info/languages.html
There are two kinds of pedestrians -- the quick and the dead.

#14

Noname

SCassidy@overlandstorage.com

almost 19 years ago

In reply to: Bill Moseley (#11)

Re: Performance issues of one vs. two split tables.

One other possible reason for splitting the table up in two chunks is to
grant different rights on the 2 sets of columns.

Susan Cassidy

Bill Moseley <moseley@hank.org>
Sent by: pgsql-general-owner@postgresql.org
05/15/2007 09:44 AM

To
Postgres General <pgsql-general@postgresql.org>
cc

Subject
Re: [GENERAL] Performance issues of one vs. two split tables.

Sorry, I don't mean to drag this thread out much longer. But, I have
one more question regarding joins.

Say I have a customer table and an order table. I want a list of all
order id's for a given customer.

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

--
Bill Moseley
moseley@hank.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

----------------------------------------------------------------------------------------------
Simply protected storage solutions ensure that your information is
automatically safe, readily available and always there, visit us at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

#15

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: Chris Browne (#13)

Re: Performance issues of one vs. two split tables.

On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:

lists@peufeu.com (PFC) writes:

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

No, it just pulls the columns you ask from the table, nothing
less, nothing more.

That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

Is that specific to Postgresql? From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used. Perhaps there's an overriding
reason for this.

If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.

Makes designing the schema a bit tough. ;)

--
Bill Moseley
moseley@hank.org

#16

Steve Atkins

steve@blighty.com

almost 19 years ago

In reply to: Bill Moseley (#15)

Re: Performance issues of one vs. two split tables.

On May 15, 2007, at 12:02 PM, Bill Moseley wrote:

On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:

lists@peufeu.com (PFC) writes:

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and
customer?
Maybe that's not a good example due to indexes.

No, it just pulls the columns you ask from the table, nothing
less, nothing more.

That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the
data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

Is that specific to Postgresql? From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used. Perhaps there's an overriding
reason for this.

The columns are next to each other on the disk. You need to read
the entire block off disk into system cache, so you'll be reading all
the columns of all the rows in that block into memory.

That's just the way that most (all?) modern filesystems work, and so the
way that most filesystem based databases are going to work. I've seen
some databases that don't store all the main columns of a table together
on disk, but they're fairly rare.

Pushing data into lookaside tables either manually or automatically
via toast changes the tradeoffs.

Cheers,
Steve

#17

John D. Burger

john@mitre.org

almost 19 years ago

In reply to: Bill Moseley (#15)

Re: Performance issues of one vs. two split tables.

Thus, if there are a whole bunch of columns on each table, the
data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

Is that specific to Postgresql? From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used. Perhaps there's an overriding
reason for this.

Anything else would seem odd to me. Pulling a page into memory
typically has OS support, and is thus very fast. Picking and
choosing bits and pieces to read would be prohibitively slow.
Moreover, caching only those bits and pieces would require
complicated code to decide whether the cached data is relevant to the
next query. Validating cached data at the page level is much
simpler, and thus faster.

Or so I assume ...

- John D. Burger
MITRE

#18

PFC

lists@peufeu.com

almost 19 years ago

In reply to: Bill Moseley (#15)

Re: Performance issues of one vs. two split tables.

Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

Yeah, I wanted to mean that ;)
All the columns are loaded (except the TOASTed ones which are not
mentioned in the query) into memory, but only the requested ones are
processed and returned to the client...

Is that specific to Postgresql?

Nope. All databases do more or less the same.

From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used. Perhaps there's an overriding
reason for this.

Yeah, where would you put this data if you didn't put it where it is now ?

If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.

Makes designing the schema a bit tough. ;)

"Premature optimization is the root of all evil"

Build a test database, fill it with data, and experiment.

#19

Bill Moseley

moseley@hank.org

almost 19 years ago

In reply to: PFC (#18)

Re: Performance issues of one vs. two split tables.

On Tue, May 15, 2007 at 09:20:53PM +0200, PFC wrote:

From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used. Perhaps there's an overriding
reason for this.

Yeah, where would you put this data if you didn't put it where it is
now ?

Swish-e isn't a database by any means, but it does have a way to store
column like meta data for each "row". When it does a search it only
explicitly pulls from disk the meta data that it's asked to return.
Granted, the OS is reading from disk more than the application is
asking for, but the application is only allocating memory for the data
it's going to return. And the column (meta data) is not always stored
together on disk.

Without knowing Pg internals I wasn't aware of how the actual table
data was organized and fetched into memory.

"Premature optimization is the root of all evil"

Exactly what prompted this thread. ;)

--
Bill Moseley
moseley@hank.org

#20

Vick Khera

vivek@khera.org

almost 19 years ago

In reply to: Bill Moseley (#1)

Re: Performance issues of one vs. two split tables.

On May 14, 2007, at 4:37 PM, Bill Moseley wrote:

Say that there's also about 10 columns of settings or preferences for
each user. Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?

when you have, say 65 million users, it makes sense to push the
ancillary info to another table to keep from having to copy too much
data when you update the main info (like last access time).

#21

Vick Khera

vivek@khera.org

almost 19 years ago

In reply to: Bill Moseley (#9)

#22

PFC

lists@peufeu.com

almost 19 years ago

In reply to: Vick Khera (#21)

#23

Brian Wipf

brian@clickspace.com

almost 19 years ago

In reply to: PFC (#22)