Are there performance advantages in storing bulky field in separate table?

Started by Ian Mayoabout 17 years ago17 messagesgeneral

ianmayo@tesco.net

about 17 years ago

Hi all,
I'm designing a Postgresql database, and would appreciate this design advice.

I've got a fairly straightforward table that's similar to a blog table
(entryId, date, title, author, etc). There is, however, the
requirement to allow a single, fairly bulky binary attachment to
around 1% of the rows.

There will be a few million rows, and I value efficient searches by
date, title, and author.

Would there be a performance advantage in storing the attachment in a
separate table - linked by entryId foreign key? Or shall I just
include it as an ALLOW NULL field my blog table?

[of course, I'd appreciate redirection to the 'right' list if this is
the wrong one].

cheers,
Ian

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Ian Mayo (#1)

Re: Are there performance advantages in storing bulky field in separate table?

Ian Mayo <ianmayo@tesco.net> writes:

I've got a fairly straightforward table that's similar to a blog table
(entryId, date, title, author, etc). There is, however, the
requirement to allow a single, fairly bulky binary attachment to
around 1% of the rows.

There will be a few million rows, and I value efficient searches by
date, title, and author.

Would there be a performance advantage in storing the attachment in a
separate table - linked by entryId foreign key?

No. You'd basically be manually reinventing the TOAST mechanism;
or the large object mechanism, if you choose to store the blob
as a large object rather than a plain bytea field. Either way,
it won't physically be in the same table as the main row data.

If you're curious, this goes into some of the gory details:
http://www.postgresql.org/docs/8.3/static/storage-toast.html

regards, tom lane

ianmayo@tesco.net

about 17 years ago

In reply to: Tom Lane (#2)

Re: Are there performance advantages in storing bulky field in separate table?

Cheers Tom,

On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ian Mayo <ianmayo@tesco.net> writes:

[snip]

No. You'd basically be manually reinventing the TOAST mechanism;
or the large object mechanism, if you choose to store the blob
as a large object rather than a plain bytea field. Either way,
it won't physically be in the same table as the main row data.

fine, that keeps the design simpler

If you're curious, this goes into some of the gory details:
http://www.postgresql.org/docs/8.3/static/storage-toast.html

Oooh, no, much too gory for me.

regards, tom lane

thanks again.

One more thing: hey, did you hear? I just got some advice from Tom Lane!

Ian

sam@samason.me.uk

about 17 years ago

In reply to: Ian Mayo (#3)

Re: Are there performance advantages in storing bulky field in separate table?

On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:

One more thing: hey, did you hear? I just got some advice from Tom Lane!

Statistically speaking; he's the person most likely to answer you by
quite a long way. Out of the ~24k emails going back to Oct 2007 I've
got from pgsql-general the most common people who wrote them are:

who num mails of total
Tom Lane 1,935 8.0%
Scott Marlowe 1,077 4.5%
Alvaro Herrera 521 2.2%
Joshua Drake 468 1.9%
Richard Huxton 432 1.8%
Craig Ringer 338 1.4%
Ivan Sergio Borgonovo 314 1.3%
Sam Mason 310 1.3%
Raymond O'Donnell 270 1.1%
Martijn van Oosterhout 264 1.1%
Greg Smith 252 1.0%

The remaining ~2000 distinct addresses were less than one percent each.

I didn't expect to see myself there; ho hum, maybe I should spend less
time on email! It's also somewhat biased as I only have archives as
long as I've posted. I just tried pulling numbers from markmail.org and
get somewhat different results. Here it knows about 161k messages and
the top twenty posters are:

Tom Lane 14,147 8.8%
Bruce Momjian 3,400 2.1%
Scott Marlowe 3,112 1.9%
Richard Huxton 2,738 1.7%
Martijn van Oosterhout 2,480 1.5%
Alvaro Herrera 1,853 1.2%
Stephan Szabo 1,783 1.1%
Joshua D. Drake 1,720 1.1%
Peter Eisentraut 1,488 0.9%
Michael Fuhr 1,328 0.8%
Bruno Wolff III 1,201 0.7%
Andrew Sullivan 985 0.6%
Doug McNaught 773 0.5%
Jan Wieck 764 0.5%
Ron Johnson 764 0.5%
Jim C. Nasby 745 0.5%
Magnus Hagander 665 0.4%
Marc G. Fournier 630 0.4%
Dennis Gearon 602 0.4%
The Hermit Hacker 601 0.4%

I've not been able to merge people where they use different email
address like with my own archive, but manually fiddled Scott Marlowe as
he appeared as "scott.marlowe" as well. I'm glad to see I drop off the
bottom now!

--
Sam http://samason.me.uk/

Ivan Sergio Borgonovo

mail@webthatworks.it

about 17 years ago

In reply to: Sam Mason (#4)

Re: Are there performance advantages in storing bulky field in separate table?

On Wed, 8 Apr 2009 17:39:02 +0100
Sam Mason <sam@samason.me.uk> wrote:

On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:

One more thing: hey, did you hear? I just got some advice from
Tom Lane!

Statistically speaking; he's the person most likely to answer you
by quite a long way. Out of the ~24k emails going back to Oct
2007 I've got from pgsql-general the most common people who wrote
them are:

who num mails of total
Tom Lane 1,935 8.0%
Scott Marlowe 1,077 4.5%
Alvaro Herrera 521 2.2%
Joshua Drake 468 1.9%
Richard Huxton 432 1.8%
Craig Ringer 338 1.4%
Ivan Sergio Borgonovo 314 1.3%

I just wrote privately to Tom that I'm ashamed I ask so much and
answer so few.
But well I'm an exception ;) I'm the top of non-contributors.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

xzilla@users.sourceforge.net

about 17 years ago

In reply to: Ian Mayo (#3)

Re: Are there performance advantages in storing bulky field in separate table?

On Wednesday 08 April 2009 11:56:35 Ian Mayo wrote:

Cheers Tom,

On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ian Mayo <ianmayo@tesco.net> writes:

[snip]

No. You'd basically be manually reinventing the TOAST mechanism;
or the large object mechanism, if you choose to store the blob
as a large object rather than a plain bytea field. Either way,
it won't physically be in the same table as the main row data.

fine, that keeps the design simpler

Maybe I've been reading too much Pascal again lately, but if only 1% of your
rows are going to have data in this column, personally, I'd put it in a
separate table.

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

ianmayo@tesco.net

about 17 years ago

In reply to: Robert Treat (#6)

Re: Are there performance advantages in storing bulky field in separate table?

On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat
<xzilla@users.sourceforge.net> wrote:

Maybe I've been reading too much Pascal again lately, but if only 1% of your
rows are going to have data in this column, personally, I'd put it in a
separate table.

thanks for that Robert - it does match my (completely groundless)
first impression.

In the nature of debate, would you mind passing on the pascal-related
reasons why you'd put the data in another table?

cheers,
Ian

rm_pg@cheapcomplexdevices.com

about 17 years ago

In reply to: Sam Mason (#4)

Re: Are there performance advantages in storing bulky field in separate table?

Sam Mason wrote:

On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:

One more thing: hey, did you hear? I just got some advice from Tom Lane!

Statistically speaking; he's the person most likely to answer you by

Even so, this might be the #1 advantage of Postgres over Oracle (cost
being #2).

Unless you're one of their ten biggest customers, I imagine it'd take
quite some time to similar support from the core team's counterparts
of the other big databases.

cbbrowne@acm.org

about 17 years ago

In reply to: Ian Mayo (#1)

Re: Are there performance advantages in storing bulky field in separate table?

ianmayo@tesco.net (Ian Mayo) writes:

On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat
<xzilla@users.sourceforge.net> wrote:

Maybe I've been reading too much Pascal again lately, but if only 1% of your
rows are going to have data in this column, personally, I'd put it in a
separate table.

thanks for that Robert - it does match my (completely groundless)
first impression.

In the nature of debate, would you mind passing on the pascal-related
reasons why you'd put the data in another table?

Fabian Pascal's thesis is that you shouldn't have NULLs altogether, as
this leads to having to support the 3-or-more-valued logic of NULLs.
The "Third Manifesto" declines to support having NULLs in relations.

Hugh Darwen wrote the relevant paper explaining how to avoid them:
http://web.onetel.com/~hughdarwen/TheThirdManifesto/Missing-info-without-nulls.pdf

I tend to agree that it is reasonable to go to *some* effort to avoid
having NULL values.

Unfortunately, it seems to me that Darwen's papers elaboration on the
issue doesn't present a solution that is without points to criticize.
He does nicely describe how you may indicate various reasons why you
might have missing information. This is both good and bad...

- It is good because it provides unambiguous ways to determine why
the data was missing. NULL leaves that ambiguous.

- It is Not So Good because it replaces the 3-value-logic of NULLs
with an "as many values for logic as we have kinds of unknown
values," which is more like a 5- or 6-value logic.

http://en.wikipedia.org/wiki/Fabian_Pascal

"Pascal is known for his sharp criticisms of the data management
industry, trade press, current state of higher education, Western
culture and alleged media bias. Pascal advocates strict adherence to
the principles of the relational model, and argues that departing from
the model in the name of pragmatism is responsible for serious data
management troubles. Criticism of Pascal's advocacy often centers
around his polemical style, which some perceive as overly
confrontational and unprofessional.

He has retired from the technological industry and now does political
commentary, specially on Middle East issues."
--
let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/linuxxian.html
"The only thing better than TV with the sound off is Radio with the
sound off." -- Dave Moon

xzilla@users.sourceforge.net

about 17 years ago

In reply to: Ian Mayo (#7)

Re: Are there performance advantages in storing bulky field in separate table?

On Wednesday 08 April 2009 15:30:28 Ian Mayo wrote:

On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat

<xzilla@users.sourceforge.net> wrote:

Maybe I've been reading too much Pascal again lately, but if only 1% of
your rows are going to have data in this column, personally, I'd put it
in a separate table.

thanks for that Robert - it does match my (completely groundless)
first impression.

In the nature of debate, would you mind passing on the pascal-related
reasons why you'd put the data in another table?

You can be sure that discussion of this topic in this forum will soon be
visited by religious zealots, but the short answer is "nulls are bad, mmkay".
A slightly longer answer would be that, as a general rule, attributes of your
relations that only apply to 1% of the rows are better represented as a one
to N relationship using a second table. For a longer answer, see
http://www.databasedesign-resource.com/null-values-in-a-database.html
or http://www.dbazine.com/ofinterest/oi-articles/pascal27

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

sam@samason.me.uk

about 17 years ago

In reply to: Robert Treat (#10)

Re: Are there performance advantages in storing bulky field in separate table?

On Wed, Apr 08, 2009 at 05:06:44PM -0400, Robert Treat wrote:

A slightly longer answer would be that, as a general rule, attributes
of your relations that only apply to 1% of the rows are better
represented as a one to N relationship using a second table.

Have you tried to maintain a non-trivial schema that does this? I'd be
interested to know how it works because I've only tried to work with
small examples that do this and it gets difficult to maintain very
quickly.

For a longer answer, see
http://www.databasedesign-resource.com/null-values-in-a-database.html
or http://www.dbazine.com/ofinterest/oi-articles/pascal27

Both of those articles seem to be written by people who struggle with,
or have incomplete mental models of, the semantics of NULL values. The
second also appears to be designed to sell a book so is of course going
to be presenting biased viewpoints.

How would outer joins work without some concept of a missing value.
Once you allow these missing values as the result of an outer join
you would be deliberately introducing limits if you couldn't also save
these values back into tables. I would say that defaulting columns to
allowing NULLs was a mistake though.

I'd be happy without NULLs in databases if there was some other way
to handle missing values. Parametric polymorphism and some sort of
option[1]http://www.standardml.org/Basis/option.html or Maybe[2]http://www.haskell.org/onlinereport/maybe.html type is what springs to mind for me. NULL would
be represented as NONE or Nothing respectively and non-NULL values as
(SOME v) or (Just v).

--
Sam http://samason.me.uk/

[1]: http://www.standardml.org/Basis/option.html
[2]: http://www.haskell.org/onlinereport/maybe.html

Thomas Kellerer

spam_eater@gmx.net

about 17 years ago

In reply to: Robert Treat (#10)

Re: Are there performance advantages in storing bulky field in separate table?

Robert Treat wrote on 08.04.2009 23:06:

http://www.databasedesign-resource.com/null-values-in-a-database.html

That is a very - hmm - strange article.

One of the proofs that nulls are bad is that "SELECT * FROM theTable" (theTable
being empty) returns nothing, whereas SELECT COUNT(*) FROM theTable returns a
single row with 0 (zero):

"This last example is even worse: The SELECT * returns 'No rows selected', but
the SELECT COUNT(*) returns ONE row with the value 0!"

I stopped reading the article at that point...

rm_pg@cheapcomplexdevices.com

about 17 years ago

In reply to: Robert Treat (#10)

Re: Are there performance advantages in storing bulky field in separate table?

Robert Treat wrote:

You can be sure that discussion of this topic in this forum will soon be
visited by religious zealots, but the short answer is "nulls are bad, mmkay".
A slightly longer answer would be that, as a general rule, attributes of your
relations that only apply to 1% of the rows are better represented as a one

To fulfill your prophecy of zealotry, I've got a number of tables
with columns that are mostly null that I can't think of that nice a
way of refactoring. I'd love ideas to improve the design, though.

One example's an address table. Most addresses have a few fields
that are typically present (building number, city, state, etc).
Others, as described in various government's address standards,
are fields that are typically absent. For example in US addressing
rules, the "Urbanization Name" line:
http://www.usps.com/ncsc/addressstds/addressformats.htm
MRS MARIA SUAREZ Name
URB LAS GLADIOLAS Urbanization name
150 CALLE A House no. and st. name
SAN JUAN PR 00926-3232 City, state, and ZIP+4
Similarly sparse columns in my address tables are,
titles, division/department Names and mailstop codes.
(described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm)

While I realize I could stick in some string (empty string, or
some other magic string like "urbanization name doesn't apply to
this address") into a table, it sure is convenient to put nulls
in those columns.

I'm quite curious what you'd suggest a well-designed address table
would look like without nulls.

Leif B. Kristensen

leif@solumslekt.org

about 17 years ago

In reply to: Ron Mayer (#8)

Re: Are there performance advantages in storing bulky field in separate table?

On Wednesday 8. April 2009, Ron Mayer wrote:

Sam Mason wrote:

On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:

One more thing: hey, did you hear? I just got some advice from
Tom Lane!

Statistically speaking; he's the person most likely to answer you by

Even so, this might be the #1 advantage of Postgres over Oracle (cost
being #2).

I'll subscribe to that. Like almost everybody else on this list, I've
got some excellent replies from Tom. And of course it's one of the
major advantages of PostgreSQL. (The cost being a close second.)
--
Leif Biberg Kristensen | Registered Linux User #338009
Me And My Database: http://solumslekt.org/blog/

dev@archonet.com

about 17 years ago

In reply to: Ivan Sergio Borgonovo (#5)

Re: Are there performance advantages in storing bulky field in separate table?

Ivan Sergio Borgonovo wrote:

On Wed, 8 Apr 2009 17:39:02 +0100
Sam Mason <sam@samason.me.uk> wrote:

who num mails of total
Tom Lane 1,935 8.0%
Scott Marlowe 1,077 4.5%
Alvaro Herrera 521 2.2%
Joshua Drake 468 1.9%
Richard Huxton 432 1.8%
Craig Ringer 338 1.4%
Ivan Sergio Borgonovo 314 1.3%

I just wrote privately to Tom that I'm ashamed I ask so much and
answer so few.
But well I'm an exception ;) I'm the top of non-contributors.

Not so fast there citizen. I'll thank you to note that I've not
contributed any code either, and for a significant number of years too :-)

--
Richard Huxton
Archonet Ltd

Grzegorz Jaśkiewicz

gryzman@gmail.com

about 17 years ago

In reply to: Tom Lane (#2)

Re: Are there performance advantages in storing bulky field in separate table?

If I may, I got an instance once, where table with bytea field was
pretty slow. Turned out, that queries modified everything apart from
bytea bit.
moving it to separate table actually helped performance.

But that only will happen providing that you have the
bytea/text/whatever that won't change, once inserted.

xzilla@users.sourceforge.net

about 17 years ago

In reply to: Ron Mayer (#13)

Re: Are there performance advantages in storing bulky field in separate table?

On Wednesday 08 April 2009 18:25:25 Ron Mayer wrote:

Robert Treat wrote:

You can be sure that discussion of this topic in this forum will soon be
visited by religious zealots, but the short answer is "nulls are bad,
mmkay". A slightly longer answer would be that, as a general rule,
attributes of your relations that only apply to 1% of the rows are better
represented as a one

To fulfill your prophecy of zealotry, I've got a number of tables
with columns that are mostly null that I can't think of that nice a
way of refactoring. I'd love ideas to improve the design, though.

One example's an address table. Most addresses have a few fields
that are typically present (building number, city, state, etc).
Others, as described in various government's address standards,
are fields that are typically absent. For example in US addressing
rules, the "Urbanization Name" line:
http://www.usps.com/ncsc/addressstds/addressformats.htm
MRS MARIA SUAREZ Name
URB LAS GLADIOLAS Urbanization name
150 CALLE A House no. and st. name
SAN JUAN PR 00926-3232 City, state, and ZIP+4
Similarly sparse columns in my address tables are,
titles, division/department Names and mailstop codes.
(described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm)

While I realize I could stick in some string (empty string, or
some other magic string like "urbanization name doesn't apply to
this address") into a table, it sure is convenient to put nulls
in those columns.

I'm quite curious what you'd suggest a well-designed address table
would look like without nulls.

The decision here would depend on your perticular sect of the anti-null
religion, but you have a couple of choices:

1) Break these fields out into one or more tables, containing entries only for
those address that have the additional information. Ideally you might be able
to do something like "extended_address_info" where all of these fields could
be kept, all of them being non-null. I suspect you can't do the ideal, so
you'd end up with a bunch of tables.

This would be used by the "normalization trumps nullification" sect

2) Given that all of these columns have an authoritarian source of what should
be allowed, you could use the "magic string" approach without requiring too
much magic, and these columns could even be a foriegn key into a table
containing the authoritarian options.

This could be justified by the all nulls are bad sect, but might also be used
by a null using crowd who take a strict approach to nulls meaning "unknown
value", since here it isn't that the value is unknown; there isn't a valid
value for these columns. (Adding the magic string to your FK table creates a
valid reference value for those entries that would otherwise not match)

Personally, if you force me into a "well-designed address table *without*
nulls" decision, I would take this latter approach. HTH

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com