Column as arrays.. more efficient than columns?
Table is like
create table foo (
number int,
subset int,
value int
)
select * from foo;
number | subset | value
1 1 1
1 2 2
1 3 10
1 4 3
current query is like
select number,
avg(case when subset = 1 then value else null end) as v1,
avg(case when subset = 2 then value else null end) as v2,
avg(case when subset = 3 then value else null end) as v3,
avg(case when subset = 4 then value else null end) as v4
from foo
group by number
results
------
number | v1 | v2 | v3 | v4
1 1 2 10 4
I'm thinking of denormalising it a bit and put it either as an array or
just create a new table with the end result like the above.
I just want to know which is more efficient. Users can just do a
select * from new_foo where number = 'X';
Thanks.
Nobody has any comments on this??
Show quoted text
On Thu, 2007-09-06 at 12:22 +0800, Ow Mun Heng wrote:
Table is like
create table foo (
number int,
subset int,
value int
)select * from foo;
number | subset | value
1 1 1
1 2 2
1 3 10
1 4 3current query is like
select number,
avg(case when subset = 1 then value else null end) as v1,
avg(case when subset = 2 then value else null end) as v2,
avg(case when subset = 3 then value else null end) as v3,
avg(case when subset = 4 then value else null end) as v4
from foo
group by numberresults
------
number | v1 | v2 | v3 | v4
1 1 2 10 4I'm thinking of denormalising it a bit and put it either as an array or
just create a new table with the end result like the above.I just want to know which is more efficient. Users can just do a
select * from new_foo where number = 'X';
Thanks.
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:
Nobody has any comments on this??
Don't do it.
Michael Glaesemann
grzm seespotcode net
On Thu, 2007-09-06 at 19:52 -0500, Michael Glaesemann wrote:
On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:
Nobody has any comments on this??
Don't do it.
don't do what?
Don't denormalise the table?
don't put them into arrays?
Thing is, end-result is always for them to be in 1 row and all the
columns are needed for data analysis.
so, it's between a select * from foo vs select avg(...), avg(..) group
by ..
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Michael Glaesemann wrote:
On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:
Nobody has any comments on this??
Don't do it.
HAHAHAHAHAHAHA....
Joshua D. Drake
Michael Glaesemann
grzm seespotcode net---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
- --
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997 http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFG4KS5ATb/zqfZUUQRAg9wAJ96nzIP18MGtMlRZltoyN0XQb3iogCfSuPd
lX7G0aGGq6NbyrHOzW2N1lk=
=YaVL
-----END PGP SIGNATURE-----
On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:
Don't denormalise the table?
Yes. Don't denormalize the tables.
don't put them into arrays?
Yes. Don't use arrays. Caveat: if the data is *naturally* an array
and you will not be doing any relational operations on individual
elements of the arrays, then it makes sense to use arrays. Treat
arrays as you would any other opaque type.
Michael Glaesemann
grzm seespotcode net
On Thu, 2007-09-06 at 20:20 -0500, Michael Glaesemann wrote:
On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:
Don't denormalise the table?
Yes. Don't denormalize the tables.
I would believe performance would be better it being denormalised. (in
this case)
don't put them into arrays?
Yes. Don't use arrays. Caveat: if the data is *naturally* an array
and you will not be doing any relational operations on individual
elements of the arrays, then it makes sense to use arrays. Treat
arrays as you would any other opaque type.
Data is naturally an array, and will be used as an array in any case.
Since there will not be queries where users will select any one of the
values in that array, but the whole array itself.
data willbe used in this form
code | v1 | v2 | v3 | v4
A 1 2 10 23
B 10 12 15 22
C 11 24 18 46
D 21 22 20 41
which will be imported into statistical software/excel for further
manipulation.
I i give them in the denormalised form, it'll take them an addition
30min or so to make them back into the form above.
and it'll make the queries more efficient too.
index on Code,
select * from foo where code = 'B';
By denormalising, I will also get the benefit of reducing the # of rows
by a factor of 20.. (20 rows = 1 code)
On 9/6/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
Table is like
create table foo (
number int,
subset int,
value int
)select * from foo;
number | subset | value
1 1 1
1 2 2
1 3 10
1 4 3current query is like
select number,
avg(case when subset = 1 then value else null end) as v1,
avg(case when subset = 2 then value else null end) as v2,
avg(case when subset = 3 then value else null end) as v3,
avg(case when subset = 4 then value else null end) as v4
from foo
group by number
arrays are interesting and have some useful problems. however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes. also,
they cause some problems with constraints and other issues that come
up with de-normalization tactics.
however, If a particular data is expressed actually as an array of
items (the polygon type comes to mind), then why not? let'l
that said, let's look at a better way to express this query. what
jumps out at me right away is:
select number, subset, avg(value) from foo group by subset;
does this give you the answer that you need? If not we can proceed
and look at why arrays may or may not be appropriate (i suspect I am
not seeing the whole picture here).
merlin
On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:
On Thu, 2007-09-06 at 20:20 -0500, Michael Glaesemann wrote:
On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:
Don't denormalise the table?
Yes. Don't denormalize the tables.
I would believe performance would be better it being denormalised. (in
this case)
I assume you've arrived at the conclusion because you have (a) shown
that the performance with a normalized schema does not meet your
needs; (b) benchmarked the normalized schema under production
conditions; (c) benchmarked the denormalized schema under production
conditions; and (d) shown that performance is improved in the
denormalized case to arrive at that conclusion. I'm interested to see
the results of your comparisons.
Regardless, it sounds like you've already made up your mind. Why ask
for comments?
Michael Glaesemann
grzm seespotcode net
On Thu, 2007-09-06 at 21:53 -0400, Merlin Moncure wrote:
On 9/6/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
Table is like
create table foo (
number int,
subset int,
value int
)select * from foo;
number | subset | value
1 1 1
1 2 2
1 3 10
1 4 3current query is like
select number,
avg(case when subset = 1 then value else null end) as v1,
avg(case when subset = 2 then value else null end) as v2,
avg(case when subset = 3 then value else null end) as v3,
avg(case when subset = 4 then value else null end) as v4
from foo
group by numberarrays are interesting and have some useful problems. however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes. also,
they cause some problems with constraints and other issues that come
up with de-normalization tactics.
I see. Didn't know that.. Good to know.
select number, subset, avg(value) from foo group by subset;
does this give you the answer that you need?
No it doesn't
select * from foo order by subset;
code | subset | value
------+--------+-------
A | 0 | 98
A | 1 | 20
A | 2 | 98
A | 3 | 98
A | 4 | 98
=> select code, subset, avg(value) from foo group by subset;
ERROR: column "foo.code" must appear in the GROUP BY clause or be used
in an aggregate function
=> select code, subset, avg(value) from foo group by subset, code;
code | subset | avg
------+--------+---------------------
A | 3 | 98.0000000000000000
A | 1 | 20.0000000000000000
A | 4 | 98.0000000000000000
A | 0 | 98.0000000000000000
A | 2 | 98.0000000000000000
=> select code, round(avg(case when subset = '0' then value else null
end),0) as v0,
round(avg(case when subset = '1' then value else null end),0) as v1,
round(avg(case when subset = '2' then value else null end),0) as v2,
round(avg(case when subset = '3' then value else null end),0) as v3,
round(avg(case when subset = '4' then value else null end),0) as v4
from foo
group by code;
code | v0 | v1 | v2 | v3 | v4
------+----+----+----+----+----
A | 98 | 20 | 98 | 98 | 98
On Thu, 2007-09-06 at 20:57 -0500, Michael Glaesemann wrote:
On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:
I would believe performance would be better it being denormalised. (in
this case)I assume you've arrived at the conclusion because you have
(a) shown
that the performance with a normalized schema does not meet your
needs;
(b) benchmarked the normalized schema under production
conditions;
(c) benchmarked the denormalized schema under production
conditions; and
(d) shown that performance is improved in the
denormalized case to arrive at that conclusion. I'm interested to see
the results of your comparisons.
Regardless, it sounds like you've already made up your mind. Why ask
for comments?
You've assumed wrong. I've not arrived at any conclusion but merely
exploring my options on which way would be the best to thread. I'm
asking the list because I'm new in PG and after reading all those
articles on highscalability etc.. majority of them are all using some
kind of denormalised tables.
Right now, there's 8 million rows of data in this one table, and growing
at a rapid rate of ~2 million/week. I can significantly reduce this
number down to 200K (i think by denormalising it) and shrink the table
size.
I would appreciate your guidance on this before I go knock my head on
the wall. :-)
Ow Mun Heng wrote:
=> select code, subset, avg(value) from foo group by subset, code;
code | subset | avg
------+--------+---------------------
A | 3 | 98.0000000000000000
A | 1 | 20.0000000000000000
A | 4 | 98.0000000000000000
A | 0 | 98.0000000000000000
A | 2 | 98.0000000000000000
An alternative way to get the output below, would be to feed your
aggregate query above to the crosstab() function in contrib/tablefunc.
Joe
Show quoted text
=> select code, round(avg(case when subset = '0' then value else null
end),0) as v0,
round(avg(case when subset = '1' then value else null end),0) as v1,
round(avg(case when subset = '2' then value else null end),0) as v2,
round(avg(case when subset = '3' then value else null end),0) as v3,
round(avg(case when subset = '4' then value else null end),0) as v4
from foo
group by code;
code | v0 | v1 | v2 | v3 | v4
------+----+----+----+----+----
A | 98 | 20 | 98 | 98 | 98
On Thu, 2007-09-06 at 20:19 -0700, Joe Conway wrote:
Ow Mun Heng wrote:
=> select code, round(avg(case when subset = '0' then value else null
end),0) as v0,
round(avg(case when subset = '1' then value else null end),0) as v1,
round(avg(case when subset = '2' then value else null end),0) as v2,
round(avg(case when subset = '3' then value else null end),0) as v3,
round(avg(case when subset = '4' then value else null end),0) as v4
from foo
group by code;
code | v0 | v1 | v2 | v3 | v4
------+----+----+----+----+----
A | 98 | 20 | 98 | 98 | 98An alternative way to get the output below, would be to feed your
aggregate query above to the crosstab() function in contrib/tablefunc.
I just looked at it and seems like the
...
row_name and value must be of type text
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/06/07 21:26, Ow Mun Heng wrote:
On Thu, 2007-09-06 at 20:57 -0500, Michael Glaesemann wrote:
On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:
I would believe performance would be better it being denormalised. (in
this case)I assume you've arrived at the conclusion because you have
(a) shown
that the performance with a normalized schema does not meet your
needs;
(b) benchmarked the normalized schema under production
conditions;
(c) benchmarked the denormalized schema under production
conditions; and
(d) shown that performance is improved in the
denormalized case to arrive at that conclusion. I'm interested to see
the results of your comparisons.Regardless, it sounds like you've already made up your mind. Why ask
for comments?You've assumed wrong. I've not arrived at any conclusion but merely
exploring my options on which way would be the best to thread. I'm
asking the list because I'm new in PG and after reading all those
articles on highscalability etc.. majority of them are all using some
kind of denormalised tables.
Correlation != causation.
There *might* be a causal relationship between high scalability and
table denormalization, but I seriously doubt it.
Right now, there's 8 million rows of data in this one table, and growing
at a rapid rate of ~2 million/week. I can significantly reduce this
number down to 200K (i think by denormalising it) and shrink the table
size.
Even presuming you only insert data SIX hours per day, that's only
13.3 inserts per second. Not very impressive.
- --
Ron Johnson, Jr.
Jefferson LA USA
Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFG4N81S9HxQb37XmcRArnRAJ9T2vOWe+RTWK99zYKCXIVfzisY5ACg3s8H
NAeykgSGT2jeiXUa8P8oRAQ=
=GBcW
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/06/07 20:53, Merlin Moncure wrote:
[snip]
arrays are interesting and have some useful problems. however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes.
Reads and writes are done at the page level, so I'm not sure this is
valid.
also,
they cause some problems with constraints and other issues that come
up with de-normalization tactics.
ACK.
- --
Ron Johnson, Jr.
Jefferson LA USA
Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFG4N+/S9HxQb37XmcRApl6AJ43p087jXwHs2LHGlr+JoIUVs8s7QCgmRWY
BjV99QNGxKQnel3vQ4RuBMA=
=IeDI
-----END PGP SIGNATURE-----
On Fri, 2007-09-07 at 00:18 -0500, Ron Johnson wrote:
On 09/06/07 21:26, Ow Mun Heng wrote:
I've not arrived at any conclusion but merelyexploring my options on which way would be the best to thread. I'm
asking the list because I'm new in PG and after reading all those
articles on highscalability etc.. majority of them are all using some
kind of denormalised tables.Correlation != causation.
There *might* be a causal relationship between high scalability and
table denormalization, but I seriously doubt it.
I can't refute you on this since I have no experience in this arena,
only what I read in highscalbility.com (IIRC)
Right now, there's 8 million rows of data in this one table, and growing
at a rapid rate of ~2 million/week. I can significantly reduce this
number down to 200K (i think by denormalising it) and shrink the table
size.Even presuming you only insert data SIX hours per day, that's only
13.3 inserts per second. Not very impressive.
Data is inserted 24 hours a day, but not at the same rate each
sec/minute. The problem isn't really the data-insertion, it's already
inserted in a normalised manner. It's the selection of data. (OLTP
datahouse) which takes a longer time and which is the area of worry.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/07/07 02:49, Ow Mun Heng wrote:
On Fri, 2007-09-07 at 00:18 -0500, Ron Johnson wrote:
On 09/06/07 21:26, Ow Mun Heng wrote:
I've not arrived at any conclusion but merelyexploring my options on which way would be the best to thread. I'm
asking the list because I'm new in PG and after reading all those
articles on highscalability etc.. majority of them are all using some
kind of denormalised tables.Correlation != causation.
There *might* be a causal relationship between high scalability and
table denormalization, but I seriously doubt it.I can't refute you on this since I have no experience in this arena,
only what I read in highscalbility.com (IIRC)Right now, there's 8 million rows of data in this one table, and growing
at a rapid rate of ~2 million/week. I can significantly reduce this
number down to 200K (i think by denormalising it) and shrink the table
size.Even presuming you only insert data SIX hours per day, that's only
13.3 inserts per second. Not very impressive.Data is inserted 24 hours a day, but not at the same rate each
sec/minute. The problem isn't really the data-insertion, it's already
inserted in a normalised manner. It's the selection of data. (OLTP
datahouse) which takes a longer time and which is the area of worry.
Datahouse or "data warehouse"?
- --
Ron Johnson, Jr.
Jefferson LA USA
Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFG4T8QS9HxQb37XmcRAmwFAJ0bOFYj4gWg2VGa4l28kiDAkraQYACgl167
sRA33c8h7ZHS2qgAfgFmzkg=
=66Z0
-----END PGP SIGNATURE-----
On 9/7/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1On 09/06/07 20:53, Merlin Moncure wrote:
[snip]arrays are interesting and have some useful problems. however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes.Reads and writes are done at the page level, so I'm not sure this is
valid.
sure it is...since the denormalized record is much larger (especially
in array scenarios), the tuple is much larger meaning the page will
fill up much more quickly meaning more dead pages, more vacuuming,
etc. Besides that, the server has to do some work presenting the
array as part of the read which is overhead. I didn't go into a lot
of detail but the reasoning is sound. Here is a quick example showing
the problem.
merlin
create table denormalized
(
data int[]
);
create table normalized
(
id int primary key,
datum int
);
insert into normalized select v, v from generate_series(1, 100) v;
insert into denormalized select array(select generate_series(1,100));
create sequence rotator maxvalue 100 cycle;
-- bench denormalized (d.sql) --
update denormalized set data[n] = data[n] + 1 from (select
nextval('rotator') as n) q
merlin@mernix:~$ pgbench -c 4 -t 1000 -f d.sql
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
number of clients: 4
number of transactions per client: 1000
number of transactions actually processed: 4000/4000
tps = 2452.188456 (including connections establishing)
tps = 2465.262905 (excluding connections establishing)
INFO: "normalized": found 0 removable, 100 nonremovable row versions
in 38 pages
-- bench normalized (n.sql) --
update normalized set datum = datum + 1 where id = (select nextval('rotator'));
merlin@mernix:~$ pgbench -c 4 -t 1000 -f n.sql
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
number of clients: 4
number of transactions per client: 1000
number of transactions actually processed: 4000/4000
tps = 6494.402637 (including connections establishing)
tps = 6594.087741 (excluding connections establishing)
INFO: "denormalized": found 0 removable, 1 nonremovable row versions
in 223 page
merlin
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 09/07/07 07:49, Merlin Moncure wrote:
On 9/7/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1On 09/06/07 20:53, Merlin Moncure wrote:
[snip]arrays are interesting and have some useful problems. however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes.Reads and writes are done at the page level, so I'm not sure this is
valid.sure it is...since the denormalized record is much larger (especially
in array scenarios), the tuple is much larger meaning the page will
fill up much more quickly meaning more dead pages, more vacuuming,
etc. Besides that, the server has to do some work presenting the
array as part of the read which is overhead. I didn't go into a lot
of detail but the reasoning is sound. Here is a quick example showing
the problem.
We agree.
What I meant was that reads and writes are done at the page level no
matter whether the table is normalized or not. Thus, to say "if you
need to read any particular item off the array you must read the
entire array from disk and you must right all items back to disk for
writes." is... irrelevant.
That's probably an imprecise word, but it's all I can think of at
the moment.
We also agree regarding big records filling pages faster.
- --
Ron Johnson, Jr.
Jefferson LA USA
Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFG4Uv8S9HxQb37XmcRAk2rAKCpxsJjhcMbvHJB5GrZOzNsUXgmWwCg7Cy0
CEU8zlbHGG9JvZgCSm/hajE=
=/Uv/
-----END PGP SIGNATURE-----
What is the best way to store clob data in postgresql?
Currently I store my clob data as a text. The data consists of large xml files. When I access the text field using getString method in jdbc, the program is not able to hold a huge string in memory .If I can stream this data to and from the database it will be good. Is there a way to do this?
Are there any other solution to this problem?
Thanks in advance
shar joe
---------------------------------
Shape Yahoo! in your own image. Join our Network Research Panel today!