Putting many related fields as an array

Started by Ow Mun Hengalmost 17 years ago5 messagesgeneral
Jump to latest
#1Ow Mun Heng
Ow.Mun.Heng@wdc.com

Hi,

Currently doing some level of aggregrate tables for some data. These
data will be used for slice/dice activity and we want to be able to
play/manipulate the data such that I can get means and stddev data.

Eg: For each Original Column eg:

population_in_town : (I get derivatives)
- mean # of ppl in each town
- stddev # of ppl in each town (stdev calc already uses 2 extra columns
for # of ppl squared and qty of ppl)
- count of ppl
- count of # of ppl is < 100 (to get a percentage of population)
- count of # of ppl is < 500

Hence, I'm seeing a 1:5 column growth here if I put them as column
based.

eg:
| sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |

I'm thinking of lumping them into 1 column via an array instead of into
5 different columns. Not sure how to go about this, hence the email to
the list.

something like {244,455,1234,43,23}

query can be done like

sum_of_count / qty = Ave
(sum_of_count_squared * sum_qty ) / (qty * (qty-1)) = STDEV
(sum_qty<100 / sum_qty) = % < 100
(sum_qty<500 / sum_qty) = % < 500

Then there's the issue of speed/responsiveness on doing it.

Help would be appreciated in this.

#2Sam Mason
sam@samason.me.uk
In reply to: Ow Mun Heng (#1)
Re: Putting many related fields as an array

On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:

| sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |

I'm thinking of lumping them into 1 column via an array instead of into
5 different columns. Not sure how to go about this, hence the email to
the list.

The normal array constructor should work:

SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
FROM (VALUES (1),(3),(4)) x(v);

Not sure why this is better than using separate columns though. Maybe a
new datatype and a custom aggregate would be easier to work with?

--
Sam http://samason.me.uk/

#3Ow Mun Heng
Ow.Mun.Heng@wdc.com
In reply to: Sam Mason (#2)
Re: Putting many related fields as an array

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:

| sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |

I'm thinking of lumping them into 1 column via an array instead of into
5 different columns. Not sure how to go about this, hence the email to
the list.

The normal array constructor should work:

SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
FROM (VALUES (1),(3),(4)) x(v);

Not sure why this is better than using separate columns though. Maybe a
new datatype and a custom aggregate would be easier to work with?

The issue here is the # of columns needed to populate the table.

The table I'm summarizing has close to between 50 to 100+ columns, if the
1:5x is used as a yardstick, then the table will get awfully wide quickly.

I need to know how to do it first, then test accordingly for performance and
corner cases.

#4Sam Mason
sam@samason.me.uk
In reply to: Ow Mun Heng (#3)
Re: Putting many related fields as an array

On Tue, May 12, 2009 at 08:06:25PM +0800, Ow Mun Heng wrote:

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:

Not sure why this is better than using separate columns though. Maybe a
new datatype and a custom aggregate would be easier to work with?

The issue here is the # of columns needed to populate the table.

The table I'm summarizing has close to between 50 to 100+ columns, if the
1:5x is used as a yardstick, then the table will get awfully wide quickly.

I need to know how to do it first, then test accordingly for performance and
corner cases.

Yes, those are going to be pretty wide tables! Maybe if you can make
the source tables a bit "narrower" it will help things; PG has to read
entire rows from the table, so if your queries are only touching a few
columns then it's going to need a lot more disk bandwidth to get a
specific number of rows back from the table.

--
Sam http://samason.me.uk/

#5Andrew Gould
andrewlylegould@gmail.com
In reply to: Ow Mun Heng (#3)
Re: Putting many related fields as an array

On Tue, May 12, 2009 at 7:06 AM, Ow Mun Heng <ow.mun.heng@wdc.com> wrote:

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:

| sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |

I'm thinking of lumping them into 1 column via an array instead of into
5 different columns. Not sure how to go about this, hence the email to
the list.

The normal array constructor should work:

SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
FROM (VALUES (1),(3),(4)) x(v);

Not sure why this is better than using separate columns though. Maybe a
new datatype and a custom aggregate would be easier to work with?

The issue here is the # of columns needed to populate the table.

The table I'm summarizing has close to between 50 to 100+ columns, if the
1:5x is used as a yardstick, then the table will get awfully wide quickly.

I need to know how to do it first, then test accordingly for performance
and
corner cases.

I apologize for coming into this conversation late. I used to do analysis
of a public use data flat file that had one row per patient and up to 24
diagnosis codes, each in a different column. Is this analogous to your
situation? I found it was worth the effort to convert the flat file into a
relational data model where the patients' diagnosis codes were in one column
in a separate table. This model also makes more complex analysis easier.

Since there were several types of fields that needed to be combined into
their own tables, I found it took less time to convert the flat file to the
relational model using a script prior to importing the data into the
database server. A Python script would read the original file and create 5
clean, tab-delimited files that were ready to be imported.

I hope this helps.

Andrew