PG for DataWarehouse type Queries

Started by Ow Mun Hengover 18 years ago8 messagesgeneral
Jump to latest
#1Ow Mun Heng
Ow.Mun.Heng@wdc.com

Can anyone shed some light on this. I just would like to know if queries
for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

Joins are done in the form of Left joins (sometimes on the same tables,
due to normalisation)

Is 30min - 2hours too long or is this considered "normal"??

#2Joshua Tolley
eggyknap@gmail.com
In reply to: Ow Mun Heng (#1)
Re: PG for DataWarehouse type Queries

On 8/3/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:

Can anyone shed some light on this. I just would like to know if queries
for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

Joins are done in the form of Left joins (sometimes on the same tables,
due to normalisation)

Is 30min - 2hours too long or is this considered "normal"??

That depends entirely on your query, your dataset, the machine you're
using, etc. Your best bet is probably to post an EXPLAIN ANALYZE of
your query along with some description of what's going on and what
you're running this all on (pg version, machine specs, etc.) and see
if someone can come up with an acceptable way to make the query
faster.

- Josh

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ow Mun Heng (#1)
Re: PG for DataWarehouse type Queries

Ow Mun Heng <Ow.Mun.Heng@wdc.com> writes:

Can anyone shed some light on this. I just would like to know if queries
for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

Joins are done in the form of Left joins (sometimes on the same tables,
due to normalisation)

Is 30min - 2hours too long or is this considered "normal"??

This question is nearly content-free. How many joins is "lots"? How
many result rows are you expecting? What PG version are you using?
What have you got work_mem set to? What does EXPLAIN say the plan is
(EXPLAIN ANALYZE output would be better, but you'll have to wait for the
results...)?

regards, tom lane

#4André Volpato
andre.volpato@ecomtecnologia.com.br
In reply to: Ow Mun Heng (#1)
Re: PG for DataWarehouse type Queries

Ow Mun Heng escreveu:

Can anyone shed some light on this. I just would like to know if queries
for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

Joins are done in the form of Left joins (sometimes on the same tables,
due to normalisation)

Is 30min - 2hours too long or is this considered "normal"??

What do you mean for 'Datawarehouse queries' ?
Is there any OLAP server between youp app and Postrgres?

--
[]�s,

Andr� Volpato
ECOM Tecnologia Ltda
andre.volpato@ecomtecnologia.com.br
(41) 3014 2322

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Ow Mun Heng (#1)
Re: PG for DataWarehouse type Queries

On Fri, 2007-08-03 at 15:12 +0800, Ow Mun Heng wrote:

Is 30min - 2hours too long or is this considered "normal"??

Yes.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: PG for DataWarehouse type Queries

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Ow Mun Heng <Ow.Mun.Heng@wdc.com> writes:

Can anyone shed some light on this. I just would like to know if queries
for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

One more question to pile on to the others: what version of Postgres is this.

I seem to recall some older versions had trouble with tables with hundreds of
columns, but that's just guessing in the dark without actual facts to look at.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#7Ow Mun Heng
Ow.Mun.Heng@wdc.com
In reply to: Joshua Tolley (#2)
Re: PG for DataWarehouse type Queries

On Fri, 2007-08-03 at 07:55 -0600, Josh Tolley wrote:

On 8/3/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:

Can anyone shed some light on this. I just would like to know if

queries

for raw data (not aggregregates) is expected to take a long time.
Running times between 30 - 2 hours for large dataset pulls.

Involves lots of joins on very large tables (min 1 millon rows each
table, 300 columns per table)

Joins are done in the form of Left joins (sometimes on the same tables,
due to normalisation)

Is 30min - 2hours too long or is this considered "normal"??

That depends entirely on your query, your dataset, the machine you're
using, etc. Your best bet is probably to post an EXPLAIN ANALYZE of
your query along with some description of what's going on and what
you're running this all on (pg version, machine specs, etc.) and see
if someone can come up with an acceptable way to make the query
faster.

Sorry this generated more questions. I was vague cos it wasn't really
running on PG (yet)

Actual query is running on SQL Server. I'm considering offloading some
of those queries to a PG (i have 8.2.4 running for testing purposes)
server. (Very Low end desktop whitebox spec, eg: 1G Ram, 3x500GB drives
raid 1/1+0/0+1, PIV 2G, with the tables Pre-joined and de-normalised
every say 5 min and perform table partitioning based on date ranges)

Current queries in SQL server takes between 30min to 2 hours. (4G
mem/Quad Dual code Xeon)

Queries involves 11 Joins (of 5 Main tables, of which 3 are Large, 2 are
fact tables)

large tables has > 1min rows, fact ~max 200 rows)

Query is of type

Select
A.A,
A.B,
C.A,
D.D
from
(Select
...
)A
left outer join
(select ...
...
)B
on A.A = B.A
let outer join
(select
...
)C

and so on..

I'm hoping that by pre-joining them and using table partitioning, the
performance would greatly increase. (I'm still testing)

BTW, this question is just to have a feel if it's "normal" to have such
long running queries (time) for ETL type queries.

Sorry for the noise (since this is not really PG related yet)
I'm still researching the feasibility etc.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ow Mun Heng (#7)
Re: PG for DataWarehouse type Queries

Ow Mun Heng <Ow.Mun.Heng@wdc.com> writes:

Queries involves 11 Joins (of 5 Main tables, of which 3 are Large, 2 are
fact tables)

It would be a good idea to increase join_collapse_limit to more than 11.
This will increase the planning time, but the investment will likely be
worthwhile for such a large query.

regards, tom lane