Table partition for very large table

Started by Yudie Pgabout 21 years ago13 messagesgeneral
Jump to latest
#1Yudie Pg
yudiepg@gmail.com

I have table with more than 4 millions records and when I do select
query it gives me "out of memory" error.
Does postgres has feature like table partition to handle table with
very large records.
Just wondering what do you guys do to deal with very large table?

Thanks!

#2Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Yudie Pg (#1)
Re: Table partition for very large table

On Mon, 2005-03-28 at 11:32, Yudie Gunawan wrote:

I have table with more than 4 millions records and when I do select
query it gives me "out of memory" error.
Does postgres has feature like table partition to handle table with
very large records.
Just wondering what do you guys do to deal with very large table?

Is this a straight "select * from table" or is there more being done to
the data?

If it's a straight select, you are likely running out of memory to hold
the result set, and need to look at using a cursor to grab the result in
pieces.

#3Joshua D. Drake
jd@commandprompt.com
In reply to: Yudie Pg (#1)
Re: Table partition for very large table

On Mon, 2005-03-28 at 11:32 -0600, Yudie Gunawan wrote:

I have table with more than 4 millions records and when I do select
query it gives me "out of memory" error.
Does postgres has feature like table partition to handle table with
very large records.
Just wondering what do you guys do to deal with very large table?

You can use a cursor to help with this problem and fetch through the
records.

http://www.postgresql.org/docs/8.0/interactive/sql-declare.html

Sincerely,

Joshua D. Drake

Thanks!

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

--
Command Prompt, Inc., Your PostgreSQL solutions company. 503-667-4564
Custom programming, 24x7 support, managed services, and hosting
Open Source Authors: plPHP, pgManage, Co-Authors: plPerlNG
Reliable replication, Mammoth Replicator - http://www.commandprompt.com/

#4Michael Fuhr
mike@fuhr.org
In reply to: Yudie Pg (#1)
Re: Table partition for very large table

On Mon, Mar 28, 2005 at 11:32:04AM -0600, Yudie Gunawan wrote:

I have table with more than 4 millions records and when I do select
query it gives me "out of memory" error.

What's the query and how are you issuing it? Where are you seeing
the error? This could be a client problem: the client might be
trying to fetch all rows before doing anything with them, thereby
exhausting all memory. If that's the case then a cursor might be
useful.

Does postgres has feature like table partition to handle table with
very large records.

Let's identify the problem before guessing how to fix it.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

#5Yudie Pg
yudiepg@gmail.com
In reply to: Yudie Pg (#1)
Re: Table partition for very large table

I actualy need to join from 2 tables. Both of them similar and has
more than 4 millions records.

CREATE TABLE prdt_old (
groupnum int4 NOT NULL,
sku varchar(30) NOT NULL,
url varchar(150),
);

CREATE TABLE prdt_new(
groupnum int4 NOT NULL,
sku varchar(30) NOT NULL,
url varchar(150) NOT NULL,
);

The query returns group number and sku from old table where has no url
in prdt_new table.

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
FROM prdt_old po
LEFT OUTER JOIN prdt_new pn
ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

I already have resolution for this problem where I seperate the query
for each group.

But when I address this question, I hope that Postgresql has some kind
of table optimazion for large records. Based my experience it is
faster to query from chopped smaller table rather than query from
single huge table. I heard Oracle has some kind of table partition
that acts like single table.

#6Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Yudie Pg (#5)
Re: Table partition for very large table

On Mon, 2005-03-28 at 13:02, Yudie Gunawan wrote:

I actualy need to join from 2 tables. Both of them similar and has
more than 4 millions records.

CREATE TABLE prdt_old (
groupnum int4 NOT NULL,
sku varchar(30) NOT NULL,
url varchar(150),
);

CREATE TABLE prdt_new(
groupnum int4 NOT NULL,
sku varchar(30) NOT NULL,
url varchar(150) NOT NULL,
);

The query returns group number and sku from old table where has no url
in prdt_new table.

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
FROM prdt_old po
LEFT OUTER JOIN prdt_new pn
ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

I already have resolution for this problem where I seperate the query
for each group.

But when I address this question, I hope that Postgresql has some kind
of table optimazion for large records. Based my experience it is
faster to query from chopped smaller table rather than query from
single huge table. I heard Oracle has some kind of table partition
that acts like single table.

Hold on, let's diagnose the real problem before we look for solutions.
What does explain <query> tell you? Have you analyzed the database?
What are your postgresql.conf settings?

#7Yudie Pg
yudiepg@gmail.com
In reply to: Scott Marlowe (#6)
Re: Table partition for very large table

Hold on, let's diagnose the real problem before we look for solutions.
What does explain <query> tell you? Have you analyzed the database?

This is the QUERY PLAN
Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17)
Hash Cond: (("outer".groupnum = "inner".groupnum) AND
(("outer".sku)::text = ("inner".sku)::text))
Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
-> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17)
-> Hash (cost=20.00..20.00 rows=1000 width=78)
-> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78)

What are your postgresql.conf settings?

What suspected specific setting need to be changed?

#8Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Yudie Pg (#7)
Re: Table partition for very large table

On Mon, 2005-03-28 at 13:50, Yudie Gunawan wrote:

Hold on, let's diagnose the real problem before we look for solutions.
What does explain <query> tell you? Have you analyzed the database?

This is the QUERY PLAN
Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17)
Hash Cond: (("outer".groupnum = "inner".groupnum) AND
(("outer".sku)::text = ("inner".sku)::text))
Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
-> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17)
-> Hash (cost=20.00..20.00 rows=1000 width=78)
-> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78)

What are your postgresql.conf settings?

What suspected specific setting need to be changed?

sort_mem also known as work_mem (in 8.0)

Also, this is important, have you anayzed the table? I'm guessing no,
since the estimates are 1,000 rows, but the has join is getting a little
bit more than that. :)

Analyze your database and then run the query again.

#9Martijn van Oosterhout
kleptog@svana.org
In reply to: Yudie Pg (#7)
Re: Table partition for very large table

Looks like you need to create some indexes, probably on (groupnum) and
possibly on (groupnum,sku) on both tables.

Hope this helps,

On Mon, Mar 28, 2005 at 01:50:06PM -0600, Yudie Gunawan wrote:

Hold on, let's diagnose the real problem before we look for solutions.
What does explain <query> tell you? Have you analyzed the database?

This is the QUERY PLAN
Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17)
Hash Cond: (("outer".groupnum = "inner".groupnum) AND
(("outer".sku)::text = ("inner".sku)::text))
Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
-> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17)
-> Hash (cost=20.00..20.00 rows=1000 width=78)
-> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78)

What are your postgresql.conf settings?

What suspected specific setting need to be changed?

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#10Yudie Pg
yudiepg@gmail.com
In reply to: Scott Marlowe (#8)
Re: Table partition for very large table

Also, this is important, have you anayzed the table? I'm guessing no,
since the estimates are 1,000 rows, but the has join is getting a little
bit more than that. :)

Analyze your database and then run the query again.

I analyze the table and it decrease number of rows in nested loop on query plan.
Then it stuck or could be timeout when I execute the query.
This work around to optimize the database seems not helping to cut the
query time.

What about table partition? anyone know about it?

#11Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Yudie Pg (#10)
Re: Table partition for very large table

On Mon, 2005-03-28 at 15:38, Yudie Pg wrote:

Also, this is important, have you anayzed the table? I'm guessing no,
since the estimates are 1,000 rows, but the has join is getting a little
bit more than that. :)

Analyze your database and then run the query again.

I analyze the table and it decrease number of rows in nested loop on query plan.
Then it stuck or could be timeout when I execute the query.
This work around to optimize the database seems not helping to cut the
query time.

What about table partition? anyone know about it?

Hold your horses there. Calm down. We'll get it running faster. Our
first step was to get the analyzer to find out the right count of how
many rows you have in your table.

There aren't any built in table partitions, and they might or might not
help if they did exist anyway.

First we had to get the patient's heart beating, now we'll work on the
exercise program.

This is a huge amount of data you're running across. What does explain
<yourquery> say now? If you can let it run, then you might want to try
explain analyze <yourquery> as well, but that has to run the whole
query.

Now, are you running the original query you listed:

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
FROM prdt_old po
LEFT OUTER JOIN prdt_new pn
ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

???

Possibly helpful indexes would be:

create index prdt_new_url_dx on prdt_new (url)
create index prdt_new_sku_dx on prdt_new (sku)
create index prdt_old_sku_dx on prdt_old (sku)
create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
IS NULL

Don't necessarily make them all. it really depends on how many rows
match and what not.

#12Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Scott Marlowe (#11)
Re: Table partition for very large table

On Mon, 2005-03-28 at 16:02, Scott Marlowe wrote:

On Mon, 2005-03-28 at 15:38, Yudie Pg wrote:

Also, this is important, have you anayzed the table? I'm guessing no,
since the estimates are 1,000 rows, but the has join is getting a little
bit more than that. :)

Analyze your database and then run the query again.

I analyze the table and it decrease number of rows in nested loop on query plan.
Then it stuck or could be timeout when I execute the query.
This work around to optimize the database seems not helping to cut the
query time.

What about table partition? anyone know about it?

Hold your horses there. Calm down. We'll get it running faster. Our
first step was to get the analyzer to find out the right count of how
many rows you have in your table.

There aren't any built in table partitions, and they might or might not
help if they did exist anyway.

First we had to get the patient's heart beating, now we'll work on the
exercise program.

This is a huge amount of data you're running across. What does explain
<yourquery> say now? If you can let it run, then you might want to try
explain analyze <yourquery> as well, but that has to run the whole
query.

Now, are you running the original query you listed:

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
FROM prdt_old po
LEFT OUTER JOIN prdt_new pn
ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

???

Possibly helpful indexes would be:

create index prdt_new_url_dx on prdt_new (url)
create index prdt_new_sku_dx on prdt_new (sku)
create index prdt_old_sku_dx on prdt_old (sku)
create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
IS NULL

Don't necessarily make them all. it really depends on how many rows
match and what not.

Oh, and look at indexing these two columns as well:

pn.groupnum = po.groupnum

#13Yudie Pg
yudiepg@gmail.com
In reply to: Scott Marlowe (#12)
Re: Table partition for very large table

create index prdt_new_url_dx on prdt_new (url)
create index prdt_new_sku_dx on prdt_new (sku)
create index prdt_old_sku_dx on prdt_old (sku)
create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
IS NULL

I added indexes & redo the analyze - Query plan looks better,
But when I execute the query it still can't finish all at once. (i've
been waiting more than 30 minutes) and seems it time out (Error
"canceling query due to user request") :-B
Maybe you can help to analyze this query plan (the second one) to see
what make it slow?

QUERY PLAN (BEFORE ANALYZE):
Merge Left Join (cost=1886617.54..1960855.12 rows=4979571 width=19)
Merge Cond: (("outer"."?column3?" = "inner"."?column4?") AND
("outer".groupnum = "inner".groupnum))
Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
-> Sort (cost=969258.98..981707.91 rows=4979571 width=19)
Sort Key: (mc.sku)::text, mc.groupnum
-> Seq Scan on prdt_old mc (cost=0.00..297611.71
rows=4979571 width=19)
-> Sort (cost=917358.56..928785.51 rows=4570779 width=82)
Sort Key: (mi.sku)::text, mi.groupnum
-> Seq Scan on prdt_new mi (cost=0.00..126438.79
rows=4570779 width=82)

QUERY PLAN AFTER ANALYZE:
Nested Loop Left Join (cost=0.00..233277.42 rows=23747 width=17)
Join Filter: ("inner".groupnum = "outer".groupnum)
Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
-> Index Scan using idx_prdtold_sku on prdt_old mc
(cost=0.00..112211.85 rows=23747 width=17)
-> Index Scan using idx_prdtnew_sku on prdt_new mi
(cost=0.00..5.08 rows=1 width=82)
Index Cond: ((mi.sku)::text = ("outer".sku)::text)

I work around with indexes before and I did not sure to create index
for the url field because it is a text field and not the main key for
this query. But I just know that we can create index with condition
(in this case where url is null), I guess it may cut some of the query
time.

I seperate the query by groupnum instead querying the whole table that
cause the bottle neck effect. It works but yes it's slow (less than 1
hour) but goes thru.
I wish could do this simultaniously.