Scaling PostgreSQL-9

Started by sandeep prakash dhumaleover 15 years ago5 messagesgeneral

Jump to latest

sandeep prakash dhumale

sandy9940@rediffmail.com

over 15 years ago

Hello All,

Need some help in scaling PostgreSQL:

I have a table with 400M records with 5 int columns having index only on 1 column.

Rows are updated by a perl script which takes 10k numbers in one transactions and fires single single update in a loop on database keeping track of the result returned . If zero returned then at later stage it does an insert. In short if the record is present in the DB then it gets updated and if not then get inserted. > 80% the records are always there in the DB so updates are more.

We need to speed up this process as it takes about 150 sec to complete 10k batch. From database logs on the avg each update takes about 15ms.

I tried to do a bulk delete of 1M numbers and copy of the same but no luck so far. Delete and copy also take a longer time more than 1 hour each.

Few Details:

PostgreSQL 9.0.0 on 11.0 open SuSe-64bit,Shared buffer=2.5GB,effective_cache_size=20GB,checkpoint_segments=200 (raised for bulkloading)

Server: dell dual quadcore ,32GB RAM, DB partition on RAID 10 and pg_xlog on RAID 1.

p.s. Previously we were having slony read only slave on 8.4.2 , where delete was fast about 7 min and copy 6 min, we moved to PostgreSQL 9 for read only Stand by slave to remove overhead caused by slony due to triggers (also the slave was always lagging in case of bulkloads on master)  in the hope of speeding up the process.

Any help would be much appriciated ...

With Regards
sandy

Vick Khera

vivek@khera.org

over 15 years ago

In reply to: sandeep prakash dhumale (#1)

Re: Scaling PostgreSQL-9

On Tue, Sep 28, 2010 at 6:31 AM, sandeep prakash dhumale <
sandy9940@rediffmail.com> wrote:

I have a table with 400M records with 5 int columns having index only on 1
column.

How is your data used? Is the update done by the primary key? Are the
queries segmented in some way that may divide the data based on one of the
other columns?

You should investigate using partitions to hold your data. I'd recommend at
least 100 partitions. I've done this with great success by dividing some
tables along one of the foreign keys. My table was just a pure relation
relating the PKs of two other tables. After analyzing the queries that were
most often run, we decided to split along the one which resulted in the
fewest partitions being referenced per search.

By splitting, we reduced a 200M+ row table into 100 tables of ~2M rows, and
the constraint exclusion (or altering the queries to directly access the
proper partition) reduced our query times dramatically.

sandeep prakash dhumale

sandy9940@rediffmail.com

over 15 years ago

In reply to: Vick Khera (#2)

Re: Re: [GENERAL] Scaling PostgreSQL-9

On Tue, 28 Sep 2010 17:45:16 +0530 wrote

On Tue, Sep 28, 2010 at 6:31 AM, sandeep prakash dhumale wrote:

I have a table with 400M records with 5 int columns having index only on 1 column.

How is your data used? Is the update done by the primary key? Are the queries segmented in some way that may divide the data based on one of the other columns?

You should investigate using partitions to hold your data. I'd recommend at least 100 partitions. I've done this with great success by dividing some tables along one of the foreign keys. My table was just a pure relation relating the PKs of two other tables. After analyzing the queries that were most often run, we decided to split along the one which resulted in the fewest partitions being referenced per search.

By splitting, we reduced a 200M+ row table into 100 tables of ~2M rows, and the constraint exclusion (or altering the queries to directly access the proper partition) reduced our query times dramatically.

Thanks Vivek for your replay,

We did have investigated the partitioning but looks like this wont be an ideal candidate for the same perhaps you might be able to share some more light on it.

Table contains unique mobile numbers and update is based on this mobile number. Initially we thought of partitioning by range of mobile series and ended up with about 50 partitions (can be increased as per your suggestion to 100), problem we faced update was also slow as update was based on mobile number and constraint was on mobile series. moreover if i have SELECT queries which has IN clause with random mobile numbers which may end up scanning all the tables.

Table has mobile number,status and expiry date. I can not partition on expiry date as all SELECT's are on mobile number.

Please suggest...

Snady

Vick Khera

vivek@khera.org

over 15 years ago

In reply to: sandeep prakash dhumale (#3)

Re: Scaling PostgreSQL-9

On Tue, Sep 28, 2010 at 8:56 AM, Sandy <sandy9940@rediffmail.com> wrote:

Table has mobile number,status and expiry date. I can not partition on
expiry date as all SELECT's are on mobile number.

Then partition on the mobile number. If your updates and queries are all
tied to that, then it is the ideal candidate.

You should also make sure that you have proper indexes on each partition to
let the query executer quickly decide that a given table is not of any use
(you really don't want sequence scan on each partition).

You will get speedups if you can achieve one or both of getting the query
panner to use the constraint exclusions to eliminate partitions and getting
the index scans to quickly skip over partitions which will not have the data
you want.

Igor Neyman

ineyman@perceptron.com

over 15 years ago

In reply to: sandeep prakash dhumale (#1)

Re: Scaling PostgreSQL-9

-----Original Message-----
From: sandeep prakash dhumale [mailto:sandy9940@rediffmail.com]
Sent: Tuesday, September 28, 2010 6:32 AM
To: pgsql-general@postgresql.org
Subject: Scaling PostgreSQL-9

Hello All,

Need some help in scaling PostgreSQL:

I have a table with 400M records with 5 int columns having
index only on 1 column.

Rows are updated by a perl script which takes 10k numbers in
one transactions and fires single single update in a loop on
database keeping track of the result returned . If zero
returned then at later stage it does an insert. In short if
the record is present in the DB then it gets updated and if
not then get inserted. > 80% the records are always there in
the DB so updates are more.

We need to speed up this process as it takes about 150 sec to
complete 10k batch. From database logs on the avg each update
takes about 15ms.

Your problem is that you process one record at a time in your loop,
Meaning you have to make 10k trips to the database to process 10k
records.

Try creating "staging" table in the database,
copy all the records from your source into staging table,
i.e. using COPY command if your source is a file.
Then using couple sql statements:
insert ... where not exists (select ...)
Update ... Where exists...

Insert new and update existing records.
Here you work with data sets inside the database,
which should be much faster then procedural perl script.

I tried to do a bulk delete of 1M numbers and copy of the
same but no luck so far. Delete and copy also take a longer
time more than 1 hour each.

How did you do copy? Again using perl script to loop through 1M records
one at a time?

Few Details:

PostgreSQL 9.0.0 on 11.0 open SuSe-64bit,Shared
buffer=2.5GB,effective_cache_size=20GB,checkpoint_segments=200
(raised for bulkloading)

Server: dell dual quadcore ,32GB RAM, DB partition on RAID 10
and pg_xlog on RAID 1.

p.s. Previously we were having slony read only slave on 8.4.2
, where delete was fast about 7 min and copy 6 min, we moved
to PostgreSQL 9 for read only Stand by slave to remove
overhead caused by slony due to triggers (also the slave was
always lagging in case of bulkloads on master) in the hope
of speeding up the process.

Any help would be much appriciated ...

With Regards
sandy

Regards,
Igor Neyman