Performance while loading data and indexing

Started by Shridhar Daithankarover 23 years ago61 messageshackersgeneral

shridhar_daithankar@persistent.co.in

over 23 years ago

hackersgeneral

Hello all,

Some time back I posted a query to build a site with 150GB of database. In last
couple of weeks, lots of things were tested at my place and there are some
results and again some concerns.

This is a long post. Please be patient and read thr. If we win this, I guess we
have a good marketing/advocacy case here..;-)

First the problems (For those who do not read beyond first page)

1) Database load time from flat file using copy is very high
2) Creating index takes huge amount of time.
3) Any suggsestions for runtime as data load and query will be going in
parallel.

Now the details. Note that this is a test run only..

Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
RedHat7.2/PostgreSQL7.1.3

Database in flat file:
125,000,000 records of around 100 bytes each.
Flat file size 12GB

Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

Important postgresql.conf settings

sort_mem = 12000
shared_buffers = 24000
fsync=true (Sad but true. Left untouched.. Will that make a difference on
SCSI?)
wal_buffers = 65536
wal_files = 64

Now the requirements

Initial flat data load: 250GB of data. This has gone up since last query. It
was 150GB earlier..
Ongoing inserts: 5000/sec.
Number of queries: 4800 queries/hour
Query response time: 10 sec.

Now questions.

1) Instead of copying from a single 12GB data file, will a parallel copy from
say 5 files will speed up the things?

Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5
setup..

2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further
addition to improve create index performance?

3) 5K concurrent inserts with an index on, will this need a additional CPU
power? Like deploying it on dual RISC CPUs etc?

4) Query performance is not a problem. Though 4.8K queries per sec. expected
response time from each query is 10 sec. But my guess is some serius CPU power
will be chewed there too..

5)Will upgrading to 7.2.2/7.3 beta help?

All in all, in the test, we didn't see the performance where hardware is
saturated to it's limits. So effectively we are not able to get postgresql
making use of it. Just pushing WAL and shared buffers does not seem to be the
solution.

If you guys have any suggestions. let me know. I need them all..

Mysql is almost out because it's creating index for last 17 hours. I don't
think it will keep up with 5K inserts per sec. with index. SAP DB is under
evaluation too. But postgresql is most favourite as of now because it works. So
I need to come up with solutions to problems that will occur in near future..
;-)

TIA..

Bye
Shridhar

--
Law of Procrastination: Procrastination avoids boredom; one never has the
feeling that there is nothing important to do.

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:

Some time back I posted a query to build a site with 150GB of database. In last
couple of weeks, lots of things were tested at my place and there are some
results and again some concerns.

2) Creating index takes huge amount of time.
Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further
addition to improve create index performance?

Just a thought. If I sort the table before making an index, would it be faster
than creating index on raw table? And/or if at all, how do I sort the table
without duplicating it?

Just a wild thought..

Bye
Shridhar

--
linux: the choice of a GNU generation(ksh@cis.ufl.edu put this on Tshirts in
'93)

Martijn van Oosterhout

kleptog@svana.org

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: Performance while loading data and indexing

I'll preface this by saying that while I have a large database, it doesn't
require quite the performace you're talking about here.

On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote:

1) Database load time from flat file using copy is very high
2) Creating index takes huge amount of time.
3) Any suggsestions for runtime as data load and query will be going in
parallel.

You're loading all the data in one copy. I find that INSERTs are mostly
limited by indexes. While index lookups are cheap, they are not free and
each index needs to be updated for each row.

I fond using partial indexes to only index the rows you actually use can
help with the loading. It's a bit obscure though.

As for parallel loading, you'll be limited mostly by your I/O bandwidth.
Have you measured it to take sure it's up to speed?

Now the details. Note that this is a test run only..

Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
RedHat7.2/PostgreSQL7.1.3

Database in flat file:
125,000,000 records of around 100 bytes each.
Flat file size 12GB

Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

So you're loading at a rate of 860KB per sec. That's not too fast. How many
indexes are active at that time? Triggers and foreign keys also take their
toll.

Important postgresql.conf settings

sort_mem = 12000
shared_buffers = 24000
fsync=true (Sad but true. Left untouched.. Will that make a difference on
SCSI?)
wal_buffers = 65536
wal_files = 64

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

Initial flat data load: 250GB of data. This has gone up since last query. It
was 150GB earlier..
Ongoing inserts: 5000/sec.
Number of queries: 4800 queries/hour
Query response time: 10 sec.

That looks quite acheivable.

1) Instead of copying from a single 12GB data file, will a parallel copy from
say 5 files will speed up the things?

Limited by I/O bandwidth. On linux vmstat can tell you how many blocks are
being loaded and stored per second. Try it. As long as sync() doesn't get
done too often, it should be help.

Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5
setup..

No, it's not. You should be able to do better.

2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further
addition to improve create index performance?

Should be fine. Admittedly your indexes are taking rather long to build.

3) 5K concurrent inserts with an index on, will this need a additional CPU
power? Like deploying it on dual RISC CPUs etc?

It shouldn't. Do you have an idea of what your CPU usage is? ps aux should
give you a decent idea.

4) Query performance is not a problem. Though 4.8K queries per sec. expected
response time from each query is 10 sec. But my guess is some serius CPU power
will be chewed there too..

Should be fine.

5)Will upgrading to 7.2.2/7.3 beta help?

Possibly, though it may be wirth it just for the features/bugfixes.

All in all, in the test, we didn't see the performance where hardware is
saturated to it's limits. So effectively we are not able to get postgresql
making use of it. Just pushing WAL and shared buffers does not seem to be the
solution.

If you guys have any suggestions. let me know. I need them all..

Find the bottleneck: CPU, I/O or memory?

Mysql is almost out because it's creating index for last 17 hours. I don't
think it will keep up with 5K inserts per sec. with index. SAP DB is under
evaluation too. But postgresql is most favourite as of now because it works. So
I need to come up with solutions to problems that will occur in near future..
;-)

17 hours! Ouch. Either way, you should be able to do much better. Hope this
helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Shridhar Daithankar (#2)

hackersgeneral

Re: Performance while loading data and indexing

On 26 Sep 2002 at 10:51, paolo.cassago@talentmanager.c wrote:

Hi,
it seems you have to cluster it, I don't think you have another choise.

Hmm.. That didn't occur to me...I guess some real time clustering like usogres
would do. Unless it turns out to be a performance hog..

But this is just insert and select. No updates no deletes(Unless customer makes
a 180 degree turn) So I doubt if clustering will help. At the most I can
replicate data across machines and spread queries on them. Replication overhead
as a down side and low query load on each machine as upside..

I'm retrieving the configuration of our postgres servers (I'm out of office
now), so I can send it to you. I was quite disperate about performance, and
I was thinking to migrate the data on an oracle database. Then I found this
configuration on the net, and I had a succesfull increase of performance.

In this case, we are upto postgresql because we/our customer wants to keep the
costs down..:-) Even they are asking now if it's possible to keep hardware
costs down as well. That's getting some funny responses here but I digress..

Maybe this can help you.

Why you use copy to insert records? I usually use perl scripts, and they
work well .

Performance reasons. As I said in one of my posts earlier, putting upto 100K
records in one transaction in steps of 10K did not reach performance of copy.
As Tom said rightly, it was a 4-1 ratio despite using transactions..

Thanks once again..
Bye
Shridhar

--
Secretary's Revenge: Filing almost everything under "the".

Import Notes

Reply to msg id not found: 19138.194.185.48.247.1033030286.squirrel@mail.talentwebsolutions.com

Justin Clift

justin@postgresql.org

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: Performance while loading data and indexing

Hi Shridhar,

Shridhar Daithankar wrote:
<snip>

3) Any suggsestions for runtime as data load and query will be going in
parallel.

That sounds unusual. From reading this, it *sounds* like you'll be
running queries against an incomplete dataset, or maybe just running the
queries that affect the tables loaded thus far (during the initial
load).

<snip>

fsync=true (Sad but true. Left untouched.. Will that make a difference on
SCSI?)

Definitely. Have directly measured a ~ 2x tps throughput increase on
FreeBSD when leaving fsync off whilst performance measuring stuff
recently (PG 7.2.2). Like anything it'll depend on workload, phase of
moon, etc, but it's a decent indicator.

<snip>

Now questions.

1) Instead of copying from a single 12GB data file, will a parallel copy from
say 5 files will speed up the things?

Not sure yet. Haven't get done enough performance testing (on the cards
very soon though).

Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5
setup..

fsync = off would help during the data load, but not a good idea if
you're going to be running queries against it at the same time.

Am still getting the hang of performance tuning stuff. Have a bunch of
Ultra160 hardware for the Intel platform, and am testing against it as
time permits.

Not as high end as I'd like, but it's a start.

:-)

Regards and best wishes,

Justin Clift

<snip>

Bye
Shridhar

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

Mario Weilguni

mario.weilguni@icomedias.com

over 23 years ago

In reply to: Justin Clift (#5)

hackersgeneral

Re: Performance while loading data and indexing

On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:

Some time back I posted a query to build a site with 150GB of database. In

last

couple of weeks, lots of things were tested at my place and there are some
results and again some concerns.

2) Creating index takes huge amount of time.
Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

I never tried 150GB of data, but 10GB of data, and this worked fine for me.
Maybe it will help if you post your table schema, including which indexes you
use, and the average size of one tuple.

Import Notes

Resolved by subject fallback

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Mario Weilguni (#6)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

On 26 Sep 2002 at 11:17, Mario Weilguni wrote:

On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:

Some time back I posted a query to build a site with 150GB of database. In

last

couple of weeks, lots of things were tested at my place and there are some
results and again some concerns.

2) Creating index takes huge amount of time.
Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

I never tried 150GB of data, but 10GB of data, and this worked fine for me.
Maybe it will help if you post your table schema, including which indexes you
use, and the average size of one tuple.

Well the test runs were for 10GB of data. Schema is attached. Read in fixed
fonts..Last nullable fields are dummies but may be used in fututre and varchars
are not acceptable(Not my requirement). Tuple size is around 100 bytes..

The index creation query was

CREATE INDEX index1 ON tablename (esn,min,datetime);

What if I put datetime ahead? It's likely the the datetime field will have high
degree of locality being log data..

Bye
Shridhar

--
brain, v: [as in "to brain"] To rebuke bluntly, but not pointedly; to dispel a
source of error in an opponent. -- Ambrose Bierce, "The Devil's Dictionary"

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Justin Clift (#5)

hackersgeneral

Re: Performance while loading data and indexing

On 26 Sep 2002 at 19:17, Justin Clift wrote:

Shridhar Daithankar wrote:
<snip>

3) Any suggsestions for runtime as data load and query will be going in
parallel.

That sounds unusual. From reading this, it *sounds* like you'll be
running queries against an incomplete dataset, or maybe just running the
queries that affect the tables loaded thus far (during the initial
load).

That's correct. Load the data so far and keep inserting data as and when it
generates.

They don't mind running against data so far. It's not very accurate stuff
IMO...

fsync=true (Sad but true. Left untouched.. Will that make a difference on
SCSI?)

Definitely. Have directly measured a ~ 2x tps throughput increase on
FreeBSD when leaving fsync off whilst performance measuring stuff
recently (PG 7.2.2). Like anything it'll depend on workload, phase of
moon, etc, but it's a decent indicator.

I didn't know even that matters with SCSI..Will check out..

fsync = off would help during the data load, but not a good idea if
you're going to be running queries against it at the same time.

That's OK for the reasons mentioned above. It wouldn't be out of place to
expect a UPS to such an installation...

Bye
Shridhar

--
Hoare's Law of Large Problems: Inside every large problem is a small problem
struggling to get out.

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Martijn van Oosterhout (#3)

hackersgeneral

Re: Performance while loading data and indexing

On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote:

On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote:

1) Database load time from flat file using copy is very high
2) Creating index takes huge amount of time.
3) Any suggsestions for runtime as data load and query will be going in
parallel.

You're loading all the data in one copy. I find that INSERTs are mostly
limited by indexes. While index lookups are cheap, they are not free and
each index needs to be updated for each row.

I fond using partial indexes to only index the rows you actually use can
help with the loading. It's a bit obscure though.

As for parallel loading, you'll be limited mostly by your I/O bandwidth.
Have you measured it to take sure it's up to speed?

Well. It's like this, as of now.. CreateDB->create table->create index->Select.

So loading is not slowed by index. As of your hint of vmstat, will check it
out.

So you're loading at a rate of 860KB per sec. That's not too fast. How many
indexes are active at that time? Triggers and foreign keys also take their
toll.

Nothing except the table where data os loaded..

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

No. Same RAID 5 disks..

It shouldn't. Do you have an idea of what your CPU usage is? ps aux should
give you a decent idea.

I guess we forgot to monitor system parameters. Next on my list is running
vmstat, top and tuning bdflush.

Find the bottleneck: CPU, I/O or memory?

Understood..

Mysql is almost out because it's creating index for last 17 hours. I don't
think it will keep up with 5K inserts per sec. with index. SAP DB is under
evaluation too. But postgresql is most favourite as of now because it works. So
I need to come up with solutions to problems that will occur in near future..
;-)

17 hours! Ouch. Either way, you should be able to do much better. Hope this
helps,

Heh.. no wonder this evaluation is taking more than 2 weeks.. Mysql was running
out of disk space while creating index and crashin. An upgrade to mysql helped
there but no numbers as yet..

Thanks once again...
Bye
Shridhar

--
Boren's Laws: (1) When in charge, ponder. (2) When in trouble, delegate. (3)
When in doubt, mumble.

#10

Richard Huxton

dev@archonet.com

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: Performance while loading data and indexing

On Thursday 26 Sep 2002 9:35 am, Shridhar Daithankar wrote:

[questions re: large database]

Before reading my advice please bear in mind you are operating way beyond the
scale of anything I have ever built.

Now the details. Note that this is a test run only..

Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
RedHat7.2/PostgreSQL7.1.3

Database in flat file:
125,000,000 records of around 100 bytes each.
Flat file size 12GB

Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field: 25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

Important postgresql.conf settings

[snipped setting details for moment]

Have you tried putting the wal files, syslog etc on separate disks/volumes? If
you've settled on Intel, about the only thing you can optimise further is the
disks.

Oh - and the OS - make sure you're running a (good) recent kernel for that
sort of hardware, I seem to remember some substantial changes in the 2.4
series regarding multi-processor.

Now the requirements

Initial flat data load: 250GB of data. This has gone up since last query.
It was 150GB earlier..
Ongoing inserts: 5000/sec.
Number of queries: 4800 queries/hour
Query response time: 10 sec.

Is this 5000 rows in say 500 transactions or 5000 insert transactions per
second. How many concurrent clients is this? Similarly for the 4800 queries,
how many concurrent clients is this? Are they expected to return approx 150
rows as in your test?

Now questions.

1) Instead of copying from a single 12GB data file, will a parallel copy
from say 5 files will speed up the things?

If the CPU is the bottle-neck then it should, but it's difficult to say
without figures.

Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5
setup..

What is saturating during the flat-file load? Something must be maxed in top /
iostat / vmstat.

[snip]

5)Will upgrading to 7.2.2/7.3 beta help?

It's unlikely to hurt.

All in all, in the test, we didn't see the performance where hardware is
saturated to it's limits.

Something *must* be.

What are your disaster recovery plans? I can see problems with taking backups
if this beast is live 24/7.

- Richard Huxton

#11

Justin Clift

justin@postgresql.org

over 23 years ago

In reply to: Shridhar Daithankar (#8)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

Shridhar Daithankar wrote:
<snip>

fsync=true (Sad but true. Left untouched.. Will that make a difference on
SCSI?)

Definitely. Have directly measured a ~ 2x tps throughput increase on
FreeBSD when leaving fsync off whilst performance measuring stuff
recently (PG 7.2.2). Like anything it'll depend on workload, phase of
moon, etc, but it's a decent indicator.

I didn't know even that matters with SCSI..Will check out..

Cool. When testing it had FreeBSD 4.6.2 installed on one drive along
with the PostgreSQL 7.2.2 binaries, it had the data on a second drive
(mounted as /pgdata), and it had the pg_xlog directory mounted on a
third drive. Swap had it's own drive as well.

Everything is UltraSCSI, etc. Haven't yet tested for a performance
difference through moving the indexes to another drive after creation
though. That apparently has the potential to help as well.

:-)

Regards and best wishes,

Justin Clift

#12

Mario Weilguni

mario.weilguni@icomedias.com

over 23 years ago

In reply to: Justin Clift (#11)

hackers

Re: Performance while loading data and indexing

Well the test runs were for 10GB of data. Schema is attached. Read in fixed
fonts..Last nullable fields are dummies but may be used in fututre and

varchars

are not acceptable(Not my requirement). Tuple size is around 100 bytes..
The index creation query was

CREATE INDEX index1 ON tablename (esn,min,datetime);

What if I put datetime ahead? It's likely the the datetime field will have

high

degree of locality being log data..

Just an idea, I noticed you use char(10) for esn and min, and use this as
index. Are these really fixed len fields all having 10 bytes? Otherwise
varchar(10) would be better, because your tables, and especially the indices
will be probably much smaller.

what average length do you have for min and esn?

Import Notes

Resolved by subject fallback

#13

Martijn van Oosterhout

kleptog@svana.org

over 23 years ago

In reply to: Shridhar Daithankar (#7)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

On Thu, Sep 26, 2002 at 03:01:35PM +0530, Shridhar Daithankar wrote:
Content-Description: Mail message body

The index creation query was

CREATE INDEX index1 ON tablename (esn,min,datetime);

What if I put datetime ahead? It's likely the the datetime field will have high
degree of locality being log data..

The order of fields depends on what you're using it for. For example, you
can use the above index for a query using the conditions:

esn = 'aaa'
esn = 'bbb' and min = 'xxx'

but not for queries with only

datetime = '2002-09-26'
min = 'ddd' and datetime = '2002-10-02'

The fields can only be used left to right. This is where a single
multicolumn index differs from multiple indexes of different columns.

Have you used EXPLAIN ANALYSE to determine whether your indexes are being
used optimally?

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#14

Justin Clift

justin@postgresql.org

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

Shridhar Daithankar wrote:

On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote:

<snip>

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

No. Same RAID 5 disks..

Not sure if this is a good idea. Would have to think deeply about the
controller and drive optimisation/load characteristics.

If it's any help, when I was testing recently with WAL on a separate
drive, the WAL logs were doing more read&writes per second than the main
data drive. This would of course be affected by the queries you are
running against the database. I was just running Tatsuo's TPC-B stuff,
and the OSDB AS3AP tests.

I guess we forgot to monitor system parameters. Next on my list is running
vmstat, top and tuning bdflush.

That'll just be the start of it for serious performance tuning and
learning how PostgreSQL works. :)

<snip>

Thanks once again...
Bye
Shridhar

#15

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Mario Weilguni (#12)

hackers

Re: Performance while loading data and indexing

On 26 Sep 2002 at 11:50, Mario Weilguni wrote:

Well the test runs were for 10GB of data. Schema is attached. Read in fixed
fonts..Last nullable fields are dummies but may be used in fututre and

varchars

are not acceptable(Not my requirement). Tuple size is around 100 bytes..
The index creation query was

CREATE INDEX index1 ON tablename (esn,min,datetime);

What if I put datetime ahead? It's likely the the datetime field will have

high

degree of locality being log data..

Just an idea, I noticed you use char(10) for esn and min, and use this as
index. Are these really fixed len fields all having 10 bytes? Otherwise
varchar(10) would be better, because your tables, and especially the indices
will be probably much smaller.

what average length do you have for min and esn?

10 bytes. Those are id numbers.. like phone numbers always have all the digits
filled in..

Bye
Shridhar

--
Bradley's Bromide: If computers get too powerful, we can organize them into a
committee -- that will do them in.

#16

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

Re: Performance while loading data and indexing

"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes:

RedHat7.2/PostgreSQL7.1.3

I'd suggest a newer release of Postgres ... 7.1.3 is pretty old ...

Create unique composite index on 2 char and a timestamp field: 25226 sec.

What do you mean by "char" exactly? If it's really char(N), how much
are you paying in padding space? There are very very few cases where
I'd not say to use varchar(N), or text, instead. Also, does it have to
be character data? If you could use an integer or float datatype
instead the index operations should be faster (though I can't say by
how much). Have you thought carefully about the order in which the
composite index columns are listed?

sort_mem = 12000

To create an index of this size, you want to push sort_mem as high as it
can go without swapping. 12000 sounds fine for the global setting, but
in the process that will create the index, try setting sort_mem to some
hundreds of megs or even 1Gb. (But be careful: the calculation of space
actually used by CREATE INDEX is off quite a bit in pre-7.3 releases
:-(. You should probably expect the actual process size to grow to two
or three times what you set sort_mem to. Don't let it get so big as to
swap.)

wal_buffers = 65536

The above is a complete waste of memory space, which would be better
spent on letting the kernel expand its disk cache. There's no reason
for wal_buffers to be more than a few dozen.

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Shridhar Daithankar (#15)

hackers

Re: Performance while loading data and indexing

"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes:

On 26 Sep 2002 at 11:50, Mario Weilguni wrote:

Just an idea, I noticed you use char(10) for esn and min, and use this as
index. Are these really fixed len fields all having 10 bytes?

10 bytes. Those are id numbers.. like phone numbers always have all the digits
filled in..

If they are numbers, can you store them as bigints instead of char(N)?

regards, tom lane

#18

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Justin Clift (#14)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

Justin Clift <justin@postgresql.org> writes:

On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote:

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

Not sure if this is a good idea. Would have to think deeply about the
controller and drive optimisation/load characteristics.

If it's any help, when I was testing recently with WAL on a separate
drive, the WAL logs were doing more read&writes per second than the main
data drive.

... but way fewer seeks. For anything involving lots of updating
transactions (and certainly 5000 separate insertions per second would
qualify; can those be batched??), it should be a win to put WAL on its
own spindle, just to get locality of access to the WAL.

regards, tom lane

#19

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Tom Lane (#16)

hackersgeneral

Re: Performance while loading data and indexing

On 26 Sep 2002 at 10:33, Tom Lane wrote:

"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes:

RedHat7.2/PostgreSQL7.1.3

I'd suggest a newer release of Postgres ... 7.1.3 is pretty old ...

I agree.. downloadind 7.2.2 right away..

Create unique composite index on 2 char and a timestamp field: 25226 sec.

What do you mean by "char" exactly? If it's really char(N), how much
are you paying in padding space? There are very very few cases where
I'd not say to use varchar(N), or text, instead. Also, does it have to
be character data? If you could use an integer or float datatype
instead the index operations should be faster (though I can't say by
how much). Have you thought carefully about the order in which the
composite index columns are listed?

I have forwarded the idea of putting things into number. If it causes speedup
in index lookup/creation, it would do. Looks like bigint is the order of the
day..

sort_mem = 12000

To create an index of this size, you want to push sort_mem as high as it
can go without swapping. 12000 sounds fine for the global setting, but
in the process that will create the index, try setting sort_mem to some
hundreds of megs or even 1Gb. (But be careful: the calculation of space
actually used by CREATE INDEX is off quite a bit in pre-7.3 releases
:-(. You should probably expect the actual process size to grow to two
or three times what you set sort_mem to. Don't let it get so big as to
swap.)

Great. I was skeptical to push it beyond 100MB. Now I can push it to corners..

wal_buffers = 65536

The above is a complete waste of memory space, which would be better
spent on letting the kernel expand its disk cache. There's no reason
for wal_buffers to be more than a few dozen.

That was a rather desparate move. Nothing was improving performance and then we
started pushing numbers.. WIll get it back.. Same goes for 64 WAL files.. A GB
looks like waste to me..

I might have found the bottleneck, although by accident. Mysql was running out
of space while creating index. So my friend shut down mysql and tried to move
things by hand to create links. He noticed that even things like cp were
terribly slow and it hit us.. May be the culprit is the file system. Ext3 in
this case.

My friend argues for ext2 to eliminate journalling overhead but I favour
reiserfs personally having used it in pgbench with 10M rows on paltry 20GB IDE
disk for 25 tps..

We will be attempting raiserfs and/or XFS if required. I know how much speed
difference exists between resiserfs and ext2. Would not be surprised if
everythng just starts screaming in one go..

Bye
Shridhar

--
Cropp's Law: The amount of work done varies inversly with the time spent in the
office.

#20

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Tom Lane (#18)

hackersgeneral

Re: [HACKERS] Performance while loading data and indexing

On 26 Sep 2002 at 10:42, Tom Lane wrote:

Justin Clift <justin@postgresql.org> writes:

If it's any help, when I was testing recently with WAL on a separate
drive, the WAL logs were doing more read&writes per second than the main
data drive.

... but way fewer seeks. For anything involving lots of updating
transactions (and certainly 5000 separate insertions per second would
qualify; can those be batched??), it should be a win to put WAL on its
own spindle, just to get locality of access to the WAL.

Probably they will be a single transcation. If possible we will bunch more of
them together.. like 5 seconds of data pushed down in a single transaction but
not sure it's possible..

This is bit like replication but from live oracle machine to postgres, from
information I have. So there should be some chance of tuning there..

Bye
Shridhar

--
Langsam's Laws: (1) Everything depends. (2) Nothing is always. (3) Everything
is sometimes.

#21

Denis Perchine

dyp@perchine.com

over 23 years ago

In reply to: Shridhar Daithankar (#19)

hackersgeneral

#22

Justin Clift

justin@postgresql.org

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

#23

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

over 23 years ago

In reply to: Justin Clift (#22)

hackersgeneral

#24

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Shridhar Daithankar (#19)

hackersgeneral

#25

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Shridhar Daithankar (#19)

hackersgeneral

#26

Doug McNaught

doug@wireboard.com

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

#27

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Bruce Momjian (#25)

hackersgeneral

#28

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Bruce Momjian (#25)

hackersgeneral

#29

scott.marlowe

scott.marlowe@ihs.com

over 23 years ago

In reply to: Shridhar Daithankar (#2)

hackersgeneral

#30

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Greg Copeland (#28)

hackersgeneral

#31

Neil Conway

neilc@samurai.com

over 23 years ago

In reply to: Bruce Momjian (#25)

hackersgeneral

#32

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Neil Conway (#31)

hackersgeneral

#33

Hans-Jürgen Schönig

postgres@cybertec.at

over 23 years ago

In reply to: Bruce Momjian (#25)

hackersgeneral

#34

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Neil Conway (#31)

hackersgeneral

#35

Neil Conway

neilc@samurai.com

over 23 years ago

In reply to: Bruce Momjian (#32)

hackersgeneral

#36

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Hans-Jürgen Schönig (#33)

hackersgeneral

#37

James Maes

jmaes@materialogic.com

over 23 years ago

In reply to: Bruce Momjian (#34)

hackersgeneral

#38

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Neil Conway (#35)

hackersgeneral

#39

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Neil Conway (#35)

hackersgeneral

#40

Neil Conway

neilc@samurai.com

over 23 years ago

In reply to: Greg Copeland (#39)

hackersgeneral

#41

Doug McNaught

doug@wireboard.com

over 23 years ago

In reply to: Bruce Momjian (#30)

hackersgeneral

#42

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Neil Conway (#35)

hackersgeneral

#43

Doug McNaught

doug@wireboard.com

over 23 years ago

In reply to: Bruce Momjian (#32)

hackersgeneral

#44

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Neil Conway (#40)

hackersgeneral

#45

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Doug McNaught (#43)

hackersgeneral

#46

Rod Taylor

rbt@rbt.ca

over 23 years ago

In reply to: Bruce Momjian (#44)

hackersgeneral

#47

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Rod Taylor (#46)

hackersgeneral

#48

Rod Taylor

rbt@rbt.ca

over 23 years ago

In reply to: Bruce Momjian (#47)

hackersgeneral

#49

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Rod Taylor (#48)

hackersgeneral

#50

Doug McNaught

doug@wireboard.com

over 23 years ago

In reply to: Bruce Momjian (#45)

hackersgeneral

#51

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Doug McNaught (#41)

hackersgeneral

#52

Yury Bokhoncovich

byg@center-f1.ru

over 23 years ago

In reply to: Bruce Momjian (#38)

hackersgeneral

#53

Mats Lofkvist

mal@algonet.se

over 23 years ago

In reply to: Neil Conway (#40)

hackersgeneral

#54

Mats Lofkvist

mal@algonet.se

over 23 years ago

In reply to: Shridhar Daithankar (#1)

hackersgeneral

#55

scott.marlowe

scott.marlowe@ihs.com

over 23 years ago

In reply to: Mats Lofkvist (#54)

hackersgeneral

#56

Florian Weimer

Weimer@CERT.Uni-Stuttgart.DE

over 23 years ago

In reply to: Tom Lane (#42)

hackersgeneral

#57

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 23 years ago

In reply to: scott.marlowe (#55)

hackersgeneral

#58

Mike Benoit

mikeb@netnation.com

over 23 years ago

In reply to: Doug McNaught (#50)

hackersgeneral

#59

Greg Copeland

greg@CopelandConsulting.Net

over 23 years ago

In reply to: Mike Benoit (#58)

hackersgeneral

#60

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Greg Copeland (#59)

hackersgeneral

#61

Curt Sampson

cjs@cynic.net

over 23 years ago

In reply to: Neil Conway (#40)

hackersgeneral