Major Performance decrease after some hours

Started by Peter Bauerover 19 years ago21 messagesgeneral

peter.m.bauer@gmail.com

over 19 years ago

Hi all,

i have a Tomcat application with PostgreSQL 8.1.4 running which
performs about 10000 inserts/deletes every 2-4 minutes and updates on
a database and after some hours of loadtesting the top output says
0.0% idle, 6-7% system load, load average 32, 31, 28 and there are
many exceptions at statement execution like:
An I/O error occured while sending to the backend.

There are 10-15 postmaster processes running which use all the CPU power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

The situation just gets better if tomcat is stopped and postgresql is
restarted -> 90%idle

The tomcat application talks to another application on the server
which uses much CPU cycles at startup of tomcat and my observation is
that after this startup with 0.0% idle, PostgreSQL can't recover from
this situation.

Dual Xeon Server
machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only
one harddisk, no RAID configuration is used.
- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure
- OS: Debian Sarge with postgresql 8.1.4
- Two cronjobs are configured to perform a "vacuumdb --all"
every 1 minute and a "vacuumdb --all--analyze" every 23
minutes
- There are 3 applications using the PostgreSQL installation, each
with their own database.
- The main application is based on Tomcat 4.1-30 partly a
Web-Application, partly a terminal login protocol based on http and
XML, so the database access is of course done using JDBC
- A cronjob is configured to perform a pg_dump of the main database
every 4 hours

Any ideas are welcome.

thx,
Peter

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

2006/10/1, Peter Bauer <peter.m.bauer@gmail.com>:

Hi all,

i have a Tomcat application with PostgreSQL 8.1.4 running which
performs about 10000 inserts/deletes every 2-4 minutes and updates on
a database and after some hours of loadtesting the top output says
0.0% idle, 6-7% system load, load average 32, 31, 28 and there are
many exceptions at statement execution like:
An I/O error occured while sending to the backend.

There are 10-15 postmaster processes running which use all the CPU power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

The situation just gets better if tomcat is stopped and postgresql is
restarted -> 90%idle

The tomcat application talks to another application on the server
which uses much CPU cycles at startup of tomcat and my observation is
that after this startup with 0.0% idle, PostgreSQL can't recover from
this situation.

Dual Xeon Server
machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only
one harddisk, no RAID configuration is used.
- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure
- OS: Debian Sarge with postgresql 8.1.4
- Two cronjobs are configured to perform a "vacuumdb --all"
every 1 minute and a "vacuumdb --all--analyze" every 23
minutes
- There are 3 applications using the PostgreSQL installation, each
with their own database.
- The main application is based on Tomcat 4.1-30 partly a
Web-Application, partly a terminal login protocol based on http and
XML, so the database access is of course done using JDBC
- A cronjob is configured to perform a pg_dump of the main database
every 4 hours

Any ideas are welcome.

thx,
Peter

Hi again,

i decreased the load of the test and the system is stable for some
hours now, no Exceptions so far, top says 80-90% idle.
The question is what are the reasons for this behaviour? Is a hardware
upgrade required? How can i check how bad the condition of the
database is?

thx,
Peter

Chris Mair

chrisnospam@1006.org

over 19 years ago

In reply to: Peter Bauer (#2)

Re: Major Performance decrease after some hours

Hi,

a few random question...

i have a Tomcat application with PostgreSQL 8.1.4 running which
performs about 10000 inserts/deletes every 2-4 minutes and updates on
a database and after some hours of loadtesting the top output says
0.0% idle, 6-7% system load, load average 32, 31, 28

Who is responsible for this high load values?
Do you see many postmaster processes at the top?

and there are

many exceptions at statement execution like:
An I/O error occured while sending to the backend.

Is this from the application?
How many connections does the application open in parallel?
Using JDBC, I guess?

- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure

Just to be 100% sure: just one server at a time runs PostgreSQL on
that shared data disk, right?

Bye,
Chris.

MaXX

bs139412@skynet.be

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

Peter Bauer wrote:
[...]

There are 10-15 postmaster processes running which use all the CPU power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

Pure speculation: are you sure you aren't vacuuming too agressively?
The DELETE waiting and SELECT waiting sound to me like they are waiting
for a lock that another vacuum is holding. You've said that you spawn a
vacumm process by cron every minute, that may be fine when the server
isn't too loaded. But when vacuums begins to take longer cron will spawn
more and more vacuums killing your machine.

I'll add a little check in the cron script to see if there isn't already
a vacuum process running or use an idependant script,
while [ 1 ]
do
vacuum --all
sleep 60
done

[...]

HTH,
--
MaXX

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Chris Mair (#3)

Re: Major Performance decrease after some hours

2006/10/1, Chris Mair <chrisnospam@1006.org>:

Hi,

a few random question...

i have a Tomcat application with PostgreSQL 8.1.4 running which
performs about 10000 inserts/deletes every 2-4 minutes and updates on
a database and after some hours of loadtesting the top output says
0.0% idle, 6-7% system load, load average 32, 31, 28

Who is responsible for this high load values?
Do you see many postmaster processes at the top?

yes, there are about 10 postmaster processes in top which eat up all
of the CPU cycles at equal parts.

and there are

many exceptions at statement execution like:
An I/O error occured while sending to the backend.

Is this from the application?

Yes, this happens when the sql statements are executed.

How many connections does the application open in parallel?

30-40, one tomcat runs on the same machine as PostgreSQL, another runs
on the other cluster and uses the database via network.

Using JDBC, I guess?

yes, postgresql-8.1-407.jdbc3 is used

- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure

Just to be 100% sure: just one server at a time runs PostgreSQL on
that shared data disk, right?

One server of a cluster runs the database and the tomcat, the other
one just runs tomcat, so they share this common database.

Bye,
Chris.

thx,
Peter

Matthew T. O'Connor

matthew@zeut.net

over 19 years ago

In reply to: MaXX (#4)

Re: Major Performance decrease after some hours

MaXX wrote:

There are 10-15 postmaster processes running which use all the CPU
power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

Pure speculation: are you sure you aren't vacuuming too agressively?

Have you tried autovacuum? You can manually tweak the settings so that
your hot spot tables are vacuumed aggressivly. One of the main upsides
to autovacuum is tha tit saves cycles, that is, it only vacuums when you
need it.

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Peter Bauer (#5)

Re: Major Performance decrease after some hours

"Peter Bauer" <peter.m.bauer@gmail.com> writes:

yes, there are about 10 postmaster processes in top which eat up all
of the CPU cycles at equal parts.

What are these processes doing exactly --- can you show us the queries
they're executing? It might be worth attaching to a few of them with
gdb to get stack traces, to see if there's any commonality about the
traces:
$ gdb /path/to/postgres PID-of-backend
gdb> bt
gdb> quit
are you sure? y
$ <repeat with other backends, and/or same backend after a few sec>

regards, tom lane

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

2006/10/1, MaXX <bs139412@skynet.be>:

Peter Bauer wrote:

2006/10/1, MaXX <bs139412@skynet.be>:

Peter Bauer wrote:
[...]

There are 10-15 postmaster processes running which use all the CPU

power.

A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

Pure speculation: are you sure you aren't vacuuming too agressively?

The previous configuration was to vacuum all 10 minutes. This resulted
in 60-70% system load (harddisk i suppose) so i thought more vacuums
would be a good idea.

It won't hurt to ensure that you will never have 2 vacuum processes
running at the same time...

The DELETE waiting and SELECT waiting sound to me like they are waiting
for a lock that another vacuum is holding. You've said that you spawn a
vacumm process by cron every minute, that may be fine when the server
isn't too loaded. But when vacuums begins to take longer cron will spawn
more and more vacuums killing your machine.

When the postmaster processes eat up all CPU cycles the vacuum still
does only take some seconds and there are no postgres VACUUM processes
hanging around, so i don't think they hamper each other.
I have the impression that the database "condition" is getting worse
over time so the queries take longer and longer and at some point the
I/O Exceptions start because the data cannot be written to the disk in
time. I just don't know how to pinpoint this bad "condition" or
whatever gets worse over time.

you can connect to your server with pgAdmin, go to "tools -> server
status" to identify wich process is causing troubles on a sunday ;-)
(you may have to turn on some logging option on postresql.conf) I find
it more friendly than psql but it's a matter of taste.
You are running pg on a Linux platform, try systat and iostat. (I'm
running FreeBSD, I suppose those 2 are availlable on Linux)

Long running transactions are not you friends too... I once made a typo
in a script which prevented commits from appenning... not good...

This was the first thing we checked when these problems appeared. The
only locking is done in functions which should not be a problem. Of
course it is possible that there are bugs which prevent some
transactions from being commited, but its a pretty big application so
i would need some advice for finding and debugging such problems.
Is it possible to check which transactions are currently running,
which thread or java process runs them and for how long?
I will attach a logfile to the reply to Toms mail which contains all
sql statements executed from the start of the loadtest to the point
the problems began, so maybe you can find something in it.

thx,
Peter

Import Notes

Reply to msg id not found: 451FBC1B.9000803@skynet.be

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

2006/10/1, Matthew T. O'Connor <matthew@zeut.net>:

MaXX wrote:

There are 10-15 postmaster processes running which use all the CPU
power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

Pure speculation: are you sure you aren't vacuuming too agressively?

Have you tried autovacuum? You can manually tweak the settings so that
your hot spot tables are vacuumed aggressivly. One of the main upsides
to autovacuum is tha tit saves cycles, that is, it only vacuums when you
need it.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Autovacuum is enabled and runs in addition to the configured cronjob
vacuums. I will attach logfiles to the reply to Toms mail, so please
have a look at it if autovacuum can be reconfigured to replace the
cronjobs.

thx,
Peter

Import Notes

Reply to msg id not found: 764c9e910610011114vcc1382brc0e306202390ea7c@mail.gmail.com

#10

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

"Peter Bauer" <peter.m.bauer@gmail.com> writes:

Attached you can find the postgresql logfiles and a logfile which
contains alls SQL statements executed in the relevant time together
with the excpetions thrown. I also attached a file with all used
Pl/pgSQL functions. Since we were not able to find a problem for so
long, i think it makes no sense to filter the information because we
are probably not looking on the right spots.

I hope the pgsql-general moderators are not going to approve an 8MB
message :-(. I'll hang onto my copy though, just in case it proves
useful.

It might be worth attaching to a few of them with
gdb to get stack traces,

Should this be done when the problems are appearing or during normal
operation or both to compare the results?

When the problems are appearing.

regards, tom lane

Import Notes

Reply to msg id not found: 764c9e910610011246y69430e67sbe93e7261c6c5a3c@mail.gmail.com

#11

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Tom Lane (#10)

Re: Major Performance decrease after some hours

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

"Peter Bauer" <peter.m.bauer@gmail.com> writes:

Attached you can find the postgresql logfiles and a logfile which
contains alls SQL statements executed in the relevant time together
with the excpetions thrown. I also attached a file with all used
Pl/pgSQL functions. Since we were not able to find a problem for so
long, i think it makes no sense to filter the information because we
are probably not looking on the right spots.

I hope the pgsql-general moderators are not going to approve an 8MB
message :-(. I'll hang onto my copy though, just in case it proves
useful.

Sorry, i put the files on http://dagobert.apus.co.at/pglogfiles/ so
please use them.

It might be worth attaching to a few of them with
gdb to get stack traces,

Should this be done when the problems are appearing or during normal
operation or both to compare the results?

When the problems are appearing.

ok, i will try to reproduce it and post the stack traces

thx,
Peter

#12

Ray Stell

stellr@cns.vt.edu

over 19 years ago

In reply to: MaXX (#4)

Re: Major Performance decrease after some hours

On Sun, Oct 01, 2006 at 12:55:51PM +0200, MaXX wrote:

Pure speculation: are you sure you aren't vacuuming too agressively?
The DELETE waiting and SELECT waiting sound to me like they are waiting
for a lock that another vacuum is holding.

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

#13

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Ray Stell (#12)

Re: Major Performance decrease after some hours

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

#14

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Tom Lane (#13)

Re: Major Performance decrease after some hours

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load (harddisk) went
down and everything was very stable at 80% idle for nearly 24 hours!
I am currently performing some pgbench runs to evaluate the hardware
and configuration for the system but i think the biggest problems are
solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#15

Alexander Staubo

alex@purefiction.net

over 19 years ago

In reply to: Peter Bauer (#14)

Re: Major Performance decrease after some hours

It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it through I/O
activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a couple
of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

Show quoted text

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load (harddisk) went
down and everything was very stable at 80% idle for nearly 24 hours!
I am currently performing some pgbench runs to evaluate the hardware
and configuration for the system but i think the biggest problems are
solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

#16

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Peter Bauer (#1)

Re: Major Performance decrease after some hours

I finished the little benchmarking on our server and the results are
quite curios.
With the numbers from http://sitening.com/tools/postgresql-benchmark/
in mind i did
./pgbench -i pgbench
and then performed some pgbench tests, for example
./pgbench -c 1 -t 1000 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
performance of the server described in the article!

I did some tests on a Xen machine running on my workstation and the
results are about 400-500tps which seems to be quite reasonable.

I also tried to disable drbd and put the data directory elsewhere, but
the performance was the same.

any ideas?

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:

Show quoted text

It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it through I/O
activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a couple
of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load (harddisk) went
down and everything was very stable at 80% idle for nearly 24 hours!
I am currently performing some pgbench runs to evaluate the hardware
and configuration for the system but i think the biggest problems are
solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Import Notes

Reply to msg id not found: 764c9e910610050439v5f17d471m2598130f8fe4fac6@mail.gmail.com

#17

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Peter Bauer (#16)

Re: Major Performance decrease after some hours

I forgot to mention that top does not show a noticeable increase of
CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
Shouldn't the machine be busy during such a test?

thx,
Peter

2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:

Show quoted text

I finished the little benchmarking on our server and the results are
quite curios.
With the numbers from http://sitening.com/tools/postgresql-benchmark/
in mind i did
./pgbench -i pgbench
and then performed some pgbench tests, for example
./pgbench -c 1 -t 1000 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
performance of the server described in the article!

I did some tests on a Xen machine running on my workstation and the
results are about 400-500tps which seems to be quite reasonable.

I also tried to disable drbd and put the data directory elsewhere, but
the performance was the same.

any ideas?

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:

It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it through I/O
activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a couple
of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load (harddisk) went
down and everything was very stable at 80% idle for nearly 24 hours!
I am currently performing some pgbench runs to evaluate the hardware
and configuration for the system but i think the biggest problems are
solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively? Is there
an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

#18

Alexander Staubo

alex@purefiction.net

over 19 years ago

In reply to: Peter Bauer (#17)

Re: Major Performance decrease after some hours

If you are on Linux, I recommend iostat(1) and vmstat(8) over top.

Iostat will report I/O transfer statistics; it's how I discovered
that work_mem buffers were spilling over to disk files. For Vmstat,
look in particular at the load (ie., how many processes are competing
for the scheduler) in the first field ("r") and how many processes
are blocked by I/O waits ("b").

Alexander.

On Oct 5, 2006, at 14:35 , Peter Bauer wrote:

Show quoted text

I forgot to mention that top does not show a noticeable increase of
CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
Shouldn't the machine be busy during such a test?

thx,
Peter

2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:

I finished the little benchmarking on our server and the results are
quite curios.
With the numbers from http://sitening.com/tools/postgresql-benchmark/
in mind i did
./pgbench -i pgbench
and then performed some pgbench tests, for example
./pgbench -c 1 -t 1000 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
performance of the server described in the article!

I did some tests on a Xen machine running on my workstation and the
results are about 400-500tps which seems to be quite reasonable.

I also tried to disable drbd and put the data directory elsewhere,
but
the performance was the same.

any ideas?

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:

It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it

through I/O

activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a

couple

of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load

(harddisk) went

down and everything was very stable at 80% idle for nearly 24

hours!

I am currently performing some pgbench runs to evaluate the

hardware

and configuration for the system but i think the biggest

problems are

solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively?

Is there

an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#19

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Alexander Staubo (#18)

Re: Major Performance decrease after some hours

it seems that the machine doesn't really care about the pgbench run. I did a
pgbench -c 10 -t 10000 -s 10 pgbench
and here is the output of vmstat 1 100 which has been started some
seconds before pgbench:
vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 10236 1240952 111324 553908 0 0 3 2 1 3 5 3 92 0
0 0 10236 1240944 111324 553908 0 0 0 0 167 48 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 20 169 52 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 44 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 164 44 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 47 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 47 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 164 49 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 40 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 41 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 45 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 48 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 164 42 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 45 0 0 100 0
0 0 10236 1240936 111332 553908 0 0 0 0 165 41 0 0 100 0
0 2 10236 1237688 111332 550256 0 0 0 10976 449 1793 13 1 86 0
0 2 10236 1237688 111332 550256 0 0 0 1928 345 3206 2 1 97 0
0 2 10236 1237680 111340 550256 0 0 0 1969 352 3012 4 1 95 0
0 2 10236 1237600 111340 550336 0 0 0 2096 353 2986 2 0 98 0
0 2 10236 1237608 111340 550336 0 0 0 1932 351 2985 1 2 97 0
0 2 10236 1237600 111340 550336 0 0 0 1836 350 3097 4 1 95 0
0 2 10236 1237600 111340 550336 0 0 0 1852 353 2971 3 1 95 0
4 1 10236 1237600 111340 550336 0 0 0 1975 372 2682 3 0 97 0
0 2 10236 1237608 111340 550336 0 0 0 2056 372 2458 2 1 96 0
0 2 10236 1237600 111340 550336 0 0 0 2028 377 2360 1 1 98 0
0 2 10236 1237588 111340 550336 0 0 0 2000 372 2630 3 1 95 0
0 2 10236 1237588 111340 550336 0 0 0 2044 372 2326 3 0 97 0
0 2 10236 1237588 111340 550336 0 0 0 1976 372 2171 2 0 98 0
0 2 10236 1237588 111340 550336 0 0 0 1972 383 2275 4 1 95 0
1 2 10236 1237588 111340 550336 0 0 0 1924 382 2500 3 1 95 0
0 2 10236 1237588 111340 550336 0 0 0 1804 372 2798 3 1 96 0
0 2 10236 1237588 111340 550336 0 0 0 1900 374 2974 6 2 92 0
0 2 10236 1237588 111340 550336 0 0 0 1820 361 2871 2 2 95 0
0 2 10236 1237576 111340 550336 0 0 0 1876 366 2762 4 1 95 0
0 2 10236 1237576 111340 550336 0 0 0 1904 370 2724 3 0 96 0
0 2 10236 1237576 111340 550336 0 0 0 1972 378 2585 6 0 93 0
6 1 10236 1237576 111340 550336 0 0 0 1800 371 2838 1 1 98 0
0 2 10236 1237576 111340 550336 0 0 0 1792 362 2826 3 1 96 0
0 2 10236 1237576 111340 550344 0 0 0 1804 362 3068 3 0 96 0
1 1 10236 1237560 111340 550360 0 0 0 1936 373 2718 2 1 96 0
6 1 10236 1237552 111340 550360 0 0 0 1788 365 2447 4 1 95 0
0 1 10236 1237552 111340 550360 0 0 0 1804 368 2362 5 2 93 0
0 2 10236 1237544 111340 550368 0 0 0 1908 370 2434 3 4 93 0
0 2 10236 1237544 111340 550368 0 0 0 1848 369 2360 4 1 94 0
0 2 10236 1237504 111340 550408 0 0 0 1796 358 2655 3 1 96 0
0 2 10236 1237496 111340 550416 0 0 0 1988 374 2491 4 1 95 0
0 2 10236 1237488 111340 550424 0 0 0 1960 372 2111 2 1 97 0
0 2 10236 1237488 111340 550424 0 0 0 1760 360 2433 4 1 95 0
0 2 10236 1237488 111340 550424 0 0 0 1944 374 2064 2 1 97 0
0 2 10236 1237496 111340 550424 0 0 0 1868 373 2169 3 0 97 0
4 1 10236 1237476 111340 550432 0 0 0 1868 372 2170 3 1 96 0

i did the same on the Xen machine:
vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 4 3964 5916 99288 0 0 1 21 34 53 0 0 99 0
0 0 4 3964 5916 99288 0 0 0 0 50 20 0 0 100 0
0 0 4 3964 5916 99288 0 0 0 0 38 20 0 0 100 0
0 0 4 3964 5916 99288 0 0 0 0 33 18 0 0 100 0
0 0 4 3964 5916 99288 0 0 0 0 28 20 0 1 99 0
0 0 4 3964 5924 99288 0 0 0 28 50 25 0 0 100 0
0 0 4 3964 5924 99288 0 0 0 0 24 19 0 0 100 0
0 0 4 3968 5924 99288 0 0 0 0 37 20 0 0 100 0
0 0 4 3968 5924 99288 0 0 0 0 50 24 0 0 100 0
0 0 4 3772 5924 99292 0 0 0 0 33 29 0 0 100 0
0 0 4 3748 5924 99292 0 0 0 0 31 19 0 0 100 0
0 0 4 3744 5924 99292 0 0 0 0 43 25 0 0 100 0
1 0 12 2560 5496 100048 0 0 44 17004 1896 812 26 10 56 8
2 0 16 2388 5080 95152 0 0 180 25788 2505 11372 54 19 2 25
1 1 16 2576 5084 94956 0 0 0 6824 1215 23437 50 19 1 31
2 0 16 2668 5084 94872 0 0 0 6588 1188 22923 51 12 2 35
1 1 16 2600 5088 94840 0 0 0 8664 1701 22326 51 10 2 37
0 1 16 2432 5100 94968 0 0 0 5492 1183 9985 26 6 2 66
0 1 16 2464 5112 95048 0 0 0 2404 495 5670 18 5 1 76
1 0 16 2596 5112 94980 0 0 32 6036 1082 21986 42 16 2 41
1 0 16 2244 5124 95268 0 0 0 7740 1526 20645 37 14 2 48
2 1 16 2540 5108 95064 0 0 0 7016 1343 18769 46 12 2 41
0 1 16 2752 5108 94668 0 0 0 5244 1165 8660 16 6 1 77
0 2 16 2780 5116 94668 0 0 8 648 80 95 0 1 0 99
4 0 16 2736 5140 94716 0 0 0 1160 363 2556 9 2 1 88
1 0 16 2268 5148 95036 0 0 8 5112 853 21498 67 13 2 18
2 1 16 2788 5048 94676 0 0 8 7876 1535 21278 65 12 2 21
0 1 16 2764 5060 94788 0 0 0 5372 1203 7024 21 6 1 72
1 0 16 2648 5076 94932 0 0 12 3112 596 10241 24 9 2 65
1 0 16 2728 5056 94772 0 0 24 6152 1142 19822 47 16 4 34
1 0 16 2504 5068 95068 0 0 0 7196 1387 20474 50 17 2 32
1 1 16 2652 5064 94844 0 0 8 7108 1018 17572 56 8 3 33
0 1 16 2568 5076 94916 0 0 0 4460 1003 5825 14 1 1 84
0 1 16 2572 5076 94924 0 0 0 704 52 90 0 0 2 98
1 0 16 2300 5096 95116 0 0 0 3688 762 14174 52 9 2 37
1 1 16 2436 5080 95080 0 0 0 7256 1407 19964 66 21 1 12
1 0 16 2640 5092 94904 0 0 0 6504 1223 19809 69 15 1 15
0 1 16 2528 5104 94960 0 0 0 4964 1165 4573 18 3 1 78
1 1 16 2332 5116 95064 0 0 0 2492 521 7197 15 5 1 79
1 1 16 2580 5080 94900 0 0 0 5076 863 19775 60 14 2 24
2 0 16 2728 5020 94732 0 0 0 7636 1533 19246 51 15 1 33
0 0 16 2484 5032 94980 0 0 0 6068 1200 16340 50 16 2 32
1 0 16 2316 5044 95036 0 0 0 3940 934 3570 10 5 3 82
0 2 16 2788 5024 94628 0 0 0 640 63 46 0 0 0 100
0 1 16 2876 5060 94636 0 0 0 1212 402 602 2 0 3 95
1 1 16 2580 5072 94876 0 0 0 6440 1255 17866 68 11 2 19
1 0 16 2636 5084 94788 0 0 0 7188 1405 18753 66 16 1 17
0 1 16 2580 5084 94828 0 0 0 544 116 2877 12 1 0 87
2 1 16 2536 5104 94908 0 0 0 2968 656 5413 26 5 2 67

The only difference i see is that there is always at least one process
waiting for I/O.

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:

Show quoted text

If you are on Linux, I recommend iostat(1) and vmstat(8) over top.

Iostat will report I/O transfer statistics; it's how I discovered
that work_mem buffers were spilling over to disk files. For Vmstat,
look in particular at the load (ie., how many processes are competing
for the scheduler) in the first field ("r") and how many processes
are blocked by I/O waits ("b").

Alexander.

On Oct 5, 2006, at 14:35 , Peter Bauer wrote:

I forgot to mention that top does not show a noticeable increase of
CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
Shouldn't the machine be busy during such a test?

thx,
Peter

2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:

I finished the little benchmarking on our server and the results are
quite curios.
With the numbers from http://sitening.com/tools/postgresql-benchmark/
in mind i did
./pgbench -i pgbench
and then performed some pgbench tests, for example
./pgbench -c 1 -t 1000 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
performance of the server described in the article!

I did some tests on a Xen machine running on my workstation and the
results are about 400-500tps which seems to be quite reasonable.

I also tried to disable drbd and put the data directory elsewhere,
but
the performance was the same.

any ideas?

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:

It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it

through I/O

activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a

couple

of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load

(harddisk) went

down and everything was very stable at 80% idle for nearly 24

hours!

I am currently performing some pgbench runs to evaluate the

hardware

and configuration for the system but i think the biggest

problems are

solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:

Ray Stell <stellr@cns.vt.edu> writes:

How would one determine the lock situation definitively?

Is there

an internal mechanism that can be queried?

pg_locks view.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#20

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Peter Bauer (#16)

Re: Major Performance decrease after some hours

"Peter Bauer" <peter.m.bauer@gmail.com> writes:

tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

That's about what you ought to expect for a single transaction stream
running on honest disk hardware (ie, disks that don't lie about write
complete). You can't commit a transaction more often than once per disk
revolution, because you have to wait for the current WAL file endpoint
to pass under the heads again. If there are multiple clients then
"ganging" concurrent commits is possible, but you tested only one.

The benchmark you reference might have been done on disks with battery
backed write cache. Or it might have been just plain unsafe (ie, the
equivalent of fsync off, but in hardware :-()

regards, tom lane

#21

Peter Bauer

peter.m.bauer@gmail.com

over 19 years ago

In reply to: Tom Lane (#20)