Strange behaviour under heavy load

Started by Dmitry Koterovabout 19 years ago13 messagesgeneral
Jump to latest
#1Dmitry Koterov
dmitry@koterov.ru

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run all the
queries slower than usual, sometimes - 5 and more times slower. I cannot
reprocude that, but symptoms are the following: queries work very fast for
5-10 minutes, and after that - significant slowdown (every query, even a
simple one, works 5-10 and ever more times slower), disk write activity
grows too (but not everytime - I don't know exactli if there is a direct
correlation). 2-3 seconds, then - performance restores back to normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what to
monitor for better problem discovering.

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Dmitry Koterov (#1)
Re: Strange behaviour under heavy load

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run all the
queries slower than usual, sometimes - 5 and more times slower. I cannot
reprocude that, but symptoms are the following: queries work very fast for
5-10 minutes, and after that - significant slowdown (every query, even a
simple one, works 5-10 and ever more times slower), disk write activity
grows too (but not everytime - I don't know exactli if there is a direct
correlation). 2-3 seconds, then - performance restores back to normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what to
monitor for better problem discovering.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Dmitry Koterov
dmitry@koterov.ru
In reply to: Oleg Bartunov (#2)
Re: Strange behaviour under heavy load

How to tune them?

Now I have:

checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

Show quoted text

On 3/27/07, Oleg Bartunov <oleg@sai.msu.su> wrote:

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run all

the

queries slower than usual, sometimes - 5 and more times slower. I cannot
reprocude that, but symptoms are the following: queries work very fast

for

5-10 minutes, and after that - significant slowdown (every query, even a
simple one, works 5-10 and ever more times slower), disk write activity
grows too (but not everytime - I don't know exactli if there is a direct
correlation). 2-3 seconds, then - performance restores back to normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what to
monitor for better problem discovering.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#4Dmitry Koterov
dmitry@koterov.ru
In reply to: Dmitry Koterov (#3)
Re: Strange behaviour under heavy load

New information.

When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.

The quesion is - how to make CHECKPOINT work faster?

Show quoted text

On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:

How to tune them?

Now I have:

checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

On 3/27/07, Oleg Bartunov <oleg@sai.msu.su> wrote:

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run all

the

queries slower than usual, sometimes - 5 and more times slower. I

cannot

reprocude that, but symptoms are the following: queries work very fast

for

5-10 minutes, and after that - significant slowdown (every query, even

a

simple one, works 5-10 and ever more times slower), disk write

activity

grows too (but not everytime - I don't know exactli if there is a

direct

correlation). 2-3 seconds, then - performance restores back to normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what to
monitor for better problem discovering.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/&lt;http://www.sai.msu.su/%7Emegera/&gt;
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#5Dmitry Koterov
dmitry@koterov.ru
In reply to: Dmitry Koterov (#4)
Re: Strange behaviour under heavy load

Disc write activity timeline looks like the following:

CHECKPOINT ->
Large disc write (30-40M/s), 1 second ->
4-5 seconds: no read and write activity at all, but server works slow ->
2-3 seconds: disc write activity about 4-5 M/s ->
restore normal speed

Show quoted text

On 3/28/07, Dmitry Koterov <dmitry@koterov.ru> wrote:

New information.

When I run CHECKPOINT manually in psql console, effect is fully
reproduced.
So - it is a checkpoint slowdown.

The quesion is - how to make CHECKPOINT work faster?

On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:

How to tune them?

Now I have:

checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

On 3/27/07, Oleg Bartunov < oleg@sai.msu.su> wrote:

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run

all the

queries slower than usual, sometimes - 5 and more times slower. I

cannot

reprocude that, but symptoms are the following: queries work very

fast for

5-10 minutes, and after that - significant slowdown (every query,

even a

simple one, works 5-10 and ever more times slower), disk write

activity

grows too (but not everytime - I don't know exactli if there is a

direct

correlation). 2-3 seconds, then - performance restores back to

normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what

to

monitor for better problem discovering.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet ( www.astronet.ru
),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/&lt;http://www.sai.msu.su/%7Emegera/&gt;
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

#6Jaime Silvela
JSilvela@Bear.com
In reply to: Dmitry Koterov (#5)
unexpected data beyond EOF and character encoding

I'm doing dry runs to migrate my database from 8.1.3 to 8.2.3, with
pg_dumpall | gg_restore, using the executables from 8.2.3.
I'm seeing these messages

ERROR: unexpected data beyond EOF in block 4506 of relation
"coverage_test_val"
HINT: This has been seen to occur with buggy kernels; consider updating
your system.
CONTEXT: COPY coverage_test_val, line 378036: "533359 61566
2007-02-01 DELINQUENCY6 16 0.356509255634 0.5 7"
ERROR: unexpected data beyond EOF in block 414 of relation "rating"
HINT: This has been seen to occur with buggy kernels; consider updating
your system.
CONTEXT: COPY rating, line 56387: "840942 19548 2005-10-25
1 25 \N 1 \N"

The migration otherwise finishes successfully, but those two tables with
the EOF problem have count(*) = 0.

Can this really be a buggy kernel? Is there a list of known Linux
releases that are buggy? Of course, upgrading the kernel on a production
machine is the last thing I want to try...
And here's a theory: a while ago, someone installed pl/R on the database
to do statistics. Executing some pl/R functions would produce a crash of
all the connected clients, and I'd see "unexpected EOF" messages. I
immediately said we needed pl/R uninstalled, and the crashes stopped.
The encoding of my database is SQL_ASCII, which from what I've been
reading, is a non-encoding and will not perform any validation. Would it
be possible that some invalid characters crept into tables related to
those pl/R crashes? Would porting to UTF8 fix this?

Is there a way I can fix those two cases of unexpected EOF so I can at
least perform the upgrade?

Thank you,
Jaime

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jaime Silvela (#6)
Re: unexpected data beyond EOF and character encoding

Jaime Silvela <JSilvela@Bear.com> writes:

I'm seeing these messages

ERROR: unexpected data beyond EOF in block 4506 of relation
"coverage_test_val"
HINT: This has been seen to occur with buggy kernels; consider updating
your system.

So what kernel are you running, on what hardware? Is the problem
reproducible if you drop the target database and start over? The
known cases of this problem have been under heavy concurrent update
load (multiple sessions trying to extend a table concurrently).
I don't entirely see how you could get it if the COPY is the only
thing writing that table ...

regards, tom lane

#8Jaime Silvela
JSilvela@Bear.com
In reply to: Tom Lane (#7)
Re: unexpected data beyond EOF and character encoding

The kernel is Linux 2.6.5
I'll drop and redo this evening after hours - the migration takes a
couple of hours. The source database is under medium load, with a couple
of heavy updates, but the target database is a clean 8.2.3 install, and
only being written to by pg_restore.
So I take it that this happens on the restore part, not the dump part then?

Thanks
Jaime

Tom Lane wrote:

Jaime Silvela <JSilvela@Bear.com> writes:

I'm seeing these messages

ERROR: unexpected data beyond EOF in block 4506 of relation
"coverage_test_val"
HINT: This has been seen to occur with buggy kernels; consider updating
your system.

So what kernel are you running, on what hardware? Is the problem
reproducible if you drop the target database and start over? The
known cases of this problem have been under heavy concurrent update
load (multiple sessions trying to extend a table concurrently).
I don't entirely see how you could get it if the COPY is the only
thing writing that table ...

regards, tom lane

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jaime Silvela (#8)
Re: unexpected data beyond EOF and character encoding

Jaime Silvela <JSilvela@bear.com> writes:

The kernel is Linux 2.6.5

2.6.5.what (give us full uname -a output please)? On what hardware?
If memory serves, we had some reason to think that it only occurred on
specific 64-bit multi-CPU platforms, so I'm not just asking idly.
The error check was put in as a consequence of this thread:
http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php

So I take it that this happens on the restore part, not the dump part then?

It's definitely a write failure not a read failure. Per the code comments:

* We get here only in the corner case where we are trying to extend
* the relation but we found a pre-existing buffer marked BM_VALID.
* This can happen because mdread doesn't complain about reads beyond
* EOF --- which is arguably bogus, but changing it seems tricky ---
* and so a previous attempt to read a block just beyond EOF could
* have left a "valid" zero-filled buffer. Unfortunately, we have
* also seen this case occurring because of buggy Linux kernels that
* sometimes return an lseek(SEEK_END) result that doesn't account for
* a recent write. In that situation, the pre-existing buffer would
* contain valid data that we don't want to overwrite. Since the
* legitimate cases should always have left a zero-filled buffer,
* complain if not PageIsNew.

regards, tom lane

#10Jaime Silvela
JSilvela@Bear.com
In reply to: Tom Lane (#9)
Re: unexpected data beyond EOF and character encoding

Sorry, forgot about the hardware.
Here's the uname -a:
Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
i686 i686 i386 GNU/Linux

From hwinfo --cpu
01: None 00.0: 10103 CPU
[Created at cpu.290]
Unique ID: rdCR.j8NaKXDZtZ6
Hardware Class: cpu
Arch: Intel
Vendor: "GenuineIntel"
Model: 15.4.3 "Intel(R) Xeon(TM) CPU 3.60GHz"
Features:
fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,lm,pni,monitor,ds_cpl,tm2,est,cid
Clock: 3600 MHz
Cache: 2048 kb
Units/Processor: 2
Config Status: cfg=new, avail=yes, need=no, active=unknown

From hwinfo --memory
01: None 00.0: 10102 Main Memory
[Created at memory.59]
Unique ID: rdCR.CxwsZFjVASF
Hardware Class: memory
Model: "Main Memory"
Memory Range: 0x00000000-0xfff5ba3f (rw)
Memory Size: 4 GB
Config Status: cfg=no, avail=yes, need=no, active=unknown

We're running on 32-bit mode for compatibility with some libraries.
How can I determine whether this is due to a buggy kernel?

Tom Lane wrote:

Jaime Silvela <JSilvela@bear.com> writes:

The kernel is Linux 2.6.5

2.6.5.what (give us full uname -a output please)? On what hardware?
If memory serves, we had some reason to think that it only occurred on
specific 64-bit multi-CPU platforms, so I'm not just asking idly.
The error check was put in as a consequence of this thread:
http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php

So I take it that this happens on the restore part, not the dump part then?

It's definitely a write failure not a read failure. Per the code comments:

* We get here only in the corner case where we are trying to extend
* the relation but we found a pre-existing buffer marked BM_VALID.
* This can happen because mdread doesn't complain about reads beyond
* EOF --- which is arguably bogus, but changing it seems tricky ---
* and so a previous attempt to read a block just beyond EOF could
* have left a "valid" zero-filled buffer. Unfortunately, we have
* also seen this case occurring because of buggy Linux kernels that
* sometimes return an lseek(SEEK_END) result that doesn't account for
* a recent write. In that situation, the pre-existing buffer would
* contain valid data that we don't want to overwrite. Since the
* legitimate cases should always have left a zero-filled buffer,
* complain if not PageIsNew.

regards, tom lane

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jaime Silvela (#10)
Re: unexpected data beyond EOF and character encoding

Jaime Silvela <JSilvela@bear.com> writes:

Here's the uname -a:
Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
i686 i686 i386 GNU/Linux

The previous thread says

SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244
to 2.6.5-7.282. Since that we haven't had any blocks of rows beign
re-written or blanked out by the kernel. The new kernel is handling the
wirtes much better.

While I can't say that his 2.6.5-7.244 is the same as yours, it does
seem a bit too close for comfort.

regards, tom lane

#12Jaime Silvela
JSilvela@Bear.com
In reply to: Tom Lane (#11)
Re: unexpected data beyond EOF and character encoding

Wow, thanks! I'll definitely get the Unix operations guys to investigate.
Do you know if this is tracked as an issue for Linux? I'd like some
reassurance that whatever release Unix Ops is able to upgrade to fixes
the bug.

I'll try to get one of our machines upgraded and test there, and will
let you know if I find anything on my part.

Thanks again
Jaime

Tom Lane wrote:

Jaime Silvela <JSilvela@bear.com> writes:

Here's the uname -a:
Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
i686 i686 i386 GNU/Linux

The previous thread says

SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244
to 2.6.5-7.282. Since that we haven't had any blocks of rows beign
re-written or blanked out by the kernel. The new kernel is handling the
wirtes much better.

While I can't say that his 2.6.5-7.244 is the same as yours, it does
seem a bit too close for comfort.

regards, tom lane

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jaime Silvela (#12)
Re: unexpected data beyond EOF and character encoding

Jaime Silvela <JSilvela@bear.com> writes:

Wow, thanks! I'll definitely get the Unix operations guys to investigate.
Do you know if this is tracked as an issue for Linux? I'd like some
reassurance that whatever release Unix Ops is able to upgrade to fixes
the bug.

I did some idle searching when we were looking at the previous report
but couldn't find any specific kernel patches about this. It's
presumably known on the kernel side but maybe not under the specific
"lseek reports stale EOF position" guise that we know it as. If you
manage to track it down please do post back here, because we'd all
like to know.

regards, tom lane