corrupted statistics file "pg_stat_tmp/pgstat.stat"

Started by Carl von Clausewitzover 13 years ago6 messagesgeneral
Jump to latest
#1Carl von Clausewitz
clausewitz45@gmail.com

Hi All,

I’ve just made a clean install for PostgreSQL 9.1 (postgresql-server-9.1.4, postgresql-contrib-9.1.4) on a FreeBSD 9 (FreeBSD 9.0-RELEASE-p3 FreeBSD 9.0-RELEASE-p3 #0: Tue Jun 12 02:52:29 UTC 2012 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64).

I’ve restored from TAR backup our databases, and everything looked fine. Without changing any setting in postgresql.conf (or in kernel settings) – only “track_counts=on”, after 2-3 days, I’m receiving huge number (~5-10 PCS in every second) of error messages like that in /var/log/postgresql.log:

*** Aug 15 06:27:26 eurodb postgres[77652]: [43-1] WARNING: corrupted statistics file "pg_stat_tmp/pgstat.stat"

The logs are going to the syslog, and there are no more Warning, Error, or other messages. If I do a full Vacuum on the database, and I restart the system, everything works fine, no more error messages, until 2-3 days – and everything starts from the beginning.

Two weeks ago, we just bought a new server, I’ve installed the same version of FreeBSD and PostgreSQL, and there is the same problem on that machine too – without any usage, or load, this is a test server for us now.

The disks (SATA3 HDD), and the controllers has been tested, they are OK. All the directories’ permissons are the defaults, pg_stat_temp is owned by pgsql:pgsql – so I think this is not a permission issue.

I’m trying to tune my config with pgtune, and I’ve made some changes on sysctl for the following:

kern.ipc.semmni="512"

kern.ipc.semmns="1024"

kern.ipc.semume="64"

kern.ipc.semmnu="512"

I will restart the machines today, and will see, what’s going on. Did anybody has seen this problem before? On our early version 9.0.8 PostgreSQL server, there wasn’t any kind of problem, but it was a FreeBSD 8.2. How can I investigate, what could be the problem? Kernel settings? Or any other idea?

Thank you in advance.

Best regards,

Csaba

#2Amitabh Kant
amitabhkant@gmail.com
In reply to: Carl von Clausewitz (#1)
Re: corrupted statistics file "pg_stat_tmp/pgstat.stat"

I run the same config (FreeBSD 9 with PG 9.1.x) on couple of servers, and
they seem to be working fine without any error messages. The only other
setting I have in my sysctl.conf is kern.maxfiles .

Amitabh

On Wed, Aug 15, 2012 at 1:27 PM, Carl von Clausewitz <clausewitz45@gmail.com

Show quoted text

wrote:

Hi All,****

** **

I’ve just made a clean install for PostgreSQL 9.1
(postgresql-server-9.1.4, postgresql-contrib-9.1.4) on a FreeBSD 9 (FreeBSD
9.0-RELEASE-p3 FreeBSD 9.0-RELEASE-p3 #0: Tue Jun 12 02:52:29 UTC 2012
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64). *
***

** **

I’ve restored from TAR backup our databases, and everything looked fine.
Without changing any setting in postgresql.conf (or in kernel settings) –
only “track_counts=on”, after 2-3 days, I’m receiving huge number (~5-10
PCS in every second) of error messages like that in /var/log/postgresql.log:
****

** **

*** Aug 15 06:27:26 eurodb postgres[77652]: [43-1] WARNING: corrupted
statistics file "pg_stat_tmp/pgstat.stat"****

** **

The logs are going to the syslog, and there are no more Warning, Error, or
other messages. If I do a full Vacuum on the database, and I restart the
system, everything works fine, no more error messages, until 2-3 days – and
everything starts from the beginning. ****

** **

Two weeks ago, we just bought a new server, I’ve installed the same
version of FreeBSD and PostgreSQL, and there is the same problem on that
machine too – without any usage, or load, this is a test server for us now.
****

** **

The disks (SATA3 HDD), and the controllers has been tested, they are OK.
All the directories’ permissons are the defaults, pg_stat_temp is owned by
pgsql:pgsql – so I think this is not a permission issue. ****

** **

I’m trying to tune my config with pgtune, and I’ve made some changes on
sysctl for the following: ****

** **

kern.ipc.semmni="512"****

kern.ipc.semmns="1024"****

kern.ipc.semume="64"****

kern.ipc.semmnu="512"****

** **

I will restart the machines today, and will see, what’s going on. Did
anybody has seen this problem before? On our early version 9.0.8 PostgreSQL
server, there wasn’t any kind of problem, but it was a FreeBSD 8.2. How can
I investigate, what could be the problem? Kernel settings? Or any other
idea? ****

** **

Thank you in advance. ****

** **

Best regards,****

Csaba****

#3Alban Hertroys
haramrae@gmail.com
In reply to: Carl von Clausewitz (#1)
Re: corrupted statistics file "pg_stat_tmp/pgstat.stat"

On 15 August 2012 09:57, Carl von Clausewitz <clausewitz45@gmail.com> wrote:

I’ve restored from TAR backup our databases, and everything looked fine.

What exactly is in that TAR backup? Is that a tar/gzipped PG dump or a
file-system snapshot?
If the latter:
- did you halt the database while creating the snapshot or at least
force a checkpoint?
- is the version of PG that you're restoring to exactly the same (same
CPU type, same major/minor version of PG) as the one that the backup
is from?

I suspect that you restored a database with corrupt statistics and
that those only get found out once the corrupted statistics are
needed.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Carl von Clausewitz (#1)
Re: corrupted statistics file "pg_stat_tmp/pgstat.stat"

"Carl von Clausewitz" <clausewitz45@gmail.com> writes:

I’ve restored from TAR backup our databases, and everything looked fine. Without changing any setting in postgresql.conf (or in kernel settings) – only “track_counts=on”, after 2-3 days, I’m receiving huge number (~5-10 PCS in every second) of error messages like that in /var/log/postgresql.log:
*** Aug 15 06:27:26 eurodb postgres[77652]: [43-1] WARNING: corrupted statistics file "pg_stat_tmp/pgstat.stat"

Huh. The stats collector process ought to rewrite that file fairly
often, so this suggests it's consistently failing to rewrite it.

You might take a look at what the file looks like after a day or so
of normal operation (eg, how big is it, how often does it get updated)
and then compare to what it looks like after the errors start.

Also, try strace'ing the stats collector process for a little while
(long enough to capture a stats file rewrite sequence) during normal
operation, and then again after the errors start.

I don't want to speculate too much in advance of the data, but I'm
wondering about a ulimit setting that limits how much data the stats
collector can write during its lifetime (ulimit -f or local equivalent).
That would eventually cause problems for any postgres process, but if
you did accidentally have one in place when starting the postmaster,
maybe the stats collector would be first to show symptoms.

regards, tom lane

#5Carl von Clausewitz
clausewitz45@gmail.com
In reply to: Alban Hertroys (#3)
Re: corrupted statistics file "pg_stat_tmp/pgstat.stat"

Hi,

I've made the backups with pg_dump, compressed, and there wasn't any kind of error messages. I've made the restoration with pgAdmin (but it also uses pg_dump for restore).

How can I check if the statistics are corrupt?

regards,
Csaba

-----Original Message-----
From: Alban Hertroys [mailto:haramrae@gmail.com]
Sent: Wednesday, August 15, 2012 3:33 PM
To: Carl von Clausewitz
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] corrupted statistics file "pg_stat_tmp/pgstat.stat"

On 15 August 2012 09:57, Carl von Clausewitz <clausewitz45@gmail.com> wrote:

I’ve restored from TAR backup our databases, and everything looked fine.

What exactly is in that TAR backup? Is that a tar/gzipped PG dump or a file-system snapshot?
If the latter:
- did you halt the database while creating the snapshot or at least force a checkpoint?
- is the version of PG that you're restoring to exactly the same (same CPU type, same major/minor version of PG) as the one that the backup is from?

I suspect that you restored a database with corrupt statistics and that those only get found out once the corrupted statistics are needed.

--
If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.

#6Carl von Clausewitz
clausewitz45@gmail.com
In reply to: Tom Lane (#4)
Re: corrupted statistics file "pg_stat_tmp/pgstat.stat"

Hi,

I've made the kernel changes that I wrote in my original e-mail, and I've created some additional logging (both csvlog and syslog), to gather more informations.

/boot/loader.conf:

kern.ipc.semmni="512"
kern.ipc.semmns="1024"
kern.ipc.semume="64"
kern.ipc.semmnu="512"

/etc/sysctl.cong:

kern.ipc.shmall=262144
kern.ipc.shmmax=1073742336
kern.ipc.semmap=256

pgTune made this config changes for me in /usr/local/pgsql/data/postgresql.conf (the server has 4GB RAM)
default_statistics_target = 50 # pgtune wizard 2012-08-15
maintenance_work_mem = 240MB # pgtune wizard 2012-08-15
constraint_exclusion = on # pgtune wizard 2012-08-15
checkpoint_completion_target = 0.9 # pgtune wizard 2012-08-15
effective_cache_size = 2816MB # pgtune wizard 2012-08-15
work_mem = 24MB # pgtune wizard 2012-08-15
wal_buffers = 8MB # pgtune wizard 2012-08-15
checkpoint_segments = 16 # pgtune wizard 2012-08-15
shared_buffers = 960MB # pgtune wizard 2012-08-15
max_connections = 80 # pgtune wizard 2012-08-15

After a day, the file is 412kb large. I've just installed strace, and I try to capture a 2-4 hours work, and check what is going on.

ulimit (& ulimit -f) output is unlimited.

I'll be back (:-)) within few days with the results. Thank you all the informations.

Regards,
Csaba

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 15, 2012 3:34 PM
To: Carl von Clausewitz
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] corrupted statistics file "pg_stat_tmp/pgstat.stat"

"Carl von Clausewitz" <clausewitz45@gmail.com> writes:

I’ve restored from TAR backup our databases, and everything looked fine. Without changing any setting in postgresql.conf (or in kernel settings) – only “track_counts=on”, after 2-3 days, I’m receiving huge number (~5-10 PCS in every second) of error messages like that in /var/log/postgresql.log:
*** Aug 15 06:27:26 eurodb postgres[77652]: [43-1] WARNING: corrupted statistics file "pg_stat_tmp/pgstat.stat"

Huh. The stats collector process ought to rewrite that file fairly often, so this suggests it's consistently failing to rewrite it.

You might take a look at what the file looks like after a day or so of normal operation (eg, how big is it, how often does it get updated) and then compare to what it looks like after the errors start.

Also, try strace'ing the stats collector process for a little while (long enough to capture a stats file rewrite sequence) during normal operation, and then again after the errors start.

I don't want to speculate too much in advance of the data, but I'm wondering about a ulimit setting that limits how much data the stats collector can write during its lifetime (ulimit -f or local equivalent).
That would eventually cause problems for any postgres process, but if you did accidentally have one in place when starting the postmaster, maybe the stats collector would be first to show symptoms.

regards, tom lane