pg_stat_tmp

Started by Tim Kaneover 12 years ago5 messagesgeneral
Jump to latest
#1Tim Kane
tim.kane@gmail.com

Hi all,

The past few days I’ve been encountering the following error, followed by a
full db restart and recovery

2013-12-16 07:12:53 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device

This occurs at a time of moderate load, during the same set of operations
each morning.
Interestingly, when I execute this manually at any other time of date, the
process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not
impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of
disk available (4GB).

My first instinct here is to symlink pg_stat_tmp against another disk with a
little more room to breathe, however I’m surprised that pgstat.tmp would
grow to be so large in the first place – possibly there is something else at
play here.

I should note that this issue has only recently occurred, no major changes
have been introduced.

Thanks for any advice on this.

Tim

#2Magnus Hagander
magnus@hagander.net
In reply to: Tim Kane (#1)
Re: pg_stat_tmp

On Mon, Dec 16, 2013 at 2:57 PM, Tim Kane <tim.kane@gmail.com> wrote:

Hi all,

The past few days I’ve been encountering the following error, followed by
a full db restart and recovery

2013-12-16 07:12:53 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device

This occurs at a time of moderate load, during the same set of operations
each morning.
Interestingly, when I execute this manually at any other time of date, the
process completes normally.

I presume that the *pg_stat_tmp* location is system-wide and likely is
not impacted by *temp_tablespaces*
The root partition, where postgresql is installed does *not* have a lot
of disk available (4GB).

My first instinct here is to symlink *pg_stat_tmp* against another disk
with a little more room to breathe, however I’m surprised that pgstat.tmp
would grow to be so large in the first place – possibly there is something
else at play here.

I should note that this issue has only recently occurred, no major changes
have been introduced.

Thanks for any advice on this.

You can control the location of the file with the parameter
stats_temp_directory.

The size is most likely because the directory needs space for >1 copy of
the file, since it's written to a separate file and then renamed()d over
the existing one for atomicity. That said, even with that it shouldn't grow
*that* large, unless you've go a huge number of tables. Certainly not into
gigabyte range. So it could be that it's just the first indicator that
fires, but that there's actually something else using up your space. But
setting stats_temp_directory is a good first way to find that out.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#3Jeff Janes
jeff.janes@gmail.com
In reply to: Tim Kane (#1)
Re: pg_stat_tmp

On Mon, Dec 16, 2013 at 5:57 AM, Tim Kane <tim.kane@gmail.com> wrote:

Hi all,

The past few days I’ve been encountering the following error, followed by
a full db restart and recovery

2013-12-16 07:12:53 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device

Is that the only thing in the logs? pg_stat_tmp problems should not bring
down your database. But problems with pg_xlog running out of space
certainly can--but they should also be logged.

This occurs at a time of moderate load, during the same set of operations
each morning.
Interestingly, when I execute this manually at any other time of date, the
process completes normally.

I presume that the *pg_stat_tmp* location is system-wide and likely is
not impacted by *temp_tablespaces*
The root partition, where postgresql is installed does *not* have a lot
of disk available (4GB).

My first instinct here is to symlink *pg_stat_tmp* against another disk
with a little more room to breathe, however I’m surprised that pgstat.tmp
would grow to be so large in the first place – possibly there is something
else at play here.

We don't know how large it is getting! If pg_stat_tmp shares the same
partition as pg_xlog, base (as in the default configuration), and pg_log,
then any of those things could be filling up the partition, and pg_stat_tmp
could just be the canary, not the culprit.

Anyway, you don't need to use a symlink, you could just change
stats_temp_directory to point someplace else.

Your best bet is run "du" or something similar to figure out where your
space is actually going.

Cheers,

Jeff

#4Tim Kane
tim.kane@gmail.com
In reply to: Jeff Janes (#3)
Re: pg_stat_tmp

Thanks Jeff, Magnus

Thanks for the suggestions.
This morning the same issue occurred, but this time it also complained that
it failed to write to pg_clog (previous days, the only failure was in
writing to pg_stat_tmp)

2013-12-17 07:13:04 GMT DETAIL: Could not write to file "pg_clog/0004" at
offset 212992: No space left on device.
2013-12-17 07:13:05 GMT ERROR: could not access status of transaction 0
2013-12-17 07:13:05 GMT DETAIL: Could not write to file "pg_clog/0004" at
offset 212992: No space left on device.
2013-12-17 07:13:06 GMT ERROR: could not access status of transaction 0
2013-12-17 07:13:06 GMT DETAIL: Could not write to file "pg_clog/0004" at
offset 212992: No space left on device.
2013-12-17 07:13:07 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:07 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:08 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:08 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on dev2013-12-17 07:25:15 GMT
WARNING: terminating connection because of crash of another server
process
2013-12-17 07:25:15 GMT DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared m
emory.
2013-12-17 07:25:15 GMT HINT: In a moment you should be able to reconnect
to the database and repeat your command.
2013-12-17 07:25:16 GMT LOG: all server processes terminated;
reinitializing
2013-12-17 07:25:17 GMT LOG: database system was interrupted; last known up
at 2013-12-17 07:08:22 GMT
2013-12-17 07:25:17 GMT LOG: database system was not properly shut down;
automatic recovery in progress
2013-12-17 07:25:17 GMT LOG: redo starts at 5BC/CF3F9888
2013-12-17 07:25:18 GMT LOG: could not open file
"pg_xlog/00000001000005BC000000D6" (log file 1468, segment 214): No such
file or directory
2013-12-17 07:25:18 GMT LOG: redo done at 5BC/D5FFFEC0
2013-12-17 07:25:18 GMT LOG: last completed transaction was at log time
2013-12-17 07:25:12.989653+00
2013-12-17 07:25:19 GMT LOG: database system is ready to accept connections
2013-12-17 07:25:19 GMT LOG: autovacuum launcher started

It’s never failed to write pg_xlog’s though – or at least, never reported a
failure.

For now, I’ve altered the pg_stat_tmp path and we’ll see what happens
tomorrow – Ideally though, yes.. I’d like to see this happening at runtime
and get a better picture of what’s blowing out here.

Further to this however, I notice that the base directory is consuming more
disk than I would have expected it to (all our relations are stored in
tablespaces assigned to other disks).
Looking for the larger culprits, I note the following files consuming 4.4GB

9.2/main/base/74641/2260957
9.2/main/base/74641/2260957.1
9.2/main/base/74641/2260957.2
9.2/main/base/74641/2260957.3
9.2/main/base/74641/2260957.4

I notice the timestamps are just shy of the time we experience that
pg_stat_tmp issue.

-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.1
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.2
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.3
-rw------- 1 postgres postgres 328466432 Dec 17 06:56 2260957.4

They appear to be orphaned oid’s, though I’m not sure this is definitive:

clone=# select pg_relation_filepath(2260957);
pg_relation_filepath
----------------------

(1 row)

I probably would have thought these to be temporary files – except for that
I have temp_tablespace set elsewhere.

Are these truly orphaned files that postgresql no longer needs (possibly due
to a crash) ? Is there a mechanism to scan the PG_DATA directories to
locate such files?
FYI, we’re running 9.2.4

Thanks again,

Tim

From: Jeff Janes <jeff.janes@gmail.com>
Date: Monday, 16 December 2013 17:32
To: Tim Kane <tim.kane@gmail.com>
Cc: pgsql-general General <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] pg_stat_tmp

On Mon, Dec 16, 2013 at 5:57 AM, Tim Kane <tim.kane@gmail.com> wrote:

Hi all,

The past few days I’ve been encountering the following error, followed by a
full db restart and recovery

2013-12-16 07:12:53 GMT LOG: could not write temporary statistics file
"pg_stat_tmp/pgstat.tmp": No space left on device

Is that the only thing in the logs? pg_stat_tmp problems should not bring
down your database. But problems with pg_xlog running out of space
certainly can--but they should also be logged.

This occurs at a time of moderate load, during the same set of operations each
morning.
Interestingly, when I execute this manually at any other time of date, the
process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not
impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of disk
available (4GB).

My first instinct here is to symlink pg_stat_tmp against another disk with a
little more room to breathe, however I’m surprised that pgstat.tmp would grow
to be so large in the first place – possibly there is something else at play
here.

We don't know how large it is getting! If pg_stat_tmp shares the same
partition as pg_xlog, base (as in the default configuration), and pg_log,
then any of those things could be filling up the partition, and pg_stat_tmp
could just be the canary, not the culprit.

Anyway, you don't need to use a symlink, you could just change
stats_temp_directory to point someplace else.

Your best bet is run "du" or something similar to figure out where your
space is actually going.

Cheers,

Jeff

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Tim Kane (#4)
Re: pg_stat_tmp

On Tue, Dec 17, 2013 at 3:25 AM, Tim Kane <tim.kane@gmail.com> wrote:

Thanks Jeff, Magnus

Thanks for the suggestions.
This morning the same issue occurred, but this time it also complained
that it failed to write to pg_clog (previous days, the only failure was in
writing to pg_stat_tmp)

...

2013-12-17 07:13:08 GMT LOG: could not write temporary statistics file "pg_stat_tmp/pgstat.tmp":
No space left on dev2013-12-17 07:25:15 GMT WARNING: terminating
connection because of crash of another server
process

...

It’s never failed to write pg_xlog’s though – or at least, never reported
a failure.

If your log file is on the same partition as all the other stuff, you are
probably losing log messages because there is no room to log them. The
mangled line above tends to support this. Logging stopped for 12 minutes,
until the crash freed up some space so it could resume.

For now, I’ve altered the pg_stat_tmp path and we’ll see what happens
tomorrow – Ideally though, yes.. I’d like to see this happening at runtime
and get a better picture of what’s blowing out here.

Further to this however, I notice that the base directory is consuming
more disk than I would have expected it to (all our relations are stored in
tablespaces assigned to other disks).
Looking for the larger culprits, I note the following files consuming 4.4GB

9.2/main/base/74641/2260957
9.2/main/base/74641/2260957.1
9.2/main/base/74641/2260957.2
9.2/main/base/74641/2260957.3
9.2/main/base/74641/2260957.4

I notice the timestamps are just shy of the time we experience that
pg_stat_tmp issue.

-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.1
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.2
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.3
-rw------- 1 postgres postgres 328466432 Dec 17 06:56 2260957.4

They appear to be orphaned oid’s, though I’m not sure this is definitive:

clone=# select pg_relation_filepath(2260957);

The transaction that was making the table probably never committed, so its
entry never becomes visible. I don't know of any good way of finding out
what the name of an uncommitted object will eventually be, which is rather
annoying when trying to monitor massives loads that happen inside a large
transaction.

(And what you should really be doing is select relname from pg_class where
relfilenode=2260957, the relfilenode is often the same as the oid, but not
always.)

Cheers,

Jeff