Removing pgsql_tmp files
We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.
select pg_postmaster_start_time();
pg_postmaster_start_time
-------------------------------
2010-08-08 22:53:31.999804-04
(1 row)
I'm not sure why the temp files aren't being cleaned up. I can confirm we haven't had a backend crash in at least a week (from inspection of the log files). The oldest backend goes back to the end of October:
production=# select current_timestamp, min(backend_start) from pg_stat_activity;
now | min
-------------------------------+-------------------------------
2010-11-08 15:23:25.331311-05 | 2010-10-27 05:51:02.707688-04
(1 row)
To confirm which files are no longer being used, I've come up with the following query (where 16384 is the tablespace in question):
SELECT filename, pid, (pg_stat_file(dir || '/' || filename)).modification AS modified_at
FROM (SELECT CAST('pg_tblspc/16384/pgsql_tmp' AS TEXT) as dir, filename,
CAST(regexp_replace(filename, $r$^pgsql_tmp(\d+)\..*$$r$, $rr$\1$rr$, 'g') AS INT) as pid
FROM pg_ls_dir('pg_tblspc/16384/pgsql_tmp') AS the (filename)) AS temp_files (dir, filename, pid)
LEFT JOIN pg_stat_activity ON procpid = pid
WHERE procpid IS NULL
ORDER BY modified_at;
filename | pid | modified_at
-------------------+-------+------------------------
pgsql_tmp29522.1 | 29522 | 2010-08-09 00:26:42-04
pgsql_tmp31962.0 | 31962 | 2010-08-09 00:45:41-04
pgsql_tmp29258.0 | 29258 | 2010-08-09 00:46:01-04
pgsql_tmp1478.0 | 1478 | 2010-08-09 00:46:16-04
pgsql_tmp1482.2 | 1482 | 2010-08-09 00:46:18-04
pgsql_tmp29267.1 | 29267 | 2010-08-09 01:02:34-04
...
pgsql_tmp21928.0 | 21928 | 2010-11-08 00:32:48-05
pgsql_tmp20825.0 | 20825 | 2010-11-08 02:17:51-05
pgsql_tmp878.0 | 878 | 2010-11-08 02:25:23-05
pgsql_tmp8064.0 | 8064 | 2010-11-08 03:47:26-05
pgsql_tmp31645.0 | 31645 | 2010-11-08 07:09:40-05
pgsql_tmp25245.0 | 25245 | 2010-11-08 09:33:24-05
pgsql_tmp2302.3 | 2302 | 2010-11-08 09:59:45-05
pgsql_tmp17961.0 | 17961 | 2010-11-08 11:29:12-05
(2685 rows)
Does this query look reasonable? What other things should I take into account before I start deleting files from the file system? Why might these files not be cleaned up on their own?
Cheers,
Michael Glaesemann
michael.glaesemann@myyearbook.com
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.
That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.
Does this query look reasonable? What other things should I take into account before I start deleting files from the file system? Why might these files not be cleaned up on their own?
Personally, I'd not risk trying to match on PID; it should be sufficient
to delete anything with a timestamp older than the oldest active
backend. (Unless you've got some really long-lived sessions in
there...)
What PG version is this?
regards, tom lane
On Nov 8, 2010, at 16:03 , Tom Lane wrote:
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.
That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.
We don't log all of them, but I checked those we did. It looks like it's happening when queries are timing out. I'm seeing this pattern pretty consistently:
temporary file + query
canceling statement due to statement timeout
second temp file
Here's a sample:
pid | 877
sess_id | 4ccf7257.36d
sess_line | 16
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.0", size 87184416
pid | 877
sess_id | 4ccf7257.36d
sess_line | 17
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | ERROR
sql_state | 57014
message | canceling statement due to statement timeout
pid | 877
sess_id | 4ccf7257.36d
sess_line | 18
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.434-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.1", size 5480448
query |
It looks like the pgsql_tmp877.1 file is cleaned up as it doesn't appear in the pgsql_tmp directory.
Personally, I'd not risk trying to match on PID; it should be sufficient
to delete anything with a timestamp older than the oldest active
backend. (Unless you've got some really long-lived sessions in
there...)
That's easily-enough determined from pg_stat_activity.
What PG version is this?
select version();
version
-------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit
(1 row)
Michael Glaesemann
michael.glaesemann@myyearbook.com
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
On Nov 8, 2010, at 16:03 , Tom Lane wrote:
That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.
We don't log all of them, but I checked those we did. It looks like it's happening when queries are timing out. I'm seeing this pattern pretty consistently:
temporary file + query
canceling statement due to statement timeout
second temp file
Here's a sample:
pid | 877
sess_id | 4ccf7257.36d
sess_line | 16
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.0", size 87184416
Oh, so you've got log_temp_files enabled?
Hmm. If you look at FileClose() in fd.c, you'll discover that that
"temporary file" log message is emitted immediately before unlink'ing
the file. It looks pretty safe ... but, scratching around, I notice
that there's a CHECK_FOR_INTERRUPTS at the end of ereport(). So a
cancel that was caught by that exact CHECK_FOR_INTERRUPTS call could
provoke this symptom. The window for this is larger than it might seem
since the CHECK_FOR_INTERRUPTS could be responding to an interrupt that
came in sometime before that.
I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.
regards, tom lane
Excerpts from Tom Lane's message of lun nov 08 22:29:28 -0300 2010:
Hmm. If you look at FileClose() in fd.c, you'll discover that that
"temporary file" log message is emitted immediately before unlink'ing
the file. It looks pretty safe ... but, scratching around, I notice
that there's a CHECK_FOR_INTERRUPTS at the end of ereport(). So a
cancel that was caught by that exact CHECK_FOR_INTERRUPTS call could
provoke this symptom. The window for this is larger than it might seem
since the CHECK_FOR_INTERRUPTS could be responding to an interrupt that
came in sometime before that.I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.
Would it work to put the removal inside a PG_CATCH block?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of lun nov 08 22:29:28 -0300 2010:
I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.
Would it work to put the removal inside a PG_CATCH block?
Well, that still begs the question of what to do exactly. After some
thought I believe the attached is the best fix.
regards, tom lane