Removing pgsql_tmp files

Started by Michael Glaesemannabout 15 years ago6 messages
#1Michael Glaesemann
michael.glaesemann@myyearbook.com

We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.

select pg_postmaster_start_time();
pg_postmaster_start_time
-------------------------------
2010-08-08 22:53:31.999804-04
(1 row)

I'm not sure why the temp files aren't being cleaned up. I can confirm we haven't had a backend crash in at least a week (from inspection of the log files). The oldest backend goes back to the end of October:

production=# select current_timestamp, min(backend_start) from pg_stat_activity;
now | min
-------------------------------+-------------------------------
2010-11-08 15:23:25.331311-05 | 2010-10-27 05:51:02.707688-04
(1 row)

To confirm which files are no longer being used, I've come up with the following query (where 16384 is the tablespace in question):

SELECT filename, pid, (pg_stat_file(dir || '/' || filename)).modification AS modified_at
FROM (SELECT CAST('pg_tblspc/16384/pgsql_tmp' AS TEXT) as dir, filename,
CAST(regexp_replace(filename, $r$^pgsql_tmp(\d+)\..*$$r$, $rr$\1$rr$, 'g') AS INT) as pid
FROM pg_ls_dir('pg_tblspc/16384/pgsql_tmp') AS the (filename)) AS temp_files (dir, filename, pid)
LEFT JOIN pg_stat_activity ON procpid = pid
WHERE procpid IS NULL
ORDER BY modified_at;

filename | pid | modified_at
-------------------+-------+------------------------
pgsql_tmp29522.1 | 29522 | 2010-08-09 00:26:42-04
pgsql_tmp31962.0 | 31962 | 2010-08-09 00:45:41-04
pgsql_tmp29258.0 | 29258 | 2010-08-09 00:46:01-04
pgsql_tmp1478.0 | 1478 | 2010-08-09 00:46:16-04
pgsql_tmp1482.2 | 1482 | 2010-08-09 00:46:18-04
pgsql_tmp29267.1 | 29267 | 2010-08-09 01:02:34-04
...

pgsql_tmp21928.0 | 21928 | 2010-11-08 00:32:48-05
pgsql_tmp20825.0 | 20825 | 2010-11-08 02:17:51-05
pgsql_tmp878.0 | 878 | 2010-11-08 02:25:23-05
pgsql_tmp8064.0 | 8064 | 2010-11-08 03:47:26-05
pgsql_tmp31645.0 | 31645 | 2010-11-08 07:09:40-05
pgsql_tmp25245.0 | 25245 | 2010-11-08 09:33:24-05
pgsql_tmp2302.3 | 2302 | 2010-11-08 09:59:45-05
pgsql_tmp17961.0 | 17961 | 2010-11-08 11:29:12-05
(2685 rows)

Does this query look reasonable? What other things should I take into account before I start deleting files from the file system? Why might these files not be cleaned up on their own?

Cheers,
Michael Glaesemann
michael.glaesemann@myyearbook.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Glaesemann (#1)
Re: Removing pgsql_tmp files

Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:

We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.

That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.

Does this query look reasonable? What other things should I take into account before I start deleting files from the file system? Why might these files not be cleaned up on their own?

Personally, I'd not risk trying to match on PID; it should be sufficient
to delete anything with a timestamp older than the oldest active
backend. (Unless you've got some really long-lived sessions in
there...)

What PG version is this?

regards, tom lane

#3Michael Glaesemann
michael.glaesemann@myyearbook.com
In reply to: Tom Lane (#2)
Re: Removing pgsql_tmp files

On Nov 8, 2010, at 16:03 , Tom Lane wrote:

Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:

We've got over 250GB of files in a pgsql_tmp directory, some with modification timestamps going back to August 2010 when the server was last restarted.

That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.

We don't log all of them, but I checked those we did. It looks like it's happening when queries are timing out. I'm seeing this pattern pretty consistently:

temporary file + query
canceling statement due to statement timeout
second temp file

Here's a sample:

pid | 877
sess_id | 4ccf7257.36d
sess_line | 16
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.0", size 87184416

pid | 877
sess_id | 4ccf7257.36d
sess_line | 17
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | ERROR
sql_state | 57014
message | canceling statement due to statement timeout

pid | 877
sess_id | 4ccf7257.36d
sess_line | 18
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.434-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.1", size 5480448
query |

It looks like the pgsql_tmp877.1 file is cleaned up as it doesn't appear in the pgsql_tmp directory.

Personally, I'd not risk trying to match on PID; it should be sufficient
to delete anything with a timestamp older than the oldest active
backend. (Unless you've got some really long-lived sessions in
there...)

That's easily-enough determined from pg_stat_activity.

What PG version is this?

select version();
version
-------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit
(1 row)

Michael Glaesemann
michael.glaesemann@myyearbook.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Glaesemann (#3)
Re: Removing pgsql_tmp files

Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:

On Nov 8, 2010, at 16:03 , Tom Lane wrote:

That's very peculiar. Do you keep query logs? It would be useful to
try to correlate the temp files' PIDs and timestamps with the specific
queries that must have created them.

We don't log all of them, but I checked those we did. It looks like it's happening when queries are timing out. I'm seeing this pattern pretty consistently:

temporary file + query
canceling statement due to statement timeout
second temp file

Here's a sample:

pid | 877
sess_id | 4ccf7257.36d
sess_line | 16
filename | pgsql_tmp877.0
accessed_at | 2010-09-15 12:14:45-04
modified_at | 2010-11-01 22:37:00-04
logged_at | 2010-11-01 22:37:01.412-04
error | LOG
sql_state | 00000
message | temporary file: path "pg_tblspc/16384/pgsql_tmp/pgsql_tmp877.0", size 87184416

Oh, so you've got log_temp_files enabled?

Hmm. If you look at FileClose() in fd.c, you'll discover that that
"temporary file" log message is emitted immediately before unlink'ing
the file. It looks pretty safe ... but, scratching around, I notice
that there's a CHECK_FOR_INTERRUPTS at the end of ereport(). So a
cancel that was caught by that exact CHECK_FOR_INTERRUPTS call could
provoke this symptom. The window for this is larger than it might seem
since the CHECK_FOR_INTERRUPTS could be responding to an interrupt that
came in sometime before that.

I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.

regards, tom lane

#5Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#4)
Re: Removing pgsql_tmp files

Excerpts from Tom Lane's message of lun nov 08 22:29:28 -0300 2010:

Hmm. If you look at FileClose() in fd.c, you'll discover that that
"temporary file" log message is emitted immediately before unlink'ing
the file. It looks pretty safe ... but, scratching around, I notice
that there's a CHECK_FOR_INTERRUPTS at the end of ereport(). So a
cancel that was caught by that exact CHECK_FOR_INTERRUPTS call could
provoke this symptom. The window for this is larger than it might seem
since the CHECK_FOR_INTERRUPTS could be responding to an interrupt that
came in sometime before that.

I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.

Would it work to put the removal inside a PG_CATCH block?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: Removing pgsql_tmp files

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Tom Lane's message of lun nov 08 22:29:28 -0300 2010:

I think we need to re-order the operations there to ensure that the
unlink will still happen if the ereport gets interrupted.

Would it work to put the removal inside a PG_CATCH block?

Well, that still begs the question of what to do exactly. After some
thought I believe the attached is the best fix.

regards, tom lane