cleanup temporary files after crash

Started by Euler Taveiraabout 5 years ago35 messages

euler.taveira@2ndquadrant.com

about 5 years ago

1 attachment(s)

Hi,

Bug #16427 mentioned that temporary files are not removed after a crash. I
heard similar complaints from some customers. In the bug report [1]/messages/by-id/CAH503wB28N1382YReXWjqpqZE6iqaxERoZUqnf02eNOYs0cZOA@mail.gmail.com, I
proposed a new GUC to control whether temporary files are removed after a
crash recovery. The current behavior is only useful for debugging purposes.
It also has an undesirable behavior: you have to restart the service to
reclaim the space. Interrupting the service continuity isn't always an
option and due to limited resources, you have no choice but to restart the
service.

The new GUC cleanup_temp_files_after_crash is marked as SIGHUP. Hence, you
can enable it to debug without service interruption. The default value is
on which means it changes the current behavior. Documentation and tests are
included.

[1]: /messages/by-id/CAH503wB28N1382YReXWjqpqZE6iqaxERoZUqnf02eNOYs0cZOA@mail.gmail.com
/messages/by-id/CAH503wB28N1382YReXWjqpqZE6iqaxERoZUqnf02eNOYs0cZOA@mail.gmail.com

Regards,

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Control-temporary-files-removal-after-crash.patchtext/x-patch; charset=US-ASCII; name=0001-Control-temporary-files-removal-after-crash.patchDownload

From bcd5dcc00cade3a80ee83f88b15f980ac7cdd0f5 Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Mon, 25 May 2020 00:08:20 -0300
Subject: [PATCH] Control temporary files removal after crash

A new GUC cleanup_temp_files_after_crash controls whether temporary
files are removed after a crash. Successive crashes could result in
useless storage usage until service is restarted. It could be the case
on host with inadequate resources. This manual intervention for some
environments is not desirable. This GUC is marked as SIGHUP hence you
don't have to interrupt the service to change it.

The current behavior introduces a backward incompatibility (minor one)
which means Postgres will reclaim temporary file space after a crash.
The main reason is that we favor service continuity over debugging.
---
 doc/src/sgml/config.sgml                      |  17 ++
 src/backend/postmaster/postmaster.c           |   5 +
 src/backend/storage/file/fd.c                 |  12 +-
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/test/recovery/t/022_crash_temp_files.pl   | 194 ++++++++++++++++++
 7 files changed, 235 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/022_crash_temp_files.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..9e6265afe1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9553,6 +9553,23 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-cleanup-temp-files-after-crash" xreflabel="cleanup_temp_files_after_crash">
+      <term><varname>cleanup_temp_files_after_crash</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>cleanup_temp_files_after_crash</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to on, <productname>PostgreSQL</productname> will automatically
+        remove temporary files after a backend crash. If disabled, sucessive
+        crashes (while queries are using temporary files) could result in
+        accumulation of useless files. However, in some circumstances (e.g. debugging),
+        you want to examine these temporary files. It defaults to <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
       <term><varname>data_sync_retry</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..c906f577f8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -248,6 +248,7 @@ bool		Db_user_namespace = false;
 bool		enable_bonjour = false;
 char	   *bonjour_name;
 bool		restart_after_crash = true;
+bool		cleanup_temp_files_after_crash = true;
 
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
@@ -4014,6 +4015,10 @@ PostmasterStateMachine(void)
 		ereport(LOG,
 				(errmsg("all server processes terminated; reinitializing")));
 
+		/* remove leftover temporary files after a crash */
+		if (cleanup_temp_files_after_crash)
+			RemovePgTempFiles();
+
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index bd72a87ee3..b5c87787a2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2992,11 +2992,13 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
  * remove any leftover files created by OpenTemporaryFile and any leftover
  * temporary relation files created by mdcreate.
  *
- * NOTE: we could, but don't, call this during a post-backend-crash restart
- * cycle.  The argument for not doing it is that someone might want to examine
- * the temp files for debugging purposes.  This does however mean that
- * OpenTemporaryFile had better allow for collision with an existing temp
- * file name.
+ * During post-backend-crash restart cycle, this routine could be called if
+ * cleanup_temp_files_after_crash GUC is enabled. Multiple crashes while
+ * queries are using temp files could result in useless storage usage that can
+ * only be reclaimed by a service restart. The argument against enabling it is
+ * that someone might want to examine the temporary files for debugging
+ * purposes. This does however mean that OpenTemporaryFile had better allow for
+ * collision with an existing temp file name.
  *
  * NOTE: this function and its subroutines generally report syscall failures
  * with ereport(LOG) and keep going.  Removing temp files is not so critical
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..44f070cb66 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1358,6 +1358,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"cleanup_temp_files_after_crash", PGC_SIGHUP, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Remove temporary files after backend crash."),
+			NULL
+		},
+		&cleanup_temp_files_after_crash,
+		true,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"log_duration", PGC_SUSET, LOGGING_WHAT,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..a4c3439211 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -757,6 +757,8 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#cleanup_temp_files_after_crash = on	# remove temporary files after
+#										# backend crash?
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index babc87dfc9..90cc7ccb62 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -29,6 +29,7 @@ extern bool log_hostname;
 extern bool enable_bonjour;
 extern char *bonjour_name;
 extern bool restart_after_crash;
+extern bool cleanup_temp_files_after_crash;
 
 #ifdef WIN32
 extern HANDLE PostmasterHandle;
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
new file mode 100644
index 0000000000..dee3783a0c
--- /dev/null
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -0,0 +1,194 @@
+#
+# Test cleanup of temporary files after a crash.
+#
+# This is a description that I will add later.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+use Time::HiRes qw(usleep);
+
+plan tests => 9;
+
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(60);
+
+my $node = get_new_node('node_crash');
+$node->init();
+$node->start();
+
+# By default, PostgresNode doesn't restart after crash
+# Reduce work_mem to generate temporary file with a few number of rows
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET cleanup_temp_files_after_crash = on;
+				   ALTER SYSTEM SET log_connections = 1;
+				   ALTER SYSTEM SET work_mem = '64kB';
+				   ALTER SYSTEM SET restart_after_crash = on;
+				   SELECT pg_reload_conf();]);
+
+# create table, insert rows
+$node->safe_psql(
+	'postgres',
+	q[CREATE TABLE tab_crash (a text);
+		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+
+#
+# Test cleanup temporary files after crash
+#
+
+# Run psql, keeping session alive, so we have an alive backend to kill.
+my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+my $killme = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin,
+	'>',
+	\$killme_stdout,
+	'2>',
+	\$killme_stderr,
+	$psql_timeout);
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+my $pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Kill with SIGKILL
+my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'no temporary files');
+
+#
+# Test old behavior (doesn't cleanup temporary file after crash)
+#
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET cleanup_temp_files_after_crash = off;
+				   SELECT pg_reload_conf();]);
+
+# Restart psql session
+($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+$killme->run();
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+$pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, $$in-progress-before-sigkill$$, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Kill with SIGKILL
+$ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files -- should be there
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(1), 'one temporary file');
+
+# Restart should remove the temporary files
+$node->restart();
+
+# Check the temporary files -- should be gone
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'temporary file was removed');
+
+$node->stop();
+
+# Pump until string is matched, or timeout occurs
+sub pump_until
+{
+	my ($proc, $stream, $untl) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("aborting wait: program died");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+}
-- 
2.20.1

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 5 years ago

In reply to: Euler Taveira (#1)

Re: cleanup temporary files after crash

Hi Euler,

On Wed, Oct 28, 2020 at 11:16:26AM -0300, Euler Taveira wrote:

Hi,

Bug #16427 mentioned that temporary files are not removed after a crash. I
heard similar complaints from some customers. In the bug report [1], I
proposed a new GUC to control whether temporary files are removed after a
crash recovery. The current behavior is only useful for debugging purposes.
It also has an undesirable behavior: you have to restart the service to
reclaim the space. Interrupting the service continuity isn't always an
option and due to limited resources, you have no choice but to restart the
service.

The new GUC cleanup_temp_files_after_crash is marked as SIGHUP. Hence, you
can enable it to debug without service interruption. The default value is
on which means it changes the current behavior. Documentation and tests are
included.

I did a quick review and the patch seems fine to me. Let's wait for a
bit and see if there are any objections - if not, I'll get it committed
in the next CF.

One thing I'm not sure about is whether we should have the GUC as
proposed, or have a negative "keep_temp_files_after_restart" defaulting
to false. But I don't have a very good justification for the alternative
other than vague personal preference.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Euler Taveira

euler.taveira@2ndquadrant.com

about 5 years ago

In reply to: Tomas Vondra (#2)

Re: cleanup temporary files after crash

On Wed, 28 Oct 2020 at 15:42, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I did a quick review and the patch seems fine to me. Let's wait for a
bit and see if there are any objections - if not, I'll get it committed
in the next CF.

Tomas, thanks for your review.

One thing I'm not sure about is whether we should have the GUC as
proposed, or have a negative "keep_temp_files_after_restart" defaulting
to false. But I don't have a very good justification for the alternative
other than vague personal preference.

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to
investigate an issue. Regarding a crash, all information is already
available
and temporary files don't provide extra details. This new GUC is just to
keep the
previous behavior. I'm fine without the GUC, though.

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Euler Taveira (#3)

Re: cleanup temporary files after crash

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to investigate an issue. Regarding a crash, all information is already
available and temporary files don't provide extra details. This new
GUC is just to keep the previous behavior. I'm fine without the GUC, though.

The original behavior is as old as 4a5f38c4, and last time we talked
about that there were arguments about still keeping the existing
behavior to not cleanup files during a restart-after-crash scenario
for the sake of being useful just "in case". I have never used that
property myself, TBH, and I have seen much more cases of users caring
about the data folder not facing an ENOSPC particularly if they don't
use different partitions for pg_wal/ and the main data folder.
--
Michael

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 5 years ago

In reply to: Michael Paquier (#4)

1 attachment(s)

Re: cleanup temporary files after crash

On 01.11.2020 04:25, Michael Paquier wrote:

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to investigate an issue. Regarding a crash, all information is already
available and temporary files don't provide extra details. This new
GUC is just to keep the previous behavior. I'm fine without the GUC, though.

The original behavior is as old as 4a5f38c4, and last time we talked
about that there were arguments about still keeping the existing
behavior to not cleanup files during a restart-after-crash scenario
for the sake of being useful just "in case". I have never used that
property myself, TBH, and I have seen much more cases of users caring
about the data folder not facing an ENOSPC particularly if they don't
use different partitions for pg_wal/ and the main data folder.
--
Michael

Thank you, Euler for submitting this.
+1 for the feature. One of the platforms we support uses temp files a
lot and we faced the problem, while never actually used these orphan
files for debugging purposes.

I also think that the GUC is not needed here. This 'feature' was
internal from the very beginning, so users shouldn't care about
preserving old behavior. Without the GUC the patch is very simple,
please see attached version. I also omit the test, because I am not sure
it will be stable given that the RemovePgTempFiles() allows the
possibility of failure.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001_remove_temporary_files_after_crash.patchtext/x-patch; charset=UTF-8; name=0001_remove_temporary_files_after_crash.patchDownload

commit 417581b287f4060178595cd96465adc639639290
Author: anastasia <a.lubennikova@postgrespro.ru>
Date:   Thu Nov 26 11:45:05 2020 +0300

    Remove temporary files after a backend crash
    
    Successive crashes could result in useless storage usage until service
    is restarted. It could be the case on host with inadequate resources.
    This manual intervention for some environments is not desirable.
    
    The only reason we kept temp files was that someone might want to examine
    them for debugging purposes. With this patch we favor service continuity
    over debugging.
    
    Author: Euler Taveira
    Reviewer: Anastasia Lubennikova

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b7799ed1d24..a151250734f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3999,6 +3999,9 @@ PostmasterStateMachine(void)
 		ereport(LOG,
 				(errmsg("all server processes terminated; reinitializing")));
 
+		/* remove leftover temporary files after a crash */
+		RemovePgTempFiles();
+
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 05abcf72d68..d30bfc28541 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2992,15 +2992,15 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
  * remove any leftover files created by OpenTemporaryFile and any leftover
  * temporary relation files created by mdcreate.
  *
- * NOTE: we could, but don't, call this during a post-backend-crash restart
- * cycle.  The argument for not doing it is that someone might want to examine
- * the temp files for debugging purposes.  This does however mean that
- * OpenTemporaryFile had better allow for collision with an existing temp
- * file name.
+ * This is also called during a post-backend-crash restart cycle. The argument
+ * for not doing it is that crashes of queries using temp files can result in
+ * useless storage usage that can only be reclaimed by a service restart.
  *
  * NOTE: this function and its subroutines generally report syscall failures
  * with ereport(LOG) and keep going.  Removing temp files is not so critical
  * that we should fail to start the database when we can't do it.
+ * Because this routine is not strict, OpenTemporaryFile allows for collision
+ * with an existing temp file name.
  */
 void
 RemovePgTempFiles(void)

Euler Taveira

euler.taveira@2ndquadrant.com

about 5 years ago

In reply to: Anastasia Lubennikova (#5)

Re: cleanup temporary files after crash

On Thu, 26 Nov 2020 at 05:48, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

I also think that the GUC is not needed here. This 'feature' was
internal from the very beginning, so users shouldn't care about
preserving old behavior. Without the GUC the patch is very simple,
please see attached version. I also omit the test, because I am not sure
it will be stable given that the RemovePgTempFiles() allows the
possibility of failure.

Anastasia, thanks for reviewing it. As I said, I'm fine without the GUC.

However, if we decided to go with the GUC, default behavior should be remove
the temporary files after the crash.

Regards,

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#6)

1 attachment(s)

Re: cleanup temporary files after crash

Hi,

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

I propose to rename the GUC to remove_temp_files_after_crash, I think
"remove" is a bit clearer than "cleanup". I've also reworded the sgml
docs a little bit.

Attached is a patch with those changes. Barring objections, I'll get
this committed in the next couple days.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Control-temporary-files-removal-after-crash-20210309.patchtext/x-patch; charset=UTF-8; name=0001-Control-temporary-files-removal-after-crash-20210309.patchDownload

From 70cc3bf303585f1ffcecbace43d13d3d4f545f1b Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Mon, 25 May 2020 00:08:20 -0300
Subject: [PATCH] Control temporary files removal after crash

A new GUC remove_temp_files_after_crash controls whether temporary
files are removed after a crash. Successive crashes could result in
useless storage usage until service is restarted. It could be the case
on host with inadequate resources. This manual intervention for some
environments is not desirable. This GUC is marked as SIGHUP hence you
don't have to interrupt the service to change it.

The current behavior introduces a backward incompatibility (minor one)
which means Postgres will reclaim temporary file space after a crash.
The main reason is that we favor service continuity over debugging.
---
 doc/src/sgml/config.sgml                      |  18 ++
 src/backend/postmaster/postmaster.c           |   5 +
 src/backend/storage/file/fd.c                 |  12 +-
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/test/recovery/t/022_crash_temp_files.pl   | 194 ++++++++++++++++++
 7 files changed, 236 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/022_crash_temp_files.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..1b102960d6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9665,6 +9665,24 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-remove-temp-files-after-crash" xreflabel="remove_temp_files_after_crash">
+      <term><varname>remove_temp_files_after_crash</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>remove_temp_files_after_crash</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to on, <productname>PostgreSQL</productname> will automatically
+        remove temporary files after a backend crash. When disabled, the temporary
+        files are retained after a crash, which may be useful in some circumstances
+        (e.g. during debugging). It may however result in accumulation of many
+        useless files, or possibly even running out of disk space.
+        It defaults to <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
       <term><varname>data_sync_retry</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index edab95a19e..2338892c33 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -242,6 +242,7 @@ bool		Db_user_namespace = false;
 bool		enable_bonjour = false;
 char	   *bonjour_name;
 bool		restart_after_crash = true;
+bool		remove_temp_files_after_crash = true;
 
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
@@ -3975,6 +3976,10 @@ PostmasterStateMachine(void)
 		ereport(LOG,
 				(errmsg("all server processes terminated; reinitializing")));
 
+		/* remove leftover temporary files after a crash */
+		if (remove_temp_files_after_crash)
+			RemovePgTempFiles();
+
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..dfc3c2fe3e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3024,11 +3024,13 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
  * remove any leftover files created by OpenTemporaryFile and any leftover
  * temporary relation files created by mdcreate.
  *
- * NOTE: we could, but don't, call this during a post-backend-crash restart
- * cycle.  The argument for not doing it is that someone might want to examine
- * the temp files for debugging purposes.  This does however mean that
- * OpenTemporaryFile had better allow for collision with an existing temp
- * file name.
+ * During post-backend-crash restart cycle, this routine could be called if
+ * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
+ * queries are using temp files could result in useless storage usage that can
+ * only be reclaimed by a service restart. The argument against enabling it is
+ * that someone might want to examine the temp files for debugging purposes.
+ * This does however mean that OpenTemporaryFile had better allow for
+ * collision with an existing temp file name.
  *
  * NOTE: this function and its subroutines generally report syscall failures
  * with ereport(LOG) and keep going.  Removing temp files is not so critical
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe2..64ac48b2e5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1361,6 +1361,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"remove_temp_files_after_crash", PGC_SIGHUP, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Remove temporary files after backend crash."),
+			NULL
+		},
+		&remove_temp_files_after_crash,
+		true,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"log_duration", PGC_SUSET, LOGGING_WHAT,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..25c1b89ffc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -759,6 +759,8 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#remove_temp_files_after_crash = on	# remove temporary files after
+#									# backend crash?
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index cfa59c4dc0..0efdd7c232 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -29,6 +29,7 @@ extern bool log_hostname;
 extern bool enable_bonjour;
 extern char *bonjour_name;
 extern bool restart_after_crash;
+extern bool remove_temp_files_after_crash;
 
 #ifdef WIN32
 extern HANDLE PostmasterHandle;
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
new file mode 100644
index 0000000000..45a70a368e
--- /dev/null
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -0,0 +1,194 @@
+#
+# Test cleanup of temporary files after a crash.
+#
+# This is a description that I will add later.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+use Time::HiRes qw(usleep);
+
+plan tests => 9;
+
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(60);
+
+my $node = get_new_node('node_crash');
+$node->init();
+$node->start();
+
+# By default, PostgresNode doesn't restart after crash
+# Reduce work_mem to generate temporary file with a few number of rows
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET remove_temp_files_after_crash = on;
+				   ALTER SYSTEM SET log_connections = 1;
+				   ALTER SYSTEM SET work_mem = '64kB';
+				   ALTER SYSTEM SET restart_after_crash = on;
+				   SELECT pg_reload_conf();]);
+
+# create table, insert rows
+$node->safe_psql(
+	'postgres',
+	q[CREATE TABLE tab_crash (a text);
+		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+
+#
+# Test cleanup temporary files after crash
+#
+
+# Run psql, keeping session alive, so we have an alive backend to kill.
+my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+my $killme = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin,
+	'>',
+	\$killme_stdout,
+	'2>',
+	\$killme_stderr,
+	$psql_timeout);
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+my $pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Kill with SIGKILL
+my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'no temporary files');
+
+#
+# Test old behavior (doesn't cleanup temporary file after crash)
+#
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET remove_temp_files_after_crash = off;
+				   SELECT pg_reload_conf();]);
+
+# Restart psql session
+($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+$killme->run();
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+$pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, $$in-progress-before-sigkill$$, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Kill with SIGKILL
+$ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files -- should be there
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(1), 'one temporary file');
+
+# Restart should remove the temporary files
+$node->restart();
+
+# Check the temporary files -- should be gone
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'temporary file was removed');
+
+$node->stop();
+
+# Pump until string is matched, or timeout occurs
+sub pump_until
+{
+	my ($proc, $stream, $untl) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("aborting wait: program died");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+}
-- 
2.29.2

Michael Paquier

michael@paquier.xyz

almost 5 years ago

In reply to: Tomas Vondra (#7)

Re: cleanup temporary files after crash

On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

Thanks for taking care of this. I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

I propose to rename the GUC to remove_temp_files_after_crash, I think
"remove" is a bit clearer than "cleanup". I've also reworded the sgml
docs a little bit.

"remove" sounds fine to me.

Attached is a patch with those changes. Barring objections, I'll get
this committed in the next couple days.

+ When set to on, <productname>PostgreSQL</productname> will automatically
Nit: using a <literal> markup for the "on" value.

+#remove_temp_files_after_crash = on    # remove temporary files after
+#                                  # backend crash?
The indentation of the second line is incorrect here (Incorrect number
of spaces in tabs perhaps?), and there is no need for the '#' at the
beginning of the line.
--
Michael

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Michael Paquier (#8)

Re: cleanup temporary files after crash

On Tue, Mar 9, 2021, at 9:31 AM, Michael Paquier wrote:

On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

Thanks for taking care of this. I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

I propose to rename the GUC to remove_temp_files_after_crash, I think
"remove" is a bit clearer than "cleanup". I've also reworded the sgml
docs a little bit.

"remove" sounds fine to me.

+1.

Attached is a patch with those changes. Barring objections, I'll get
this committed in the next couple days.

+ When set to on, <productname>PostgreSQL</productname> will automatically
Nit: using a <literal> markup for the "on" value.
+#remove_temp_files_after_crash = on    # remove temporary files after
+#                                  # backend crash?
The indentation of the second line is incorrect here (Incorrect number
of spaces in tabs perhaps?), and there is no need for the '#' at the
beginning of the line.
--
Michael

That was my fault. Editor automatically added #.

I'm not sure Tomas will include the tests. If so. the terminology should be adjusted too.

+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -0,0 +1,194 @@
+#
+# Test cleanup of temporary files after a crash.

s/cleanup/remove/

--
Euler Taveira
EDB https://www.enterprisedb.com/

#10

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Michael Paquier (#8)

Re: cleanup temporary files after crash

On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

Thanks for taking care of this. I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

+1 for having it on by default.

I was also just looking at this patch and came here to say LGTM except
for two cosmetic things, below.

I propose to rename the GUC to remove_temp_files_after_crash, I think
"remove" is a bit clearer than "cleanup". I've also reworded the sgml
docs a little bit.

"remove" sounds fine to me.

Attached is a patch with those changes. Barring objections, I'll get
this committed in the next couple days.

+ When set to on, <productname>PostgreSQL</productname> will automatically
Nit: using a <literal> markup for the "on" value.

Maybe should say "which is the default", like other similar things?

+#remove_temp_files_after_crash = on    # remove temporary files after
+#                                  # backend crash?
The indentation of the second line is incorrect here (Incorrect number
of spaces in tabs perhaps?), and there is no need for the '#' at the
beginning of the line.

Yeah, that's wrong. For some reason that one file uses a tab size of
8, unlike the rest of the tree (I guess because people will read that
file in software with the more common setting of 8). If you do :set
tabstop=8 in vim, suddenly it all makes sense, but it is revealed that
this patch has it wrong, as you said. (Perhaps this file should have
some of those special Vim/Emacs control messages so we don't keep
getting this wrong?)

#11

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Thomas Munro (#10)

1 attachment(s)

Re: cleanup temporary files after crash

On Sun, Mar 14, 2021, at 11:01 PM, Thomas Munro wrote:

On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

Thanks for taking care of this. I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

+1 for having it on by default.

I was also just looking at this patch and came here to say LGTM except
for two cosmetic things, below.

Thanks for taking a look at this patch. I'm not sure Tomas is preparing a new
patch that includes the suggested modifications but I decided to do it. This
new version has the new GUC name (using "remove"). I also replaced "cleanup"
with "remove" in the all the remain places. As pointed by Thomas, I reworded
the paragraph that describes the GUC moving the default information to the
beginning of the sentence. I also added the "literal" as suggested by Michael.
The postgresql.conf.sample was fixed. The tests was slightly modified. I
reworded some comments and added a hack to avoid breaking the temporary file
test in slow machines. A usleep() after sending the query provides some time
for the query to create the temporary file. I used an arbitrarily sleep (10ms)
that seems to be sufficient.

+#remove_temp_files_after_crash = on    # remove temporary files after
+#                                  # backend crash?
The indentation of the second line is incorrect here (Incorrect number
of spaces in tabs perhaps?), and there is no need for the '#' at the
beginning of the line.
Yeah, that's wrong. For some reason that one file uses a tab size of
8, unlike the rest of the tree (I guess because people will read that
file in software with the more common setting of 8). If you do :set
tabstop=8 in vim, suddenly it all makes sense, but it is revealed that
this patch has it wrong, as you said. (Perhaps this file should have
some of those special Vim/Emacs control messages so we don't keep
getting this wrong?)

I hadn't noticed that this file use ts=8. (This explains the misalignment that
I observed in some parameter comments). I'm not sure if people care about the
indentation in this file. From the development perspective, we can use the same
number of spaces for Tab as the source code so it is not required to fix your
dev environment. However, the regular user doesn't follow the dev guidelines so
it could probably observe the misalignment while using Vim (that has 8 spaces
as default), for example. For now, I will add

autocmd BufRead,BufNewFile postgresql.conf* setlocal ts=8

to my .vimrc. We should probably fix some settings that are misaligned such as
parallel_setup_cost and shared_preload_libraries. The parameters
timezone_abbreviations and max_pred_locks_per_page are using spaces instead of
tabs and should probably be fixed too.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachments:

v2-0001-Control-temporary-files-removal-after-crash.patchtext/x-patch; name=v2-0001-Control-temporary-files-removal-after-crash.patchDownload

From 8dfee5694210c14054170563ff01fc15a66a76b0 Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Mon, 25 May 2020 00:08:20 -0300
Subject: [PATCH v2] Control temporary files removal after crash

A new GUC remove_temp_files_after_crash controls whether temporary
files are removed after a crash. Successive crashes could result in
useless storage usage until service is restarted. It could be the case
on host with inadequate resources. This manual intervention for some
environments is not desirable. This GUC is marked as SIGHUP hence you
don't have to interrupt the service to change it.

The current behavior introduces a backward incompatibility (minor one)
which means Postgres will reclaim temporary file space after a crash.
The main reason is that we favor service continuity over debugging.
---
 doc/src/sgml/config.sgml                      |  18 ++
 src/backend/postmaster/postmaster.c           |   5 +
 src/backend/storage/file/fd.c                 |  12 +-
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/test/recovery/t/022_crash_temp_files.pl   | 192 ++++++++++++++++++
 7 files changed, 234 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/022_crash_temp_files.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5e551b9f6d..c0b1b48136 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9648,6 +9648,24 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-remove-temp-files-after-crash" xreflabel="remove_temp_files_after_crash">
+      <term><varname>remove_temp_files_after_crash</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>remove_temp_files_after_crash</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>on</literal>, which is the default,
+        <productname>PostgreSQL</productname> will automatically remove
+        temporary files after a backend crash. If disabled, sucessive crashes 
+        (while queries are using temporary files) could result in accumulation
+        of useless files. However, in some circumstances (e.g. debugging),
+        you want to examine these temporary files.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
       <term><varname>data_sync_retry</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e8af05c04e..fc9c769bc6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -242,6 +242,7 @@ bool		Db_user_namespace = false;
 bool		enable_bonjour = false;
 char	   *bonjour_name;
 bool		restart_after_crash = true;
+bool		remove_temp_files_after_crash = true;
 
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
@@ -3976,6 +3977,10 @@ PostmasterStateMachine(void)
 		ereport(LOG,
 				(errmsg("all server processes terminated; reinitializing")));
 
+		/* remove leftover temporary files after a crash */
+		if (remove_temp_files_after_crash)
+			RemovePgTempFiles();
+
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..8c75d135a2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3024,11 +3024,13 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
  * remove any leftover files created by OpenTemporaryFile and any leftover
  * temporary relation files created by mdcreate.
  *
- * NOTE: we could, but don't, call this during a post-backend-crash restart
- * cycle.  The argument for not doing it is that someone might want to examine
- * the temp files for debugging purposes.  This does however mean that
- * OpenTemporaryFile had better allow for collision with an existing temp
- * file name.
+ * During post-backend-crash restart cycle, this routine could be called if
+ * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
+ * queries are using temp files could result in useless storage usage that can
+ * only be reclaimed by a service restart. The argument against enabling it is
+ * that someone might want to examine the temporary files for debugging
+ * purposes. This does however mean that OpenTemporaryFile had better allow for
+ * collision with an existing temp file name.
  *
  * NOTE: this function and its subroutines generally report syscall failures
  * with ereport(LOG) and keep going.  Removing temp files is not so critical
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..9b6170a6cd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1361,6 +1361,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"remove_temp_files_after_crash", PGC_SIGHUP, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Remove temporary files after backend crash."),
+			NULL
+		},
+		&remove_temp_files_after_crash,
+		true,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"log_duration", PGC_SUSET, LOGGING_WHAT,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..1b67420c9f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -757,6 +757,8 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#remove_temp_files_after_crash = on	# remove temporary files after
+					# backend crash?
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index cfa59c4dc0..0efdd7c232 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -29,6 +29,7 @@ extern bool log_hostname;
 extern bool enable_bonjour;
 extern char *bonjour_name;
 extern bool restart_after_crash;
+extern bool remove_temp_files_after_crash;
 
 #ifdef WIN32
 extern HANDLE PostmasterHandle;
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
new file mode 100644
index 0000000000..c37b227770
--- /dev/null
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -0,0 +1,192 @@
+# Test remove of temporary files after a crash.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+use Time::HiRes qw(usleep);
+
+plan tests => 9;
+
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(60);
+
+my $node = get_new_node('node_crash');
+$node->init();
+$node->start();
+
+# By default, PostgresNode doesn't restart after crash
+# Reduce work_mem to generate temporary file with a few number of rows
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET remove_temp_files_after_crash = on;
+				   ALTER SYSTEM SET log_connections = 1;
+				   ALTER SYSTEM SET work_mem = '64kB';
+				   ALTER SYSTEM SET restart_after_crash = on;
+				   SELECT pg_reload_conf();]);
+
+# create table, insert rows
+$node->safe_psql(
+	'postgres',
+	q[CREATE TABLE tab_crash (a text);
+		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+
+# Run psql, keeping session alive, so we have an alive backend to kill.
+my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+my $killme = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin,
+	'>',
+	\$killme_stdout,
+	'2>',
+	\$killme_stderr,
+	$psql_timeout);
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+my $pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Wait some time so the temporary file is generated by SELECT
+usleep(10_000);
+
+# Kill with SIGKILL
+my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'no temporary files');
+
+#
+# Test old behavior (don't remove temporary files after crash)
+#
+$node->safe_psql(
+	'postgres',
+	q[ALTER SYSTEM SET remove_temp_files_after_crash = off;
+				   SELECT pg_reload_conf();]);
+
+# Restart psql session
+($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+$killme->run();
+
+# Get backend pid
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid for SIGKILL');
+$pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Run the query that generates a temporary file and that will be killed before
+# it finishes. Since the query that generates the temporary file does not
+# return before the connection is killed, use a SELECT before to trigger
+# pump_until.
+$killme_stdin .= q[
+BEGIN;
+SELECT $$in-progress-before-sigkill$$;
+WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+];
+ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
+	'select in-progress-before-sigkill');
+$killme_stdout = '';
+$killme_stderr = '';
+
+# Wait some time so the temporary file is generated by SELECT
+usleep(10_000);
+
+# Kill with SIGKILL
+$ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# Close psql session
+$killme->finish;
+
+# Wait till server restarts
+$node->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Check for temporary files -- should be there
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(1), 'one temporary file');
+
+# Restart should remove the temporary files
+$node->restart();
+
+# Check the temporary files -- should be gone
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(0), 'temporary file was removed');
+
+$node->stop();
+
+# Pump until string is matched, or timeout occurs
+sub pump_until
+{
+	my ($proc, $stream, $untl) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("aborting wait: program died");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+}
-- 
2.20.1

#12

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#11)

Re: cleanup temporary files after crash

On 3/17/21 2:34 AM, Euler Taveira wrote:

On Sun, Mar 14, 2021, at 11:01 PM, Thomas Munro wrote:

On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier <michael@paquier.xyz
<mailto:michael@paquier.xyz>> wrote:

On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:

Let's move this patch forward. Based on the responses, I agree the
default behavior should be to remove the temp files, and I think we
should have the GUC (on the off chance that someone wants to preserve
the temporary files for debugging or whatever other reason).

Thanks for taking care of this. I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

+1 for having it on by default.

I was also just looking at this patch and came here to say LGTM except
for two cosmetic things, below.

Thanks for taking a look at this patch. I'm not sure Tomas is preparing
a new
patch that includes the suggested modifications but I decided to do it. This
new version has the new GUC name (using "remove"). I also replaced "cleanup"
with "remove" in the all the remain places. As pointed by Thomas, I reworded
the paragraph that describes the GUC moving the default information to the
beginning of the sentence. I also added the "literal" as suggested by
Michael.
The postgresql.conf.sample was fixed. The tests was slightly modified. I
reworded some comments and added a hack to avoid breaking the temporary file
test in slow machines. A usleep() after sending the query provides some time
for the query to create the temporary file. I used an arbitrarily sleep
(10ms)
that seems to be sufficient.

Thanks. Pushed with some minor changes to docs wording.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#12)

Re: cleanup temporary files after crash

Hmm,

crake and florican seem to have failed because of this commit, with this
error in the new TAP test:

error running SQL: 'psql:<stdin>:1: ERROR: could not open directory
"base/pgsql_tmp": No such file or directory'
while running 'psql -XAtq -d port=64336 host=/tmp/sv1WjSvj3P
dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT COUNT(1)
FROM pg_ls_dir($$base/pgsql_tmp$$)' at
/home/andrew/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm
line 1572.

So it seems the pgsql_tmp directory does not exist, for some reason.
Considering the directory should be created for the first temp file,
that either means the query in the TAP test does not actually create a
temp file on those machines, or it gets killed too early.

The 500 rows used by the test seems fairly low, so maybe those machines
can do the sort entirely in memory?

The other option is that the sleep in the TAP test is a bit too short,
but those machines don't seem to be that slow.

Anyway, TAP test relying on timing like this may not be the best idea,
so I wonder how else to test this ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#13)

Re: cleanup temporary files after crash

On 3/18/21 6:51 PM, Tomas Vondra wrote:

Hmm,

crake and florican seem to have failed because of this commit, with this
error in the new TAP test:

error running SQL: 'psql:<stdin>:1: ERROR: could not open directory
"base/pgsql_tmp": No such file or directory'
while running 'psql -XAtq -d port=64336 host=/tmp/sv1WjSvj3P
dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT COUNT(1)
FROM pg_ls_dir($$base/pgsql_tmp$$)' at
/home/andrew/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm
line 1572.

So it seems the pgsql_tmp directory does not exist, for some reason.
Considering the directory should be created for the first temp file,
that either means the query in the TAP test does not actually create a
temp file on those machines, or it gets killed too early.

The 500 rows used by the test seems fairly low, so maybe those machines
can do the sort entirely in memory?

The other option is that the sleep in the TAP test is a bit too short,
but those machines don't seem to be that slow.

Anyway, TAP test relying on timing like this may not be the best idea,
so I wonder how else to test this ...

I think a better way to test this would be to use a tuple lock:

setup:

create table t (a int unique);

session 1:

begin;
insert into t values (1);
... keep open ...

session 2:

begin;
set work_mem = '64kB';
insert into t select i from generate_series(1,10000) s(i);
... should block ...

Then, once the second session gets waiting on the tuple, kill the
backend. We might as well test that there actually is a temp file first,
and then test that it disappeared.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Tomas Vondra (#14)

Re: cleanup temporary files after crash

On Thu, Mar 18, 2021, at 4:20 PM, Tomas Vondra wrote:

I think a better way to test this would be to use a tuple lock:

I predicated such issues with this test. Your suggestion works for me. Maybe
you should use less rows in the session 2 query.

setup:

create table t (a int unique);

session 1:

begin;
insert into t values (1);
... keep open ...

session 2:

begin;
set work_mem = '64kB';
insert into t select i from generate_series(1,10000) s(i);
... should block ...

Then, once the second session gets waiting on the tuple, kill the
backend. We might as well test that there actually is a temp file first,
and then test that it disappeared.

Your suggestion works for me. Maybe you could use less rows in the session 2
query. I experimented with 1k rows and it generates a temporary file.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#16

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#15)

Re: cleanup temporary files after crash

On 3/18/21 9:06 PM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 4:20 PM, Tomas Vondra wrote:

I think a better way to test this would be to use a tuple lock:

I predicated such issues with this test. Your suggestion works for me. Maybe
you should use less rows in the session 2 query.

setup:

create table t (a int unique);

session 1:

begin;
insert into t values (1);
... keep open ...

session 2:

   begin;
   set work_mem = '64kB';
   insert into t select i from generate_series(1,10000) s(i);
   ... should block ...

Then, once the second session gets waiting on the tuple, kill the
backend. We might as well test that there actually is a temp file first,
and then test that it disappeared.

Your suggestion works for me. Maybe you could use less rows in the session 2
query. I experimented with 1k rows and it generates a temporary file.

OK. Can you prepare a patch with the proposed test approach?

FWIW I can reproduce this on a 32-bit ARM system (rpi4), where 500 rows
simply does not use a temp file, and with 1000 rows it works fine. On
the x86_64 the temp file is created even with 500 rows. So there clearly
is some platform dependency, not sure if it's due to 32/64 bits,
alignment or something else. In any case, the 500 rows seems to be just
on the threshold.

We need to do both - stop using the timing and increase the number of
rows, to consistently get temp files.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Tomas Vondra (#16)

Re: cleanup temporary files after crash

On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:

OK. Can you prepare a patch with the proposed test approach?

I'm on it.

FWIW I can reproduce this on a 32-bit ARM system (rpi4), where 500 rows
simply does not use a temp file, and with 1000 rows it works fine. On
the x86_64 the temp file is created even with 500 rows. So there clearly
is some platform dependency, not sure if it's due to 32/64 bits,
alignment or something else. In any case, the 500 rows seems to be just
on the threshold.

We need to do both - stop using the timing and increase the number of
rows, to consistently get temp files.

Yeah.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#18

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Euler Taveira (#17)

1 attachment(s)

Re: cleanup temporary files after crash

On Thu, Mar 18, 2021, at 6:00 PM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:

OK. Can you prepare a patch with the proposed test approach?

I'm on it.

What do you think about this patch?

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachments:

fix-crash-temp-files.patchtext/x-patch; name=fix-crash-temp-files.patchDownload

diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index c37b227770..68e6fc20ad 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -5,7 +5,6 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
-use Time::HiRes qw(usleep);
 
 plan tests => 9;
 
@@ -33,8 +32,7 @@ $node->safe_psql(
 # create table, insert rows
 $node->safe_psql(
 	'postgres',
-	q[CREATE TABLE tab_crash (a text);
-		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+	q[CREATE TABLE tab_crash (a integer UNIQUE);]);
 
 # Run psql, keeping session alive, so we have an alive backend to kill.
 my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
@@ -62,6 +60,32 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Open a 2nd session that will block the 1st session. The UNIQUE constraint
+# will prevent the temporary file from the 1st session to be removed.
+my ($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+my $killme2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin2,
+	'>',
+	\$killme_stdout2,
+	'2>',
+	\$killme_stderr2,
+	$psql_timeout);
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -69,22 +93,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 1000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');
@@ -118,6 +140,20 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Restart the 2nd psql session
+($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+$killme2->run();
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -125,22 +161,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 1000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');

#19

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#18)

Re: cleanup temporary files after crash

On 3/18/21 10:44 PM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 6:00 PM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:

OK. Can you prepare a patch with the proposed test approach?

I'm on it.

What do you think about this patch?

Well, that's better, bit it still does not do the trick on the 32-bit
machine - in that case a 1000 rows with int4 still fit into work_mem, so
the temp file is not created. Per my experiments about 1040 rows are
needed - soooo close ;-) So let's make it 2000.

We might as well check that the temp file actually exists, before
killing the backend. Just to be sure.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Tomas Vondra (#19)

1 attachment(s)

Re: cleanup temporary files after crash

On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:

Well, that's better, bit it still does not do the trick on the 32-bit
machine - in that case a 1000 rows with int4 still fit into work_mem, so
the temp file is not created. Per my experiments about 1040 rows are
needed - soooo close ;-) So let's make it 2000.

My 32-bit laptop needs some repairs so I blindly chose 1k rows.

We might as well check that the temp file actually exists, before
killing the backend. Just to be sure.

Do you mean with remove_temp_files_after_crash = on? New version attached.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachments:

v2-fix-crash-temp-files.patchtext/x-patch; name=v2-fix-crash-temp-files.patchDownload

diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index c37b227770..38e935d641 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -5,9 +5,8 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
-use Time::HiRes qw(usleep);
 
-plan tests => 9;
+plan tests => 10;
 
 
 # To avoid hanging while expecting some specific input from a psql
@@ -33,8 +32,7 @@ $node->safe_psql(
 # create table, insert rows
 $node->safe_psql(
 	'postgres',
-	q[CREATE TABLE tab_crash (a text);
-		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+	q[CREATE TABLE tab_crash (a integer UNIQUE);]);
 
 # Run psql, keeping session alive, so we have an alive backend to kill.
 my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
@@ -62,6 +60,32 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Open a 2nd session that will block the 1st session. The UNIQUE constraint
+# will prevent the temporary file from the 1st session to be removed.
+my ($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+my $killme2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin2,
+	'>',
+	\$killme_stdout2,
+	'2>',
+	\$killme_stderr2,
+	$psql_timeout);
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -69,15 +93,18 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 2000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
+# Check for the existence of a temporary file
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(1), 'one temporary file');
 
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
@@ -85,6 +112,7 @@ is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');
@@ -118,6 +146,20 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Restart the 2nd psql session
+($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+$killme2->run();
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -125,22 +167,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 2000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');

#21

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#20)

Re: cleanup temporary files after crash

On 3/19/21 1:54 AM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:

Well, that's better, bit it still does not do the trick on the 32-bit
machine - in that case a 1000 rows with int4 still fit into work_mem, so
the temp file is not created. Per my experiments about 1040 rows are
needed - soooo close ;-) So let's make it 2000.

My 32-bit laptop needs some repairs so I blindly chose 1k rows.

Yeah, it's not immediately obvious how many rows are needed. 2000 would
be enough, but I just bumped it up to 5000 for good measure.

We might as well check that the temp file actually exists, before
killing the backend. Just to be sure.

Do you mean with remove_temp_files_after_crash = on? New version attached.

On second thought, we don't need that check. I realized the second part
of the test (checking remove_temp_files_after_crash=off) actually does
verify that we've created the file, so that should be OK.

I've pushed the previous version with the 5000 rows. Let's see what
crake and florican think about this ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#21)

Re: cleanup temporary files after crash

On 3/19/21 2:21 AM, Tomas Vondra wrote:

On 3/19/21 1:54 AM, Euler Taveira wrote:

On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:

Well, that's better, bit it still does not do the trick on the 32-bit
machine - in that case a 1000 rows with int4 still fit into work_mem, so
the temp file is not created. Per my experiments about 1040 rows are
needed - soooo close ;-) So let's make it 2000.

My 32-bit laptop needs some repairs so I blindly chose 1k rows.

Yeah, it's not immediately obvious how many rows are needed. 2000 would
be enough, but I just bumped it up to 5000 for good measure.

We might as well check that the temp file actually exists, before
killing the backend. Just to be sure.

Do you mean with remove_temp_files_after_crash = on? New version attached.

On second thought, we don't need that check. I realized the second part
of the test (checking remove_temp_files_after_crash=off) actually does
verify that we've created the file, so that should be OK.

I've pushed the previous version with the 5000 rows. Let's see what
crake and florican think about this ...

So crake and florican seem to be happy now. Not sure about lapwing yet.

But interestingly enough, prion and curculio got unhappy. They worked
fine with the older test, but now it fails with the "no such file or
directory" message. I wonder what makes them different from the other
x86_64 machines ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tomas Vondra (#22)

Re: cleanup temporary files after crash

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

So crake and florican seem to be happy now. Not sure about lapwing yet.
But interestingly enough, prion and curculio got unhappy. They worked
fine with the older test, but now it fails with the "no such file or
directory" message. I wonder what makes them different from the other
x86_64 machines ...

I'm confused about why the new query would use a temp file at all.
Maybe they aren't using the same plan?

regards, tom lane

#24

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#23)

Re: cleanup temporary files after crash

On 3/19/21 3:57 AM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

So crake and florican seem to be happy now. Not sure about lapwing yet.
But interestingly enough, prion and curculio got unhappy. They worked
fine with the older test, but now it fails with the "no such file or
directory" message. I wonder what makes them different from the other
x86_64 machines ...

I'm confused about why the new query would use a temp file at all.
Maybe they aren't using the same plan?

Because generate_series() stashes the results into a tuplestore, and the
test sets work_mem to just 64kB. Maybe I'm missing some detail, though.

The plan is extremely simple - just Function Scan feeding data into an
Insert, not sure what other plan could be used.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tomas Vondra (#24)

Re: cleanup temporary files after crash

[ reads code... ]
... no, I think the problem is the test is still full of race conditions.

In the first place, waiting till you see the output of a SELECT that's
before the useful query is not enough to guarantee that the useful query
is done, or even started. That's broken on both sessions.

In the second place, even if the second INSERT has started, you don't know
that it's reached the point of blocking on the tuple conflict yet.
Which in turn means that it might not've filled its tuplestore yet.

In short, this script is designed to ensure that the test query can't
finish too soon, but that proves nothing about whether the test query
has even started. And since you also haven't really guaranteed that the
intended-to-be-blocking query is done, I don't think that the first
condition really holds either.

regards, tom lane

#26

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Tom Lane (#25)

1 attachment(s)

Re: cleanup temporary files after crash

On Fri, Mar 19, 2021, at 12:23 AM, Tom Lane wrote:

[ reads code... ]
... no, I think the problem is the test is still full of race conditions.

In the first place, waiting till you see the output of a SELECT that's
before the useful query is not enough to guarantee that the useful query
is done, or even started. That's broken on both sessions.

That's an ugly and fragile mechanism to workaround the fact that pump_until
reacts after you have the query return.

In the second place, even if the second INSERT has started, you don't know
that it's reached the point of blocking on the tuple conflict yet.
Which in turn means that it might not've filled its tuplestore yet.

In short, this script is designed to ensure that the test query can't
finish too soon, but that proves nothing about whether the test query
has even started. And since you also haven't really guaranteed that the
intended-to-be-blocking query is done, I don't think that the first
condition really holds either.

In order to avoid the race condition between filling the tuplestore and killing
the backend, we could use a pool_query_until() before SIGKILL to wait the
temporary file being created. Do you think this modification will make this
test more stable?

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachments:

v3-fix-crash-temp-files.patchtext/x-patch; name=v3-fix-crash-temp-files.patchDownload

diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..41a91ebd06 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -100,6 +100,11 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait till a temporary file is created
+$node->poll_query_until(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)', '1');
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -168,6 +173,11 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait till a temporary file is created
+$node->poll_query_until(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)', '1');
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');

#27

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Euler Taveira (#26)

1 attachment(s)

Re: cleanup temporary files after crash

On 3/19/21 3:17 PM, Euler Taveira wrote:

On Fri, Mar 19, 2021, at 12:23 AM, Tom Lane wrote:

[ reads code... ]
... no, I think the problem is the test is still full of race conditions.

In the first place, waiting till you see the output of a SELECT that's
before the useful query is not enough to guarantee that the useful query
is done, or even started. That's broken on both sessions.

That's an ugly and fragile mechanism to workaround the fact that pump_until
reacts after you have the query return.

In the second place, even if the second INSERT has started, you don't know
that it's reached the point of blocking on the tuple conflict yet.
Which in turn means that it might not've filled its tuplestore yet.

In short, this script is designed to ensure that the test query can't
finish too soon, but that proves nothing about whether the test query
has even started. And since you also haven't really guaranteed that the
intended-to-be-blocking query is done, I don't think that the first
condition really holds either.

In order to avoid the race condition between filling the tuplestore and
killing
the backend, we could use a pool_query_until() before SIGKILL to wait the
temporary file being created. Do you think this modification will make this
test more stable?

Wow, I thought I understand the perl code, but clearly I was wrong.

I think the solution is not particularly difficult.

For the initial insert, it's fine to switch the order of the SQL
commands, so that the insert happens first, and only then emit the string.

For the second insert, we can't do that, because it's expected to block
on the lock. So we have to verify that by looking at the lock directly.

I however don't understand what the pump_until function does. AFAICS it
should run the query in a loop, and bail out once it matches the
expected output. But that does not seem to happen, so when the query
gets executed before the lock is there, it results in infinite loop.

In the attached patch I've simulated this by random() < 0.5.

If I replace this with a wait loop in a plpgsql block, that works
perfectly fine (no infinite loops). Tested both on x86_64 and rpi.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

cleanup-tap.patchtext/x-patch; charset=UTF-8; name=cleanup-tap.patchDownload

diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..88478d44c1 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -79,8 +79,8 @@ my $killme2 = IPC::Run::start(
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -100,6 +100,15 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+SELECT $$insert-tuple-lock-waiting$$ FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted AND random() < 0.5;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -147,8 +156,8 @@ $killme2->run();
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -168,6 +177,15 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+SELECT $$insert-tuple-lock-waiting$$ FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted AND random() < 0.5;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');

#28

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#27)

Re: cleanup temporary files after crash

On 3/19/21 5:23 PM, Tomas Vondra wrote:

...

If I replace this with a wait loop in a plpgsql block, that works
perfectly fine (no infinite loops). Tested both on x86_64 and rpi.

For the record, here's the version with plpgsql block, which seems to be
working just fine.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#27)

1 attachment(s)

Re: cleanup temporary files after crash

On 3/19/21 5:23 PM, Tomas Vondra wrote:

...

If I replace this with a wait loop in a plpgsql block, that works
perfectly fine (no infinite loops). Tested both on x86_64 and rpi.

For the record, here's the version with plpgsql block, which seems to be
working just fine.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

cleanup-tap-plpgsql-block.patchtext/x-patch; charset=UTF-8; name=cleanup-tap-plpgsql-block.patchDownload

diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..c5624fe864 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -79,8 +79,8 @@ my $killme2 = IPC::Run::start(
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -100,6 +100,26 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+DO $c$
+DECLARE
+  c INT;
+BEGIN
+  LOOP
+    SELECT COUNT(*) INTO c FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted;
+    IF c > 0 THEN
+      EXIT;
+    END IF;
+  END LOOP;
+END; $c$;
+SELECT $$insert-tuple-lock-waiting$$;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -147,8 +167,8 @@ $killme2->run();
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -168,6 +188,26 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+DO $c$
+DECLARE
+  c INT;
+BEGIN
+  LOOP
+    SELECT COUNT(*) INTO c FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted;
+    IF c > 0 THEN
+      EXIT;
+    END IF;
+  END LOOP;
+END; $c$;
+SELECT $$insert-tuple-lock-waiting$$;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');

#30

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#29)

Re: cleanup temporary files after crash

On 3/19/21 5:26 PM, Tomas Vondra wrote:

On 3/19/21 5:23 PM, Tomas Vondra wrote:

...

If I replace this with a wait loop in a plpgsql block, that works
perfectly fine (no infinite loops). Tested both on x86_64 and rpi.

For the record, here's the version with plpgsql block, which seems to be
working just fine.

I've pushed this version, with the plpgsql block. I've tested it on all
the machines I have here, hopefully it'll make buildfarm happy too.

AFAICS I was mistaken about what the pump() functions do - it clearly
does not run the command repeatedly, it just waits for the right output
to appear. So a busy loop in plpgsql seems like a reasonable solution.
Perhaps there's a better way to do this in TAP, not sure.

My brain hurts from reading too much Perl today ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

Euler Taveira

euler@eulerto.com

almost 5 years ago

In reply to: Tomas Vondra (#30)

Re: cleanup temporary files after crash

On Fri, Mar 19, 2021, at 2:27 PM, Tomas Vondra wrote:

I've pushed this version, with the plpgsql block. I've tested it on all
the machines I have here, hopefully it'll make buildfarm happy too.

Thanks for taking care of this issue.

AFAICS I was mistaken about what the pump() functions do - it clearly
does not run the command repeatedly, it just waits for the right output
to appear. So a busy loop in plpgsql seems like a reasonable solution.
Perhaps there's a better way to do this in TAP, not sure.

I took 3 times longer to do that fragile test than the code itself. :-/

My brain hurts from reading too much Perl today ...

I feel the same when I have to deal with Perl code.

It seems the animals are happy with this fix.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#32

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Euler Taveira (#3)

Re: cleanup temporary files after crash

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

The current behavior is only useful for debugging purposes.

On Wed, 28 Oct 2020 at 15:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

One thing I'm not sure about is whether we should have the GUC as
proposed, or have a negative "keep_temp_files_after_restart" defaulting
to false. But I don't have a very good justification for the alternative
other than vague personal preference.

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to investigate an issue. Regarding a crash, all information is already
available and temporary files don't provide extra details. This new GUC is just to keep the
previous behavior. I'm fine without the GUC, though.

Should this GUC be classified as a developer option, and removed from
postgresql.sample.conf ?

That was discussed initially in October but not since.

On Wed, Oct 28, 2020 at 11:16:26AM -0300, Euler Taveira wrote:

It also has an undesirable behavior: you have to restart the service to
reclaim the space.

BTW, that's not really true - you can remove the tempfiles while the server is
running. The complainant in bug#16427 was being careful - which is good.
I watch for and remove tempfiles older than 2 days. The worst consequence of
removing a tempfile would be a failed query.

--
Justin

#33

Euler Taveira

euler@eulerto.com

over 4 years ago

In reply to: Justin Pryzby (#32)

1 attachment(s)

Re: cleanup temporary files after crash

On Fri, Jun 11, 2021, at 9:43 PM, Justin Pryzby wrote:

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

The current behavior is only useful for debugging purposes.

On Wed, 28 Oct 2020 at 15:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

One thing I'm not sure about is whether we should have the GUC as
proposed, or have a negative "keep_temp_files_after_restart" defaulting
to false. But I don't have a very good justification for the alternative
other than vague personal preference.

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to investigate an issue. Regarding a crash, all information is already
available and temporary files don't provide extra details. This new GUC is just to keep the
previous behavior. I'm fine without the GUC, though.

Should this GUC be classified as a developer option, and removed from
postgresql.sample.conf ?

It probably should.

That was discussed initially in October but not since.

Was it? I seem to have missed this suggestion.

I'm attaching a patch to fix it.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachments:

v1-0001-Move-new-GUC-remove_temp_files_after_crash-to-ano.patchtext/x-patch; name=v1-0001-Move-new-GUC-remove_temp_files_after_crash-to-ano.patchDownload

From fbfd63fcb0d75045b56576cb220e7caea6af9e92 Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.taveira@enterprisedb.com>
Date: Tue, 22 Jun 2021 14:39:56 -0300
Subject: [PATCH v1] Move new GUC remove_temp_files_after_crash to another
 section

The appropriate section for GUC remove_temp_files_after_crash is
probably Developer Options since the main goal of the feature is to
allow inspecting temporary files for debugging purposes. It was an
oversight in commit cd91de0d. Per report from Justin Pryzby.
---
 doc/src/sgml/config.sgml                      | 41 +++++++++----------
 src/backend/utils/misc/guc.c                  |  2 +-
 src/backend/utils/misc/postgresql.conf.sample |  2 -
 3 files changed, 20 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index eeba2caa43..f5a753e589 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9863,28 +9863,6 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-remove-temp-files-after-crash" xreflabel="remove_temp_files_after_crash">
-      <term><varname>remove_temp_files_after_crash</varname> (<type>boolean</type>)
-      <indexterm>
-       <primary><varname>remove_temp_files_after_crash</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        When set to <literal>on</literal>, which is the default,
-        <productname>PostgreSQL</productname> will automatically remove
-        temporary files after a backend crash. If disabled, the files will be
-        retained and may be used for debugging, for example. Repeated crashes
-        may however result in accumulation of useless files.
-       </para>
-
-       <para>
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
       <term><varname>data_sync_retry</varname> (<type>boolean</type>)
       <indexterm>
@@ -10912,6 +10890,25 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-remove-temp-files-after-crash" xreflabel="remove_temp_files_after_crash">
+      <term><varname>remove_temp_files_after_crash</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>remove_temp_files_after_crash</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>on</literal>, which is the default,
+        <productname>PostgreSQL</productname> will automatically remove
+        temporary files after a backend crash. If disabled, the files will be
+        retained and may be used for debugging, for example. Repeated crashes
+        may however result in accumulation of useless files. This parameter 
+        can only be set in the <filename>postgresql.conf</filename> file or on
+        the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
   </sect1>
   <sect1 id="runtime-config-short">
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68b62d523d..d8b8f4a572 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1404,7 +1404,7 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
-		{"remove_temp_files_after_crash", PGC_SIGHUP, ERROR_HANDLING_OPTIONS,
+		{"remove_temp_files_after_crash", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Remove temporary files after backend crash."),
 			NULL
 		},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ddbb6dc2be..a5a7174b0e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -768,8 +768,6 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
-#remove_temp_files_after_crash = on	# remove temporary files after
-					# backend crash?
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
-- 
2.20.1

#34

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Euler Taveira (#33)

Re: cleanup temporary files after crash

On Tue, Jun 22, 2021 at 04:12:51PM -0300, Euler Taveira wrote:

Was it? I seem to have missed this suggestion.

I'm attaching a patch to fix it.

Looks adapted to me.
--
Michael

#35

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Michael Paquier (#34)

Re: cleanup temporary files after crash

On Wed, Jun 23, 2021 at 08:46:12AM +0900, Michael Paquier wrote:

Looks adapted to me.

Applied as it was still on the stack. Please note that
GUC_NOT_IN_SAMPLE was missing.
--
Michael