Progress report removal of temp files and temp relation files using ereport_startup_progress
Hi,
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer during
which users have no clue about what's going on in the server before it
comes up online.
Here's a proposal to use ereport_startup_progress to report the
progress of the file removal.
Thoughts?
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-Progress-report-removal-of-temp-files-and-temp-re.patchapplication/octet-stream; name=v1-0001-Progress-report-removal-of-temp-files-and-temp-re.patchDownload
From 9f9da2ad7d1bc433bbef0ed44b2f37454d968ec2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 30 Apr 2022 05:35:22 +0000
Subject: [PATCH v1] Progress report removal of temp files and temp relation
files
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer
during which users have no clue about what's going on in the
server before it comes up online. This patch uses
ereport_startup_progress to report the progress of the file
removal.
---
src/backend/storage/file/fd.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..08c333e514 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3156,6 +3156,12 @@ RemovePgTempFiles(void)
DIR *spc_dir;
struct dirent *spc_de;
+ /*
+ * Prepare to report progress of the temporary and temporary relation files
+ * removal phase.
+ */
+ begin_startup_progress_phase();
+
/*
* First process temp files in pg_default ($PGDATA/base)
*/
@@ -3229,6 +3235,9 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
tmpdirname, temp_de->d_name);
+ ereport_startup_progress("removing temporary files under pgsql_tmp directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink_all ||
strncmp(temp_de->d_name,
PG_TEMP_FILE_PREFIX,
@@ -3319,6 +3328,9 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
dbspacedirname, de->d_name);
+ ereport_startup_progress("removing temporary relation files under pg_tblspc directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink(rm_path) < 0)
ereport(LOG,
(errcode_for_file_access(),
--
2.25.1
Hi Bharath,
On Sat, Apr 30, 2022 at 11:08 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer during
which users have no clue about what's going on in the server before it
comes up online.Here's a proposal to use ereport_startup_progress to report the
progress of the file removal.Thoughts?
The patch looks good to me.
With this patch, the user would at least know which directory is being
scanned and how much time has elapsed. It would be better to know how
much work is remaining. I could not find a way to estimate the number
of files in the directory so that we can extrapolate elapsed time and
estimate the remaining time. Well, we could loop the output of
opendir() twice, first to estimate and then for the actual work. This
might actually work, if the time to delete all the files is very high
compared to the time it takes to scan all the files/directories.
Another possibility is to scan the sorted output of opendir() thus
using the current file name to estimate remaining files in a very
crude and inaccurate way. That doesn't look attractive either. I can't
think of any better way to estimate the remaining time.
But at least with this patch, a user knows which files have been
deleted, guessing how far, in the directory structure, the process has
reached. S/he can then take a look at the remaining contents of the
directory to estimate how much it should wait.
--
Best Wishes,
Ashutosh Bapat
On Mon, May 2, 2022 at 6:26 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
Hi Bharath,
On Sat, Apr 30, 2022 at 11:08 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Hi,
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer during
which users have no clue about what's going on in the server before it
comes up online.Here's a proposal to use ereport_startup_progress to report the
progress of the file removal.Thoughts?
The patch looks good to me.
With this patch, the user would at least know which directory is being
scanned and how much time has elapsed.
There's a problem with the patch, the timeout mechanism isn't being
used by the postmaster process. Postmaster doesn't
InitializeTimeouts() and doesn't register STARTUP_PROGRESS_TIMEOUT, I
tried to make postmaster do that (attached a v2 patch) but make check
fails.
Now, I'm thinking if it's a good idea to let postmaster use timeouts at all?
It would be better to know how
much work is remaining. I could not find a way to estimate the number
of files in the directory so that we can extrapolate elapsed time and
estimate the remaining time. Well, we could loop the output of
opendir() twice, first to estimate and then for the actual work. This
might actually work, if the time to delete all the files is very high
compared to the time it takes to scan all the files/directories.Another possibility is to scan the sorted output of opendir() thus
using the current file name to estimate remaining files in a very
crude and inaccurate way. That doesn't look attractive either. I can't
think of any better way to estimate the remaining time.
I think 'how much work/how many files remaining to process' is a
generic problem, for instance, snapshot, mapping files, old WAL file
processing and so on. I don't think we can do much about it.
But at least with this patch, a user knows which files have been
deleted, guessing how far, in the directory structure, the process has
reached. S/he can then take a look at the remaining contents of the
directory to estimate how much it should wait.
Not sure we will be able to use the timeout mechanism within
postmaster. Another idea is to have a generic GUC something like
log_file_processing_traffic = {none, medium, high} (similar idea is
proposed for WAL files processing while replaying/recovering at [1]/messages/by-id/CALj2ACVnhbx4pLZepvdqOfeOekvZXJ2F=wJeConGzok+6kgCVA@mail.gmail.com),
default being none, when set to medium a log message gets emitted for
every say 128 or 256 (just a random number) files processed. when set
to high, log messages get emitted for every file processed (too
verbose). I think this generic GUC log_file_processing_traffic can be
used in many other file processing areas.
Thoughts?
[1]: /messages/by-id/CALj2ACVnhbx4pLZepvdqOfeOekvZXJ2F=wJeConGzok+6kgCVA@mail.gmail.com
Regards,
Bharath Rupireddy.
Attachments:
v2-0001-pgsql_tmp-ereport_startup_progress.patchapplication/octet-stream; name=v2-0001-pgsql_tmp-ereport_startup_progress.patchDownload
From 059099c469c40ee9d91a3b37b8960b36557e6030 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 4 May 2022 14:17:24 +0000
Subject: [PATCH v2] pgsql_tmp ereport_startup_progress
---
src/backend/postmaster/postmaster.c | 11 +++++++++++
src/backend/storage/file/fd.c | 15 +++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 964a56dec4..261a5a5911 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -116,6 +116,7 @@
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
+#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
@@ -689,6 +690,11 @@ PostmasterMain(int argc, char *argv[])
pqsignal_pm(SIGXFSZ, SIG_IGN); /* ignored */
#endif
+ InitializeTimeouts(); /* establishes SIGALRM handler */
+
+ RegisterTimeout(STARTUP_PROGRESS_TIMEOUT,
+ startup_progress_timeout_handler);
+
/*
* Options setup
*/
@@ -1104,6 +1110,9 @@ PostmasterMain(int argc, char *argv[])
/* Write out nondefault GUC settings for child processes to use */
write_nondefault_variables(PGC_POSTMASTER);
+ /* Prepare to report progress of the temporary files removal phase */
+ begin_startup_progress_phase();
+
/*
* Clean out the temp directory used to transmit parameters to child
* processes (see internal_forkexec, below). We must do this before
@@ -1113,6 +1122,8 @@ PostmasterMain(int argc, char *argv[])
* conflicting Postgres processes in this data directory.
*/
RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+
+ disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..c793303bd8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -102,6 +102,7 @@
#include "storage/ipc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
+#include "utils/timeout.h"
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
#if defined(HAVE_SYNC_FILE_RANGE)
@@ -3156,6 +3157,12 @@ RemovePgTempFiles(void)
DIR *spc_dir;
struct dirent *spc_de;
+ /*
+ * Prepare to report progress of the temporary and temporary relation files
+ * removal phase.
+ */
+ begin_startup_progress_phase();
+
/*
* First process temp files in pg_default ($PGDATA/base)
*/
@@ -3185,6 +3192,8 @@ RemovePgTempFiles(void)
FreeDir(spc_dir);
+ disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);
+
/*
* In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
* DataDir as well. However, that is *not* cleaned here because doing so
@@ -3229,6 +3238,9 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
tmpdirname, temp_de->d_name);
+ ereport_startup_progress("removing temporary files under pgsql_tmp directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink_all ||
strncmp(temp_de->d_name,
PG_TEMP_FILE_PREFIX,
@@ -3319,6 +3331,9 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
dbspacedirname, de->d_name);
+ ereport_startup_progress("removing temporary relation files under pg_tblspc directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink(rm_path) < 0)
ereport(LOG,
(errcode_for_file_access(),
--
2.25.1
On Thu, May 5, 2022 at 12:11 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, May 2, 2022 at 6:26 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:Hi Bharath,
On Sat, Apr 30, 2022 at 11:08 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Hi,
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer during
which users have no clue about what's going on in the server before it
comes up online.Here's a proposal to use ereport_startup_progress to report the
progress of the file removal.Thoughts?
The patch looks good to me.
With this patch, the user would at least know which directory is being
scanned and how much time has elapsed.There's a problem with the patch, the timeout mechanism isn't being
used by the postmaster process. Postmaster doesn't
InitializeTimeouts() and doesn't register STARTUP_PROGRESS_TIMEOUT, I
tried to make postmaster do that (attached a v2 patch) but make check
fails.Now, I'm thinking if it's a good idea to let postmaster use timeouts at all?
Here's the v3 patch, which adds progress reports for temp file removal
under the pgsql_tmp directory and temporary relation files under the
pg_tblspc directory, regression tests pass with it.
Regards,
Bharath Rupireddy.
Attachments:
v3-0001-Progress-report-removal-of-temp-files-and-temp-re.patchapplication/x-patch; name=v3-0001-Progress-report-removal-of-temp-files-and-temp-re.patchDownload
From c31492191ffcd6222d7e42917cbd928351623780 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 2 Aug 2022 05:10:06 +0000
Subject: [PATCH v3] Progress report removal of temp files and temp relation
files
At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer
during which users have no clue about what's going on in the
server before it comes up online. This patch uses
ereport_startup_progress to report the progress of the file
removal.
---
src/backend/postmaster/postmaster.c | 10 +++++++++-
src/backend/storage/file/fd.c | 13 +++++++++++++
src/backend/utils/misc/timeout.c | 11 +++++++++--
3 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e541b16bdb..40b53276be 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -116,6 +116,7 @@
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
+#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
@@ -653,7 +654,6 @@ PostmasterMain(int argc, char *argv[])
pqsignal_pm(SIGINT, pmdie); /* send SIGTERM and shut down */
pqsignal_pm(SIGQUIT, pmdie); /* send SIGQUIT and die */
pqsignal_pm(SIGTERM, pmdie); /* wait for children and shut down */
- pqsignal_pm(SIGALRM, SIG_IGN); /* ignored */
pqsignal_pm(SIGPIPE, SIG_IGN); /* ignored */
pqsignal_pm(SIGUSR1, sigusr1_handler); /* message from child process */
pqsignal_pm(SIGUSR2, dummy_handler); /* unused, reserve for children */
@@ -688,6 +688,11 @@ PostmasterMain(int argc, char *argv[])
pqsignal_pm(SIGXFSZ, SIG_IGN); /* ignored */
#endif
+ InitializeTimeouts(); /* establishes SIGALRM handler */
+
+ RegisterTimeout(STARTUP_PROGRESS_TIMEOUT,
+ startup_progress_timeout_handler);
+
/*
* Options setup
*/
@@ -1119,6 +1124,9 @@ PostmasterMain(int argc, char *argv[])
/* Write out nondefault GUC settings for child processes to use */
write_nondefault_variables(PGC_POSTMASTER);
+ /* Prepare to report progress of the temporary files removal phase */
+ begin_startup_progress_phase();
+
/*
* Clean out the temp directory used to transmit parameters to child
* processes (see internal_forkexec, below). We must do this before
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f904f60c08..37c13a60ba 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -102,6 +102,7 @@
#include "storage/ipc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
+#include "utils/timeout.h"
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
#if defined(HAVE_SYNC_FILE_RANGE)
@@ -3093,6 +3094,12 @@ RemovePgTempFiles(void)
DIR *spc_dir;
struct dirent *spc_de;
+ /*
+ * Prepare to report progress of the temporary and temporary relation files
+ * removal phase.
+ */
+ begin_startup_progress_phase();
+
/*
* First process temp files in pg_default ($PGDATA/base)
*/
@@ -3166,6 +3173,9 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
tmpdirname, temp_de->d_name);
+ ereport_startup_progress("removing temporary files under pgsql_tmp directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink_all ||
strncmp(temp_de->d_name,
PG_TEMP_FILE_PREFIX,
@@ -3256,6 +3266,9 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
snprintf(rm_path, sizeof(rm_path), "%s/%s",
dbspacedirname, de->d_name);
+ ereport_startup_progress("removing temporary relation files under pg_tblspc directory, elapsed time: %ld.%02d s, current file: %s",
+ rm_path);
+
if (unlink(rm_path) < 0)
ereport(LOG,
(errcode_for_file_access(),
diff --git a/src/backend/utils/misc/timeout.c b/src/backend/utils/misc/timeout.c
index 6f5e08bc30..cc4b9d6ad7 100644
--- a/src/backend/utils/misc/timeout.c
+++ b/src/backend/utils/misc/timeout.c
@@ -16,6 +16,7 @@
#include <sys/time.h>
+#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "storage/proc.h"
#include "utils/timeout.h"
@@ -375,8 +376,11 @@ handle_sig_alarm(SIGNAL_ARGS)
/*
* SIGALRM is always cause for waking anything waiting on the process
* latch.
+ *
+ * Postmaster has no latch associated with it.
*/
- SetLatch(MyLatch);
+ if (MyLatch)
+ SetLatch(MyLatch);
/*
* Always reset signal_pending, even if !alarm_enabled, since indeed no
@@ -494,7 +498,10 @@ InitializeTimeouts(void)
all_timeouts_initialized = true;
/* Now establish the signal handler */
- pqsignal(SIGALRM, handle_sig_alarm);
+ if (IsPostmasterEnvironment)
+ pqsignal_pm(SIGALRM, handle_sig_alarm);
+ else
+ pqsignal(SIGALRM, handle_sig_alarm);
}
/*
--
2.34.1