Enhance file_fdw to report processed and skipped tuples in COPY progress

Started by Fujii Masaoover 1 year ago9 messages
#1Fujii Masao
masao.fujii@oss.nttdata.com
1 attachment(s)

Hi,

Currently, file_fdw updates several columns in the pg_stat_progress_copy view,
like relid and bytes_processed, but it doesn't track tuples_processed or
tuples_skipped. Monitoring these would be particularly useful when handling
large data sets via file_fdw, as it helps track the progress of scan.

The attached patch updates file_fdw to add support for reporting
the number of tuples processed and skipped (due to on_error = 'ignore')
in the pg_stat_progress_copy view. What are your thoughts?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v1-0001-file_fdw-Report-tuples-processed-and-skipped-for-.patchtext/plain; charset=UTF-8; name=v1-0001-file_fdw-Report-tuples-processed-and-skipped-for-.patchDownload
From 9309195910d3b77614f330af8a80c38fcbc5c532 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Thu, 3 Oct 2024 17:59:11 +0900
Subject: [PATCH v1] file_fdw: Report tuples processed and skipped for COPY
 progress.

This commit enhances file_fdw to report the number of tuples processed
and skipped (due to on_error = 'ignore') in the pg_stat_progress_copy view.
These are shown in the tuples_processed and tuples_skipped columns,
respectively. Previously, they were not reported while other columns like
bytes_processed and bytes_total were updated.

Discussion: https://postgr.es/m/bbddc7a0-b71a-4d54-9560-c90a26a06c48@oss.nttdata.com
---
 contrib/file_fdw/file_fdw.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 043204c3e7..6b6bf7ff18 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -25,6 +25,7 @@
 #include "commands/copyfrom_internal.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "foreign/foreign.h"
@@ -35,6 +36,7 @@
 #include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "utils/acl.h"
+#include "utils/backend_progress.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/sampling.h"
@@ -773,6 +775,10 @@ retry:
 			 */
 			cstate->escontext->error_occurred = false;
 
+			/* Report that this tuple was skipped due to ON_ERROR = ignore */
+			pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
+										 cstate->num_errors);
+
 			/* Switch back to original memory context */
 			MemoryContextSwitchTo(oldcontext);
 
@@ -801,6 +807,9 @@ retry:
 	/* Remove error callback. */
 	error_context_stack = errcallback.previous;
 
+	/* Update the processed tuple count for COPY progress reporting */
+	pgstat_progress_incr_param(PROGRESS_COPY_TUPLES_PROCESSED, 1);
+
 	return slot;
 }
 
@@ -1253,6 +1262,10 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 			 */
 			cstate->escontext->error_occurred = false;
 
+			/* Report that this tuple was skipped due to ON_ERROR = ignore */
+			pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
+										 cstate->num_errors);
+
 			/* Repeat NextCopyFrom() until no soft error occurs */
 			continue;
 		}
@@ -1294,6 +1307,9 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 		}
 
 		*totalrows += 1;
+
+		/* Update the processed tuple count for COPY progress reporting */
+		pgstat_progress_update_param(PROGRESS_COPY_TUPLES_PROCESSED, *totalrows);
 	}
 
 	/* Remove error callback. */
-- 
2.45.2

#2Fabrízio de Royes Mello
fabriziomello@gmail.com
In reply to: Fujii Masao (#1)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On Thu, Oct 3, 2024 at 6:23 AM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

Hi,

Currently, file_fdw updates several columns in the pg_stat_progress_copy

view,

like relid and bytes_processed, but it doesn't track tuples_processed or
tuples_skipped. Monitoring these would be particularly useful when

handling

large data sets via file_fdw, as it helps track the progress of scan.

The attached patch updates file_fdw to add support for reporting
the number of tuples processed and skipped (due to on_error = 'ignore')
in the pg_stat_progress_copy view. What are your thoughts?

Awesome... no failures building and it worked as expected:

-[ RECORD 1 ]----+-----------
pid | 102570
datid | 16388
datname | fabrizio
relid | 16398
command | COPY FROM
type | FILE
bytes_processed | 1359675392
bytes_total | 2921579128
tuples_processed | 18605306
tuples_excluded | 0
tuples_skipped | 0

Regards,

--
Fabrízio de Royes Mello

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#1)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

Hi,

On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Hi,

Currently, file_fdw updates several columns in the pg_stat_progress_copy view,
like relid and bytes_processed, but it doesn't track tuples_processed or
tuples_skipped. Monitoring these would be particularly useful when handling
large data sets via file_fdw, as it helps track the progress of scan.

The attached patch updates file_fdw to add support for reporting
the number of tuples processed and skipped (due to on_error = 'ignore')
in the pg_stat_progress_copy view. What are your thoughts?

While the patch works fine and looks good to me, in the first place,
it seems to me that the fact that file_fdw uses the COPY progress
itself doesn't work properly. For example, unlike COPY command,
queries could have multiple scans on one or more flie_fdw foreign
tables when joining tables. I found the discussion for that[1]/messages/by-id/20230119054703.GB13860@telsasoft.com: there
was a proposal of disabling COPY progress for file_fdw but the votes
are split. I think it would be better to consider if we really want to
support COPY progress for file_fdw before supporting more progress
information.

[1]: /messages/by-id/20230119054703.GB13860@telsasoft.com

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#4Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Masahiko Sawada (#3)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On 2024/10/04 2:12, Masahiko Sawada wrote:

Hi,

On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Hi,

Currently, file_fdw updates several columns in the pg_stat_progress_copy view,
like relid and bytes_processed, but it doesn't track tuples_processed or
tuples_skipped. Monitoring these would be particularly useful when handling
large data sets via file_fdw, as it helps track the progress of scan.

The attached patch updates file_fdw to add support for reporting
the number of tuples processed and skipped (due to on_error = 'ignore')
in the pg_stat_progress_copy view. What are your thoughts?

While the patch works fine and looks good to me, in the first place,
it seems to me that the fact that file_fdw uses the COPY progress
itself doesn't work properly. For example, unlike COPY command,
queries could have multiple scans on one or more flie_fdw foreign
tables when joining tables. I found the discussion for that[1]: there
was a proposal of disabling COPY progress for file_fdw but the votes
are split. I think it would be better to consider if we really want to
support COPY progress for file_fdw before supporting more progress
information.

Yes, you're right. We need to address how to handle multiple commands
that trigger progress reporting when executed concurrently, at first.

The current progress reporting mechanism assumes only one command
triggering progress is running at a time, as each backend has just
one memory area for progress reporting. If multiple commands run simultaneously,
the progress data would be incorrect. As you mentioned, this could happen
when querying multiple file_fdw foreign tables, where multiple COPY commands
could execute concurrently.

However, this issue already exists without the proposed patch.
Since file_fdw already reports progress partially, querying multiple
file_fdw tables can lead to inaccurate or confusing progress reports.
You can even observe this when analyzing a file_fdw table and also
when copying to the table with a trigger that executes progress-reporting
commands.

So, I don’t think this issue should block the proposed patch.
In fact, progress reporting is already flawed in these scenarios,
regardless of whether the patch is applied.

On the other hand, in many cases where a single file_fdw table is scanned,
COPY progress reporting works correctly for file_fdw and is useful.
Therefore, I believe it's still worth improving file_fdw’s progress reporting.

To prevent misleading reports when multiple commands are run concurrently,
just idea, we could consider displaying NULL columns in the progress report
if this situation is detected, as a separate patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#5Yugo Nagata
nagata@sraoss.co.jp
In reply to: Fujii Masao (#4)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On Fri, 11 Oct 2024 10:53:10 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2024/10/04 2:12, Masahiko Sawada wrote:

Hi,

On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Hi,

Currently, file_fdw updates several columns in the pg_stat_progress_copy view,
like relid and bytes_processed, but it doesn't track tuples_processed or
tuples_skipped. Monitoring these would be particularly useful when handling
large data sets via file_fdw, as it helps track the progress of scan.

The attached patch updates file_fdw to add support for reporting
the number of tuples processed and skipped (due to on_error = 'ignore')
in the pg_stat_progress_copy view. What are your thoughts?

While the patch works fine and looks good to me, in the first place,
it seems to me that the fact that file_fdw uses the COPY progress
itself doesn't work properly. For example, unlike COPY command,
queries could have multiple scans on one or more flie_fdw foreign
tables when joining tables. I found the discussion for that[1]: there
was a proposal of disabling COPY progress for file_fdw but the votes
are split. I think it would be better to consider if we really want to
support COPY progress for file_fdw before supporting more progress
information.

Yes, you're right. We need to address how to handle multiple commands
that trigger progress reporting when executed concurrently, at first.

The current progress reporting mechanism assumes only one command
triggering progress is running at a time, as each backend has just
one memory area for progress reporting. If multiple commands run simultaneously,
the progress data would be incorrect. As you mentioned, this could happen
when querying multiple file_fdw foreign tables, where multiple COPY commands
could execute concurrently.

However, this issue already exists without the proposed patch.
Since file_fdw already reports progress partially, querying multiple
file_fdw tables can lead to inaccurate or confusing progress reports.
You can even observe this when analyzing a file_fdw table and also
when copying to the table with a trigger that executes progress-reporting
commands.

So, I don’t think this issue should block the proposed patch.
In fact, progress reporting is already flawed in these scenarios,
regardless of whether the patch is applied.

On the other hand, in many cases where a single file_fdw table is scanned,
COPY progress reporting works correctly for file_fdw and is useful.
Therefore, I believe it's still worth improving file_fdw’s progress reporting.

I think reporting tuples_processed and tuples_skipped columns additionally
in file_fdw is reasonable, since it already reports bytes_processed and bytes_total.

By the way, in the documentation of fild_fdw, it is not explicitly described
that file_fdw uses COPY internally, although I can find several wordings like "as COPY".
To prevent users to face unexpected experiences, how about explaining explicitly that
file_fdw uses COPY and updates pg_stat_progress_copy?

To prevent misleading reports when multiple commands are run concurrently,
just idea, we could consider displaying NULL columns in the progress report
if this situation is detected, as a separate patch.

Or, could we add additional field to pg_stat_progress_copy to
show how much commands are running COPY? It is also just idea.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#6Kirill Reshke
reshkekirill@gmail.com
In reply to: Fujii Masao (#4)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On Fri, 11 Oct 2024 at 06:53, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

However, this issue already exists without the proposed patch.
Since file_fdw already reports progress partially, querying multiple
file_fdw tables can lead to inaccurate or confusing progress reports.
You can even observe this when analyzing a file_fdw table and also
when copying to the table with a trigger that executes progress-reporting
commands.

So, I don’t think this issue should block the proposed patch.
In fact, progress reporting is already flawed in these scenarios,
regardless of whether the patch is applied.

Im +1 on this. To include or not to include these new fields, and to
fix multiple-run-bug is two separate topics IMO.
Also please notice review comments in[0]/messages/by-id/20241011153645.a348de1576a3f57092c68355@sraoss.co.jp.

[0]: /messages/by-id/20241011153645.a348de1576a3f57092c68355@sraoss.co.jp

Moved to the next CF.

--
Best regards,
Kirill Reshke

#7Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kirill Reshke (#6)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On 2024/11/30 15:23, Kirill Reshke wrote:

On Fri, 11 Oct 2024 at 06:53, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

However, this issue already exists without the proposed patch.
Since file_fdw already reports progress partially, querying multiple
file_fdw tables can lead to inaccurate or confusing progress reports.
You can even observe this when analyzing a file_fdw table and also
when copying to the table with a trigger that executes progress-reporting
commands.

So, I don’t think this issue should block the proposed patch.
In fact, progress reporting is already flawed in these scenarios,
regardless of whether the patch is applied.

On second thought, supporting progress tracking for COPY used by file_fdw
could increase the chances of multiple commands being tracked simultaneously
by a single backend. This means the command progress view might show
incorrect results more often.

As I mentioned before, this issue already exists, but it currently
only happens in rare cases. I don’t think the fact that the issue
already exists is a good reason to introduce more, and likely more common,
scenarios where it could occur.

With that in mind, I'm thinking of withdrawing this patch for now.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#8vignesh C
vignesh21@gmail.com
In reply to: Fujii Masao (#7)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On Wed, 5 Mar 2025 at 21:43, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2024/11/30 15:23, Kirill Reshke wrote:

On Fri, 11 Oct 2024 at 06:53, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

However, this issue already exists without the proposed patch.
Since file_fdw already reports progress partially, querying multiple
file_fdw tables can lead to inaccurate or confusing progress reports.
You can even observe this when analyzing a file_fdw table and also
when copying to the table with a trigger that executes progress-reporting
commands.

So, I don’t think this issue should block the proposed patch.
In fact, progress reporting is already flawed in these scenarios,
regardless of whether the patch is applied.

On second thought, supporting progress tracking for COPY used by file_fdw
could increase the chances of multiple commands being tracked simultaneously
by a single backend. This means the command progress view might show
incorrect results more often.

As I mentioned before, this issue already exists, but it currently
only happens in rare cases. I don’t think the fact that the issue
already exists is a good reason to introduce more, and likely more common,
scenarios where it could occur.

With that in mind, I'm thinking of withdrawing this patch for now.

I've updated the status to "withdrawn." Feel free to add it again
anytime if you change your mind.

Regards,
Vignesh

#9Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: vignesh C (#8)
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

On 2025/03/09 20:38, vignesh C wrote:

I've updated the status to "withdrawn." Feel free to add it again
anytime if you change your mind.

Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION