Add tuples_skipped to pg_stat_progress_copy

Started by torikoshiaalmost 2 years ago6 messages
#1torikoshia
torikoshia
torikoshia@oss.nttdata.com
1 attachment(s)

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachments:

v1-0001-Add-tuples_skipped-to-pg_stat_progress_copy.patchtext/x-diff; name=v1-0001-Add-tuples_skipped-to-pg_stat_progress_copy.patch
#2Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: torikoshia (#1)
Re: Add tuples_skipped to pg_stat_progress_copy

On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#3torikoshia
torikoshia
torikoshia@oss.nttdata.com
In reply to: Masahiko Sawada (#2)
2 attachment(s)
Re: Add tuples_skipped to pg_stat_progress_copy

On 2024-01-17 14:47, Masahiko Sawada wrote:

On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of
skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise 
zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by
b725b7eec.

BTW, based on this patch, I think we can add another option which
specifies the maximum tolerable number of malformed rows.
I remember this was discussed in [1]/messages/by-id/752672.1699474336@sss.pgh.pa.us, and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached 0002 is WIP patch for this(I haven't added doc yet).

This may be better discussed in another thread, but any comments(e.g.
necessity of this option, option name) are welcome.

[1]: /messages/by-id/752672.1699474336@sss.pgh.pa.us
/messages/by-id/752672.1699474336@sss.pgh.pa.us

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachments:

v2-0001-Add-tuples_skipped-to-pg_stat_progress_copy.patchtext/x-diff; name=v2-0001-Add-tuples_skipped-to-pg_stat_progress_copy.patch
v2-0002-Add-new-COPY-option-REJECT_LIMIT.patchtext/x-diff; name=v2-0002-Add-new-COPY-option-REJECT_LIMIT.patch
#4Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: torikoshia (#3)
Re: Add tuples_skipped to pg_stat_progress_copy

On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> wrote:

On 2024-01-17 14:47, Masahiko Sawada wrote:

On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of
skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by
b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

BTW, based on this patch, I think we can add another option which
specifies the maximum tolerable number of malformed rows.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached 0002 is WIP patch for this(I haven't added doc yet).

Yeah, it could be a good option.

This may be better discussed in another thread, but any comments(e.g.
necessity of this option, option name) are welcome.

I'd recommend forking a new thread for this option. As far as I
remember, there also was an opinion that "reject limit" stuff is not
very useful.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#5torikoshia
torikoshia
torikoshia@oss.nttdata.com
In reply to: Masahiko Sawada (#4)
Re: Add tuples_skipped to pg_stat_progress_copy

On 2024-01-24 17:05, Masahiko Sawada wrote:

On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:

On 2024-01-17 14:47, Masahiko Sawada wrote:

On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of
skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by
b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

Thanks!

BTW, based on this patch, I think we can add another option which
specifies the maximum tolerable number of malformed rows.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached 0002 is WIP patch for this(I haven't added doc yet).

Yeah, it could be a good option.

This may be better discussed in another thread, but any comments(e.g.
necessity of this option, option name) are welcome.

I'd recommend forking a new thread for this option. As far as I
remember, there also was an opinion that "reject limit" stuff is not
very useful.

OK, I'll make another thread for this.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#6Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: torikoshia (#5)
Re: Add tuples_skipped to pg_stat_progress_copy

On Thu, Jan 25, 2024 at 11:25 AM torikoshia <torikoshia@oss.nttdata.com> wrote:

On 2024-01-24 17:05, Masahiko Sawada wrote:

On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:

On 2024-01-17 14:47, Masahiko Sawada wrote:

On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
skip malformed data, but there is no way to watch the number of
skipped
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing
when there is a larger amount of skipped data than expected, for
example.

As described in commit log, it is expected to add more choices for
SAVE_ERROR_TO like 'log' and using such options may enable us to know
the number of skipped tuples during COPY, but exposed in
pg_stat_progress_copy would be easier to monitor.

What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by
b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

Thanks!

Pushed (commit 729439607).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com