[PATCH] Simple progress reporting for COPY command

Started by Josef Šimánekover 5 years ago26 messageshackers
Jump to latest
#1Josef Šimánek
josef.simanek@gmail.com

Hello,

finally I had some time to revisit patch and all comments from
/messages/by-id/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com
and I have prepared simple version of COPY command progress reporting.

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)
lines_processed - bigint - amount of tuples processed

example output of progress for common use case (import from CSV file):

first console:
yr=# COPY test FROM '/home/retro/test.csv' (FORMAT CSV);

second console:
yr=# SELECT * FROM pg_stat_progress_copy;
pid | datid | datname | relid | bytes_processed | bytes_total |
lines_processed
--------+-------+---------+-------+-----------------+-------------+-----------------
803148 | 16384 | yr | 16394 | 998965248 | 1777777796 |
56730126
(1 row)

It is simple to get progress in percents for example by:

yr=# SELECT (bytes_processed/bytes_total::decimal)*100 FROM
pg_stat_progress_copy WHERE pid = 803148;
?column?
-------------------------
50.04287948706048525800

^ ~50% of file processed already

I did some dead simple benchmarking as well. The difference is not
huge. Each command works with 100 millions of tuples. Times are in
seconds.

test with progress master (32d6287) difference
------------------------- --------------- ------------------ ------------
COPY table TO 46.102 47.499 -1.397
COPY query TO 52.168 49.822 2.346
COPY table TO PROGRAM 52.345 51.882 0.463
COPY query TO PROGRAM 54.141 52.763 1.378
COPY table FROM 88.970 85.161 3.809
COPY table FROM PROGRAM 94.393 90.346 4.047

Properly formatted table (since I'm not sure everyone here would be
able to see the table formatted well) and the benchmark source is
present at https://github.com/simi/postgres/pull/6. I have also
included an example output in there.

I'll add this to the current commitfest as well.

Attachments:

001-copy-progress.patchtext/x-patch; charset=US-ASCII; name=001-copy-progress.patchDownload+168-5
#2Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Josef Šimánek (#1)
Re: [PATCH] Simple progress reporting for COPY command

On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote:

finally I had some time to revisit patch and all comments from
/messages/by-id/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com
and I have prepared simple version of COPY command progress reporting.

Thanks for the patch.

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)

It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
having this parameter as 0 is an indication of the COPY command being
from STDIN? I'm not sure whether it's discussed or not previously, but
I personally prefer to have it as a separate column, say a varchar or
text column with values STDIN, FILE or PROGRAM may be. And also it
will be better if we have the format of the data streaming in, as a
separate column, with possible values may be CSV, TEXT, BINARY.

Since this progress reporting is for both COPY FROM and COPY TO,
wouldn't it be good to have that also as a separate column, say
command type with values FROM and TO?

Thoughts?

I'm sure some of the points would have already been discussed. I just
shared my thoughts here.

I have not looked into the patch in detail, but it's missing tests.
I'm sure we can not add the tests into copy.sql or copy2.sql, can we
think of adding test cases to TAP or under isolation? I'm not sure
whether other progress reporting commands have test cases.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#3Josef Šimánek
josef.simanek@gmail.com
In reply to: Bharath Rupireddy (#2)
Re: [PATCH] Simple progress reporting for COPY command

pá 1. 1. 2021 v 11:16 odesílatel Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> napsal:

On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote:

finally I had some time to revisit patch and all comments from
/messages/by-id/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com
and I have prepared simple version of COPY command progress reporting.

Thanks for the patch.

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)

It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
having this parameter as 0 is an indication of the COPY command being
from STDIN? I'm not sure whether it's discussed or not previously, but
I personally prefer to have it as a separate column, say a varchar or
text column with values STDIN, FILE or PROGRAM may be. And also it
will be better if we have the format of the data streaming in, as a
separate column, with possible values may be CSV, TEXT, BINARY.

Since this progress reporting is for both COPY FROM and COPY TO,
wouldn't it be good to have that also as a separate column, say
command type with values FROM and TO?

Thoughts?

My first patch had more columns (direction - FROM/TO, file bool,
program bool), but it was subject of discussion what's the purpose.
You can check the query by looking at pg_stat_activity by pid to get
more info. To keep it simple I decided to go with a minimal set of
columns for now. It can be extended later. I'm ok to continue on
improving this with more feedback once merged.

I'm sure some of the points would have already been discussed. I just
shared my thoughts here.

I have not looked into the patch in detail, but it's missing tests.
I'm sure we can not add the tests into copy.sql or copy2.sql, can we
think of adding test cases to TAP or under isolation? I'm not sure
whether other progress reporting commands have test cases.

I have raised the question in an old thread as well since there are no
tests for other progress commands as well. I was told it is ok for now
to keep it untested, since there's no easy way to do that for now.

/messages/by-id/20200615001828.GA52676@paquier.xyz

Show quoted text

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josef Šimánek (#3)
Re: [PATCH] Simple progress reporting for COPY command

Hi,

I did take a quick look today, and I have a couple minor comments:

1) The catalog sgml docs seem to mention bytes_processed twice (one of
that should be bytes_total), and line_processed (should be "lines_").

2) I'm not quite sure about not including any info about the command.
For example pg_stat_progress_create_index includes info about the
command, although it's not very detailed. Not sure how useful would it
be to show just COPY FROM / COPY TO, without more details.

It's probably possible to extract this from pg_stat_activity, but that
may be rather cumbersome (parsing arbitrary SQL and all that). Also,
what if the COPY is called from a function etc.?

3) I wonder if we should have something like lines_estimated. For COPY
TO it's pretty simple to calculate it as

bytes_total / (bytes_processed / lines_processed)

but what about

COPY (query) TO file

In that case we don't know the total amount of bytes / rows, so we can't
calculate any estimate. So maybe we could peek into the query plan? But
I agree this is something we can add later.

4) This comment is a bit confusing, as it mixes "total" and "processed".
I'd just say "number of bytes processed so far" instead.

Other than that, it seems fine. I'm sure we could add more features, but
it seems like a good start - I plan to get this committed once I get a
patch fixing the docs issues.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Josef Šimánek
josef.simanek@gmail.com
In reply to: Tomas Vondra (#4)
Re: [PATCH] Simple progress reporting for COPY command

út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

Hi,

I did take a quick look today, and I have a couple minor comments:

Hi! Thanks for your time.

1) The catalog sgml docs seem to mention bytes_processed twice (one of
that should be bytes_total), and line_processed (should be "lines_").

Fixed in attached patch.

2) I'm not quite sure about not including any info about the command.
For example pg_stat_progress_create_index includes info about the
command, although it's not very detailed. Not sure how useful would it
be to show just COPY FROM / COPY TO, without more details.

It's probably possible to extract this from pg_stat_activity, but that
may be rather cumbersome (parsing arbitrary SQL and all that). Also,
what if the COPY is called from a function etc.?

Any idea where to discuss this? My usecase is really simple. I would
like to be able to see progress of COPY command by pid. There is a lot
of COPY info inside CopyToStateData and CopyFromStateData structs. The
only limitation I see is support of only int64 values for progress
reporting columns. I'm not sure if it is easily possible to expose for
example filename this way.

3) I wonder if we should have something like lines_estimated. For COPY
TO it's pretty simple to calculate it as

bytes_total / (bytes_processed / lines_processed)

but what about

COPY (query) TO file

In that case we don't know the total amount of bytes / rows, so we can't
calculate any estimate. So maybe we could peek into the query plan? But
I agree this is something we can add later.

If I remember well one of the original ideas was to be able to pass
custom bytes_total (or lines_total) by COPY options to be stored in
copy progress. I can add this in some following patch if still
welcomed.

For estimates I would prefer to add an additional column to not mix
those two together (or at least boolean estimated = true/false and
reuse bytes_total column). If query estimates are welcomed, I can take
a look at how to reach the query plan and expose those numbers when
the query is used to estimated_lines and potentially estimated_bytes
columns. It would be probably a little tricky to calculate
estimated_bytes for some column types.

Also currently only COPY FROM file supports bytes_total (it is 0 for
all other scenarios). I think it should be possible for PostgreSQL to
know the actual amount of lines query returns for some kind of
queries, but I have no idea where to look at this. If that's possible
to get, it would be one of the next steps to introduce additional
column lines_total.

4) This comment is a bit confusing, as it mixes "total" and "processed".
I'd just say "number of bytes processed so far" instead.

Fixed in attached patch.

Other than that, it seems fine. I'm sure we could add more features, but
it seems like a good start - I plan to get this committed once I get a
patch fixing the docs issues.

Patch is attached, it should be applied to the top of the previous
patch. Overall patch (having both patches merged together) could be
found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.

Show quoted text

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

002-copy-progress.patchtext/x-patch; charset=US-ASCII; name=002-copy-progress.patchDownload+3-3
#6Josef Šimánek
josef.simanek@gmail.com
In reply to: Josef Šimánek (#5)
Re: [PATCH] Simple progress reporting for COPY command

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

út 5. 1. 2021 v 2:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:

Show quoted text

út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

Hi,

I did take a quick look today, and I have a couple minor comments:

Hi! Thanks for your time.

1) The catalog sgml docs seem to mention bytes_processed twice (one of
that should be bytes_total), and line_processed (should be "lines_").

Fixed in attached patch.

2) I'm not quite sure about not including any info about the command.
For example pg_stat_progress_create_index includes info about the
command, although it's not very detailed. Not sure how useful would it
be to show just COPY FROM / COPY TO, without more details.

It's probably possible to extract this from pg_stat_activity, but that
may be rather cumbersome (parsing arbitrary SQL and all that). Also,
what if the COPY is called from a function etc.?

Any idea where to discuss this? My usecase is really simple. I would
like to be able to see progress of COPY command by pid. There is a lot
of COPY info inside CopyToStateData and CopyFromStateData structs. The
only limitation I see is support of only int64 values for progress
reporting columns. I'm not sure if it is easily possible to expose for
example filename this way.

3) I wonder if we should have something like lines_estimated. For COPY
TO it's pretty simple to calculate it as

bytes_total / (bytes_processed / lines_processed)

but what about

COPY (query) TO file

In that case we don't know the total amount of bytes / rows, so we can't
calculate any estimate. So maybe we could peek into the query plan? But
I agree this is something we can add later.

If I remember well one of the original ideas was to be able to pass
custom bytes_total (or lines_total) by COPY options to be stored in
copy progress. I can add this in some following patch if still
welcomed.

For estimates I would prefer to add an additional column to not mix
those two together (or at least boolean estimated = true/false and
reuse bytes_total column). If query estimates are welcomed, I can take
a look at how to reach the query plan and expose those numbers when
the query is used to estimated_lines and potentially estimated_bytes
columns. It would be probably a little tricky to calculate
estimated_bytes for some column types.

Also currently only COPY FROM file supports bytes_total (it is 0 for
all other scenarios). I think it should be possible for PostgreSQL to
know the actual amount of lines query returns for some kind of
queries, but I have no idea where to look at this. If that's possible
to get, it would be one of the next steps to introduce additional
column lines_total.

4) This comment is a bit confusing, as it mixes "total" and "processed".
I'd just say "number of bytes processed so far" instead.

Fixed in attached patch.

Other than that, it seems fine. I'm sure we could add more features, but
it seems like a good start - I plan to get this committed once I get a
patch fixing the docs issues.

Patch is attached, it should be applied to the top of the previous
patch. Overall patch (having both patches merged together) could be
found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

001-copy-progress-revivew-1.patchtext/x-patch; charset=US-ASCII; name=001-copy-progress-revivew-1.patchDownload+171-8
#7Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Josef Šimánek (#1)
Re: [PATCH] Simple progress reporting for COPY command

On Fri, 1 Jan 2021 at 02:25, Josef Šimánek <josef.simanek@gmail.com> wrote:

Hello,

finally I had some time to revisit patch and all comments from

/messages/by-id/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com

and I have prepared simple version of COPY command progress reporting.

+1

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Seems reasonable. One potential extention I could think of is progress
reporting of tuples processed before/after applying the COPY's WHERE
clause, when and where applicable.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)
lines_processed - bigint - amount of tuples processed

Could you update the name of this column to be consistent with the
'tuples'-terminology used in the other progress reporting views, i.e.
lines_processed -> tuples_processed? That adds consistency, and also
disambiguates the meaning in case of e.g. COPY FROM (format CSV), as
multiline CSV fields do exist, and we're not necessarily counting lines
from or to a file.

With regards,

Matthias van de Meent

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josef Šimánek (#6)
Re: [PATCH] Simple progress reporting for COPY command

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

One more question, though - I now realize the lines_processed ignores
rows skipped because of BEFORE INSERT triggers. I wonder if that's the
right thing to do? Imagine you know the number of lines in a file. You
can't really use (lines_processed / total_lines) to measure progress,
because that may ignore many "skipped" rows. So maybe this should be
changed to count all rows. OTOH we still have bytes_processed.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#8)
Re: [PATCH] Simple progress reporting for COPY command

On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to work
correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section was
not indexed properly, and so on.

Some more doc fixes.

From 0616f448057905ab9b3218ebddfdd3af52e62bac Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 6 Jan 2021 19:08:11 -0600
Subject: [PATCH] doc review: COPY progress:
8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f

---
doc/src/sgml/monitoring.sgml | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43fe8ae383..3cdb1aff3c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6414,9 +6414,9 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   <para>
    Whenever <command>COPY</command> is running, the
    <structname>pg_stat_progress_copy</structname> view will contain one row
-   for each backend that is currently running <command>COPY</command> command.
-   The table bellow describes the information that will be reported and provide
-   information how to interpret it.
+   for each backend that is currently running a <command>COPY</command> command.
+   The table below describes the information that will be reported and provides
+   information about how to interpret it.
   </para>

<table id="pg-stat-progress-copy-view" xreflabel="pg_stat_progress_copy">
@@ -6445,7 +6445,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,

      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>datid</structfield> <type>text</type>
+       <structfield>datid</structfield> <type>oid</type>
       </para>
       <para>
        OID of the database to which this backend is connected.
@@ -6467,7 +6467,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
       </para>
       <para>
        OID of the table on which the <command>COPY</command> command is executed.
-       It is set to 0 if <command>SELECT</command> query is provided.
+       It is set to 0 if copying from a <command>SELECT</command> query.
       </para></entry>
      </row>

--
2.17.0

Attachments:

0001-doc-review-COPY-progress-8a4f618e7ae3cb11b0b37d0f06f.patchtext/x-diff; charset=us-asciiDownload+5-6
#10Josef Šimánek
josef.simanek@gmail.com
In reply to: Tomas Vondra (#8)
Re: [PATCH] Simple progress reporting for COPY command

st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

One more question, though - I now realize the lines_processed ignores
rows skipped because of BEFORE INSERT triggers. I wonder if that's the
right thing to do? Imagine you know the number of lines in a file. You
can't really use (lines_processed / total_lines) to measure progress,
because that may ignore many "skipped" rows. So maybe this should be
changed to count all rows. OTOH we still have bytes_processed.

I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.

Show quoted text

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#8)
Re: [PATCH] Simple progress reporting for COPY command

On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

How about including a column for command similar to
pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
that command will be useful in the context of COPY as there are many
variants of COPY.

--
With Regards,
Amit Kapila.

#12Josef Šimánek
josef.simanek@gmail.com
In reply to: Amit Kapila (#11)
Re: [PATCH] Simple progress reporting for COPY command

čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:

On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

How about including a column for command similar to
pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
that command will be useful in the context of COPY as there are many
variants of COPY.

From pg_stat_progress_create_index docs:

The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
REINDEX, or REINDEX CONCURRENTLY.

Which values should be present in copy progress? Just COPY FROM or COPY TO?

Show quoted text

With Regards,
Amit Kapila.

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josef Šimánek (#10)
Re: [PATCH] Simple progress reporting for COPY command

On 1/7/21 12:06 PM, Josef Šimánek wrote:

st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

One more question, though - I now realize the lines_processed ignores
rows skipped because of BEFORE INSERT triggers. I wonder if that's the
right thing to do? Imagine you know the number of lines in a file. You
can't really use (lines_processed / total_lines) to measure progress,
because that may ignore many "skipped" rows. So maybe this should be
changed to count all rows. OTOH we still have bytes_processed.

I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.

So we may either rename the column to "lines_inserted", or tweak the
code to count all processed lines. Or track both and have two columns.

regarss

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Josef Šimánek
josef.simanek@gmail.com
In reply to: Tomas Vondra (#13)
Re: [PATCH] Simple progress reporting for COPY command

čt 7. 1. 2021 v 16:54 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

On 1/7/21 12:06 PM, Josef Šimánek wrote:

st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

One more question, though - I now realize the lines_processed ignores
rows skipped because of BEFORE INSERT triggers. I wonder if that's the
right thing to do? Imagine you know the number of lines in a file. You
can't really use (lines_processed / total_lines) to measure progress,
because that may ignore many "skipped" rows. So maybe this should be
changed to count all rows. OTOH we still have bytes_processed.

I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.

So we may either rename the column to "lines_inserted", or tweak the
code to count all processed lines. Or track both and have two columns.

First I'll ensure lines_processed represents the actual amount of
processed lines. If reading from file and some lines are skipped due
to before insert trigger, I consider that one processed as well, even
if it is not inserted. If welcomed, I can add lines_inserted later as
well. But I'm not sure about the use case.

Also thanks to mentioning triggers, I think those could be used to
test the COPY progress (at least some variants). I'll check if I would
be able to add some test cases as well.

Show quoted text

regarss

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#9)
Re: [PATCH] Simple progress reporting for COPY command

On 1/7/21 2:17 AM, Justin Pryzby wrote:

On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote:

On 1/5/21 11:02 AM, Josef �im�nek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to work
correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section was
not indexed properly, and so on.

Some more doc fixes.

Thanks, pushed.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#8)
Re: [PATCH] Simple progress reporting for COPY command

On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If so desired, I'll split this off into a different thread & CF entry.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.

With regards,

Matthias van de Meent

Attachments:

v1-0001-Add-progress-reporting-for-filtered-rows.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-progress-reporting-for-filtered-rows.patchDownload+18-2
v1-0002-Rename-lines-to-tuples-in-COPY-progress-reporting.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Rename-lines-to-tuples-in-COPY-progress-reporting.patchDownload+12-13
#17Josef Šimánek
josef.simanek@gmail.com
In reply to: Matthias van de Meent (#16)
Re: [PATCH] Simple progress reporting for COPY command

čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
<boekewurm+postgres@gmail.com> napsal:

On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.

Show quoted text

If so desired, I'll split this off into a different thread & CF entry.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.

With regards,

Matthias van de Meent

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josef Šimánek (#17)
Re: [PATCH] Simple progress reporting for COPY command

On 1/7/21 7:56 PM, Josef Šimánek wrote:

čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
<boekewurm+postgres@gmail.com> napsal:

On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.

That's my understanding too.

If so desired, I'll split this off into a different thread & CF entry.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.

I'm not particularly attached to the "lines" naming, it just seemed OK
to me. So if there's consensus to rename this somehow, I'm OK with it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Josef Šimánek
josef.simanek@gmail.com
In reply to: Tomas Vondra (#18)
Re: [PATCH] Simple progress reporting for COPY command

čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:

On 1/7/21 7:56 PM, Josef Šimánek wrote:

čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
<boekewurm+postgres@gmail.com> napsal:

On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.

That's my understanding too.

If so desired, I'll split this off into a different thread & CF entry.

I see Matthias proposed to change "lines" to "tuples" - I only saw the
message after pushing, but I probably wouldn't make that change anyway.
The CSV docs seem to talk about lines, newlines etc. so it seems fine.
If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.

I'm not particularly attached to the "lines" naming, it just seemed OK
to me. So if there's consensus to rename this somehow, I'm OK with it.

The problem I do see here is it depends on the "way" of COPY. If
you're copying from CSV file to table, those are actually lines (since
1 line = 1 tuple). But copying from DB to file is copying tuples (but
1 tuple = 1 file line). Line works better here for me personally.

Once I'll fix the problem with triggers (and also another cases if
found), I think we can consider it lines. It will represent amount of
lines processed from file on COPY FROM and amount of lines written to
file in COPY TO form (at least in CSV format). I'm not sure how BINARY
format works, I'll check.

Show quoted text

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Josef Šimánek (#12)
Re: [PATCH] Simple progress reporting for COPY command

On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek <josef.simanek@gmail.com> wrote:

čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:

On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/5/21 11:02 AM, Josef Šimánek wrote:

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.

Yeah, the whole patch needs to be attached for the commitfest tester to
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks,
mainly to the docs - one place used pg_stat_copy_progress, the section
was not indexed properly, and so on.

How about including a column for command similar to
pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
that command will be useful in the context of COPY as there are many
variants of COPY.

From pg_stat_progress_create_index docs:

The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
REINDEX, or REINDEX CONCURRENTLY.

Which values should be present in copy progress? Just COPY FROM or COPY TO?

Can't we display the entire COPY command? I checked that
pg_stat_statements display the query so there shouldn't be a problem
to display the entire command.

--
With Regards,
Amit Kapila.

#21Josef Šimánek
josef.simanek@gmail.com
In reply to: Amit Kapila (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Josef Šimánek (#21)
#23Josef Šimánek
josef.simanek@gmail.com
In reply to: Amit Kapila (#22)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Josef Šimánek (#23)
#25Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Josef Šimánek (#19)
#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Matthias van de Meent (#25)