[PATCH] Initial progress reporting for COPY command

Started by Josef Šimánekalmost 6 years ago34 messageshackers

josef.simanek@gmail.com

almost 6 years ago

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL
maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program |
lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 0 | TO | t | f |
3529943 | 24906226
(1 row)

"COPY test FROM '/tmp/ids';

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since oid
should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known (for
example when copying to file, it is 0)
direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both
directions (FROM/TO)
file_bytes_processed - amount of bytes processed when file is used
(otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I havefew initial notes and questions.

I'm using ftell to get current position in file to populate
file_bytes_processed without error handling (ftell can return -1L and also
populate errno on problems).

1. Is that a good way to get progress of file processing?
2. Is it safe in given context to not care about errors? If not, what to do
on error?

Some columns are not populated on certain COPY commands. For example when a
file is not used, file_bytes_processed is set to 0. Would it be better to
use NULL instead when the column is not related to the current command?
Same problem is for relid column.

I have not found any tests for progress reporting. Are there any? It would
need two backends running (one running COPY, one checking output of report
view). Is there any similar test I can inspire at? In theory, it should be
possible to use dblink_send_query to run async COPY command in the
background.

My initial (attached) patch also doesn't introduce documentation for this
system view. I can add that later once this patch is finalized (if that
happens).

Michael Paquier

michael@paquier.xyz

almost 6 years ago

In reply to: Josef Šimánek (#1)

Re: [PATCH] Initial progress reporting for COPY command

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL
maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

I have not found any tests for progress reporting. Are there any? It would
need two backends running (one running COPY, one checking output of report
view). Is there any similar test I can inspire at? In theory, it should be
possible to use dblink_send_query to run async COPY command in the
background.

We don't have any tests in core. I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at). So I think that it is
fine to not focus on that for this feature. The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

My initial (attached) patch also doesn't introduce documentation for this
system view. I can add that later once this patch is finalized (if that
happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.
--
Michael

Fujii Masao

masao.fujii@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#1)

Re: [PATCH] Initial progress reporting for COPY command

On 2020/06/14 21:32, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

file - bool - is file is used?
program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?

file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#1)

Re: [PATCH] Initial progress reporting for COPY command

I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can return -1L and also populate errno on problems).

1. Is that a good way to get progress of file processing?

IMO, it's better to handle the error cases. One possible case where
ftell can return -1 and set errno is when the total bytes processed is
more than LONG_MAX.

Will your patch handle file_bytes_processed reporting for COPY FROM
STDIN cases? For this case, ftell can't be used.

Instead of using ftell and worrying about the errors, a simple
approach could be to have a uint64 variable in CopyStateData to track
the number of bytes read whenever CopyGetData is called. This approach
can also handle the case of COPY FROM STDIN.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#2)

Re: [PATCH] Initial progress reporting for COPY command

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier <michael@paquier.xyz>
napsal:

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech

PostgreSQL

maillist (

https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer
),

I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

Great. I will continue working on this.

I have not found any tests for progress reporting. Are there any? It

would

need two backends running (one running COPY, one checking output of

report

view). Is there any similar test I can inspire at? In theory, it should

be

possible to use dblink_send_query to run async COPY command in the
background.

We don't have any tests in core. I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at). So I think that it is
fine to not focus on that for this feature. The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

Thanks for the info. I'm focusing exactly at looking for right spots to
report the progress. I'll attach new patch with better places and
supporting more options of reporting (including STDIN, STDOUT) soon and
also I'll try to add it to commitfest.

Show quoted text

My initial (attached) patch also doesn't introduce documentation for this
system view. I can add that later once this patch is finalized (if that
happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.
--
Michael

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Fujii Masao (#3)

Re: [PATCH] Initial progress reporting for COPY command

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com>
napsal:

On 2020/06/14 21:32, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech

PostgreSQL maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

file - bool - is file is used?
program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?

For STDOUT and STDIN file is true and program is false.

file_bytes_processed - amount of bytes processed when file is used

(otherwise 0), works for both direction (

FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach
new patch soon supporting those as well.

Show quoted text

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#2)

Re: [PATCH] Initial progress reporting for COPY command

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier <michael@paquier.xyz>
napsal:

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech

PostgreSQL

maillist (

https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer
),

I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

I have not found any tests for progress reporting. Are there any? It

would

need two backends running (one running COPY, one checking output of

report

view). Is there any similar test I can inspire at? In theory, it should

be

possible to use dblink_send_query to run async COPY command in the
background.

We don't have any tests in core. I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at). So I think that it is
fine to not focus on that for this feature. The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

My initial (attached) patch also doesn't introduce documentation for this
system view. I can add that later once this patch is finalized (if that
happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.

I have added documentation, more code comments and I'll upload patch to
commit fest.

Show quoted text

--
Michael

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Bharath Rupireddy (#4)

Re: [PATCH] Initial progress reporting for COPY command

po 15. 6. 2020 v 7:34 odesílatel Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> napsal:

I'm using ftell to get current position in file to populate

file_bytes_processed without error handling (ftell can return -1L and also
populate errno on problems).

1. Is that a good way to get progress of file processing?

IMO, it's better to handle the error cases. One possible case where
ftell can return -1 and set errno is when the total bytes processed is
more than LONG_MAX.

Will your patch handle file_bytes_processed reporting for COPY FROM
STDIN cases? For this case, ftell can't be used.

Instead of using ftell and worrying about the errors, a simple
approach could be to have a uint64 variable in CopyStateData to track
the number of bytes read whenever CopyGetData is called. This approach
can also handle the case of COPY FROM STDIN.

Thanks for suggestion. I used this approach and latest patch supports both
STDIN and STDOUT now.

Show quoted text

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#1)

Re: [PATCH] Initial progress reporting for COPY command

Thanks for all comments. I have updated code to support more options
(including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation
file.

I'll do my best to get this to commitfest now.

ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com>
napsal:

Show quoted text

Hello, as proposed by Pavel Stěhule and discussed on local czech
PostgreSQL maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program |
lines_processed | file_bytes_processed

---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 0 | TO | t | f |
3529943 | 24906226
(1 row)

"COPY test FROM '/tmp/ids';

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program |
lines_processed | file_bytes_processed

---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 16385 | FROM | t | f |
121591999 | 957218816
(1 row)

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since oid
should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known (for
example when copying to file, it is 0)
direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both
directions (FROM/TO)
file_bytes_processed - amount of bytes processed when file is used
(otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I havefew initial notes and questions.

I'm using ftell to get current position in file to populate
file_bytes_processed without error handling (ftell can return -1L and also
populate errno on problems).

1. Is that a good way to get progress of file processing?
2. Is it safe in given context to not care about errors? If not, what to
do on error?

Some columns are not populated on certain COPY commands. For example when
a file is not used, file_bytes_processed is set to 0. Would it be better to
use NULL instead when the column is not related to the current command?
Same problem is for relid column.

I have not found any tests for progress reporting. Are there any? It would
need two backends running (one running COPY, one checking output of report
view). Is there any similar test I can inspire at? In theory, it should be
possible to use dblink_send_query to run async COPY command in the
background.

My initial (attached) patch also doesn't introduce documentation for this
system view. I can add that later once this patch is finalized (if that
happens).

#10

Fujii Masao

masao.fujii@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#6)

Re: [PATCH] Initial progress reporting for COPY command

On 2020/06/21 20:33, Josef Šimánek wrote:

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> napsal:

On 2020/06/14 21:32, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

file - bool - is file is used?
program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?

For STDOUT and STDIN file is true and program is false.

Could you tell me why these columns are necessary in *progress* view?
If we want to see what copy command is actually running, we can see
pg_stat_activity, instead. For example,

SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a WHERE pc.pid = a.pid;

file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well.

Thanks for the patch!

With the patch, pg_stat_progress_copy seems to report the progress of
the processing on file_fdw. Is this intentional?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#11

vignesh C

vignesh21@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#9)

Re: [PATCH] Initial progress reporting for COPY command

On Sun, Jun 21, 2020 at 5:11 PM Josef Šimánek <josef.simanek@gmail.com> wrote:

Thanks for all comments. I have updated code to support more options (including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation file.

I'll do my best to get this to commitfest now.

ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 0 | TO | t | f | 3529943 | 24906226
(1 row)

"COPY test FROM '/tmp/ids';

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 16385 | FROM | t | f | 121591999 | 957218816
(1 row)

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0)
direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both directions (FROM/TO)
file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.

Few comments:
@@ -713,6 +714,8 @@ CopyGetData(CopyState cstate, void *databuf, int
minread, int maxread)
break;
}

+ CopyUpdateBytesProgress(cstate, bytesread);
+
  return bytesread;
 }

This is actually the read data, actual processing will happen later
like in CopyReadLineText, it would be better if
CopyUpdateBytesProgress is done later, if not it will give the same
value even though it does multiple inserts on the table.
lines_processed will keep getting updated but file_bytes_processed
will not be updated.

 +pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed

You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#12

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Fujii Masao (#10)

Re: [PATCH] Initial progress reporting for COPY command

po 22. 6. 2020 v 4:48 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com>
napsal:

On 2020/06/21 20:33, Josef Šimánek wrote:

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <

masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> napsal:

On 2020/06/14 21:32, Josef Šimánek wrote:

Hello, as proposed by Pavel Stěhule and discussed on local czech

PostgreSQL maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

file - bool - is file is used?
program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY

command?

For STDOUT and STDIN file is true and program is false.

Could you tell me why these columns are necessary in *progress* view?
If we want to see what copy command is actually running, we can see
pg_stat_activity, instead. For example,

SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity
a WHERE pc.pid = a.pid;

If that doesn't make any sense, I can remove those. I have not strong
opinion about those values. Those were just around when I was looking for
possible values to include in the progress report.

file_bytes_processed - amount of bytes processed when file is

used (otherwise 0), works for both direction (

FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY

command?

For my first patch nothing was reported on STDOUT/STDIN usage. I'll

attach new patch soon supporting those as well.

Thanks for the patch!

With the patch, pg_stat_progress_copy seems to report the progress of
the processing on file_fdw. Is this intentional?

Every action using internally COPY will be included in the progress report
view.
I have spotted for example pg_dump does that and is reported there as well.
I do not see any problem regarding this. For pg_dump it is consistent with
"pg_stat_activity" reporting COPY command in the query field.

Show quoted text

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#13

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: vignesh C (#11)

Re: [PATCH] Initial progress reporting for COPY command

po 22. 6. 2020 v 9:15 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Sun, Jun 21, 2020 at 5:11 PM Josef Šimánek <josef.simanek@gmail.com>
wrote:

Thanks for all comments. I have updated code to support more options

(including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at

https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html

documentation file.

I'll do my best to get this to commitfest now.

ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com>

napsal:

Hello, as proposed by Pavel Stěhule and discussed on local czech

PostgreSQL maillist (
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program |

lines_processed | file_bytes_processed

---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------

3347126 | 16384 | yr | 0 | TO | t | f |

3529943 | 24906226

(1 row)

"COPY test FROM '/tmp/ids';

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program |

lines_processed | file_bytes_processed

---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------

3347126 | 16384 | yr | 16385 | FROM | t | f |

121591999 | 957218816

(1 row)

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since

oid should be enough, but it is the same in CREATE INDEX)

relid - oid - oid of table related to COPY command, when not known (for

example when copying to file, it is 0)

direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both

directions (FROM/TO)

file_bytes_processed - amount of bytes processed when file is used

(otherwise 0), works for both direction (

FROM/TO) when file is used (file = t)

Patch is attached and can be found also at

https://github.com/simi/postgres/pull/5.

Few comments:
@@ -713,6 +714,8 @@ CopyGetData(CopyState cstate, void *databuf, int
minread, int maxread)
break;
}
+ CopyUpdateBytesProgress(cstate, bytesread);
+
return bytesread;
}
This is actually the read data, actual processing will happen later
like in CopyReadLineText, it would be better if
CopyUpdateBytesProgress is done later, if not it will give the same
value even though it does multiple inserts on the table.
lines_processed will keep getting updated but file_bytes_processed
will not be updated.

First I would like to explain what's reported (or at least I'm trying to
get reported) at bytes_processed column.

When exporting to file it should start at 0 and end up at the actual final
file size.
When importing from file, it should do the same. You can check file size
before you start COPY FROM and get actual progress looking at
bytes_processed.

This column is just a counter of bytes read from input on COPY FROM or
amount of bytes going through COPY TO.

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO
STDOUT;' > /tmp/ryba.txt
echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1 line
is 2 bytes here - "1" and newline character).
I'll try to check more complex COPY commands to ensure everything is in
sync.

If you have any ideas for testing queries, feel free to suggest.

+pg_stat_progress_copy| SELECT s.pid,

+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed
You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.

I was looking at the rest of reporting views and for me those seem to be
just basic ones providing just raw data to be used later in custom nice
friendly human-readable views built on the client side.
For example "pg_stat_progress_basebackup" also reports "backup_streamed" in
raw form.

Anyway if you would like to make this view more user-friendly, I can add
that. Just ping me.

Regards,

Show quoted text

Vignesh
EnterpriseDB: http://www.enterprisedb.com

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Josef Šimánek (#9)

Re: [PATCH] Initial progress reporting for COPY command

On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:

Thanks for all comments. I have updated code to support more options
(including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation
file.

I'll do my best to get this to commitfest now.

I see we're not showing the total number of bytes the COPY is expected
to process, which makes it hard to estimate how far we actually are.
Clearly there are cases when we really don't know that (exports, import
from stdin/program), but why not to show file size for imports from a
file? I'd expect that to be the most common case.

I wonder if it made sense to show some estimates in the other cases. For
example when exporting query result, maybe we could show the estimated
number of rows and size? Of course, that's prone to estimation errors
and it's more a wild idea for the future, I don't expect this patch to
implement that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#14)

Re: [PATCH] Initial progress reporting for COPY command

po 22. 6. 2020 v 14:14 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:

Thanks for all comments. I have updated code to support more options
(including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation
file.

I'll do my best to get this to commitfest now.

I see we're not showing the total number of bytes the COPY is expected
to process, which makes it hard to estimate how far we actually are.
Clearly there are cases when we really don't know that (exports, import
from stdin/program), but why not to show file size for imports from a
file? I'd expect that to be the most common case.

For COPY FROM file fstat is done and info is available already at
https://github.com/postgres/postgres/blob/fe186b4c200b76a5c0f03379fe8645ed1c70a844/src/backend/commands/copy.c#L1934.
It should be easy to update some param (param6 for example) with file size
and expose it in report view. When not available, this column can be NULL.

Would that be enough?

On the other side everyone can check file size manually to get total value
expected and just compare to reported bytes_processed. Alt. "wc -l" can be
checked to get amount of lines and check lines_processed column to get
progress. Should it check amount of lines and populate another column with
lines total (using a configured separator) as well? AFAIK that would need
full file scan which can be slow for huge files.

I wonder if it made sense to show some estimates in the other cases. For
example when exporting query result, maybe we could show the estimated
number of rows and size? Of course, that's prone to estimation errors
and it's more a wild idea for the future, I don't expect this patch to
implement that.

My plan here was to expose numbers not being currently available and let
clients get the rest of info on their own.

For example:
- for "COPY (query) TO file" - EXPLAIN or COUNT variant of query could be
executed before to get the amount of expected rows
- for "COPY table FROM file" - file size or amount of lines in file can be
inspected first to get amount of expected rows or bytes to be processed

I see the current system view in my patch (and also all other report views
currently available) more as a scaffold to build own tools.

For example CLI tools can use this to provide some kind of progress.

Show quoted text

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Josef Šimánek (#15)

Re: [PATCH] Initial progress reporting for COPY command

On Mon, Jun 22, 2020 at 03:33:00PM +0200, Josef Šimánek wrote:

po 22. 6. 2020 v 14:14 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:

Thanks for all comments. I have updated code to support more options
(including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at
https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation
file.

I'll do my best to get this to commitfest now.

I see we're not showing the total number of bytes the COPY is expected
to process, which makes it hard to estimate how far we actually are.
Clearly there are cases when we really don't know that (exports, import
from stdin/program), but why not to show file size for imports from a
file? I'd expect that to be the most common case.

For COPY FROM file fstat is done and info is available already at
https://github.com/postgres/postgres/blob/fe186b4c200b76a5c0f03379fe8645ed1c70a844/src/backend/commands/copy.c#L1934.
It should be easy to update some param (param6 for example) with file size
and expose it in report view. When not available, this column can be NULL.

Would that be enough?

Yes, I think that'd be fine. The rows without a file should have NULL,
because we literally don't know what the value is. And 0 is a valid file
size, so we can't use it anyway.

On the other side everyone can check file size manually to get total value
expected and just compare to reported bytes_processed. Alt. "wc -l" can be
checked to get amount of lines and check lines_processed column to get
progress. Should it check amount of lines and populate another column with
lines total (using a configured separator) as well? AFAIK that would need
full file scan which can be slow for huge files.

Sure, but the extra `wc -l` is less convenient and you then need to
combine that with pg_stat_progress_copy. With the information right in
the view, you can do (100.0 * bytes_processed / bytes_total) and you get
the progress as a percentage. (I've omitted the NULL handling.)

As for the number of lines, I certainly don't think we need to scan the
file - that'd be far too expensive. What we might do is estimate it as

total_bytes / (processed_bytes / processed_rows)

but that's something people can easily do on their own. So I don't think
it needs to be part of the patch, and IMHO bytes_processed / bytes_total
is a sufficient measure of progress.

I wonder if it made sense to show some estimates in the other cases. For
example when exporting query result, maybe we could show the estimated
number of rows and size? Of course, that's prone to estimation errors
and it's more a wild idea for the future, I don't expect this patch to
implement that.

My plan here was to expose numbers not being currently available and let
clients get the rest of info on their own.

For example:
- for "COPY (query) TO file" - EXPLAIN or COUNT variant of query could be
executed before to get the amount of expected rows
- for "COPY table FROM file" - file size or amount of lines in file can be
inspected first to get amount of expected rows or bytes to be processed

I see the current system view in my patch (and also all other report views
currently available) more as a scaffold to build own tools.

For example CLI tools can use this to provide some kind of progress.

True, but I'd advise against putting this into v1 of the patch. Let's
keep it simple, get it committed and then maybe improve it later.

Some of these stats (like the estimates from a query) may be quite
unreliable, so I think it needs more discussion. We might invent
lines_estimated or something like that, for example.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

vignesh C

vignesh21@gmail.com

almost 6 years ago

In reply to: Josef Šimánek (#13)

Re: [PATCH] Initial progress reporting for COPY command

On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline character).
I'll try to check more complex COPY commands to ensure everything is in sync.

If you have any ideas for testing queries, feel free to suggest.

For copy from statement you could attach the session, put a breakpoint
at CopyReadLineText, execution will hit this breakpoint for every
record it is doing COPY FROM and parallely check if
pg_stat_progress_copy is getting updated correctly. I noticed it was
showing the file read size instead of the actual processed bytes.

+pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed
You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.
I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to be used later in custom nice friendly human-readable views built on the client side.
For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.

Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.

I felt we could add pg_size_pretty to make the view more user friendly.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: vignesh C (#17)

Re: [PATCH] Initial progress reporting for COPY command

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:

On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline character).
I'll try to check more complex COPY commands to ensure everything is in sync.

If you have any ideas for testing queries, feel free to suggest.

For copy from statement you could attach the session, put a breakpoint
at CopyReadLineText, execution will hit this breakpoint for every
record it is doing COPY FROM and parallely check if
pg_stat_progress_copy is getting updated correctly. I noticed it was
showing the file read size instead of the actual processed bytes.
+pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed
You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.
I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to be used later in custom nice friendly human-readable views built on the client side.
For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.

Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.
I felt we could add pg_size_pretty to make the view more user friendly.

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

Pavel Stehule

pavel.stehule@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#18)

Re: [PATCH] Initial progress reporting for COPY command

út 23. 6. 2020 v 13:15 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:

On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com>

wrote:

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000))

TO STDOUT;' > /tmp/ryba.txt

echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1

line is 2 bytes here - "1" and newline character).

I'll try to check more complex COPY commands to ensure everything is in

sync.
If you have any ideas for testing queries, feel free to suggest.

For copy from statement you could attach the session, put a breakpoint
at CopyReadLineText, execution will hit this breakpoint for every
record it is doing COPY FROM and parallely check if
pg_stat_progress_copy is getting updated correctly. I noticed it was
showing the file read size instead of the actual processed bytes.
+pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed
You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.
I was looking at the rest of reporting views and for me those seem to
be just basic ones providing just raw data to be used later in custom nice
friendly human-readable views built on the client side.

For example "pg_stat_progress_basebackup" also reports

"backup_streamed" in raw form.

Anyway if you would like to make this view more user-friendly, I can

add that. Just ping me.

I felt we could add pg_size_pretty to make the view more user friendly.

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.

+1, *_pretty functions should be used on the client side only. Server side
(source) should be in raw format.

Regards

Pavel

Show quoted text

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20

Josef Šimánek

josef.simanek@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#18)

Re: [PATCH] Initial progress reporting for COPY command

út 23. 6. 2020 v 13:15 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:

On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com>

wrote:

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000))

TO STDOUT;' > /tmp/ryba.txt

echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1

line is 2 bytes here - "1" and newline character).

I'll try to check more complex COPY commands to ensure everything is in

sync.
If you have any ideas for testing queries, feel free to suggest.

For copy from statement you could attach the session, put a breakpoint
at CopyReadLineText, execution will hit this breakpoint for every
record it is doing COPY FROM and parallely check if
pg_stat_progress_copy is getting updated correctly. I noticed it was
showing the file read size instead of the actual processed bytes.
+pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed
You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.
I was looking at the rest of reporting views and for me those seem to
be just basic ones providing just raw data to be used later in custom nice
friendly human-readable views built on the client side.

For example "pg_stat_progress_basebackup" also reports

"backup_streamed" in raw form.

Anyway if you would like to make this view more user-friendly, I can

add that. Just ping me.

I felt we could add pg_size_pretty to make the view more user friendly.

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.