GSOC'17 project introduction: Parallel COPY execution with errors handling

Started by Alexey Kondratovalmost 9 years ago34 messages
#1Alexey Kondratov
kondratov.aleksey@gmail.com

Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch /messages/by-id/603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.

(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.

My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.

P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.

Best regards,

Alexey

#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Alexey Kondratov (#1)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Hi

2017-03-23 12:33 GMT+01:00 Alexey Kondratov <kondratov.aleksey@gmail.com>:

Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two
parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are
some cases when you need to load (e.g. for a further analytics) a bunch of
not well consistent records with rare type/column mismatches. Since PG
throws exception on the first error, currently the only one solution is to
preformat your data with any other tool and then load to PG. However,
frequently it is easier to drop certain records instead of doing such
preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO
https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push
similar patch /messages/by-id/flat/
603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#
603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com. There were
no negative responses against this patch and it seams that it was just
forgoten and have not been finalized.

As an example of a general idea I can provide *read_csv* method of python
package – *pandas* (http://pandas.pydata.org/pandas-docs/stable/generated/
pandas.read_csv.html). It uses C parser which throws error on first
columns mismatch. However, it has two flags *error_bad_lines* and
*warn_bad_lines*, which being set to False helps to drop bad lines or
even hide warn messages about them.

(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just
should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki
https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some
side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it
always better to have well-performing core functionality out of the box.

My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such
project and can be a menthor?
2) These two points have a general idea – to simplify work with a large
amount of data from a different sources, but mybe it would be better to
focus on the single task?

I spent lot of time on implementation @1 - maybe I found somewhere a patch.
Both tasks has some common - you have to divide import to more batches.

3) Is it realistic to mostly finish both parts during the 3+ months of
almost full-time work or I am too presumptuous?

It is possible, I am thinking - I am not sure about all possible details,
but basic implementation can be done in 3 months.

Show quoted text

I will be very appreciate to any comments and criticism.

P.S. I know about very interesting ready projects from the PG's comunity
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more
interesting to solve your own problems, issues and questions, which are the
product of you experience with software. That's why I dare to propose my
own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from
Moscow, Russia, and highly involved in software development since 2010. I
guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C,
Fortran development and basic understanding of algorithms design and
analysis.

Best regards,

Alexey

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Pavel Stehule (#2)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

1) Is there anyone out of PG comunity who will be interested in such
project and can be a menthor?
2) These two points have a general idea – to simplify work with a large
amount of data from a different sources, but mybe it would be better to
focus on the single task?

I spent lot of time on implementation @1 - maybe I found somewhere a
patch. Both tasks has some common - you have to divide import to more
batches.

Patch is in /dev/null :( - My implementation was based on subtransactions
for 1000 rows. When some checks fails, then I throw subtransaction and I
imported every row from block in own subtransaction. It was a prototype - I
didn't search some smarter implementation.

3) Is it realistic to mostly finish both parts during the 3+ months of
almost full-time work or I am too presumptuous?

It is possible, I am thinking - I am not sure about all possible details,
but basic implementation can be done in 3 months.

Some data, some check depends on order - it can be a problem in parallel
processing - you should to define corner cases.

Show quoted text

I will be very appreciate to any comments and criticism.

P.S. I know about very interesting ready projects from the PG's comunity
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more
interesting to solve your own problems, issues and questions, which are the
product of you experience with software. That's why I dare to propose my
own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics
from Moscow, Russia, and highly involved in software development since
2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB,
C, Fortran development and basic understanding of algorithms design and
analysis.

Best regards,

Alexey

#4Craig Ringer
craig@2ndquadrant.com
In reply to: Alexey Kondratov (#1)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

(1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Stas Kelvich
stas.kelvich@gmail.com
In reply to: Craig Ringer (#4)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote:

On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

(1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.

But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker)
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Stas Kelvich (#5)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Pavel, Craig and Stas,

Thank you for your responses and valuable comments!

I have written draft proposal https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit <https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit&gt;

It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

Alexey

Show quoted text

On 23 Mar 2017, at 17:24, Stas Kelvich <stas.kelvich@gmail.com> wrote:

On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote:

On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

(1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.

But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker)
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers <http://www.postgresql.org/mailpref/pgsql-hackers&gt;

#7Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Alexey Kondratov (#6)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
kondratov.aleksey@gmail.com> wrote:

Thank you for your responses and valuable comments!

I have written draft proposal https://docs.google.com/document/d/1Y4mc_
PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit

It seems that COPY currently is able to return first error line and error
type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being
optimised for a small number of error rows, should not
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records

2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)

5. Report, save to errors table or silently drop k-th line

6. Next, try to insert lines from (k + 1) till N with another
subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we
started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY

execution. Each process will receive equal piece of the input file. Since
file is splitted by size not by lines, each worker will start import from
the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is
not typical. More typical case is \copy psql command. In that case "COPY
... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#8Alex K
kondratov.aleksey@gmail.com
In reply to: Alexander Korotkov (#7)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Hi Alexander!

I've missed your reply, since proposal submission deadline have passed last
Monday and I didn't check hackers mailing list too frequently.

(1) It seems that starting new subtransaction at step 4 is not necessary.
We can just gather all error lines in one pass and at the end of input
start the only one additional subtransaction with all safe-lines at once:
[1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error
line number.

But assuming that the only livable use-case is when number of errors is
relatively small compared to the total rows number, because if the input is
in totally inconsistent format, then it seems useless to import it into the
db. Thus, it is not 100% clear for me, would it be any real difference in
performance, if one starts new subtransaction at step 4 or not.

(2) Hmm, good question. As far as I know it is impossible to get stdin
input size, thus it is impossible to distribute stdin directly to the
parallel workers. The first approach which comes to the mind is to store
stdin input in any kind of buffer/query and next read it in parallel by
workers. The question is how it will perform in the case of large file, I
guess poor, at least from the memory consumption perspective. But would
parallel execution still be faster is the next question.

Alexey

On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Show quoted text

Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
kondratov.aleksey@gmail.com> wrote:

Thank you for your responses and valuable comments!

I have written draft proposal https://docs.google.c
om/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit

It seems that COPY currently is able to return first error line and error
type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being
optimised for a small number of error rows, should not
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records

2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)

5. Report, save to errors table or silently drop k-th line

6. Next, try to insert lines from (k + 1) till N with another
subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction
we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY

execution. Each process will receive equal piece of the input file. Since
file is splitted by size not by lines, each worker will start import from
the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY
is not typical. More typical case is \copy psql command. In that case
"COPY ... FROM stdin;" is actually executed while psql is streaming the
data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#9Robert Haas
robertmhaas@gmail.com
In reply to: Alex K (#8)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote:

(1) It seems that starting new subtransaction at step 4 is not necessary. We
can just gather all error lines in one pass and at the end of input start
the only one additional subtransaction with all safe-lines at once: [1, ...,
k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

The only way to recover from an error is to abort the subtransaction,
or to abort the toplevel transaction. Anything else is unsafe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Robert Haas (#9)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm:

1. Suppose we have to insert N records
2. Start subtransaction with these N records
3. Error is raised on k-th line
4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
7. Repeat until the end of file

One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line
OR
iterate till the end of file and start only one subtransaction with all lines excepting error lines.

Alexey

On 10 Apr 2017, at 19:55, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote:

(1) It seems that starting new subtransaction at step 4 is not necessary. We
can just gather all error lines in one pass and at the end of input start
the only one additional subtransaction with all safe-lines at once: [1, ...,
k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

The only way to recover from an error is to abort the subtransaction,
or to abort the toplevel transaction. Anything else is unsafe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Alexey Kondratov (#10)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On Mon, Apr 10, 2017 at 2:46 PM, Alexey Kondratov
<kondratov.aleksey@gmail.com> wrote:

Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm:

1. Suppose we have to insert N records
2. Start subtransaction with these N records
3. Error is raised on k-th line
4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
7. Repeat until the end of file

One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line

I don't understand what you mean by that.

OR
iterate till the end of file and start only one subtransaction with all lines excepting error lines.

That could involve buffering a huge file. Imagine a 300GB load.

Also consider how many XIDs whatever design is proposed will blow
through when loading 300GB of data. There's a nasty trade-off here
between XID consumption (and the aggressive vacuums it eventually
causes) and preserving performance in the face of errors - e.g. if you
make k = 100,000 you consume 100x fewer XIDs than if you make k =
1000, but you also have 100x the work to redo (on average) every time
you hit an error. If the data quality is poor (say, 50% of lines have
errors) it's almost impossible to avoid runaway XID consumption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Robert Haas (#11)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-11 Robert Haas <robertmhaas@gmail.com>:

There's a nasty trade-off here between XID consumption (and the
aggressive vacuums it eventually causes) and preserving performance in
the face of errors - e.g. if you make k = 100,000 you consume 100x
fewer XIDs than if you make k = 1000, but you also have 100x the work
to redo (on average) every time you hit an error.

You could make it dynamic: Commit the subtransaction even when not
encountering any error after N lines (N starts out at 1), then double
N and continue. When encountering an error, roll back the current
subtransaction back and re-insert all the known good rows that have
been rolled back (plus maybe the erroneous row into a separate table
or whatever) in one new subtransaction and commit; then reset N to 1
and continue processing the rest of the file.

That would work reasonable well whenever the ratio of erroneous rows
is not extremely high: whether the erroneous rows are all clumped
together, entirely randomly spread out over the file, or a combination
of both.

If the data quality is poor (say, 50% of lines have errors) it's
almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Greetings,

Nicolas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Nicolas Barbier (#12)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:

2017-04-11 Robert Haas <robertmhaas@gmail.com>:

There's a nasty trade-off here between XID consumption (and the
aggressive vacuums it eventually causes) and preserving performance in
the face of errors - e.g. if you make k = 100,000 you consume 100x
fewer XIDs than if you make k = 1000, but you also have 100x the work
to redo (on average) every time you hit an error.

You could make it dynamic: Commit the subtransaction even when not
encountering any error after N lines (N starts out at 1), then double
N and continue. When encountering an error, roll back the current
subtransaction back and re-insert all the known good rows that have
been rolled back (plus maybe the erroneous row into a separate table
or whatever) in one new subtransaction and commit; then reset N to 1
and continue processing the rest of the file.

That would work reasonable well whenever the ratio of erroneous rows
is not extremely high: whether the erroneous rows are all clumped
together, entirely randomly spread out over the file, or a combination
of both.

Right. I wouldn't suggest the exact algorithm you proposed; I think
you ought to vary between some lower limit >1, maybe 10, and some
upper limit, maybe 1,000,000, ratcheting up and down based on how
often you hit errors in some way that might not be as simple as
doubling. But something along those lines.

If the data quality is poor (say, 50% of lines have errors) it's
almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Yep. But it does seem reasonably likely that someone might shoot
themselves in the foot anyway. Maybe we just live with that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Robert Haas (#13)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 12 Apr 2017, at 20:23, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:

2017-04-11 Robert Haas <robertmhaas@gmail.com>:

If the data quality is poor (say, 50% of lines have errors) it's
almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Yep. But it does seem reasonably likely that someone might shoot
themselves in the foot anyway. Maybe we just live with that.

Moreover if that file consists of one-byte lines (plus one byte of newline char)
then during its import xid wraparound will happens 18 times =)

I think it’s reasonable at least to have something like max_errors parameter
to COPY, that will be set by default to 1000 for example. If user will hit that
limit then it is a good moment to put a warning about possible xid consumption
in case of bigger limit.

However I think it worth of quick research whether it is possible to create special
code path for COPY in which errors don’t cancel transaction. At least when
COPY called outside of transaction block.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Craig Ringer
craig@2ndquadrant.com
In reply to: Stas Kelvich (#14)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 13 April 2017 at 01:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

However I think it worth of quick research whether it is possible to create special
code path for COPY in which errors don’t cancel transaction.

Not really. Anything at any layer of the system expects to be able to ERROR:

* datatype input functions
* CHECK constraints
* FK constraints
* unique indexes
* user defined functions run by triggers
* interrupt signalling (think deadlock detector)
* ...

and we rely on ERROR unwinding any relevant memory contexts, releasing
lwlocks, etc.

When an xact aborts it may leave all sorts of mess on disk. Nothing
gets deleted, it's just ignored due to an aborted xmin.

Maybe some xid burn could be saved by trying harder to pre-validate
batches of data as much as possible before we write anything to the
heap, sorting obviously faulty data into buffers and doing as much
work as possible before allocating a new (sub)xid and writing to the
heap. We'd still abort but we'd only be aborting a vtxid.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Alex K
kondratov.aleksey@gmail.com
In reply to: Craig Ringer (#15)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Hi pgsql-hackers,

Thank you again for all these replies. I have started working under this
project
and learnt a lot of new stuff last month, so here are some new thoughts
about
ERRORS handling in COPY. I decided to stick to the same thread, since it
has a neutral subject.

(1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
UPSERT. It may be a good point to be able to achieve the same functionality
as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
and errors handling is turned on. It could additionally reduce number of
failed
subtransactions and reduce XIDs consumption, while still ignoring some
common
errors like unique index violation.

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.

(2) Otherwise, I am still going to use subtransactions via
BeginInternalSubTransaction
and PG_TRY / PG_CATCH with
ReleaseCurrentSubTransaction / RollbackAndReleaseCurrentSubTransaction.
To minimize XIDs consumption I will try to insert tuples in batches and
pre-validate
them as much as possible (as was suggested in the thread before).

Alexey

In reply to: Alex K (#16)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On Wed, Jun 7, 2017 at 12:34 PM, Alex K <kondratov.aleksey@gmail.com> wrote:

(1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
UPSERT.

It may be a good point to be able to achieve the same functionality
as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
and errors handling is turned on. It could additionally reduce number of
failed
subtransactions and reduce XIDs consumption, while still ignoring some
common
errors like unique index violation.

Alvaro and I talked about this informally at PGCon.

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.

You seem to be talking about doing this as an optimization on top of a
base feature that does the main thing you want (captures all errors
within an implementation level subxact without passing them to the
client). That could make sense, as a way of preventing extreme bloat
for a very bad case where almost all inserts have conflicts. (This
seems quite possible, whereas it seems much less likely that users
would have an input file simple full of illformed tuples.)

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow. I'm not sure if it's
possible to avoid using subtransactions all together, but speculative
insertion would help if you find that you can do it without
subtransactions. Using subtransactions is always going to be a bit
ugly, because you'll need to continually reassess whether or not
you're batching insertions together at the right granularity (that is,
that you've weighed the rate of XID consumption against how much work
you lose when a batched transaction has to be "replayed" to include
things that are known to be valid). And, if you care about duplicate
violations, then you can't really be sure that replaying a "known
good" tuple will stay good from one moment to the next.

My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types. Is there really much value in
ignoring errors due to invalid encoding? It's not as if such problems
can be reliably detected today. If you use the wrong encoding, and
ignore some errors that COPY would generally raise, then there is an
excellent chance that you'll still insert some remaining rows with
text that has been incorrectly interpreted as valid in the database
encoding -- some text datums are bound to accidentally appear valid.
There are probably similar issues with other types. It's not clear
what the point is at which the user is no longer helped by ignoring
problems, because we cannot reliably detect *all* problems at the
level of each row.

If you must ignore errors within the input functions of types, then
maybe you can optionally let the user do that by way of a "dry run",
where the entire input file is examined for basic structural soundness
ahead of considering constraints. Any errors are saved then and there,
in a format that can be used to make sure that those entries are
skipped on a later COPY. As a further enhancement, in the future, the
user might then be able to define special transform functions that
correct the errors for those rows only. You kind of need to see all
the faulty rows together to do something like that, so a dry run could
make a lot of sense.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Peter Geoghegan (#17)
1 attachment(s)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Thank you for your comments Peter, there are some points that I did not think about before.

On 9 Jun 2017, at 01:09, Peter Geoghegan <pg@bowt.ie> wrote:

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.

I am not going to start with "speculative insertion" right now, but it would be very
useful, if you give me a point, where to start. Maybe I will at least try to evaluate
the complexity of the problem.

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow.
...
My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types.

Initially I was thinking only about malformed rows, e.g. less or extra columns.
Honestly, I did not know that there are so many levels and ways where error
can occur. So currently (and especially after your comments) I prefer to focus
only on the following list of errors:

1) File format issues
a. Less columns than needed
b. Extra columns

2) I am doubt about type mismatch. It is possible to imagine a situation when,
e.g. some integers are exported as int, and some as "int", but I am not sure
that is is a common situation.

3) Some constraint violations, e.g. unique index.

First appeared to be easy achievable without subtransactions. I have created a
proof of concept version of copy, where the errors handling is turned on by default.
Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec)
or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files <https://github.com/ololobus/postgres/pull/1/files&gt;.
It throws warnings instead of errors for malformed lines with less/extra columns
and reports line number.

Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH
around heap_form_tuple, since it is not yet inserted into the heap.

But third is questionable without subtransactions, since even if we check
constraints once, there maybe various before/after triggers which can modify
tuple, so it will not satisfy them. Corresponding comment inside copy.c states:
"Note that a BR trigger might modify tuple such that the partition constraint is
no satisfied, so we need to check in that case." Thus, there are maybe different
situations here, as I understand. However, it a point where "speculative insertion"
is able to help.

These three cases should cover most real-life scenarios.

Is there really much value in ignoring errors due to invalid encoding?

Now, I have some doubts about it too. If there is an encoding problem,
it is probably about the whole file, not only a few rows.

Alexey

Attachments:

copy-errors-v0.1.patchapplication/octet-stream; name=copy-errors-v0.1.patch; x-unix-mode=0644Download
From 86fd522d40eefc6320ee4f963404454855d5dcfc Mon Sep 17 00:00:00 2001
From: Alex K <alex.lumir@gmail.com>
Date: Fri, 9 Jun 2017 23:41:51 +0300
Subject: [PATCH 1/2] Allow ignoring some errors during COPY FROM

---
 contrib/file_fdw/file_fdw.c |   4 +-
 src/backend/commands/copy.c | 294 ++++++++++++++++++++++++++------------------
 src/include/commands/copy.h |   2 +-
 3 files changed, 178 insertions(+), 122 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 277639f6e9..1d855ca6e7 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -689,7 +689,7 @@ fileIterateForeignScan(ForeignScanState *node)
 {
 	FileFdwExecutionState *festate = (FileFdwExecutionState *) node->fdw_state;
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
-	bool		found;
+	int		found;
 	ErrorContextCallback errcallback;
 
 	/* Set up callback to identify error line number. */
@@ -1080,7 +1080,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 	TupleDesc	tupDesc;
 	Datum	   *values;
 	bool	   *nulls;
-	bool		found;
+	int		    found;
 	char	   *filename;
 	bool		is_program;
 	List	   *options;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 0a33c40c17..d40d753c47 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -54,6 +54,16 @@
 #define OCTVALUE(c) ((c) - '0')
 
 /*
+    NextCopyFrom states:
+    0 – Error or stop
+    1 – Successfully read data
+    2 – Data with errors, skip
+*/
+#define NCF_STOP         0
+#define NCF_SUCCESS      1
+#define NCF_SKIP         2
+
+/*
  * Represents the different source/dest cases we need to worry about at
  * the bottom level
  */
@@ -139,6 +149,7 @@ typedef struct CopyStateData
 	int			cur_lineno;		/* line number for error messages */
 	const char *cur_attname;	/* current att for error messages */
 	const char *cur_attval;		/* current att value for error messages */
+    bool        ignore_errors;  /* ignore errors during COPY FROM */
 
 	/*
 	 * Working state for COPY TO/FROM
@@ -2310,6 +2321,7 @@ CopyFrom(CopyState cstate)
 	bool		useHeapMultiInsert;
 	int			nBufferedTuples = 0;
 	int			prev_leaf_part_index = -1;
+    int         next_cf_state; /* NextCopyFrom return state */
 
 #define MAX_BUFFERED_TUPLES 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
@@ -2519,116 +2531,125 @@ CopyFrom(CopyState cstate)
 		/* Switch into its memory context */
 		MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
 
-		if (!NextCopyFrom(cstate, econtext, values, nulls, &loaded_oid))
-			break;
-
-		/* And now we can form the input tuple. */
-		tuple = heap_form_tuple(tupDesc, values, nulls);
-
-		if (loaded_oid != InvalidOid)
-			HeapTupleSetOid(tuple, loaded_oid);
-
-		/*
-		 * Constraints might reference the tableoid column, so initialize
-		 * t_tableOid before evaluating them.
-		 */
-		tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
-
-		/* Triggers and stuff need to be invoked in query context. */
-		MemoryContextSwitchTo(oldcontext);
-
-		/* Place tuple in tuple slot --- but slot shouldn't free it */
-		slot = myslot;
-		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-
-		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
-		{
-			int			leaf_part_index;
-			TupleConversionMap *map;
-
-			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
-			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-											 cstate->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
-
-			/*
-			 * If this tuple is mapped to a partition that is not same as the
-			 * previous one, we'd better make the bulk insert mechanism gets a
-			 * new buffer.
-			 */
-			if (prev_leaf_part_index != leaf_part_index)
-			{
-				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
-			}
-
-			/*
-			 * Save the old ResultRelInfo and switch to the one corresponding
-			 * to the selected partition.
-			 */
-			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
-
-			/* We do not yet have a way to insert into a foreign partition */
-			if (resultRelInfo->ri_FdwRoutine)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("cannot route inserted tuples to a foreign table")));
-
-			/*
-			 * For ExecInsertIndexTuples() to work on the partition's indexes
-			 */
-			estate->es_result_relation_info = resultRelInfo;
-
-			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
-			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
-			if (map)
-			{
-				Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-				tuple = do_convert_tuple(tuple, map);
-
-				/*
-				 * We must use the partition's tuple descriptor from this
-				 * point on.  Use a dedicated slot from this point on until
-				 * we're finished dealing with the partition.
-				 */
-				slot = cstate->partition_tuple_slot;
-				Assert(slot != NULL);
-				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-			}
+        next_cf_state = NextCopyFrom(cstate, econtext, values, nulls, &loaded_oid);
 
-			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
-		}
-
-		skip_tuple = false;
-
-		/* BEFORE ROW INSERT Triggers */
-		if (resultRelInfo->ri_TrigDesc &&
-			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
-		{
-			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
-
-			if (slot == NULL)	/* "do nothing" */
-				skip_tuple = true;
-			else	/* trigger might have changed tuple */
-				tuple = ExecMaterializeSlot(slot);
+		if (!next_cf_state) {
+			break;
 		}
+        else if (next_cf_state == NCF_SUCCESS)
+        {
+    		/* And now we can form the input tuple. */
+    		tuple = heap_form_tuple(tupDesc, values, nulls);
+
+    		if (loaded_oid != InvalidOid)
+    			HeapTupleSetOid(tuple, loaded_oid);
+
+    		/*
+    		 * Constraints might reference the tableoid column, so initialize
+    		 * t_tableOid before evaluating them.
+    		 */
+    		tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+    		/* Triggers and stuff need to be invoked in query context. */
+    		MemoryContextSwitchTo(oldcontext);
+
+    		/* Place tuple in tuple slot --- but slot shouldn't free it */
+    		slot = myslot;
+    		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
+
+    		/* Determine the partition to heap_insert the tuple into */
+    		if (cstate->partition_dispatch_info)
+    		{
+    			int			leaf_part_index;
+    			TupleConversionMap *map;
+
+    			/*
+    			 * Away we go ... If we end up not finding a partition after all,
+    			 * ExecFindPartition() does not return and errors out instead.
+    			 * Otherwise, the returned value is to be used as an index into
+    			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
+    			 * will get us the ResultRelInfo and TupleConversionMap for the
+    			 * partition, respectively.
+    			 */
+    			leaf_part_index = ExecFindPartition(resultRelInfo,
+    											 cstate->partition_dispatch_info,
+    												slot,
+    												estate);
+    			Assert(leaf_part_index >= 0 &&
+    				   leaf_part_index < cstate->num_partitions);
+
+    			/*
+    			 * If this tuple is mapped to a partition that is not same as the
+    			 * previous one, we'd better make the bulk insert mechanism gets a
+    			 * new buffer.
+    			 */
+    			if (prev_leaf_part_index != leaf_part_index)
+    			{
+    				ReleaseBulkInsertStatePin(bistate);
+    				prev_leaf_part_index = leaf_part_index;
+    			}
+
+    			/*
+    			 * Save the old ResultRelInfo and switch to the one corresponding
+    			 * to the selected partition.
+    			 */
+    			saved_resultRelInfo = resultRelInfo;
+    			resultRelInfo = cstate->partitions + leaf_part_index;
+
+    			/* We do not yet have a way to insert into a foreign partition */
+    			if (resultRelInfo->ri_FdwRoutine)
+    				ereport(ERROR,
+    						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+    				 errmsg("cannot route inserted tuples to a foreign table")));
+
+    			/*
+    			 * For ExecInsertIndexTuples() to work on the partition's indexes
+    			 */
+    			estate->es_result_relation_info = resultRelInfo;
+
+    			/*
+    			 * We might need to convert from the parent rowtype to the
+    			 * partition rowtype.
+    			 */
+    			map = cstate->partition_tupconv_maps[leaf_part_index];
+    			if (map)
+    			{
+    				Relation	partrel = resultRelInfo->ri_RelationDesc;
+
+    				tuple = do_convert_tuple(tuple, map);
+
+    				/*
+    				 * We must use the partition's tuple descriptor from this
+    				 * point on.  Use a dedicated slot from this point on until
+    				 * we're finished dealing with the partition.
+    				 */
+    				slot = cstate->partition_tuple_slot;
+    				Assert(slot != NULL);
+    				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
+    				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
+    			}
+
+    			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+    		}
+
+    		skip_tuple = false;
+
+    		/* BEFORE ROW INSERT Triggers */
+    		if (resultRelInfo->ri_TrigDesc &&
+    			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+    		{
+    			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
+
+    			if (slot == NULL)	/* "do nothing" */
+    				skip_tuple = true;
+    			else	/* trigger might have changed tuple */
+    				tuple = ExecMaterializeSlot(slot);
+    		}
+        }
+        else
+        {
+            skip_tuple = true;
+        }
 
 		if (!skip_tuple)
 		{
@@ -2925,6 +2946,7 @@ BeginCopyFrom(ParseState *pstate,
 	cstate->cur_lineno = 0;
 	cstate->cur_attname = NULL;
 	cstate->cur_attval = NULL;
+    cstate->ignore_errors = true;
 
 	/* Set up variables to avoid per-attribute overhead. */
 	initStringInfo(&cstate->attribute_buf);
@@ -3202,7 +3224,7 @@ NextCopyFromRawFields(CopyState cstate, char ***fields, int *nfields)
  * relation passed to BeginCopyFrom. This function fills the arrays.
  * Oid of the tuple is returned with 'tupleOid' separately.
  */
-bool
+int
 NextCopyFrom(CopyState cstate, ExprContext *econtext,
 			 Datum *values, bool *nulls, Oid *tupleOid)
 {
@@ -3220,11 +3242,20 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 	int		   *defmap = cstate->defmap;
 	ExprState **defexprs = cstate->defexprs;
 
+    int         error_level = ERROR;        /* Error level for COPY FROM input data errors */
+    int         exec_state = NCF_SUCCESS;   /* Return code */
+
 	tupDesc = RelationGetDescr(cstate->rel);
 	attr = tupDesc->attrs;
 	num_phys_attrs = tupDesc->natts;
 	attr_count = list_length(cstate->attnumlist);
 	nfields = file_has_oids ? (attr_count + 1) : attr_count;
+    
+    /* Set error level to WARNING, if errors handling is turned on */
+    if (cstate->ignore_errors)
+    {
+        error_level = WARNING;
+    }
 
 	/* Initialize all values for row to NULL */
 	MemSet(values, 0, num_phys_attrs * sizeof(Datum));
@@ -3240,13 +3271,17 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 
 		/* read raw fields in the next line */
 		if (!NextCopyFromRawFields(cstate, &field_strings, &fldct))
-			return false;
+			return NCF_STOP;
 
 		/* check for overflowing fields */
 		if (nfields > 0 && fldct > nfields)
-			ereport(ERROR,
+        {
+            exec_state = NCF_SKIP;
+			ereport(error_level,
 					(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 					 errmsg("extra data after last expected column")));
+        }
+
 
 		fieldno = 0;
 
@@ -3254,15 +3289,22 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 		if (file_has_oids)
 		{
 			if (fieldno >= fldct)
-				ereport(ERROR,
+            {
+                exec_state = NCF_SKIP;
+				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 						 errmsg("missing data for OID column")));
+            }
+
 			string = field_strings[fieldno++];
 
 			if (string == NULL)
-				ereport(ERROR,
+            {
+                exec_state = NCF_SKIP;
+				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 						 errmsg("null OID in COPY data")));
+            }
 			else if (cstate->oids && tupleOid != NULL)
 			{
 				cstate->cur_attname = "oid";
@@ -3270,9 +3312,13 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 				*tupleOid = DatumGetObjectId(DirectFunctionCall1(oidin,
 												   CStringGetDatum(string)));
 				if (*tupleOid == InvalidOid)
-					ereport(ERROR,
+                {
+                    exec_state = NCF_SKIP;
+					ereport(error_level,
 							(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 							 errmsg("invalid OID in COPY data")));
+                }
+
 				cstate->cur_attname = NULL;
 				cstate->cur_attval = NULL;
 			}
@@ -3285,10 +3331,14 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 			int			m = attnum - 1;
 
 			if (fieldno >= fldct)
-				ereport(ERROR,
+            {
+                exec_state = NCF_SKIP;
+				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 						 errmsg("missing data for column \"%s\"",
 								NameStr(attr[m]->attname))));
+            }
+
 			string = field_strings[fieldno++];
 
 			if (cstate->convert_select_flags &&
@@ -3371,14 +3421,17 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 				ereport(ERROR,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 						 errmsg("received copy data after EOF marker")));
-			return false;
+			return NCF_STOP;
 		}
 
 		if (fld_count != attr_count)
-			ereport(ERROR,
+        {
+            exec_state = NCF_SKIP;
+			ereport(error_level,
 					(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 					 errmsg("row field count is %d, expected %d",
 							(int) fld_count, attr_count)));
+        }
 
 		if (file_has_oids)
 		{
@@ -3393,9 +3446,12 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 														 -1,
 														 &isnull));
 			if (isnull || loaded_oid == InvalidOid)
-				ereport(ERROR,
+            {
+                exec_state = NCF_SKIP;
+				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 						 errmsg("invalid OID in COPY data")));
+            }
 			cstate->cur_attname = NULL;
 			if (cstate->oids && tupleOid != NULL)
 				*tupleOid = loaded_oid;
@@ -3437,7 +3493,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 										 &nulls[defmap[i]]);
 	}
 
-	return true;
+	return exec_state;
 }
 
 /*
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index f081f2219f..c4d38dbea7 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -31,7 +31,7 @@ extern void ProcessCopyOptions(ParseState *pstate, CopyState cstate, bool is_fro
 extern CopyState BeginCopyFrom(ParseState *pstate, Relation rel, const char *filename,
 			  bool is_program, copy_data_source_cb data_source_cb, List *attnamelist, List *options);
 extern void EndCopyFrom(CopyState cstate);
-extern bool NextCopyFrom(CopyState cstate, ExprContext *econtext,
+extern int NextCopyFrom(CopyState cstate, ExprContext *econtext,
 			 Datum *values, bool *nulls, Oid *tupleOid);
 extern bool NextCopyFromRawFields(CopyState cstate,
 					  char ***fields, int *nfields);
-- 
2.11.0


From 7acc7b951c75f7e307f176d5ad991abe21592eb4 Mon Sep 17 00:00:00 2001
From: Alex K <alex.lumir@gmail.com>
Date: Sun, 11 Jun 2017 16:48:37 +0300
Subject: [PATCH 2/2] Report line number on each ERROR/WARNING

---
 src/backend/commands/copy.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d40d753c47..ed2cc87c8a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3279,7 +3279,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
             exec_state = NCF_SKIP;
 			ereport(error_level,
 					(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-					 errmsg("extra data after last expected column")));
+					 errmsg("extra data after last expected column at line %d", cstate->cur_lineno)));
         }
 
 
@@ -3293,7 +3293,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
                 exec_state = NCF_SKIP;
 				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-						 errmsg("missing data for OID column")));
+						 errmsg("missing data for OID column at line %d", cstate->cur_lineno)));
             }
 
 			string = field_strings[fieldno++];
@@ -3303,7 +3303,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
                 exec_state = NCF_SKIP;
 				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-						 errmsg("null OID in COPY data")));
+						 errmsg("null OID in COPY data at line %d", cstate->cur_lineno)));
             }
 			else if (cstate->oids && tupleOid != NULL)
 			{
@@ -3316,7 +3316,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
                     exec_state = NCF_SKIP;
 					ereport(error_level,
 							(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-							 errmsg("invalid OID in COPY data")));
+							 errmsg("invalid OID in COPY data at line %d", cstate->cur_lineno)));
                 }
 
 				cstate->cur_attname = NULL;
@@ -3335,8 +3335,8 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
                 exec_state = NCF_SKIP;
 				ereport(error_level,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-						 errmsg("missing data for column \"%s\"",
-								NameStr(attr[m]->attname))));
+						 errmsg("missing data for column \"%s\" at line %d",
+								NameStr(attr[m]->attname), cstate->cur_lineno)));
             }
 
 			string = field_strings[fieldno++];
-- 
2.11.0

In reply to: Alexey Kondratov (#18)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On Mon, Jun 12, 2017 at 3:52 AM, Alexey Kondratov
<kondratov.aleksey@gmail.com> wrote:

I am not going to start with "speculative insertion" right now, but it would
be very
useful, if you give me a point, where to start. Maybe I will at least try to
evaluate
the complexity of the problem.

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c:

* SpeculativeInsertionLockAcquire()

* HeapTupleHeaderSetSpeculativeToken()

* heap_insert() called with HEAP_INSERT_SPECULATIVE argument

* ExecInsertIndexTuples() with specInsert = true

* heap_finish_speculative()

* heap_abort_speculative()

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types.

Initially I was thinking only about malformed rows, e.g. less or extra
columns.
Honestly, I did not know that there are so many levels and ways where error
can occur.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case. But, that's just my opinion, and it's possible that someone else
will disagree. Try to get a second opinion.

Limiting the feature to just skip rows on the basis of a formally
defined constraint failing (not including type input failure, or a
trigger throwing an error, and probably not including foreign key
failures because they're really triggers) might be a good approach.
MySQL's INSERT IGNORE is a bit like that, I think. (It doesn't *just*
ignore duplicate violations, unlike our ON CONFLICT DO NOTHING
feature).

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed". Plus maybe one or two other cases, like using speculative
insertion to back out of unique violation without consuming a subxact.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Peter Geoghegan (#19)
1 attachment(s)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:

I am not going to start with "speculative insertion" right now, but it would
be very
useful, if you give me a point, where to start. Maybe I will at least try to
evaluate
the complexity of the problem.

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case.

Yes, and worse case probably will be a quite frequent case, since it is not possible to do heap_multy_insert, if BEFORE/INSTEAD triggers or partitioning exist (according to the current copy.c code). Thus, it will frequently fall back into a single heap_insert, each being wrapped with subtransaction will consume XIDs too greedy and seriously affect performance. I like my previous idea less and less.

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"

Currently ExecConstraints always throws an error and I do not think, that it would be wise from my side to modify its behaviour.

I have updated my patch (rebased over the topmost master commit 94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch file attached or always up to date version on GitHub https://github.com/ololobus/postgres/pull/1/files <https://github.com/ololobus/postgres/pull/1/files&gt;

Currently, It caches all major errors in the input data:

1) Rows with less/extra columns cause WARNINGs and are skipped

2) I found that input type format errors are thrown from the InputFunctionCall; and wrapped it up with PG_TRY/CATCH. I am not 100%

Alexey

Attachments:

copy-errors-v0.2.diff.zipapplication/zip; name=copy-errors-v0.2.diff.zip; x-unix-mode=0644Download
#21Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Peter Geoghegan (#19)
1 attachment(s)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

Sorry for a previous email, I have accidentally sent it unfinished.

On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case.

Yes, and worse case probably will be a quite frequent case, since it is not
possible to do heap_multi_insert, when BEFORE/INSTEAD triggers or partitioning
exist (according to the current copy.c code). Thus, it will frequently fall back
into a single heap_insert, each being wrapped with subtransaction will
consume XIDs too greedy and seriously affect performance. I like my initial
idea less and less.

By the way, is it possible to use heap_multi_insert with speculative insertion too?

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"

Currently, ExecConstraints always throws an error and I do not think, that
it would be wise from my side to modify its behaviour.

I have updated my patch (rebased over the topmost master commit
94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch
file attached or always up to date version on GitHub
https://github.com/ololobus/postgres/pull/1/files <https://github.com/ololobus/postgres/pull/1/files&gt;

It catches all major errors in the input data:

1) Rows with less/extra columns cause WARNINGs and are skipped

2) I found that input type format errors are thrown from the
InputFunctionCall; and wrapped it up with PG_TRY/CATCH.

I am not sure that it is 100% transactionally safe, but it seems so,
since all these errors are handled before this point
https://github.com/postgres/postgres/blob/651902deb1551db8b401fdeab9bdb8a61cee7f9f/src/backend/commands/copy.c#L2628 <https://github.com/postgres/postgres/blob/651902deb1551db8b401fdeab9bdb8a61cee7f9f/src/backend/commands/copy.c#L2628&gt;,
where current COPY implementation has a mechanism to skip tuple.
I use the same skip_tuple flag.

Patch passes all regression tests, excepting a few tests due to the slightly
changed error message texts.

Now, I think that it may be a good idea to separate all possible errors
into two groups:
– Malformed input data
– DB conflicts during insertion

First is solved (I hope) well with the current patch. I can add, e.g.
MAXERRORS flag to COPY, which will limit number of errors.

Second may be solved with speculative insertion using the same
syntax ON CONFLICT DO as in INSERT statement.

Following this way, we do not use subtransactions at all; and keeping
predictable and consistent behaviour of INSERT and COPY along the
database. For me it sounds much better, than just swallowing all errors
without a difference and any logic.

Alexey

Attachments:

copy-errors-v0.2.diff.zipapplication/zip; name=copy-errors-v0.2.diff.zip; x-unix-mode=0644Download
#22Alex K
kondratov.aleksey@gmail.com
In reply to: Alexey Kondratov (#21)
1 attachment(s)
Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.

After a more precise look, I have figured out at least one difficulty, COPY
and INSERT follow the different execution paths: INSERT goes through
the Planner, while COPY does not. It leads to the absence of some required
attributes like arbiterIndexes, which are available during INSERT via
PlanState/ModifyTableState. Probably it is possible to get the same in the
COPY, but it is not clear for me how.

Anyway, adding of the 'speculative insertion' into the COPY is worth of a
separated patch; and I would be glad to try implementing it.

In the same time I have prepared a complete working patch with:

- ignoring of the input data formatting errors
- IGNORE_ERRORS parameter in the COPY options
- updated regression tests

Please, find the patch attached or check the web UI diff on GitHub as always:
https://github.com/ololobus/postgres/pull/1/files

Alexey

Attachments:

copy-errors-v1.0.diff.zipapplication/zip; name=copy-errors-v1.0.diff.zipDownload
#23Michael Paquier
michael.paquier@gmail.com
In reply to: Alex K (#22)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On Wed, Jun 21, 2017 at 11:37 PM, Alex K <kondratov.aleksey@gmail.com> wrote:

On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.

After a more precise look, I have figured out at least one difficulty, COPY
and INSERT follow the different execution paths: INSERT goes through
the Planner, while COPY does not. It leads to the absence of some required
attributes like arbiterIndexes, which are available during INSERT via
PlanState/ModifyTableState. Probably it is possible to get the same in the
COPY, but it is not clear for me how.

Anyway, adding of the 'speculative insertion' into the COPY is worth of a
separated patch; and I would be glad to try implementing it.

In the same time I have prepared a complete working patch with:

- ignoring of the input data formatting errors
- IGNORE_ERRORS parameter in the COPY options
- updated regression tests

Please, find the patch attached or check the web UI diff on GitHub as always:
https://github.com/ololobus/postgres/pull/1/files

"git diff master --check" complains heavily, and the patch does not
apply anymore. The last patch is 5-month old as well, so I am marking
the patch as returned with feedback.
--
Michael

#24Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Michael Paquier (#23)
1 attachment(s)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Hi Michael,

"git diff master --check" complains heavily, and the patch does not
apply anymore. The last patch is 5-month old as well, so I am marking
the patch as returned with feedback.

I have rebased my branch and squashed all commits into one, since the
patch is quite small. Everything seems to be working fine now, patch
passes all regression tests.

The latest version is attached. I will sign up it to the next
commitfest right now, so I will be very appreciate if someone will
review it one day.

Patch diff may be also found on GitHub:
https://github.com/ololobus/postgres/pull/4/files

Best,

Alexey

Attachments:

copy-errors-v1.0.1.patch.zipapplication/zip; name=copy-errors-v1.0.1.patch.zipDownload
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexey Kondratov (#24)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Alexey Kondratov wrote:

I have rebased my branch and squashed all commits into one, since the
patch is quite small. Everything seems to be working fine now, patch
passes all regression tests.

On a *very* quick look, please use an enum to return from NextCopyFrom
rather than 'int'. The chunks that change bool to int are very
odd-looking. This would move the comment that explains the value from
copy.c to copy.h, obviously. Also, you seem to be using non-ASCII dashes
in the descriptions of those values; please don't.

But those are really just minor nitpicks while paging through. After
looking at the way you're handling errors, it seems that you've chosen
to go precisely the way people were saying that was not admissible: use
a PG_TRY block, and when things go south, simply log a WARNING and move
on. This is unsafe. For example: what happens if the error being
raised is out-of-memory? Failing to re-throw means you don't release
current transaction memory context. What if the error is that WAL
cannot be written because the WAL disk ran out of space? This would
just be reported as a warning over and over until the server log disk is
also full. What if your insertion fired a trigger that caused an update
which got a serializability exception? You cannot just move on, because
at that point session state (serializability session state) might be
busted until you abort the current transaction.

Or maybe I misunderstood the patch completely.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26Alexey Kondratov
kondratov.aleksey@gmail.com
In reply to: Alvaro Herrera (#25)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On Fri, Dec 1, 2017 at 1:58 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On a *very* quick look, please use an enum to return from NextCopyFrom
rather than 'int'. The chunks that change bool to int are very
odd-looking. This would move the comment that explains the value from
copy.c to copy.h, obviously. Also, you seem to be using non-ASCII dashes
in the descriptions of those values; please don't.

I will fix it, thank you.

Or maybe I misunderstood the patch completely.

I hope so. Here is my thoughts how it all works, please correct me,
where I am wrong:

1) First, I have simply changed ereport level to WARNING for specific
validations (extra or missing columns, etc) if INGONE_ERRORS option is
used. All these checks are inside NextCopyFrom. Thus, this patch
performs here pretty much the same as before, excepting that it is
possible to skip bad lines, and this part should be safe as well

2) About PG_TRY/CATCH. I use it to catch the only one specific
function call inside NextCopyFrom--it is InputFunctionCall--which is
used just to parse datatype from the input string. I have no idea how
WAL write or trigger errors could get here

All of these is done before actually forming a tuple, putting it into
the heap, firing insert-related triggers, etc. I am not trying to
catch all errors during the row processing, only input data errors. So
why is it unsafe?

Best,

Alexey

#27Stephen Frost
sfrost@snowman.net
In reply to: Alexey Kondratov (#26)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Alexey,

* Alexey Kondratov (kondratov.aleksey@gmail.com) wrote:

On Fri, Dec 1, 2017 at 1:58 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On a *very* quick look, please use an enum to return from NextCopyFrom
rather than 'int'. The chunks that change bool to int are very
odd-looking. This would move the comment that explains the value from
copy.c to copy.h, obviously. Also, you seem to be using non-ASCII dashes
in the descriptions of those values; please don't.

I will fix it, thank you.

Or maybe I misunderstood the patch completely.

I hope so. Here is my thoughts how it all works, please correct me,
where I am wrong:

1) First, I have simply changed ereport level to WARNING for specific
validations (extra or missing columns, etc) if INGONE_ERRORS option is
used. All these checks are inside NextCopyFrom. Thus, this patch
performs here pretty much the same as before, excepting that it is
possible to skip bad lines, and this part should be safe as well

2) About PG_TRY/CATCH. I use it to catch the only one specific
function call inside NextCopyFrom--it is InputFunctionCall--which is
used just to parse datatype from the input string. I have no idea how
WAL write or trigger errors could get here

All of these is done before actually forming a tuple, putting it into
the heap, firing insert-related triggers, etc. I am not trying to
catch all errors during the row processing, only input data errors. So
why is it unsafe?

The issue is that InputFunctionCall is actually calling out to other
functions and they could be allocating memory and doing other things.
What you're missing by using PG_TRY() here without doing a PG_RE_THROW()
is all the cleanup work that AbortTransaction() and friends do- ensuring
there's a valid memory context (cleaning up the ones that might have
been allocated by the input function), releasing locks, resetting user
and security contexts, and a whole slew of other things.

Is all of that always needed for an error thrown by an input function?
Hard to say, really, since input functions can be provided by
extensions. You might get away with doing this for int4in(), but we
also have to be thinking about extensions like PostGIS which introduce
their own geometry and geography data types that are a great deal more
complicated.

In the end, this isn't an acceptable approach to this problem. The
approach outlined previously using sub-transactions which attempted
insert some number of records and then, on failure, restarted and
inserted up until the failed record and then skipped it, starting over
again with the next batch, seemed pretty reasonable to me. If you have
time to work on that, that'd be great, otherwise we'll probably need to
mark this as returned with feedback.

Thanks!

Stephen

#28Craig Ringer
craig@2ndquadrant.com
In reply to: Stephen Frost (#27)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On 23 January 2018 at 15:21, Stephen Frost <sfrost@snowman.net> wrote:

Alexey,

* Alexey Kondratov (kondratov.aleksey@gmail.com) wrote:

On Fri, Dec 1, 2017 at 1:58 AM, Alvaro Herrera <alvherre@2ndquadrant.com>

wrote:

On a *very* quick look, please use an enum to return from NextCopyFrom
rather than 'int'. The chunks that change bool to int are very
odd-looking. This would move the comment that explains the value from
copy.c to copy.h, obviously. Also, you seem to be using non-ASCII

dashes

in the descriptions of those values; please don't.

I will fix it, thank you.

Or maybe I misunderstood the patch completely.

I hope so. Here is my thoughts how it all works, please correct me,
where I am wrong:

1) First, I have simply changed ereport level to WARNING for specific
validations (extra or missing columns, etc) if INGONE_ERRORS option is
used. All these checks are inside NextCopyFrom. Thus, this patch
performs here pretty much the same as before, excepting that it is
possible to skip bad lines, and this part should be safe as well

2) About PG_TRY/CATCH. I use it to catch the only one specific
function call inside NextCopyFrom--it is InputFunctionCall--which is
used just to parse datatype from the input string. I have no idea how
WAL write or trigger errors could get here

All of these is done before actually forming a tuple, putting it into
the heap, firing insert-related triggers, etc. I am not trying to
catch all errors during the row processing, only input data errors. So
why is it unsafe?

The issue is that InputFunctionCall is actually calling out to other
functions and they could be allocating memory and doing other things.
What you're missing by using PG_TRY() here without doing a PG_RE_THROW()
is all the cleanup work that AbortTransaction() and friends do- ensuring
there's a valid memory context (cleaning up the ones that might have
been allocated by the input function), releasing locks, resetting user
and security contexts, and a whole slew of other things.

Is all of that always needed for an error thrown by an input function?
Hard to say, really, since input functions can be provided by
extensions. You might get away with doing this for int4in(), but we
also have to be thinking about extensions like PostGIS which introduce
their own geometry and geography data types that are a great deal more
complicated.

For example, an input function could conceivably take a lwlock, do some
work in the heapam, or whatever.

We don't have much in the way of rules about what input functions can or
cannot do, so you can't assume much about their behaviour and what must /
must not be cleaned up. Nor can you just reset the state in a heavy handed
manner like (say) plpgsql does.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#29Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Craig Ringer (#28)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On 1/22/18 21:33, Craig Ringer wrote:

We don't have much in the way of rules about what input functions can or
cannot do, so you can't assume much about their behaviour and what must
/ must not be cleaned up. Nor can you just reset the state in a heavy
handed manner like (say) plpgsql does.

I think one thing to try would to define a special kind of exception
that can safely be caught and ignored. Then, input functions can
communicate benign parse errors by doing their own cleanup first, then
throwing this exception, and then the COPY subsystem can deal with it.

Anyway, this patch is clearly not doing this, so I'm setting it to
returned with feedback.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30Craig Ringer
craig@2ndquadrant.com
In reply to: Peter Eisentraut (#29)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On 3 March 2018 at 13:08, Peter Eisentraut <peter.eisentraut@2ndquadrant.com

wrote:

On 1/22/18 21:33, Craig Ringer wrote:

We don't have much in the way of rules about what input functions can or
cannot do, so you can't assume much about their behaviour and what must
/ must not be cleaned up. Nor can you just reset the state in a heavy
handed manner like (say) plpgsql does.

I think one thing to try would to define a special kind of exception
that can safely be caught and ignored. Then, input functions can
communicate benign parse errors by doing their own cleanup first, then
throwing this exception, and then the COPY subsystem can deal with it.

That makes sense. We'd only use the error code range in question when it
was safe to catch without re-throw, and we'd have to enforce rules around
using a specific memory context. Of course no LWLocks could be held, but
that's IIRC true when throwing anyway unless you plan to proc_exit() in
your handler.

People will immediately ask for it to handle RI errors too, so something
similar would be needed there. But frankly, Pg's RI handling for bulk
loading desperately needs a major change in how it works to make it
efficient anyway, the current model of individual row triggers is horrible
for bulk load performance.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig Ringer (#30)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Craig Ringer <craig@2ndquadrant.com> writes:

On 3 March 2018 at 13:08, Peter Eisentraut <peter.eisentraut@2ndquadrant.com
wrote:

I think one thing to try would to define a special kind of exception
that can safely be caught and ignored. Then, input functions can
communicate benign parse errors by doing their own cleanup first, then
throwing this exception, and then the COPY subsystem can deal with it.

That makes sense. We'd only use the error code range in question when it
was safe to catch without re-throw, and we'd have to enforce rules around
using a specific memory context.

I don't think that can possibly work. It would only be safe if, between
the thrower and the catcher, there were no other levels of control
operating according to the normal error-handling rules. But input
functions certainly cannot assume that they are only called by COPY,
so how could they safely throw a "soft error"?

regards, tom lane

#32Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tom Lane (#31)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

On 3/3/18 00:48, Tom Lane wrote:

I don't think that can possibly work. It would only be safe if, between
the thrower and the catcher, there were no other levels of control
operating according to the normal error-handling rules. But input
functions certainly cannot assume that they are only called by COPY,
so how could they safely throw a "soft error"?

That assumes that throwing a soft error in a context that does not
handle it specially is not safe. I'd imagine in such situations the
soft error just behaves like a normal exception.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#33Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Eisentraut (#32)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2018-03-03 15:02 GMT+01:00 Peter Eisentraut <
peter.eisentraut@2ndquadrant.com>:

On 3/3/18 00:48, Tom Lane wrote:

I don't think that can possibly work. It would only be safe if, between
the thrower and the catcher, there were no other levels of control
operating according to the normal error-handling rules. But input
functions certainly cannot assume that they are only called by COPY,
so how could they safely throw a "soft error"?

That assumes that throwing a soft error in a context that does not
handle it specially is not safe. I'd imagine in such situations the
soft error just behaves like a normal exception.

This soft error solves what? If somebody use fault tolerant COPY, then he
will not be satisfied, when only format errors will be ignored. Any
constraints, maybe trigger errors should be ignored too.

But is true, so some other databases raises soft error on format issues,
and doesn't raise a exception. Isn't better to use some alternative
algorithm like

under subtransaction try to insert block of 1000 lines. When some fails do
rollback and try to import per block of 100 lines, and try to find wrong
line?

or another way - for this specific case reduce the cost of subtransaction?
This is special case, subtransaction under one command.

Regards

Pavel

Show quoted text

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#32)
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 3/3/18 00:48, Tom Lane wrote:

I don't think that can possibly work. It would only be safe if, between
the thrower and the catcher, there were no other levels of control
operating according to the normal error-handling rules. But input
functions certainly cannot assume that they are only called by COPY,
so how could they safely throw a "soft error"?

That assumes that throwing a soft error in a context that does not
handle it specially is not safe. I'd imagine in such situations the
soft error just behaves like a normal exception.

My point is that neither the thrower of the error, nor the catcher of it,
can possibly guarantee that there were no levels of control in between
that were expecting their resource leakages to be cleaned up by normal
error recovery. As a counterexample consider the chain of control

COPY -> domain_in -> SQL function in CHECK -> int4in

If int4in throws a "soft" error, and COPY catches it and skips
error cleanup, we've likely got problems, depending on what the SQL
function did. This could even break domain_in, although probably any
such effort would've taken care to harden domain_in against the issue.
But the possibility of SQL or PL functions having done nearly-arbitrary
things in between means that most of the backend is at hazard.

You could imagine making something like this work, but it's gonna take
very invasive and bug-inducing changes to our error cleanup rules,
affecting a ton of places that have no connection to the goal.

regards, tom lane