Why "copy ... from stdio" does not return immediately when reading invalid data?

Started by Nicolas Grillyalmost 15 years ago3 messages
#1Nicolas Grilly
nicolas@gardentechno.com

Hello,

I am importing gigabytes of data into PostgreSQL, and I don't want to wait
10 minutes just to discover an error in the 10th line of my input file.

I tried the command "\copy ... from stdio" in psql and it looks like psql
has to read the entire input before returning a potential error, even if the
invalid value is in one of the first rows.

Is it a limitation of PostgreSQL protocol, of the library lipq, or of the
tool psql?

Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to send
gigabytes of data with just one "copy ... from stdio" query, and is there a
way to be notified of a potential error before calling PQputCopyEnd? Or do I
have to send my data in small chunks (for example batch of 10000
rows), issue a PQputCopyEnd, check for errors, and continue with the next
chunk?

Thanks for your help and advice.

Regards,

Nicolas Grilly

#2John R Pierce
pierce@hogranch.com
In reply to: Nicolas Grilly (#1)
Re: Why "copy ... from stdio" does not return immediately when reading invalid data?

On 02/02/11 10:20 AM, Nicolas Grilly wrote:

Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to
send gigabytes of data with just one "copy ... from stdio" query, and
is there a way to be notified of a potential error before calling
PQputCopyEnd? Or do I have to send my data in small chunks (for
example batch of 10000 rows), issue a PQputCopyEnd, check for errors,
and continue with the next chunk?

I would batch the data, maybe 1000 lines or even 100 lines at a time if
these errors are at all frequent. put the errored batches in an
exception list or something so you can sort them out later.

#3Nicolas Grilly
nicolas@gardentechno.com
In reply to: John R Pierce (#2)
Re: Why "copy ... from stdio" does not return immediately when reading invalid data?

I have analyzed the PostgreSQL protocol using Wireshark (an open source
packet analyzer), and I observed that the PostgreSQL backend, while doing a
COPY ... FROM STDIN, reports errors as soon as possible (especially errors
related to invalid data).

Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN
is not a limitation of the underlying protocol; it is a limitation (or a
design choice) of the libpq library.

It looks like this is a well known issue because it is listed on the todo
list:
http://wiki.postgresql.org/wiki/Todo#COPY

And was discussed before:
http://archives.postgresql.org/pgsql-hackers/2008-04/msg01169.php

Do you think it is possible to change that behavior, or work around it?

While reading libpq source code, I noticed the function pqParseInput3 (file
fe-protocol3.c) ignores error responses while the connection is
in PGASYNC_COPY_IN state. Maybe we can make a special case for the COPY FROM
subprotocol and handle errors early, in order to make them available to
PQgetResult? Is is feasible in a simple way or is it a bad idea?

Regards,

Nicolas Grilly

On Wed, Feb 2, 2011 at 20:06, John R Pierce <pierce@hogranch.com> wrote:

Show quoted text

On 02/02/11 10:20 AM, Nicolas Grilly wrote:

Is the copy protocol (aka PQputCopyData and PQputCopyEnd) designed to send
gigabytes of data with just one "copy ... from stdio" query, and is there a
way to be notified of a potential error before calling PQputCopyEnd? Or do I
have to send my data in small chunks (for example batch of 10000 rows),
issue a PQputCopyEnd, check for errors, and continue with the next chunk?

I would batch the data, maybe 1000 lines or even 100 lines at a time if
these errors are at all frequent. put the errored batches in an exception
list or something so you can sort them out later.