BUG #13888: pg_dump write error

Started by Vladimir Kunschikovover 10 years ago11 messagesbugs

kunschikov@gmail.com

over 10 years ago

The following bug has been logged on the website:

Bug reference: 13888
Logged by: Vladimir Kunschikov
Email address: kunschikov@gmail.com
PostgreSQL version: 9.4.5
Operating system: Linux
Description:

Database backup in heavy loaded environment sometimes fails with the
following error:

pg_dump: [parallel archiver] could not write to output file: Success

Subsequent rerun of the pg_dump solves that problem.
More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in
ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
There can be some error checking instead of this macro.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Vladimir Kunschikov (#1)

Re: BUG #13888: pg_dump write error

kunschikov@gmail.com wrote:

Database backup in heavy loaded environment sometimes fails with the
following error:

pg_dump: [parallel archiver] could not write to output file: Success

Subsequent rerun of the pg_dump solves that problem.
More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in
ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
There can be some error checking instead of this macro.

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Vladimir Kunschikov

kunschikov@gmail.com

over 10 years ago

In reply to: Alvaro Herrera (#2)

Re: BUG #13888: pg_dump write error

Attached small refactoring of the ahwrite() which includes checking of the
errno.
It was made with postgres-9.4.5 sources but applicable to 9.4.1 and even to
9.5.0 with small fuss at ExecuteSqlCommandBuf() call
It could be done with small check of the errno with rerun without any
refactoring.

On Mon, Jan 25, 2016 at 7:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Show quoted text

kunschikov@gmail.com wrote:

Database backup in heavy loaded environment sometimes fails with the
following error:

pg_dump: [parallel archiver] could not write to output file: Success

Subsequent rerun of the pg_dump solves that problem.
More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement

in

ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
There can be some error checking instead of this macro.

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Michael Paquier

michael@paquier.xyz

over 10 years ago

In reply to: Alvaro Herrera (#2)

Re: BUG #13888: pg_dump write error

On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

kunschikov@gmail.com wrote:

Database backup in heavy loaded environment sometimes fails with the
following error:

pg_dump: [parallel archiver] could not write to output file: Success

Subsequent rerun of the pg_dump solves that problem.
More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in
ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
There can be some error checking instead of this macro.

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Michael Paquier (#4)

Re: BUG #13888: pg_dump write error

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?

I do not understand these claims that there isn't an error check there.
There surely is. But fwrite() didn't set errno.

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here. Should we?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Paquier

michael@paquier.xyz

over 10 years ago

In reply to: Tom Lane (#5)

Re: BUG #13888: pg_dump write error

On Tue, Jan 26, 2016 at 11:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?

I do not understand these claims that there isn't an error check there.
There surely is. But fwrite() didn't set errno.

Yep, and the error message provided let's the user think that there
has been a failure, with a success. I am wondering about something
like that:
-#define WRITE_ERROR_EXIT \
+#define WRITE_ERROR_EXIT(errno) \
        do { \
-               exit_horribly(modulename, "could not write to output
file: %s\n", \
-                                         strerror(errno)); \
+               if (errno != 0) \
+                       exit_horribly(modulename, "could not write to
output file: %s\n", \
+                                                 strerror(errno)); \
+               else \
+                       exit_horribly(modulename, "could not write to
output file\n"); \
        } while (0)
A more invasive approach would be to actually replace most of the
READ/WRITE_ERROR_EXIT stuff by more appropriate error checks with
customized error messages. And I would prefer that, this helps having
a deeper understanding of exactly where the error occurred when
running pg_dump.

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here. Should we?

I would keep the code simple, we don't do that in the backend I recall.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Vladimir Kunschikov

kunschikov@gmail.com

over 10 years ago

In reply to: Tom Lane (#5)

Re: BUG #13888: pg_dump write error

pg_dump: could not write to output file: Success

We don't make any attempt to support filesystems that require retries

Filesystem: VFAT on usb media.Seems like short write was caused not by
filesystem but by gzwrite(), since pg_dump was called with `-F d` switch:

pg_dump -U *** -d **** -v -j 2 -t "alert_counters" -t "aggr_alert_counters"
-t "alerts*" -t "payloads*" -F d -f
"/media/usb0/backup_963871436_2016-01-18_09-16-12"

I think there was EAGAIN error. We are solving this problem now by
suggesting to users rerun of the backup script. I've attached small patch
to the ahwrite() function in my previous posting (it is waiting for the
moderation approval ).

On Tue, Jan 26, 2016 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?

I do not understand these claims that there isn't an error check there.
There surely is. But fwrite() didn't set errno.

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here. Should we?

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Tom Lane (#5)

Re: BUG #13888: pg_dump write error

Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?

I do not understand these claims that there isn't an error check there.
There surely is. But fwrite() didn't set errno.

Yeah, that's what I was remembering actually:
/messages/by-id/20150608174336.GM133018@postgresql.org

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here. Should we?

Sounds likely.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Alvaro Herrera (#8)

Re: BUG #13888: pg_dump write error

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

I do not understand these claims that there isn't an error check there.
There surely is. But fwrite() didn't set errno.

Yeah, that's what I was remembering actually:
/messages/by-id/20150608174336.GM133018@postgresql.org

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here. Should we?

Sounds likely.

Actually, now that I think about it while not in an airport, what I
think we really need is logic along the lines of

errno = 0;
nbytes = fwrite(...);
if (nbytes != expected && errno == 0)
errno = ENOSPC;

ie, assume a short write implies out-of-disk-space. I believe that
is what we do in most (hopefully all) cases in the backend; see for
example UpdateControlFile() in xlog.c.

IMO it is not really WRITE_ERROR_EXIT's place to assume that errno == 0
should be replaced by ENOSPC; in particular, that would be incorrect,
or at least insufficient, unless there's an explicit initialization of
"errno = 0;" before the fwrite call. So I'm inclined to put the
responsibility to do all of the above logic on the fwrite call sites,
rather than splitting it between callers and WRITE_ERROR_EXIT.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#9)

Re: BUG #13888: pg_dump write error

On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, now that I think about it while not in an airport, what I
think we really need is logic along the lines of

errno = 0;
nbytes = fwrite(...);
if (nbytes != expected && errno == 0)
errno = ENOSPC;

ie, assume a short write implies out-of-disk-space. I believe that
is what we do in most (hopefully all) cases in the backend; see for
example UpdateControlFile() in xlog.c.

There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found a case where large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there aren't other cases where that should be done, die to large writes.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Andres Freund (#10)

Re: BUG #13888: pg_dump write error

Andres Freund <andres@anarazel.de> writes:

On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

ie, assume a short write implies out-of-disk-space. I believe that
is what we do in most (hopefully all) cases in the backend; see for
example UpdateControlFile() in xlog.c.

There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found a case where large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there aren't other cases where that should be done, die to large writes.

I could support making pg_dump do that if there's a way to do it when
writing through zlib (which is probably the normal case these days).
I'm not at all sure that gzwrite() is retryable after a partial write;
that would likely have consequences for the internal state of the
compressor.

In any case, if you get nbytes == 0 and errno == 0, substituting ENOSPC
seems like the right thing.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

BUG #13888: pg_dump write error

Attachments: