Aborted Transaction During Vacuum

Started by G. Anthony Reinaover 26 years ago13 messages

reina@nsi.edu

over 26 years ago

I've been getting a strange error from the vacuum command. When I try
'vacuum verbose analyze' the vacuum goes through the tables fine until
just after finishing one particular table. Then I get the error:

NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is
impossible. Terminating.

When I try to vacuum the tables individually, I get no problems with
aborted backends.

Does anyone know what is going on here?

-Tony

PostgreSQL 6.5.1 on RH Linux 6.0, PII/400 MHz, 512 Meg RAM, 18 Gig HD

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: G. Anthony Reina (#1)

Re: [HACKERS] Aborted Transaction During Vacuum

"G. Anthony Reina" <reina@nsi.edu> writes:

I've been getting a strange error from the vacuum command. When I try
'vacuum verbose analyze' the vacuum goes through the tables fine until
just after finishing one particular table. Then I get the error:

NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.

Interesting. Is there any additional message appearing in the
postmaster log? Is a core file being generated? (look in the
data/base/ subdirectory for the database in question) If there
is a corefile, a debugger backtrace from it would be helpful.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofThu12Aug1999163332-070037B359CC.C9CCB870@nsi.edu | Resolved by subject fallback

G. Anthony Reina

reina@nsi.edu

over 26 years ago

In reply to: Tom Lane (#2)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom Lane wrote:

Interesting. Is there any additional message appearing in the
postmaster log? Is a core file being generated? (look in the
data/base/ subdirectory for the database in question) If there
is a corefile, a debugger backtrace from it would be helpful.

regards, tom lane

Tom,

I tried the 'vacuum verbose analyze' again today. I get the same error
with the AbortTransaction. There is a core file generated but no pg_vlock
file. The core is over 1 Gig in size (38 Megs gzipped) so I'm not so sure
you'd want to get that (you can have it if you want it though!). Plus,
there seems to be nothing written to the postmaster.log file (I re-started
the postmaster before the vacuum using 'nohup postmaster -i -B 15000 -o -F

postmaster.log&').

Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
like like a bug you patched for him a few days ago:

It does sound very like the bug that was found a few days back where the

pg_vlock file gets deleted by a mistaken call to vc_abort(). This only

gets called in the analyze code.

We're probably talking about the same bug. When I try just 'vacuum verbose'
without the analyze, the vacuum completes just fine. So it must be within
the analyze code.

-Tony

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: G. Anthony Reina (#3)

Re: [HACKERS] Aborted Transaction During Vacuum

"G. Anthony Reina" <reina@nsi.edu> writes:

I tried the 'vacuum verbose analyze' again today. I get the same error
with the AbortTransaction. There is a core file generated but no pg_vlock
file. The core is over 1 Gig in size (38 Megs gzipped) so I'm not so sure
you'd want to get that (you can have it if you want it though!).

It wouldn't do me any good anyway without a copy of the executable and a
copy of gdb built for whatever platform you are on. I was hoping you
could run gdb on the corefile there and just send the backtrace output
("gdb postgres-executable core-file" and then say "bt").

Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
like like a bug you patched for him a few days ago:

I doubt it's the same bug --- the error message generated by the other
bug was different...

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri13Aug1999102810-070037B455AA.99AC7CA2@nsi.edu | Resolved by subject fallback

Oliver Elphick

olly@lfix.co.uk

over 26 years ago

In reply to: Tom Lane (#4)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom Lane wrote:

Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
like like a bug you patched for him a few days ago:

I doubt it's the same bug --- the error message generated by the other
bug was different...

The error message I posted was the log output at level 3, which he doesn't
seem to be running; if you switch to logging at that level, look for
messages that mention pg_vlock.

As far as I recall, the only error message he has listed so far is from
psql, which merely records the unlink failure at the end of the vacuum.
Everything else looks remarkably like the vc_abort bug.

--
Vote against SPAM: http://www.politik-digital.de/spam/
========================================
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
PGP key from public servers; key ID 32B8FAA1
========================================
"Watch ye therefore, and pray always, that ye may be
accounted worthy to escape all these things that shall
come to pass, and to stand before the Son of man."
Luke 21:36

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Oliver Elphick (#5)

Re: [HACKERS] Aborted Transaction During Vacuum

"Oliver Elphick" <olly@lfix.co.uk> writes:

As far as I recall, the only error message he has listed so far is from
psql, which merely records the unlink failure at the end of the vacuum.
Everything else looks remarkably like the vc_abort bug.

But the vc_abort problem didn't cause a backend coredump --- it just
reported an error and failed to complete the vacuum, no?

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri13Aug1999190443+0100199908131804.TAA16836@linda.lfix.co.uk | Resolved by subject fallback

Oliver Elphick

olly@lfix.co.uk

over 26 years ago

In reply to: Tom Lane (#6)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom Lane wrote:

"Oliver Elphick" <olly@lfix.co.uk> writes:

As far as I recall, the only error message he has listed so far is from
psql, which merely records the unlink failure at the end of the vacuum.
Everything else looks remarkably like the vc_abort bug.

But the vc_abort problem didn't cause a backend coredump --- it just
reported an error and failed to complete the vacuum, no?

I got a coredump too; I never mentioned it, because I found the proximate
cause and curing that made it go away. When unlink failed in vc_shutdown,
it called ELOG and a segfault occurred a little later.

I forgot about that when your patch fixed the original problem.

Import Notes

Reply to msg id not found: MessagefromTomLanetgl@sss.pgh.pa.usofFri13Aug1999140901EDT.5526.934567741@sss.pgh.pa.us | Resolved by subject fallback

G. Anthony Reina

reina@nsi.edu

over 26 years ago

In reply to: Tom Lane (#6)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom Lane wrote:

But the vc_abort problem didn't cause a backend coredump --- it just
reported an error and failed to complete the vacuum, no?

regards, tom lane

Here's the error message again:

NOTICE: --Relation ex_ellipse_proc--
NOTICE: --Relation ex_ellipse_proc--
NOTICE: Pages 2545: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 660, MaxLen 660; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 1/0 sec.
NOTICE: Pages 2545: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 660, MaxLen 660; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 1/0 sec.
NOTICE: Index pkex_ellipse_proc: Pages 138; Tuples 30535. Elapsed 0/0 sec.
NOTICE: Index pkex_ellipse_proc: Pages 138; Tuples 30535. Elapsed 0/0 sec.
NOTICE: --Relation ex_ellipse_cell--
NOTICE: --Relation ex_ellipse_cell--
NOTICE: Pages 370: Changed 0, Reapped 0, Empty 0, New 0; Tup 6109: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 80, MaxLen 2736; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 0/0 sec.
NOTICE: Pages 370: Changed 0, Reapped 0, Empty 0, New 0; Tup 6109: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 80, MaxLen 2736; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 0/0 sec.
NOTICE: Index pkex_ellipse_cell: Pages 42; Tuples 6109. Elapsed 0/0 sec.
NOTICE: Index pkex_ellipse_cell: Pages 42; Tuples 6109. Elapsed 0/0 sec.
NOTICE: --Relation ex_ellipse_opto--
NOTICE: --Relation ex_ellipse_opto--
NOTICE: Pages 26356: Changed 0, Reapped 0, Empty 0, New 0; Tup 71475: Vac
0, Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 1760, MaxLen 6108; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 3/2 sec.
NOTICE: Pages 26356: Changed 0, Reapped 0, Empty 0, New 0; Tup 71475: Vac
0, Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 1760, MaxLen 6108; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 3/2 sec.
NOTICE: Index pkex_ellipse_opto: Pages 357; Tuples 71475. Elapsed 0/0 sec.
NOTICE: Index pkex_ellipse_opto: Pages 357; Tuples 71475. Elapsed 0/0 sec.
NOTICE: --Relation ex_ellipse_opto_proc--
NOTICE: --Relation ex_ellipse_opto_proc--
NOTICE: Pages 14742: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac
0, Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 1816, MaxLen 5900; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 1/1 sec.
NOTICE: Pages 14742: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac
0, Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 1816, MaxLen 5900; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
Pages 0/0. Elapsed 1/1 sec.
NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
sec.
NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
sec.
ERROR: vacuum: can't destroy lock file!
NOTICE: AbortTransaction and not in in-progress state
NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is
impossible. Terminating.
[postgres@bigred ~]$

-Tony

G. Anthony Reina

reina@nsi.edu

over 26 years ago

In reply to: G. Anthony Reina (#8)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom Lane wrote:

However, there should have been an "ERROR" message if something reported
an error. You said you only saw "NOTICE: AbortTransaction and not in
in-progress state" and not any "ERROR" before or after it? The NOTICE
presumably came out of AbortTransaction itself, after whatever went
wrong went wrong...

Yes, I have an ERROR message (either I didn't notice it before or it is
new):

NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
sec.
NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
sec.
ERROR: vacuum: can't destroy lock file!
NOTICE: AbortTransaction and not in in-progress state
NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is
impossible. Terminating.

It looks like the error is either occuring on table ex_ellipse_opto_proc or
the next table in the list ex_ellipse_proc. However, I think the error is
more general than that. It appears to occur just before the last table in
the database gets vacuumed. Here's the list of my tables:

Database    = db01
 +------------------+----------------------------------+----------+
 |  Owner           |             Relation             |   Type   |
 +------------------+----------------------------------+----------+
 | postgres         | center_out                       | table    |
 | postgres         | center_out_analog                | table    |
 | postgres         | center_out_analog_proc           | table    |
 | postgres         | center_out_cell                  | table    |
 | postgres         | center_out_cell_proc             | table    |
 | postgres         | center_out_opto                  | table    |
 | postgres         | center_out_opto_proc             | table    |
 | postgres         | center_out_pref_direction        | table    |
 | postgres         | center_out_proc                  | table    |
 | postgres         | electrode                        | table    |
 | postgres         | ellipse                          | table    |
 | postgres         | ellipse_analog                   | table    |
 | dan              | ellipse_analog_proc              | table    |
 | postgres         | ellipse_cell                     | table    |
 | dan              | ellipse_cell_proc                | table    |
 | postgres         | ellipse_opto                     | table    |
 | dan              | ellipse_opto_proc                | table    |
 | dan              | ellipse_proc                     | table    |
 | dan              | ex_ellipse                       | table    |
 | dan              | ex_ellipse_analog_proc           | table    |
 | dan              | ex_ellipse_cell                  | table    |
 | dan              | ex_ellipse_cell_proc             | table    |
 | dan              | ex_ellipse_opto                  | table    |
 | dan              | ex_ellipse_opto_proc             | table
|                             <---- ERROR occurs somewhere after here
 | dan              | ex_ellipse_proc                  | table    |
 +------------------+----------------------------------+----------+

Yesterday, I was adding tables in one by one from a previous pg_dump. The
error didn't pop up until after I had about 9 or 10 tables restored. I
didn't think about it then, but it may have always occured after the second
to last table in the list. But don't hold me to that.

In any case, I'll try to re-build everything like you've asked to get a
better error message. Maybe if I go through step-by-step again. You'll be
able to help me find where the error is taking place.

Thanks Tom and Oliver. I'll get back to you when I finish the rebuild.

-Tony

Import Notes

Reference msg id not found: 5637.934570558@sss.pgh.pa.us | Resolved by subject fallback

#10

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: G. Anthony Reina (#9)

Re: [HACKERS] Aborted Transaction During Vacuum

"G. Anthony Reina" <reina@nsi.edu> writes:

ERROR: vacuum: can't destroy lock file!
NOTICE: AbortTransaction and not in in-progress state
NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.

Ah-hah! Oliver is right, then, at least in part --- that error message
from vacuum suggests that the vc_abort bug *is* biting you. However,
there may be more going on, because what Oliver and others observed did
not include a coredump (at least not that I heard about).

You can probably suppress the problem by installing the patch I posted
to pgsql-patches a few days ago. However, I'd appreciate it if you'd
first try to reproduce the problem again with debug/assert turned on,
so that we can get some idea whether there is an additional bug that's
only biting you and not the previous reporters.

BTW, if vc_abort is involved then the occurrence of the problem probably
depends on whether any other backends are using the database and what
they are doing. (The vc_abort bug escaped notice up till last week
because it doesn't trigger when vacuum is the only thing running.)
Do you have other clients running when you do the vacuum? What are
they doing?

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri13Aug1999134205-070037B4831D.C9283D9E@nsi.edu | Resolved by subject fallback

#11

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Tom Lane (#10)

Re: [HACKERS] Aborted Transaction During Vacuum

"Oliver Elphick" <olly@lfix.co.uk> writes:

But the vc_abort problem didn't cause a backend coredump --- it just
reported an error and failed to complete the vacuum, no?

I got a coredump too; I never mentioned it, because I found the proximate
cause and curing that made it go away. When unlink failed in vc_shutdown,
it called ELOG and a segfault occurred a little later.

Ah, I wish I'd known that. So what Tony is seeing is exactly the same
behavior you observed. OK, I feel better now --- I thought the coredump
was probably some platform-specific misbehavior that only Tony was seeing.

We still need to figure out what is causing it, because I can see no
reason for a coredump after vc_shutdown elog()s. Something is being
clobbered that should not be. But it sounds like installing the
vc_abort patch will get Tony on his feet, and then we can look for the
secondary bug at our leisure.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri13Aug1999192831+0100199908131828.TAA17562@linda.lfix.co.uk | Resolved by subject fallback

#12

G. Anthony Reina

reina@nsi.edu

over 26 years ago

In reply to: Tom Lane (#11)

Re: [HACKERS] Aborted Transaction During Vacuum

Tom and Oliver,

I added the patch and re-built everything. No more problems with the
vacuum analyze (knock on wood since this is still Friday the 13th).

Tom, does this patch address the problem you found in your
EndPortalAllocMode comments? I'm just wondering whether another patch
will be needed or if this one should cover the problem.

Thanks again for the help guys.
-Tony

Tom wrote:

(a) vacuum.c does a CommitTransaction to commit its final table's
worth of fixes.

(b) CommitTransaction calls EndPortalAllocMode.

(c) vacuum calls vc_shutdown, which tries to remove pg_vlock,

and reports an error when the unlink() call fails.

Show quoted text

(d) during error cleanup, AbortTransaction is called.

(e) AbortTransaction calls EndPortalAllocMode. There has been no
intervening StartPortalAllocMode, so the portal's context stack
is empty. EndPortalAllocMode tries to free a nonexistent memory
context (or, if you have asserts turned on, dies with an assert
failure). Ka-boom.

Import Notes

Resolved by subject fallback

#13

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: G. Anthony Reina (#12)

Re: [HACKERS] Aborted Transaction During Vacuum

"G. Anthony Reina" <reina@nsi.edu> writes:

Tom, does this patch address the problem you found in your
EndPortalAllocMode comments? I'm just wondering whether another patch
will be needed or if this one should cover the problem.

I think you need not worry. EndPortalAllocMode has a bug that ought to
be fixed IMHO, but it's only triggered if an error is reported outside
of any transaction context, and there is darn little backend processing
that happens outside of any transaction context. This vacuum shutdown
check might be the only such error, in fact. In short, it's something
to clean up for 6.6, but I doubt it's worth issuing a 6.5 patch for.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri13Aug1999211301-070037B4ECCD.DF56BA7D@nsi.edu | Resolved by subject fallback