ALTER TABLE ALTER COLUMN SET TYPE crash

Started by Robins Tharakanover 5 years ago10 messagesbugs
Jump to latest
#1Robins Tharakan
tharakan@gmail.com

Hi,

Unlike a recently reported similar issue, executing the following ALTER
TABLE on the regression database crashes Postgres (master).

Admittedly it doesn't do anything constructive (and am new to the tool),
but do let me know if such reports are interesting and / or if you need
more details for reproduction.

Steps to reproduce:
==================

ubuntu@laptop:~$ dropdb --if-exists tempdel && createdb tempdel
ubuntu@laptop:~$ ./pg_regress --use-existing --schedule=serial_schedule
--dbname=tempdel >/dev/null
ubuntu@laptop:~$ sleep 5 && psql -c "SELECT version();" tempdel
version
--------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
(1 row)

ubuntu@laptop:~$ psql -c "begin; ALTER TABLE ONLY public.xacttest ALTER
COLUMN a SET DATA TYPE TEXT USING public.max_xacttest(); ROLLBACK;"
tempdel
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
ubuntu@laptop:~$ sleep 5 && psql -c "SELECT 1;" tempdel
?column?
----------
1
(1 row)

-
robins tharakan

#2Michael Paquier
michael@paquier.xyz
In reply to: Robins Tharakan (#1)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On Tue, Aug 25, 2020 at 02:14:06PM +1000, Robins Tharakan wrote:

Unlike a recently reported similar issue, executing the following ALTER
TABLE on the regression database crashes Postgres (master).

Admittedly it doesn't do anything constructive (and am new to the tool),
but do let me know if such reports are interesting and / or if you need
more details for reproduction.

Such reports are constructive! I can reproduce the crash here down to
9.5. From what I can see, the problem comes from ATRewriteTable() ->
ExecEvalExpr() when we evaluate expressions with inputs coming from
the old tuple. It looks like a memory corruption issue or a context
issue at quick glance, and I cannot get a clean backtrace.
--
Michael

#3Bruce Momjian
bruce@momjian.us
In reply to: Michael Paquier (#2)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On Tue, Aug 25, 2020 at 03:05:11PM +0900, Michael Paquier wrote:

On Tue, Aug 25, 2020 at 02:14:06PM +1000, Robins Tharakan wrote:

Unlike a recently reported similar issue, executing the following ALTER
TABLE on the regression database crashes Postgres (master).

Admittedly it doesn't do anything constructive (and am new to the tool),
but do let me know if such reports are interesting and / or if you need
more details for reproduction.

Such reports are constructive! I can reproduce the crash here down to
9.5. From what I can see, the problem comes from ATRewriteTable() ->
ExecEvalExpr() when we evaluate expressions with inputs coming from
the old tuple. It looks like a memory corruption issue or a context
issue at quick glance, and I cannot get a clean backtrace.

OK, I have generated a attached backtrace by manually walking down
through the stack using break points. Notice len2 on the top line:

#0 varstr_cmp (arg1=0x55e0ef64262c "\252\061\306BT\026", len1=21,
arg2=0x7f274db17ebc "-\317\303=T\026", len2=-4, collid=100) at
varlena.c:1633 -------

-4 is not a good len. I will do more research to see why that -4 len is
happening.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

Attachments:

bt.txttext/plain; charset=us-asciiDownload
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

Bruce Momjian <bruce@momjian.us> writes:

On Tue, Aug 25, 2020 at 03:05:11PM +0900, Michael Paquier wrote:

On Tue, Aug 25, 2020 at 02:14:06PM +1000, Robins Tharakan wrote:

Unlike a recently reported similar issue, executing the following ALTER
TABLE on the regression database crashes Postgres (master).

Such reports are constructive! I can reproduce the crash here down to
9.5. From what I can see, the problem comes from ATRewriteTable() ->
ExecEvalExpr() when we evaluate expressions with inputs coming from
the old tuple. It looks like a memory corruption issue or a context
issue at quick glance, and I cannot get a clean backtrace.

I think the nature of the problem (and Robins' other report too) is pretty
clear. We have a SQL or plpgsql function that's trying to access a table
that is inconsistent during an ALTER TABLE operation. The function would
be locked out from seeing that transient state if it were in another
session, thanks to normal locking rules; but the lock acquisition rules
don't prevent same-session accesses.

ALTER TABLE and related utility commands contain guards against the
reverse form of this problem: CheckTableNotInUse will bitch if there's
some already-active outer query referencing the table. But we haven't
thought about the possibility of one of these commands executing
user-defined code midway through.

regards, tom lane

#5Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On Tue, Aug 25, 2020 at 12:14:20PM -0400, Tom Lane wrote:

I think the nature of the problem (and Robins' other report too) is pretty
clear. We have a SQL or plpgsql function that's trying to access a table
that is inconsistent during an ALTER TABLE operation. The function would
be locked out from seeing that transient state if it were in another
session, thanks to normal locking rules; but the lock acquisition rules
don't prevent same-session accesses.

ALTER TABLE and related utility commands contain guards against the
reverse form of this problem: CheckTableNotInUse will bitch if there's
some already-active outer query referencing the table. But we haven't
thought about the possibility of one of these commands executing
user-defined code midway through.

That does explain why it fails on the _second_ time through those
lower-level functions.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

#6Michael Paquier
michael@paquier.xyz
In reply to: Bruce Momjian (#5)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On Tue, Aug 25, 2020 at 01:35:11PM -0400, Bruce Momjian wrote:

On Tue, Aug 25, 2020 at 12:14:20PM -0400, Tom Lane wrote:

I think the nature of the problem (and Robins' other report too) is pretty
clear. We have a SQL or plpgsql function that's trying to access a table
that is inconsistent during an ALTER TABLE operation. The function would
be locked out from seeing that transient state if it were in another
session, thanks to normal locking rules; but the lock acquisition rules
don't prevent same-session accesses.

There are already some safeguards to prevent directly the use of
aggregates in USING, and here we have a function that itself calls an
aggregate on the table.

ALTER TABLE and related utility commands contain guards against the
reverse form of this problem: CheckTableNotInUse will bitch if there's
some already-active outer query referencing the table. But we haven't
thought about the possibility of one of these commands executing
user-defined code midway through.

Hmm. That's not an easy problem as this would need restrictions for
high-level commands directly in the executor.. And it is not like you
should restrict everything either.
--
Michael

#7Robins Tharakan
tharakan@gmail.com
In reply to: Michael Paquier (#2)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On Tue, 25 Aug 2020 at 16:05, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 25, 2020 at 02:14:06PM +1000, Robins Tharakan wrote:

Admittedly it doesn't do anything constructive (and am new to the tool),
but do let me know if such reports are interesting and / or if you need
more details for reproduction.

Such reports are constructive!

--

Michael

Thanks all for picking this up promptly and Michael for confirming
the need. I'll try to add support for more DDLs and see if it can throw up
similar corner-cases.

For now though, I need to see how to exclude some functions. The above
use-case has pretty much rendered my test-setup useless - since it easily
triggers engine restarts every few minutes.
-
robins

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#6)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Aug 25, 2020 at 01:35:11PM -0400, Bruce Momjian wrote:

On Tue, Aug 25, 2020 at 12:14:20PM -0400, Tom Lane wrote:

I think the nature of the problem (and Robins' other report too) is pretty
clear. We have a SQL or plpgsql function that's trying to access a table
that is inconsistent during an ALTER TABLE operation. The function would
be locked out from seeing that transient state if it were in another
session, thanks to normal locking rules; but the lock acquisition rules
don't prevent same-session accesses.

There are already some safeguards to prevent directly the use of
aggregates in USING, and here we have a function that itself calls an
aggregate on the table.

That's an independent issue though. It stems mostly from not wanting
to use the full-scale planner or executor for subexpressions of utility
commands. (Of course, the PL function handler does so internally, but
that's its problem.)

The core issue here is that the table's catalog entries, as visible within
this transaction, don't match what's in its disk file. ALTER TABLE knows
that and is careful not to touch the old table using the new rowtype ---
but nothing else knows that. So I think we need to block other code
from touching the table-under-modification. As you say, there's not
any existing infrastructure that will serve for that directly. We might
be able to invent something comparable to the existing relcache entry
refcount, but counting exclusive opens ("there can be only one").

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

I wrote:

The core issue here is that the table's catalog entries, as visible within
this transaction, don't match what's in its disk file. ALTER TABLE knows
that and is careful not to touch the old table using the new rowtype ---
but nothing else knows that. So I think we need to block other code
from touching the table-under-modification. As you say, there's not
any existing infrastructure that will serve for that directly. We might
be able to invent something comparable to the existing relcache entry
refcount, but counting exclusive opens ("there can be only one").

I initially tried to fix this with a hard restriction that you can't open
a relation that's marked as being exclusively accessed, as in the 0001
patch below. That crashed and burned most spectacularly. To understand
the second attempt, it's helpful to enumerate some of the problems:

1. We have felt free to allow ALTER TABLE and similar commands to call
subroutines that re-open the target relation. Refactoring to the point
where that wouldn't happen (by means of always passing down an open
Relation) seems entirely impractical.

2. Keeping the exclusive-access state in relcache entries, as I first
tried to do, simply doesn't work, because ALTER TABLE doesn't
hold the relcache entry open throughout --- for instance, the initial
pin is released just after Phase 1 in ATController. I did try not
doing that, but it breaks other things. Besides I'm pretty sure that
that's designed that way intentionally, to reduce the number of forced
rebuilds of the relcache entry.

3. We can't have CheckTableNotInUse enforcing this either, because trying
to do so fails in numerous situation. ALTER TABLE itself is recursive to
some extent, and it turns out that most of the other callers neither need
nor want any such restriction. For instance there's actually a regression
test that fails if TRUNCATE tries to lock out accesses by called code.

In view of point 1, I have set this up so that we only check for
disallowed re-entrancy in the parse/rewrite/plan/execute code path.
(The checks added there are in the places where we first acquire lock
on any given relation.) Perhaps there are other places where we need
to check, but this at least plugs the primary hole.

In view of point 2, there's a new separate hashtable that holds the
exclusive-marking state. The extra cycles to probe it are slightly
annoying, but in the big scheme of things I doubt it's measurable.
(In any case, in a typical session that has never done ALTER TABLE,
no lookups are required.)

As for point 3, right now only ALTER TABLE is applying such marking
at all, and it's careful to release it as soon as consistency has
been restored. I looked through the other CheckTableNotInUse
callers and tentatively concluded that none of the other ones have
a problem; but more eyeballs on that would be a good thing.

Anyway, 0001 is just for amusement's sake; 0002 is the proposed patch.

regards, tom lane

Attachments:

0001-exclusive-access-FAIL.patchtext/x-diff; charset=us-ascii; name=0001-exclusive-access-FAIL.patchDownload+14-1
0002-exclusive-access-checks.patchtext/x-diff; charset=us-ascii; name=0002-exclusive-access-checks.patchDownload+258-12
#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robins Tharakan (#7)
Re: ALTER TABLE ALTER COLUMN SET TYPE crash

On 2020-Aug-26, Robins Tharakan wrote:

For now though, I need to see how to exclude some functions. The above
use-case has pretty much rendered my test-setup useless - since it easily
triggers engine restarts every few minutes.

An easy workaround from Postgres POV might be to patch it to return an
error on ATExecAlterColumnType or so.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services