logical decoding bug when mapped relation with toast contents is rewritten repeatedly

Started by Andres Freundover 7 years ago4 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

(Tomas, CCing you because you IIRC mentioned encountered an issue like
this)

I just spent quite a while debugging an issue where running logical
decoding yielded a:
ERROR: could not map filenode "base/$X/$Y" to relation OID
error.

After discarding like 30 different theories, I have found the cause:

During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
toast table with actual live toasted tuples (pg_proc in my case and
henceforth) heap inserts with the toast data happen into the new toast
relation, triggered by:

static void
raw_heap_insert(RewriteState state, HeapTuple tup)
...
/*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*
* Note: below this point, heaptup is the data we actually intend to store
* into the relation; tup is the caller's original untoasted data.
*/
if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
{
/* toast table entries should never be recursively toasted */
Assert(!HeapTupleHasExternal(tup));
heaptup = tup;
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
HEAP_INSERT_SKIP_FSM |
(state->rs_use_wal ?
0 : HEAP_INSERT_SKIP_WAL));
else
heaptup = tup;

At that point the new toast relation does *NOT* appear to be a system
catalog, it's just appears as an "independent" table. Therefore we do
not trigger, in heap_insert():

/*
* RelationIsLogicallyLogged
* True if we need to log enough information to extract the data from the
* WAL stream.
*
* We don't log information for unlogged tables (since they don't WAL log
* anyway) and for system tables (their content is hard to make sense of, and
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
!IsCatalogRelation(relation))

/*
* For logical decoding, we need the tuple even if we're doing a full
* page write, so make sure it's included even if we take a full-page
* image. (XXX We could alternatively store a pointer into the FPW).
*/
if (RelationIsLogicallyLogged(relation))
{
xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
bufflags |= REGBUF_KEEP_DATA;
}

i.e. the inserted toast tuple will be marked as
XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
system table. Which we currently do not allow do be logically decoded.

That normally ends up being harmless, because ReorderBufferCommit() has the
following check:
if (!RelationIsLogicallyLogged(relation))
goto change_done;

but to reach that check, we first have to map the relfilenode from the
WAL to the corresponding OID:
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);

That works correctly if there's only one rewrite - the relmapper
contains the data for the new toast table. But if there's been *two*
consecutive rewrites, the relmapper *does not* contain the intermediary
relfilenode of pg_proc. There's no such problem for non-mapped tables,
because historic snapshots allow us to access the relevant data, but the
relmapper isn't mvcc.

Therefore the catalog-rewrite escape hatch of:
/*
* Catalog tuple without data, emitted while catalog was
* in the process of being rewritten.
*/
if (reloid == InvalidOid &&
change->data.tp.newtuple == NULL &&
change->data.tp.oldtuple == NULL)
goto change_done;
does not trigger and we run into:
else if (reloid == InvalidOid)
elog(ERROR, "could not map filenode \"%s\" to relation OID",
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));

commenting out this error / converting it into a warning makes this case
harmless, but could obviously be problematic in other scenarios.

I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach. Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.

The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.

Now I'm having a beer outside.

Greetings,

Andres Freund

Attachments:

decoding-fix.difftext/x-diff; charset=us-asciiDownload+826-5
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: logical decoding bug when mapped relation with toast contents is rewritten repeatedly

Hi,

On 09/14/2018 04:10 AM, Andres Freund wrote:

Hi,

(Tomas, CCing you because you IIRC mentioned encountered an issue like
this)

I might have mentioned an issue with this symptom recently, but that
turned out to be already fixed by da10d6a8a9 (before the minor version
with that fix got released).

I just spent quite a while debugging an issue where running logical
decoding yielded a:
ERROR: could not map filenode "base/$X/$Y" to relation OID
error.

After discarding like 30 different theories, I have found the cause:

Yeah. These issues are not exactly trivial to investigate ;-)

During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
toast table with actual live toasted tuples (pg_proc in my case and
henceforth) heap inserts with the toast data happen into the new toast
relation, triggered by:

static void
raw_heap_insert(RewriteState state, HeapTuple tup)
...
/*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*
* Note: below this point, heaptup is the data we actually intend to store
* into the relation; tup is the caller's original untoasted data.
*/
if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
{
/* toast table entries should never be recursively toasted */
Assert(!HeapTupleHasExternal(tup));
heaptup = tup;
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
HEAP_INSERT_SKIP_FSM |
(state->rs_use_wal ?
0 : HEAP_INSERT_SKIP_WAL));
else
heaptup = tup;

At that point the new toast relation does *NOT* appear to be a system
catalog, it's just appears as an "independent" table. Therefore we do
not trigger, in heap_insert():

Hmm, can't we change that? Recognizing the new TOAST table as a catalog
would fix the issue, no?

/*
* RelationIsLogicallyLogged
* True if we need to log enough information to extract the data from the
* WAL stream.
*
* We don't log information for unlogged tables (since they don't WAL log
* anyway) and for system tables (their content is hard to make sense of, and
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
!IsCatalogRelation(relation))

/*
* For logical decoding, we need the tuple even if we're doing a full
* page write, so make sure it's included even if we take a full-page
* image. (XXX We could alternatively store a pointer into the FPW).
*/
if (RelationIsLogicallyLogged(relation))
{
xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
bufflags |= REGBUF_KEEP_DATA;
}

i.e. the inserted toast tuple will be marked as
XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
system table. Which we currently do not allow do be logically decoded.

That normally ends up being harmless, because ReorderBufferCommit() has the
following check:
if (!RelationIsLogicallyLogged(relation))
goto change_done;

but to reach that check, we first have to map the relfilenode from the
WAL to the corresponding OID:
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);

That works correctly if there's only one rewrite - the relmapper
contains the data for the new toast table. But if there's been *two*
consecutive rewrites, the relmapper *does not* contain the intermediary
relfilenode of pg_proc. There's no such problem for non-mapped tables,
because historic snapshots allow us to access the relevant data, but the
relmapper isn't mvcc.

Therefore the catalog-rewrite escape hatch of:
/*
* Catalog tuple without data, emitted while catalog was
* in the process of being rewritten.
*/
if (reloid == InvalidOid &&
change->data.tp.newtuple == NULL &&
change->data.tp.oldtuple == NULL)
goto change_done;
does not trigger and we run into:
else if (reloid == InvalidOid)
elog(ERROR, "could not map filenode \"%s\" to relation OID",
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));

commenting out this error / converting it into a warning makes this case
harmless, but could obviously be problematic in other scenarios.

Yeah, that seems like a bad idea. That error already caught a couple of
bugs (including da10d6a8a9 and this one), and I have a hunch those are
not the last ones.

I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach. Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.

The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.

Now I'm having a beer outside.

After discarding 30 theories? Have two.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#2)
Re: logical decoding bug when mapped relation with toast contents is rewritten repeatedly

Hi,

On 2018-09-14 16:13:46 +0200, Tomas Vondra wrote:

I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach. Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.

The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.

Now I'm having a beer outside.

Yeah, that seems like a bad idea. That error already caught a couple of
bugs (including da10d6a8a9 and this one), and I have a hunch those are
not the last ones.

One problem with this is that that means upgrading won't fix an existing
instance of the problem, but turning the ERROR into a WARNING would. I
personally think that's *NOT* enough justification for relaxing the
error, given that recreating the slot would fix the issue, but I see how
other people can reasonably differ. I can't really see a reasonably
complex approach that solves the issue in a "cake but have it too"
way...

After discarding 30 theories? Have two.

I will neither confirm nor deny. ;)

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: logical decoding bug when mapped relation with toast contents is rewritten repeatedly

Hi,

On 2018-09-13 19:10:46 -0700, Andres Freund wrote:

(Tomas, CCing you because you IIRC mentioned encountered an issue like
this)

I just spent quite a while debugging an issue where running logical
decoding yielded a:
ERROR: could not map filenode "base/$X/$Y" to relation OID
error.

After discarding like 30 different theories, I have found the cause:

During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
toast table with actual live toasted tuples (pg_proc in my case and
henceforth) heap inserts with the toast data happen into the new toast
relation, triggered by:

At that point the new toast relation does *NOT* appear to be a system
OA catalog, it's just appears as an "independent" table. Therefore we do
not trigger, in heap_insert():

/*
* RelationIsLogicallyLogged
* True if we need to log enough information to extract the data from the
* WAL stream.
*
* We don't log information for unlogged tables (since they don't WAL log
* anyway) and for system tables (their content is hard to make sense of, and
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
!IsCatalogRelation(relation))

/*
* For logical decoding, we need the tuple even if we're doing a full
* page write, so make sure it's included even if we take a full-page
* image. (XXX We could alternatively store a pointer into the FPW).
*/
if (RelationIsLogicallyLogged(relation))
{
xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
bufflags |= REGBUF_KEEP_DATA;
}

i.e. the inserted toast tuple will be marked as
XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
system table. Which we currently do not allow do be logically decoded.

That normally ends up being harmless, because ReorderBufferCommit() has the
following check:
if (!RelationIsLogicallyLogged(relation))
goto change_done;

but to reach that check, we first have to map the relfilenode from the
WAL to the corresponding OID:
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);

That works correctly if there's only one rewrite - the relmapper
contains the data for the new toast table. But if there's been *two*
consecutive rewrites, the relmapper *does not* contain the intermediary
relfilenode of pg_proc. There's no such problem for non-mapped tables,
because historic snapshots allow us to access the relevant data, but the
relmapper isn't mvcc.

Therefore the catalog-rewrite escape hatch of:
/*
* Catalog tuple without data, emitted while catalog was
* in the process of being rewritten.
*/
if (reloid == InvalidOid &&
change->data.tp.newtuple == NULL &&
change->data.tp.oldtuple == NULL)
goto change_done;
does not trigger and we run into:
else if (reloid == InvalidOid)
elog(ERROR, "could not map filenode \"%s\" to relation OID",
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));

commenting out this error / converting it into a warning makes this case
harmless, but could obviously be problematic in other scenarios.

I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach. Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.

The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.

I've pushed this now. I added more tests,which found an issue around
with the change around rewrites of non-mapped catalog tables.

Greetings,

Andres Freund