Some pgq table rewrite incompatibility with logical decoding?
I am hoping someone here can shed some light on this issue - I apologize if
this isn't the right place to ask this but I'm almost some of you all were
involving in pgq's dev and might be able to answer this.
We are actually running 2 replication technologies on a few of our dbs,
skytools and pglogical. Although we are moving towards only using logical
decoding-based replication, right now we have both for different purposes.
There seems to be a table rewrite happening on table pgq.event_58_1 that
has happened twice, and it ends up in the decoding stream, resulting in the
following error:
ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
OID"
In retracing what happened, we discovered that this relfilenode was
rewritten. But somehow, it is ending up in the logical decoding stream as
is "undecodable". This is pretty disastrous because the only way to fix it
really is to advance the replication slot and lose data.
The only obvious table rewrite I can find in the pgq codebase is a truncate
in pgq.maint_rotate_tables.sql. But there isn't anything surprising
there. If anyone has any ideas as to what might cause this so that we
could somehow mitigate the possibility of this happening again until we
move off pgq, that would be much appreciated.
Thanks,
Jeremy
Hi,
On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
I am hoping someone here can shed some light on this issue - I apologize if
this isn't the right place to ask this but I'm almost some of you all were
involving in pgq's dev and might be able to answer this.We are actually running 2 replication technologies on a few of our dbs,
skytools and pglogical. Although we are moving towards only using logical
decoding-based replication, right now we have both for different purposes.There seems to be a table rewrite happening on table pgq.event_58_1 that
has happened twice, and it ends up in the decoding stream, resulting in the
following error:ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
OID"In retracing what happened, we discovered that this relfilenode was
rewritten. But somehow, it is ending up in the logical decoding stream as
is "undecodable". This is pretty disastrous because the only way to fix it
really is to advance the replication slot and lose data.The only obvious table rewrite I can find in the pgq codebase is a truncate
in pgq.maint_rotate_tables.sql. But there isn't anything surprising
there. If anyone has any ideas as to what might cause this so that we
could somehow mitigate the possibility of this happening again until we
move off pgq, that would be much appreciated.
I suspect the issue might be that pgq does some updates to catalog
tables. Is that indeed the case?
Greetings,
Andres Freund
On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
I am hoping someone here can shed some light on this issue - I apologize
if
this isn't the right place to ask this but I'm almost some of you all
were
involving in pgq's dev and might be able to answer this.
We are actually running 2 replication technologies on a few of our dbs,
skytools and pglogical. Although we are moving towards only usinglogical
decoding-based replication, right now we have both for different
purposes.
There seems to be a table rewrite happening on table pgq.event_58_1 that
has happened twice, and it ends up in the decoding stream, resulting inthe
following error:
ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
OID"In retracing what happened, we discovered that this relfilenode was
rewritten. But somehow, it is ending up in the logical decoding streamas
is "undecodable". This is pretty disastrous because the only way to fix
it
really is to advance the replication slot and lose data.
The only obvious table rewrite I can find in the pgq codebase is a
truncate
in pgq.maint_rotate_tables.sql. But there isn't anything surprising
there. If anyone has any ideas as to what might cause this so that we
could somehow mitigate the possibility of this happening again until we
move off pgq, that would be much appreciated.I suspect the issue might be that pgq does some updates to catalog
tables. Is that indeed the case?
I also suspected this. The only case I found of this is that it is doing
deletes and inserts to pg_autovacuum. I could not find anything quickly
otherwise but I'm not sure if I'm missing something in some of the C code.
Thanks,
Jeremy
On 06/25/2018 07:48 PM, Jeremy Finzel wrote:
On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:Hi,
On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
I am hoping someone here can shed some light on this issue - I apologize if
this isn't the right place to ask this but I'm almost some of you all were
involving in pgq's dev and might be able to answer this.We are actually running 2 replication technologies on a few of our dbs,
skytools and pglogical. Although we are moving towards only using logical
decoding-based replication, right now we have both for different purposes.There seems to be a table rewrite happening on table pgq.event_58_1 that
has happened twice, and it ends up in the decoding stream, resulting in the
following error:ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
OID"In retracing what happened, we discovered that this relfilenode was
rewritten. But somehow, it is ending up in the logical decoding stream as
is "undecodable". This is pretty disastrous because the only way to fix it
really is to advance the replication slot and lose data.The only obvious table rewrite I can find in the pgq codebase is a truncate
in pgq.maint_rotate_tables.sql. But there isn't anything surprising
there. If anyone has any ideas as to what might cause this so that we
could somehow mitigate the possibility of this happening again until we
move off pgq, that would be much appreciated.I suspect the issue might be that pgq does some updates to catalog
tables. Is that indeed the case?I also suspected this. The only case I found of this is that it is
doing deletes and inserts to pg_autovacuum. I could not find anything
quickly otherwise but I'm not sure if I'm missing something in some of
the C code.
I don't think that's true, for two reasons.
Firstly, I don't think pgq updates catalogs directly, it simply
truncates the queue tables when rotating them (which updates the
relfilenode in pg_class, of course).
Secondly, we're occasionally seeing this on systems that do not use pgq,
but that do VACUUM FULL on custom "queue" tables. The symptoms are
exactly the same ("ERROR: could not map filenode"). It's pretty damn
rare and we don't have direct access to the systems, so investigation is
difficult :-( Our current hypothesis is that it's somewhat related to
subtransactions (because of triggers with exception blocks).
Jeremy, are you able to reproduce the issue locally, using pgq? That
would be very valuable.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I don't think that's true, for two reasons.
Firstly, I don't think pgq updates catalogs directly, it simply truncates
the queue tables when rotating them (which updates the relfilenode in
pg_class, of course).Secondly, we're occasionally seeing this on systems that do not use pgq,
but that do VACUUM FULL on custom "queue" tables. The symptoms are exactly
the same ("ERROR: could not map filenode"). It's pretty damn rare and we
don't have direct access to the systems, so investigation is difficult :-(
Our current hypothesis is that it's somewhat related to subtransactions
(because of triggers with exception blocks).Jeremy, are you able to reproduce the issue locally, using pgq? That would
be very valuable.
We have tried but have been unable to reproduce it. If we do encounter it
again, we will plan on reporting back and seeing if we can do some deep
debugging.
Thanks,
Jeremy
Jeremy, are you able to reproduce the issue locally, using pgq? That would
be very valuable.
Tomas et al:
We have hit this error again, and we plan to snapshot the database as to be
able to do whatever troubleshooting we can. If someone could provide me
guidance as to what exactly you would like me to do, please let me know. I
am able to provide an xlog dump and also debugging information upon request.
This is actually a different database system that also uses skytools, and
the exact same table (pgq.event_58_1) is again the cause of the relfilenode
error. I did a point-in-time recovery to a point after this relfilenode
appears using pg_xlogdump, and verified this was the table that appeared,
then disappeared.
Here is the exact issue again I am having:
Provider logs:
2018-08-28 08:29:30.334
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,2,"authentication",2018-08-28
08:29:30 CDT,289/78900643,0,LOG,00000,"replication connection authorized:
user=foo_logical SSL enabled (protocol=TLSv1.2,
cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)",,,,,,,,,""
2018-08-28 08:29:30.366
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,3,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,00000,"starting logical decoding for slot
""pgl_foo_prod_providerb97b25d_foo336ddc1""","streaming transactions
committing after C0A5/EEBBD4E8, reading WAL from
C0A5/EC723AC8",,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.366
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,4,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,00000,"logical decoding found consistent point at
C0A5/EC723AC8","There are no running transactions.",,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.448
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,5,"idle",2018-08-28
08:29:30 CDT,289/0,0,ERROR,XX000,"could not map filenode
""base/16418/2800559918"" to relation OID",,,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.463
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,6,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,08006,"could not receive data from client:
Connection reset by peer",,,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.463
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,7,"idle",2018-08-28
08:29:30 CDT,,0,LOG,00000,"disconnection: session time: 0:00:00.135
user=foo_logical database=foo_prod host=0.0.0.0
port=48314",,,,,,,,,"bar_foo_foo_a"
Subscriber logs:
2018-08-28 08:32:10.040 CDT,,,71810,,5b854eda.11882,1,,2018-08-28 08:32:10
CDT,7/0,0,LOG,00000,"starting apply for subscription
bar_foo_foo_a",,,,,,,,,"pglogical apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,2,,2018-08-28 08:32:10
CDT,7/242099,123886525,ERROR,XX000,"data stream ended",,,,,,,,,"pglogical
apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,3,,2018-08-28 08:32:10
CDT,7/0,0,LOG,00000,"apply worker [71810] at slot 1 generation 366 exiting
with error",,,,,,,,,"pglogical apply 16418:862837778"
2018-08-28 08:32:10.179 CDT,,,27173,,5b61d336.6a25,373,,2018-08-01 10:35:18
CDT,,0,LOG,00000,"worker process: pglogical apply 16418:862837778 (PID
71810) exited with exit code 1",,,,,,,,,""
Thanks,
Jeremy
We have hit this error again, and we plan to snapshot the database as to
be able to do whatever troubleshooting we can.
I am happy to report that we were able to get replication working again by
running snapshots of the systems in question on servers running the latest
point release 9.6.10, and replication simply works and skips over these
previously erroring relfilenodes. So whatever fixes were made in this
point release to logical decoding seems to have fixed the issue.
Thanks,
Jeremy
Hi Jeremy,
On 08/28/2018 10:46 PM, Jeremy Finzel wrote:
We have hit this error again, and we plan to snapshot the database
as to be able to do whatever troubleshooting we can.I am happy to report that we were able to get replication working again
by running snapshots of the systems in question on servers running the
latest point release 9.6.10, and replication simply works and skips over
these previously erroring relfilenodes. So whatever fixes were made in
this point release to logical decoding seems to have fixed the issue.
Interesting.
So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?
I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):
6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
Fix bugs in vacuum of shared rels, by keeping their relcache entries
current.
da10d6a8a94eec016fa072d007bced9159a28d39
Fix "base" snapshot handling in logical decoding
0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
Add table relcache invalidation to index builds.
But it's hard to say if/which of those commits did the trick, without
more information.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
Jeremy, are you able to reproduce the issue locally, using pgq?
That would be very valuable.Tomas et al:
We have hit this error again, and we plan to snapshot the database as to
be able to do whatever troubleshooting we can. If someone could provide
me guidance as to what exactly you would like me to do, please let me
know. I am able to provide an xlog dump and also debugging information
upon request.This is actually a different database system that also uses skytools,
and the exact same table (pgq.event_58_1) is again the cause of the
relfilenode error. I did a point-in-time recovery to a point after this
relfilenode appears using pg_xlogdump, and verified this was the table
that appeared, then disappeared.
Great!
Can you attach to the decoding process using gdb, and set a breakpoint
to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
record it fails?
If it fails too fast, making it difficult to attach gdb before the
crash, adding the LSN to the log message might be easier.
Once we have the LSN, it would be useful to see the pg_xlogdump
before/around that position.
Another interesting piece of information would be to know the contents
of the relmapper cache (which essentially means stepping through
RelationMapFilenodeToOid or something like that).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 08/29/2018 12:01 AM, Tomas Vondra wrote:
On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
Jeremy, are you able to reproduce the issue locally, using pgq?
That would be very valuable.Tomas et al:
We have hit this error again, and we plan to snapshot the database as to
be able to do whatever troubleshooting we can. If someone could provide
me guidance as to what exactly you would like me to do, please let me
know. I am able to provide an xlog dump and also debugging information
upon request.This is actually a different database system that also uses skytools,
and the exact same table (pgq.event_58_1) is again the cause of the
relfilenode error. I did a point-in-time recovery to a point after this
relfilenode appears using pg_xlogdump, and verified this was the table
that appeared, then disappeared.Great!
Can you attach to the decoding process using gdb, and set a breakpoint
to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
record it fails?If it fails too fast, making it difficult to attach gdb before the
crash, adding the LSN to the log message might be easier.Once we have the LSN, it would be useful to see the pg_xlogdump
before/around that position.Another interesting piece of information would be to know the contents
of the relmapper cache (which essentially means stepping through
RelationMapFilenodeToOid or something like that).
Have you managed to get the backtrace, or investigate where exactly it
fails (which LSN etc.)? We've managed to get a backtrace for "our"
failure, and it'd be interesting to compare those.
Attached is a subset of pg_waldump, for the two relevant transactions.
2554301859 is the transaction doing VACUUM FULL on the user table (so
essentially the pgq table), and 2554301862 (with a single 2554301862
subxact, likely due to exception handled in plpgsql function) is the
transaction that fails during decoding - on the very first WAL record
after the 2554301859 commit.
In reality there are records from additional transactions intermixed,
but those are irrelevant I believe. It's 2554301862 that fails, because
it commits first (and thus gets decoded first).
At first I thought this might be related to the "could not read block 3"
issue discussed in another thread, but if you say upgrade to 9.6.10
fixes this, that seems unlikely (because that issue is still there).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
Interesting.
So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?
I am resposting this to the list and not only to Tomas. Tomas, I can’t
promise just yet to delve into this because given the patch fixes the issue
it’s obviously much lower priority for our team. Are you hoping for me to
confirm the exact scenario in which the 9.6.10 patch fixes the bug?
Actually, there were more things changed than that so I'm not positive it
was the last patch:
BEFORE:
Provider - 9.6.8-1.pgdg16.04+1, pglogical 2.1.1-1.xenial+1
Subscriber - 9.6.9-2.pgdg16.04+1, 2.1.1-1.xenial+1
AFTER:
Provider - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
Subscriber - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
Fix bugs in vacuum of shared rels, by keeping their relcache entries
current.da10d6a8a94eec016fa072d007bced9159a28d39
Fix "base" snapshot handling in logical decoding0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
Add table relcache invalidation to index builds.But it's hard to say if/which of those commits did the trick, without
more information.
Let me know if that info gives you any more insight - actually 2 point
version jumps for provider, 1 for subscriber.
Thanks,
Jeremy
On 01/09/18 18:25, Jeremy Finzel wrote:
Interesting.
So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?I am resposting this to the list and not only to Tomas. Tomas, I can’t
promise just yet to delve into this because given the patch fixes the
issue it’s obviously much lower priority for our team. Are you hoping
for me to confirm the exact scenario in which the 9.6.10 patch fixes the
bug?Actually, there were more things changed than that so I'm not positive
it was the last patch:BEFORE:
Provider - 9.6.8-1.pgdg16.04+1, pglogical 2.1.1-1.xenial+1
Subscriber - 9.6.9-2.pgdg16.04+1, 2.1.1-1.xenial+1AFTER:
Provider - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
Subscriber - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
I finally managed to reproduce this and it's indeed fixed in the 9.6.10
by the commit da10d6a. It was caused by the issue with subtransaction
and SnapBuilder snapshots described there.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services