Logical decoding error

Started by Mark Fletcherover 8 years ago3 messagesgeneral
Jump to latest
#1Mark Fletcher
markf@corp.groups.io

Hello,

Running Postgres 9.6.5, we're using logical decoding to take changes to the
database and propagate them elsewhere in our system. We are using the PGX
Go Postgres library, at https://github.com/jackc/pgx, and we are using the
test_decoding plugin to format the changes. We are using 6 slots/have 6
processes streaming the changes from our database.

This setup works great, except that every 20 hours or so, some or all of
the processes encounter a problem, all at the same time. They receive an
unexpected message type 'w'. At this point the processes restart, and when
they do, they encounter another error: "ERROR: got sequence entry 0 for
toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk
number/seq number varies). This causes them to restart again. They will
encounter the sequence entry error up to 3 more times, before things
magically start to work again.

We are also doing standard streaming replication to a slave off this
database, and that has never seen a problem.

Does this ring a bell for anyone? Do you have any suggestions for how I
should go about figuring out what's happening?

Thanks,
Mark

#2Steve Atkins
steve@blighty.com
In reply to: Mark Fletcher (#1)
Re: Logical decoding error

On Nov 2, 2017, at 9:34 AM, Mark Fletcher <markf@corp.groups.io> wrote:

Hello,

Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in our system. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the test_decoding plugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our database.

This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at the same time. They receive an unexpected message type 'w'. At this point the processes restart, and when they do, they encounter another error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up to 3 more times, before things magically start to work again.

We are also doing standard streaming replication to a slave off this database, and that has never seen a problem.

Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's happening?

Where are the errors coming from - your code or pgx? If it's from pgx, what's the exact error? ('w' is regular replication payload data, so it'd be expected as a copydata payload message type, but would be an error for a replication message).

Do you capture the raw data from the replication connection when the error happens?

(If you're using pgx you might be interested in https://github.com/wttw/pgoutput - it's support for the pgoutput logical decoder in PG10, which might be a bit more robust to deal with than the test_decoding one).

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Mark Fletcher
markf@corp.groups.io
In reply to: Steve Atkins (#2)
Re: Logical decoding error

On Thu, Nov 2, 2017 at 9:59 AM, Steve Atkins <steve@blighty.com> wrote:

Where are the errors coming from - your code or pgx? If it's from pgx,
what's the exact error? ('w' is regular replication payload data, so it'd
be expected as a copydata payload message type, but would be an error for a
replication message).

The errors are coming from PGX. Best I can tell (and I'm not very familiar

with the PGX code, nor with the internals of Postgres replication), it's
waiting for a replication message from the backend, so it's expecting
things like CopyData/CopyInResponse/CopyOutResponse/CopyBothResponse. It
gets a 'w' msg type there, which it doesn't handle, so it errors back to me
(the copydata message type it expects there is 'd').

I'm only using PGX to open a replication connection, start replication,
read replication messages and send heartbeats.

I did open an issue on GitHub about this a couple of days ago, but haven't
heard back from the PGX developer, so I thought I should try to help out
and debug it myself. So I've been trying to educate myself on replication
internals and such. Hence my email here.

Do you capture the raw data from the replication connection when the error

happens?

I was not, but I have added that and some other logging, and will report

back when I get more info.

Thanks,
Mark