Critical errors during logical decoding

Started by Colin Morelliabout 8 years ago5 messagesgeneral

colin.morelli@gmail.com

about 8 years ago

List,

For seemingly no reason at all, my logical replication slot has just
started bailing out with errors every time my client tries to connect to
it. Particularly - at this moment - I'm getting:

2018-02-07 19:14:31 UTC [3985-1] svc_app@app LOG: 00000: received
replication command: START_REPLICATION SLOT event_stream LOGICAL 0/0
("startup_params_format" '1', "no_txinfo" 'true', "expected_encoding"
'UTF8', "max_proto_version" '1', "proto_format" 'json', "min_proto_version"
'1')
2018-02-07 19:14:31 UTC [3985-2] svc_app@app LOCATION:
exec_replication_command, walsender.c:1341
2018-02-07 19:14:31 UTC [3985-3] svc_app@app LOG: 00000: starting logical
decoding for slot "event_stream"
2018-02-07 19:14:31 UTC [3985-4] svc_app@app DETAIL: streaming
transactions committing after 0/8DDFD48, reading WAL from 0/8DDDC00
2018-02-07 19:14:31 UTC [3985-5] svc_app@app LOCATION:
CreateDecodingContext, logical.c:408
2018-02-07 19:14:31 UTC [3985-6] svc_app@app LOG: 00000: logical decoding
found consistent point at 0/8DDDC00
2018-02-07 19:14:31 UTC [3985-7] svc_app@app DETAIL: There are no running
transactions.
2018-02-07 19:14:31 UTC [3985-8] svc_app@app LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1245
2018-02-07 19:14:31 UTC [3985-9] svc_app@app ERROR: XX000: no known
snapshots

Other errors have included missing toast entries, as well as missing base/
files on disk. I blew away the entire data directory, and restored the
database from a pg_dump to a fresh directory after checking filesystem
consistency. The logical replication stream operates for a few minutes, and
then eventually bails out again. This is Postgres 9.6.6 on Ubuntu installed
from packages.

Does anyone have any insight into what could be happening here? Other steps
to try and rectify the problem? The database itself does not appear to have
any issues that I can see.

Best,
Colin

Colin Morelli

colin.morelli@gmail.com

about 8 years ago

In reply to: Colin Morelli (#1)

Re: Critical errors during logical decoding

Appears to be a problem with pglogical_output when de-toasting columns.
Switched to wal2json and the problem went away. I believe I may have found
the issue anyway. Apologies for the noise.

Best,
Colin

On Wed, Feb 7, 2018 at 2:20 PM, Colin Morelli <colin.morelli@gmail.com>
wrote:

Show quoted text

List,

For seemingly no reason at all, my logical replication slot has just
started bailing out with errors every time my client tries to connect to
it. Particularly - at this moment - I'm getting:

2018-02-07 19:14:31 UTC [3985-1] svc_app@app LOG: 00000: received
replication command: START_REPLICATION SLOT event_stream LOGICAL 0/0
("startup_params_format" '1', "no_txinfo" 'true', "expected_encoding"
'UTF8', "max_proto_version" '1', "proto_format" 'json', "min_proto_version"
'1')
2018-02-07 19:14:31 UTC [3985-2] svc_app@app LOCATION:
exec_replication_command, walsender.c:1341
2018-02-07 19:14:31 UTC [3985-3] svc_app@app LOG: 00000: starting
logical decoding for slot "event_stream"
2018-02-07 19:14:31 UTC [3985-4] svc_app@app DETAIL: streaming
transactions committing after 0/8DDFD48, reading WAL from 0/8DDDC00
2018-02-07 19:14:31 UTC [3985-5] svc_app@app LOCATION:
CreateDecodingContext, logical.c:408
2018-02-07 19:14:31 UTC [3985-6] svc_app@app LOG: 00000: logical
decoding found consistent point at 0/8DDDC00
2018-02-07 19:14:31 UTC [3985-7] svc_app@app DETAIL: There are no
running transactions.
2018-02-07 19:14:31 UTC [3985-8] svc_app@app LOCATION:
SnapBuildFindSnapshot, snapbuild.c:1245
2018-02-07 19:14:31 UTC [3985-9] svc_app@app ERROR: XX000: no known
snapshots

Other errors have included missing toast entries, as well as missing base/
files on disk. I blew away the entire data directory, and restored the
database from a pg_dump to a fresh directory after checking filesystem
consistency. The logical replication stream operates for a few minutes, and
then eventually bails out again. This is Postgres 9.6.6 on Ubuntu installed
from packages.

Does anyone have any insight into what could be happening here? Other
steps to try and rectify the problem? The database itself does not appear
to have any issues that I can see.

Best,
Colin

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Colin Morelli (#2)

Re: Critical errors during logical decoding

On 02/07/2018 09:07 PM, Colin Morelli wrote:

Appears to be a problem with pglogical_output when de-toasting columns.
Switched to wal2json and the problem went away. I believe I may have
found the issue anyway. Apologies for the noise.

What pglogical version are you using? Can you share an example of data
causing this issue (or even better a script reproducing it)?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Colin Morelli

colin.morelli@gmail.com

about 8 years ago

In reply to: Tomas Vondra (#3)

Re: Critical errors during logical decoding

Hey Tomas,

We're using v1.1.0 of pglogical. I can't honestly say what could reproduce
this issue as I'm not entirely sure and haven't tested it yet.

I had thought it was caused when a toasted row is deleted before the
pglogical_output plugin processes the event (so it fails when going to
de-toast the row), but I would imagine that postgres internally should
prevent deleting data that would be needed by a replication stream. I must
admit I'm not well versed in this area, though. My understanding is
wal2json doesn't detoast the data at all, and thus doesn't have an issue
(for our case, we really just need the json data with an identifier telling
us that the row changed, we're not doing full logical replication).

Best,
Colin

On Wed, Feb 7, 2018 at 4:18 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Show quoted text

On 02/07/2018 09:07 PM, Colin Morelli wrote:

Appears to be a problem with pglogical_output when de-toasting columns.
Switched to wal2json and the problem went away. I believe I may have
found the issue anyway. Apologies for the noise.

What pglogical version are you using? Can you share an example of data
causing this issue (or even better a script reproducing it)?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Colin Morelli (#4)

Re: Critical errors during logical decoding

On 02/07/2018 10:22 PM, Colin Morelli wrote:

Hey Tomas,

We're using v1.1.0 of pglogical. I can't honestly say what could
reproduce this issue as I'm not entirely sure and haven't tested it yet.

Well, consider updating - there's 1.1.2 in that exact branch, and
there's also 2.1.1 now. I don't recall if there were bugs with such
symptoms, though.

I had thought it was caused when a toasted row is deleted before the
pglogical_output plugin processes the event (so it fails when going to
de-toast the row), but I would imagine that postgres internally should
prevent deleting data that would be needed by a replication stream. I
must admit I'm not well versed in this area, though. My understanding is
wal2json doesn't detoast the data at all, and thus doesn't have an issue
(for our case, we really just need the json data with an identifier
telling us that the row changed, we're not doing full logical replication).

That seems really unlikely, for a couple of reasons.

1) As you've correctly guessed, toasted values are protected by the same
MVCC mechanism as regular tables, so rows should not disappear when
transactions might still need them (including replication).

2) The detoasting happens in reorderbuffer, which is shared both by
pglogical and wal2json. So both should have been affected.

3) You've also mentioned missing files in base/ which suggests there's
something far more serious going on. Not sure what, though.

I suggest you do one or both of these things:

a) update to newer pglogical version and see if it fixes the issue

b) try preparing a self-contained reproducer

c) share a more complete version of the server logs (hard to say what's
the issue from just a few lines)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services