Timeline following for logical slots
Hi all
Per discussion on the failover slots thread (
https://commitfest.postgresql.org/9/488/) I'm splitting timeline following
for logical slots into its own separate patch.
The attached patch fixes an issue I found while testing the prior revision:
it would read WAL from WAL segments on the old timeline up until the
timeline switch boundary, but this doesn't work if the last WAL segment on
the timeline has been renamed to append the .partial suffix.
Instead it's necessary to eagerly switch to reading the WAL segment from
the newest timeline on that segment. We'll still be reading WAL records
from the correct timeline since the partial WAL segment from the old
timeline gets copied to a new name on promotion, but we're reading it from
the newest copy of that segment, which is either complete and archived or
is still being written to by the current timeline.
For example, if the old master was on timeline 1 and writing
to 000000010000000000000003 when it dies and we promote a streaming
replica, the replica will copy 000000010000000000000003
to 000000020000000000000003 and append its recovery checkpoint to the copy.
It renames 000000010000000000000003 to 000000010000000000000003.partial,
which means the xlogreader won't find it. If we're reading the record at
0/3000000 then even though 0/3000000 is on timeline 1, we have to read it
from the segment on timeline 2.
Fun, eh?
(I'm going to write a README.timelines to document some of this stuff soon,
since it has some pretty hairy corners and some of the code paths are a bit
special.)
I've written some initial TAP tests for timeline following that exploit the
fact that replication slots are preserved on a replica if the replica is
created with a filesystem level copy that includes pg_replslot, rather than
using pg_basebackup. They are not included here because they rely on TAP
support improvements (filesystem backup support, psql enhancements, etc)
that I'll submit separately, but they're how I found the .partial issue.
A subsequent patch can add testing of slot creation and advance on replicas
using a C test extension to prove that this approach can be used to achieve
practical logical failover for extensions.
I think this is ready to go as-is.
I don't want to hold it up waiting for test framework enhancements unless
those can be committed fairly easily because I think we need this in 9.6
and the tests demonstrate that it works when run separately.
See for a git tree containing the timeline following patch, TAP
enhancements and the tests for timeline following.
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Allow-logical-slots-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-logical-slots-to-follow-timeline-switches.patchDownload+323-36
Here are the tests.
They depend on https://commitfest.postgresql.org/9/569/#
I've also rebased the git tree for failover slots on top of the tree for
TAP test improvements.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Tests-for-logical-decoding-timeline-following.patchtext/x-patch; charset=US-ASCII; name=0001-Tests-for-logical-decoding-timeline-following.patchDownload+98-1
On 1 March 2016 at 21:00, Craig Ringer <craig@2ndquadrant.com> wrote:
Hi all
Per discussion on the failover slots thread (
https://commitfest.postgresql.org/9/488/) I'm splitting timeline
following for logical slots into its own separate patch.
I've updated the logical decoding timeline following patch to fix a bug
found as a result of test development related to how Pg renames the last
WAL seg on the old timeline to suffix it with .partial on promotion. The
xlogreader must switch to reading from the newest-timeline version of a
given segment eagerly, for the first page of the segment, since that's the
only one guaranteed to actually exist.
I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading to
it is not.
I've also attached an updated version of the tests posted a few days ago.
The tests depend on the remaining patches from the TAP enhancements tree so
it's easiest to just get the whole tree from
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
(subject to regular rebases and force pushes, do not use as a base).
The tests now include a test module that exposes some slots guts to SQL to
allow the client to sync slot state from master to replica(s) without
needing failover slots and the use of extra WAL as transport. It's very
much for-testing-only.
The new test module is used by a second round of tests to demonstrate the
practicality of failover of a logical replication client to a physical
replica using a base backup taken by pg_basebackup and without the presence
of failover slots. I won't pretend it's pretty.
This proves that the approach works barring unforseen showstoppers. It also
proves it's pretty ugly - failover slots provide a much, MUCH simpler and
safer way for clients to achieve this with way less custom code needed by
each client to sync slot state.
I've got a bit of cleanup to do in the test suite and a few more tests to
write for cases where the slot on the replica is allowed to fall behind the
slot on the master but this is mostly waiting on the remaining two TAP test
patches before it can be evaluated for possible push.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Allow-logical-slots-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-logical-slots-to-follow-timeline-switches.patchDownload+323-36
0002-Tests-for-logical-decoding-timeline-following.patchtext/x-patch; charset=US-ASCII; name=0002-Tests-for-logical-decoding-timeline-following.patchDownload+491-1
On 04/03/16 17:08, Craig Ringer wrote:
I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading
to it is not.
I think this will have to be work for committer at this point. I can't
find any flaws in the logic myself so I unless somebody protests I am
going to mark this as ready for committer.
I've also attached an updated version of the tests posted a few days
ago. The tests depend on the remaining patches from the TAP
enhancements tree so it's easiest to just get the whole tree from
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
(subject to regular rebases and force pushes, do not use as a base).The tests now include a test module that exposes some slots guts to SQL
to allow the client to sync slot state from master to replica(s) without
needing failover slots and the use of extra WAL as transport. It's very
much for-testing-only.The new test module is used by a second round of tests to demonstrate
the practicality of failover of a logical replication client to a
physical replica using a base backup taken by pg_basebackup and without
the presence of failover slots. I won't pretend it's pretty.
Well for testing purposes it's quite fine I think. The TAP framework
enhancements needed for this are now in and it works correctly against
current master.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Petr Jelinek wrote:
On 04/03/16 17:08, Craig Ringer wrote:
I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading
to it is not.I think this will have to be work for committer at this point. I can't find
any flaws in the logic myself so I unless somebody protests I am going to
mark this as ready for committer.
Great, thanks. I've studied this to the point where I'm confident that
it makes sense, so I'm about to push it. I didn't change any logic,
only updated comments here and there, both in the patch and in some
preexisting code. I also changed the "List *timelineHistory" to be
#ifndef FRONTEND, which seems cleaner than having it and insist that it
couldn't be used.
Also, hopefully you don't have any future callers for the functions I
marked static (I checked the failover slots series and couldn't see
anything). I will push this early tomorrow.
One thing I'm not quite sure about is why this is said to apply to
"logical slots" and not just to replication slots in general. In what
sense does it not apply to physical replication slots?
(I noticed that we have this new line,
- randAccess = true;
+ randAccess = true; /* allow readPageTLI to go backwards */
in which now the comment is an identical copy of an identical line
elsewhere; however, randAccess doesn't actually affect readPageTLI in
any way, so AFAICS both comments are now wrong.)
Well for testing purposes it's quite fine I think. The TAP framework
enhancements needed for this are now in and it works correctly against
current master.
Certainly the new src/test/recovery/t/006whatever.pl file is going to be
part of the commit. What I'm not so sure about is
src/test/modules/decoding_failover: is there any reason we don't just
put the new functions in contrib/test_decoding?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-Allow-logical-slots-to-follow-timeline-switches-2.patchtext/x-diff; charset=us-asciiDownload+852-68
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
Great, thanks. I've studied this to the point where I'm confident that
it makes sense, so I'm about to push it. I didn't change any logic,
only updated comments here and there, both in the patch and in some
preexisting code. I also changed the "List *timelineHistory" to be
#ifndef FRONTEND, which seems cleaner than having it and insist that it
couldn't be used.
Could you perhaps wait till tomorrow? I'd like to do a pass over it.
Thanks
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
Could you perhaps wait till tomorrow? I'd like to do a pass over it.
Sure thing. I'll wait for your review.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 March 2016 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Petr Jelinek wrote:
On 04/03/16 17:08, Craig Ringer wrote:
I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading
to it is not.I think this will have to be work for committer at this point. I can't
find
any flaws in the logic myself so I unless somebody protests I am going to
mark this as ready for committer.Great, thanks. I've studied this to the point where I'm confident that
it makes sense, so I'm about to push it.
Thanks, though I'm happy enough to wait for Andres's input as he requested.
Also, hopefully you don't have any future callers for the functions I
marked static (I checked the failover slots series and couldn't see
anything). I will push this early tomorrow.
I don't see it being overly useful for an extension, so that sounds fine.
One thing I'm not quite sure about is why this is said to apply to
"logical slots" and not just to replication slots in general. In what
sense does it not apply to physical replication slots?
It's only useful for logical slots currently because physical slot
timeline following is done client side by the walreceiver. When the
walsender hits the end of the timeline it sends CopyDone and returns to
command mode. The client is expected to request the timeline history file,
parse it, work out which timeline to request next and specify that in its
next START_REPLICATION call.
None of that happens for logical slots. Andres was quite deliberate about
keeping timelines out of the interface for logical slots and logical
replication. I tried to retain that when adding the ability to follow
physical failover, keeping things simple for the client. Given that logical
replication slots are intended to be consumed by more than just a postgres
downstream it IMO makes sense to keep as much internal complexity hidden,
especially when dealing with something as surprisingly convoluted as
timelines.
This does mean that we can't tell if we replayed past the timeline switch
point on the old master before we switched to the new master, but I don't
think sending a timeline history file is the solution there. I'd rather
expose a walsender command and SQL function to let the client say, after
connecting, "I last replayed from timeline <x> up to point <y>, if I start
replaying from you will I get a consistent stream?". That can be done
separately to this patch and IMO isn't crucial since clients can currently
work it out, just with more difficulty.
Maybe physical rep can use the same logic, but I'm really not convinced. It
already knows how to follow timelines client-side and handles that as part
of walreceiver and archive recovery. Because recovery handles following the
timeline switches and walreceiver just fetches the WAL as an alternative to
archive fetches I'm not sure it makes sense; we still have to have all that
logic for archive recovery from restore_command, so doing it differently
for streaming just makes things more complicated for no clear benefit.
I certainly didn't want to address it in the same patch, much like I
intentionally avoided trying to teach pg_xlogdump to be able to follow
timelines. Partly to keep it simpler and focused, partly because it'd
require timeline.c to be made frontend-clean which means no List, no elog,
etc...
(I noticed that we have this new line, - randAccess = true; + randAccess = true; /* allow readPageTLI to go backwards */ in which now the comment is an identical copy of an identical line elsewhere; however, randAccess doesn't actually affect readPageTLI in any way, so AFAICS both comments are now wrong.)
Yeah, that's completely bogus. I have a clear image in my head of
randAccess being tested by ValidXLogPageHeader where it sanity
checks state->latestPageTLI, the TLI of the *prior* page, against the TLI
of the most recent page read, to make sure the TLI advances. But nope, I'm
apparently imagining things 'cos it just isn't there, it just tests to see
if we read a LSN later than the prior page instead.
Well for testing purposes it's quite fine I think. The TAP framework
enhancements needed for this are now in and it works correctly against
current master.Certainly the new src/test/recovery/t/006whatever.pl file is going to be
part of the commit. What I'm not so sure about is
src/test/modules/decoding_failover: is there any reason we don't just
put the new functions in contrib/test_decoding?
Because contrib/test_decoding isn't an extension. It is a sample logical
decoding output plugin. I didn't want to mess it up by adding an extension
control file, extension script and some extension functions. Sure, you
*can* use the same shared library as an extension and as a logical decoding
output plugin but I think that makes it rather more confusing as an
example.
Also, because I don't want people reading test_decoding to find those
functions and think it's a good idea to actually use them. It's hidden away
in src/test/modules for a reason ;)
I have no particular objection to stripping the test module and the parts
of the t/ script associated with it if you don't think it's worthwhile.
I'd be happy to have it as an ongoing test showing that "hot" timeline
following on a logical slot does work, but the tests that use only base
backups are probably sufficient to detect obvious breakage/regressions.
If Michael Paquier's decoder_raw and receiver_raw were in contrib/ I'd add
support for doing it into reciever_raw without exposing SQL functions. They
aren't in core and can't form part of the core test suite, so I needed
something minimalist to prove the concept, and figured it might as well
serve as a regression test as well.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index fcb0872..7b60b8f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -10,9 +10,11 @@ * * NOTES * See xlogreader.h for more notes on this facility. + * + * This file is compiled as both front-end and backend code, so it + * may not use ereport, server-defined static variables, etc. *------------------------------------------------------------------------- */ -
Huh?
#include "postgres.h"
#include "access/transam.h"
@@ -116,6 +118,11 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
return NULL;
}+#ifndef FRONTEND + /* Will be loaded on first read */ + state->timelineHistory = NIL; +#endif + return state; }@@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state) pfree(state->errormsg_buf); if (state->readRecordBuf) pfree(state->readRecordBuf); +#ifndef FRONTEND + if (state->timelineHistory) + list_free_deep(state->timelineHistory); +#endif
Hm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
pfree(state->readBuf);
pfree(state);
}
@@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)if (RecPtr == InvalidXLogRecPtr)
{
+ /* No explicit start point; read the record after the one we just read */
RecPtr = state->EndRecPtr;if (state->ReadRecPtr == InvalidXLogRecPtr) - randAccess = true; + randAccess = true; /* allow readPageTLI to go backwards */
randAccess is doing more than that, so I'm doubtful that comment is an
improvment.
/* * RecPtr is pointing to end+1 of the previous WAL record. If we're @@ -223,6 +235,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) else { /* + * Caller supplied a position to start at. + * * In this case, the passed-in record pointer should already be * pointing to a valid record starting position. */ @@ -309,8 +323,10 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) /* XXX: more validation should be done here */ if (total_len < SizeOfXLogRecord) { - report_invalid_record(state, "invalid record length at %X/%X", - (uint32) (RecPtr >> 32), (uint32) RecPtr); + report_invalid_record(state, + "invalid record length at %X/%X: wanted %lu, got %u", + (uint32) (RecPtr >> 32), (uint32) RecPtr, + SizeOfXLogRecord, total_len); goto err; } gotheader = false; @@ -466,9 +482,7 @@ err: * Invalidate the xlog page we've cached. We might read from a different * source after failure. */ - state->readSegNo = 0; - state->readOff = 0; - state->readLen = 0; + XLogReaderInvalCache(state);
I don't think that "cache" is the right way to describe this.
#include <unistd.h>
-#include "miscadmin.h"
-
spurious change imo.
/* - * TODO: This is duplicate code with pg_xlogdump, similar to walsender.c, but - * we currently don't have the infrastructure (elog!) to share it. + * Read 'count' bytes from WAL into 'buf', starting at location 'startptr' + * in timeline 'tli'. + * + * Will open, and keep open, one WAL segment stored in the static file + * descriptor 'sendFile'. This means if XLogRead is used once, there will + * always be one descriptor left open until the process ends, but never + * more than one. + * + * XXX This is very similar to pg_xlogdump's XLogDumpXLogRead and to XLogRead + * in walsender.c but for small differences (such as lack of elog() in + * frontend). Probably these should be merged at some point. */ static void XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count) @@ -648,8 +657,12 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count) XLogRecPtr recptr; Size nbytes;+ /* + * Cached state across calls. + */
One line?
static int sendFile = -1;
static XLogSegNo sendSegNo = 0;
+ static TimeLineID sendTLI = 0;
static uint32 sendOff = 0;p = buf;
@@ -664,11 +677,12 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)startoff = recptr % XLogSegSize;
- if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo)) + /* Do we need to open a new xlog segment? */ + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || + sendTLI != tli) {
s/open a new/open a different/? New imo has connotations that we don't
really want here.
char path[MAXPGPATH];
- /* Switch to another logfile segment */
if (sendFile >= 0)
close(sendFile);
E.g. you could just have moved the above comment.
/* Need to seek in the file? */ if (sendOff != startoff) { if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0) - { - char path[MAXPGPATH]; - - XLogFilePath(path, tli, sendSegNo); - ereport(ERROR, (errcode_for_file_access(), errmsg("could not seek in log segment %s to offset %u: %m", - path, startoff))); - } + XLogFileNameP(tli, sendSegNo), startoff))); sendOff = startoff; }
Not a serious issue, more a general remark: I'm doubtful that going for
palloc in error situations is good practice. This will be allocated in
the current memory context; without access to the emergency error
reserves.
I'm also getting the feeling that the patch is bordering on doing some
relatively random cleanups mixed in with architectural changes. Makes
things a bit harder to review.
+static void +XLogReadDetermineTimeline(XLogReaderState *state) +{ + /* Read the history on first time through */ + if (state->timelineHistory == NIL) + state->timelineHistory = readTimeLineHistory(ThisTimeLineID); + + /* + * Are we reading the record immediately following the one we read last + * time? If not, then don't use the cached timeline info. + */ + if (state->currRecPtr != state->EndRecPtr) + { + state->currTLI = 0; + state->currTLIValidUntil = InvalidXLogRecPtr; + }
Hm. So we grow essentially a second version of the last end position and
the randAccess stuff in XLogReadRecord().
+ if (state->currTLI == 0) + { + /* + * Something changed; work out what timeline this record is on. We + * might read it from the segment on this TLI or, if the segment is + * also contained by newer timelines, the copy from a newer TLI. + */ + state->currTLI = tliOfPointInHistory(state->currRecPtr, + state->timelineHistory); + + /* + * Look for the most recent timeline that's on the same xlog segment + * as this record, since that's the only one we can assume is still + * readable. + */ + while (state->currTLI != ThisTimeLineID && + state->currTLIValidUntil == InvalidXLogRecPtr) + { + XLogRecPtr tliSwitch; + TimeLineID nextTLI; + + tliSwitch = tliSwitchPoint(state->currTLI, state->timelineHistory, + &nextTLI); + + /* round ValidUntil down to start of seg containing the switch */ + state->currTLIValidUntil = + ((tliSwitch / XLogSegSize) * XLogSegSize); + + if (state->currRecPtr >= state->currTLIValidUntil) + { + /* + * The new currTLI ends on this WAL segment so check the next + * TLI to see if it's the last one on the segment. + * + * If that's the current TLI we'll stop searching.
I don't really understand how we're stopping searching here?
+ */ + state->currTLI = nextTLI; + state->currTLIValidUntil = InvalidXLogRecPtr; + } + } +}
XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
whether there's not a simpler way to write this.
+/* + * XLogPageReadCB callback for reading local xlog files * * Public because it would likely be very helpful for someone writing another * output method outside walsender, e.g. in a bgworker. * - * TODO: The walsender has it's own version of this, but it relies on the + * TODO: The walsender has its own version of this, but it relies on the * walsender's latch being set whenever WAL is flushed. No such infrastructure * exists for normal backends, so we have to do a check/sleep/repeat style of * loop for now. @@ -754,46 +897,88 @@ int read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI) { - XLogRecPtr flushptr, + XLogRecPtr read_upto, loc; int count;loc = targetPagePtr + reqLen; + + /* Make sure enough xlog is available... */ while (1) { /* - * TODO: we're going to have to do something more intelligent about - * timelines on standbys. Use readTimeLineHistory() and - * tliOfPointInHistory() to get the proper LSN? For now we'll catch - * that case earlier, but the code and TODO is left in here for when - * that changes. + * Check which timeline to get the record from. + * + * We have to do it each time through the loop because if we're in + * recovery as a cascading standby, the current timeline might've + * become historical. */ - if (!RecoveryInProgress()) + XLogReadDetermineTimeline(state); + + if (state->currTLI == ThisTimeLineID) { - *pageTLI = ThisTimeLineID; - flushptr = GetFlushRecPtr(); + /* + * We're reading from the current timeline so we might have to + * wait for the desired record to be generated (or, for a standby, + * received & replayed) + */ + if (!RecoveryInProgress()) + { + *pageTLI = ThisTimeLineID; + read_upto = GetFlushRecPtr(); + } + else + read_upto = GetXLogReplayRecPtr(pageTLI); + + if (loc <= read_upto) + break; + + CHECK_FOR_INTERRUPTS(); + pg_usleep(1000L); } else - flushptr = GetXLogReplayRecPtr(pageTLI); + { + /* + * We're on a historical timeline, so limit reading to the switch + * point where we moved to the next timeline. + */ + read_upto = state->currTLIValidUntil;
Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If
so, how come?
- if (loc <= flushptr) + /* + * Setting pageTLI to our wanted record's TLI is slightly wrong; + * the page might begin on an older timeline if it contains a + * timeline switch, since its xlog segment will have been copied + * from the prior timeline. This is pretty harmless though, as + * nothing cares so long as the timeline doesn't go backwards. We + * should read the page header instead; FIXME someday. + */ + *pageTLI = state->currTLI; + + /* No need to wait on a historical timeline */ break; - - CHECK_FOR_INTERRUPTS(); - pg_usleep(1000L); + } }- /* more than one block available */ - if (targetPagePtr + XLOG_BLCKSZ <= flushptr) + if (targetPagePtr + XLOG_BLCKSZ <= read_upto) + { + /* + * more than one block available; read only that block, have caller + * come back if they need more. + */ count = XLOG_BLCKSZ; - /* not enough data there */ - else if (targetPagePtr + reqLen > flushptr) + } + else if (targetPagePtr + reqLen > read_upto) + { + /* not enough data there */ return -1; - /* part of the page available */ + } else - count = flushptr - targetPagePtr; + { + /* enough bytes available to satisfy the request */ + count = read_upto - targetPagePtr; + }- XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ); + XLogRead(cur_page, *pageTLI, targetPagePtr, count);
When are we reading less than a page? That should afaik never be required.
+ /* + * We start reading xlog from the restart lsn, even though in + * CreateDecodingContext we set the snapshot builder up using the + * slot's candidate_restart_lsn. This means we might read xlog we + * don't actually decode rows from, but the snapshot builder might + * need it to get to a consistent point. The point we start returning + * data to *users* at is the candidate restart lsn from the decoding + * context. + */
Uh? Where are we using candidate_restart_lsn that way? I seriously doubt
it is - candidate_restart_lsn is about a potential future restart_lsn,
which we can set once we get reception confirmation from the client.
@@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
CHECK_FOR_INTERRUPTS();
}+ /* Make sure timeline lookups use the start of the next record */ + startptr = ctx->reader->EndRecPtr;
Huh? startptr isn't used after this, so I'm not sure what this even
means?
+ /* + * The XLogReader will read a page past the valid end of WAL because + * it doesn't know about timelines. When we switch timelines and ask + * it for the first page on the new timeline it will think it has it + * cached, but it'll have the old partial page and say it can't find + * the next record. So flush the cache. + */ + XLogReaderInvalCache(ctx->reader); +
dito.
diff --git a/src/test/modules/decoding_failover/decoding_failover.c b/src/test/modules/decoding_failover/decoding_failover.c new file mode 100644 index 0000000..669e6c4 --- /dev/null +++ b/src/test/modules/decoding_failover/decoding_failover.c
+ +/* + * Create a new logical slot, with invalid LSN and xid, directly. This does not + * use the snapshot builder or logical decoding machinery. It's only intended + * for creating a slot on a replica that mirrors the state of a slot on an + * upstream master. + * + * You should immediately decoding_failover_advance_logical_slot(...) it + * after creation. + */
Uh. I doubt we want this, even if it's formally located in
src/test/modules. These comments make it appear not to be only intended
for that, and I have serious doubts about the validity of the concept as
is.
This seems to need some more polishing.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 March 2016 at 17:12, Andres Freund <andres@anarazel.de> wrote:
Hi
Thanks very much for the review.
This patch was split out from failover slots, which its self underwent
quite a few revisions, so I'm really happy to have fresh eyes on it.
Especially more experienced ones.
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
diff --git a/src/backend/access/transam/xlogreader.cb/src/backend/access/transam/xlogreader.c
index fcb0872..7b60b8f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -10,9 +10,11 @@ * * NOTES * See xlogreader.h for more notes on this facility. + * + * This file is compiled as both front-end and backend code,so it
+ * may not use ereport, server-defined static variables, etc.
*-------------------------------------------------------------------------
*/
-Huh?
I'm not sure what the concern here is. xlogreader *is* compiled as frontend
code - the file gets linked into the tree for pg_xlogdump and pg_rewind, at
least.
I found that really confusing when working on it and thought it merited a
comment.
@@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state) pfree(state->errormsg_buf); if (state->readRecordBuf) pfree(state->readRecordBuf); +#ifndef FRONTEND + if (state->timelineHistory) + list_free_deep(state->timelineHistory); +#endifHm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
Yes, it would. I don't want to address that in the same patch though. It'd
require making timeline.c frontend-clean, dealing with the absence of List
on the frontend, etc, and I don't want to complicate this patch with that.
I've intentionally written the timeline logic so it can pretty easily be
moved into xlogreader.c as a self-contained unit and used for those
utilities once timeline.c can be compiled for frontend too.
pfree(state->readBuf);
pfree(state);
}
@@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtrRecPtr, char **errormsg)
if (RecPtr == InvalidXLogRecPtr)
{
+ /* No explicit start point; read the record after the onewe just read */
RecPtr = state->EndRecPtr;
if (state->ReadRecPtr == InvalidXLogRecPtr) - randAccess = true; + randAccess = true; /* allow readPageTLI to gobackwards */
randAccess is doing more than that, so I'm doubtful that comment is an
improvment.
Yeah, I have no idea what I was on about there, per response to Álvaro's
post.
@@ -466,9 +482,7 @@ err:
* Invalidate the xlog page we've cached. We might read from adifferent
* source after failure. */ - state->readSegNo = 0; - state->readOff = 0; - state->readLen = 0; + XLogReaderInvalCache(state);I don't think that "cache" is the right way to describe this.
Isn't that what it is? It reads a page, caches it, and reuses it for
subsequent requests on the same page. The pre-existing comment even calls
it a cache above.
I don't mind changing it, but don't have any better ideas.
#include <unistd.h>
-#include "miscadmin.h"
-spurious change imo.
Added in Álvaro's rev; it puts the header in the correct sort order, but
I'm not sure it should be bundled with this patch.
- if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo)) + /* Do we need to open a new xlog segment? */ + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || + sendTLI != tli) {s/open a new/open a different/? New imo has connotations that we don't
really want here.
In my original patch this was:
/* Do we need to switch to a new xlog segment? */
but yeah, "open a different" is better than either.
/* Need to seek in the file? */
if (sendOff != startoff)
{
if (lseek(sendFile, (off_t) startoff, SEEK_SET) <0)
- {
- char path[MAXPGPATH];
-
- XLogFilePath(path, tli, sendSegNo);
-
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not seek in log segment %sto offset %u: %m",
- path, startoff))); - } + XLogFileNameP(tli,sendSegNo), startoff)));
sendOff = startoff;
}Not a serious issue, more a general remark: I'm doubtful that going for
palloc in error situations is good practice. This will be allocated in
the current memory context; without access to the emergency error
reserves.
I was getting pretty confused, since I was sure I didn't write that. My
memory's bad enough that sometimes I go "huh, ok, guess I did"... but in
this case it wasn't in my patch so I think Álvaro's added it.
Agree that it's unrelated and probably better how it was.
+static void +XLogReadDetermineTimeline(XLogReaderState *state) +{ + /* Read the history on first time through */ + if (state->timelineHistory == NIL) + state->timelineHistory =readTimeLineHistory(ThisTimeLineID);
+ + /* + * Are we reading the record immediately following the one we readlast
+ * time? If not, then don't use the cached timeline info. + */ + if (state->currRecPtr != state->EndRecPtr) + { + state->currTLI = 0; + state->currTLIValidUntil = InvalidXLogRecPtr; + }Hm. So we grow essentially a second version of the last end position and
the randAccess stuff in XLogReadRecord().
Yeah, and in an earlier version of this patch that's where it lived.
I landed up moving it into its own self-contained function mainly because
the xlog read callback needs to be able to re-read the timeline info and
re-check the timeline end if the current timeline becomes historical while
it's waiting for new WAL, which could happen if it's a cascading standby
and its parent got promoted. Hence the call to XLogReadDetermineTimeline
from within read_local_xlog_page(...). That can't actually happen right now
since logical decoding can't be done on a standby yet, but I didn't want to
introduce new problems to fix for that when adding timeline following
support.
XLogReadRecord(...) could clear this info instead of doing it in
XLogReadDetermineTimeline(...), but I thought it made more sense to keep
use of that state local to XLogReadDetermineTimeline(...) rather than
scatter it.
+ if (state->currTLI == 0) + { + /* + * Something changed; work out what timeline this recordis on. We
+ * might read it from the segment on this TLI or, if the
segment is
+ * also contained by newer timelines, the copy from a
newer TLI.
+ */ + state->currTLI = tliOfPointInHistory(state->currRecPtr, +state->timelineHistory);
+ + /* + * Look for the most recent timeline that's on the samexlog segment
+ * as this record, since that's the only one we can assume
is still
+ * readable. + */ + while (state->currTLI != ThisTimeLineID && + state->currTLIValidUntil == InvalidXLogRecPtr) + { + XLogRecPtr tliSwitch; + TimeLineID nextTLI; + + tliSwitch = tliSwitchPoint(state->currTLI,state->timelineHistory,
+
&nextTLI);
+ + /* round ValidUntil down to start of segcontaining the switch */
+ state->currTLIValidUntil = + ((tliSwitch / XLogSegSize) * XLogSegSize); + + if (state->currRecPtr >= state->currTLIValidUntil) + { + /* + * The new currTLI ends on this WALsegment so check the next
+ * TLI to see if it's the last one on the
segment.
+ * + * If that's the current TLI we'll stopsearching.
I don't really understand how we're stopping searching here?
What I'm doing here is looking for the newest timeline that exists in this
WAL segment. Each time through we advance currTLI if we find that there's a
newer timeline on this segment then look again. The loop ends if we find
that the newest timeline on the segment is the current timeline being
written/replayed by the server (in which case we know that segment must
still exist and not have been renamed to .partial) or until we find that
currTLI's validity extends past this segment.
There is some explanation for this in the comments at the start of the
function since it affects the overall logic.
On reading it again I think that testing against state->currRecPtr is
confusing here. It relies on the fact that currTLIValidUntil has been
rounded down to the LSN of the start of the segment, so if the current
record pointer is greater than it we know the timeline ends somewhere on
this segment.
I guess it could be clearer (but verbose) to define an XLogSegNoFromLSN
macro then
if ( XLogSegNoFromLSN(state->currRecPtr) >=
XLogSegNoFromLSN(state->currTLIValidUntil))
but ... eh.
The alternative seems to be to search the timeline history info directly to
find the most recent timeline in the segment, starting from the current
timeline being read.
+ */ + state->currTLI = nextTLI; + state->currTLIValidUntil =InvalidXLogRecPtr;
+ } + } +}XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
whether there's not a simpler way to write this.
If there is I'd be quite happy. It took me some time to figure out the
wrinkles here.
You can't do this where you'd expect, in XLogReadPage. Partly because it
gets used by frontend code too, as discussed above. Partly because the
xlogreader callback is responsible for waiting for new WAL rather than
XLogReadPage, and as noted above that would cause issues when a cascading
standby gets promoted while we're waiting for WAL. There's no way to pass
the wanted timeline to the callback anyway, but if that were the only issue
I'd have just added it (and did, in earlier versions of this patch).
Additionally, xlogreader and XLogReadPage is used by the physical
replication walsender which neither needs nor wants timeline following - it
expects to return failure when it runs out of data on the timeline instead,
so the client can manage the timeline switch after the CopyBoth data stream
ends.
It's not desirable to do the timeline switch for logical decoding at a
higher level, before calling into the walsender, because then it has to be
done separately in the SQL interface and walsender interface for logical
decoding, similarly to how the client does it for physical replication. I
actually did it this way in the proof of concept version and it works fine,
it's just more intrusive and ugly and duplicates logic in multiple places.
There's more to it than that but I'm tired after a long day. I'll try to
write that timelines readme after I review my notes so I can explain better.
As for the actual mechanism by which XLogReadDetermineTimeline operates,
the main thing I wonder is whether it can be usefully simplified by having
it directly scan the loaded timeline history and determine the last
timeline for a segment. I'm not convinced there's much of a way around
needing the rest of the logic.
+/*
+ * XLogPageReadCB callback for reading local xlog files
*
* Public because it would likely be very helpful for someone writinganother
* output method outside walsender, e.g. in a bgworker.
*
- * TODO: The walsender has it's own version of this, but it relies onthe
+ * TODO: The walsender has its own version of this, but it relies on the
* walsender's latch being set whenever WAL is flushed. No suchinfrastructure
* exists for normal backends, so we have to do a check/sleep/repeat
style of
* loop for now.
@@ -754,46 +897,88 @@ int
read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID*pageTLI)
{
- XLogRecPtr flushptr,
+ XLogRecPtr read_upto,
loc;
int count;loc = targetPagePtr + reqLen; + + /* Make sure enough xlog is available... */ while (1) { /* - * TODO: we're going to have to do something moreintelligent about
- * timelines on standbys. Use readTimeLineHistory() and
- * tliOfPointInHistory() to get the proper LSN? For nowwe'll catch
- * that case earlier, but the code and TODO is left in
here for when
- * that changes. + * Check which timeline to get the record from. + * + * We have to do it each time through the loop because ifwe're in
+ * recovery as a cascading standby, the current timeline
might've
+ * become historical. */ - if (!RecoveryInProgress()) + XLogReadDetermineTimeline(state); + + if (state->currTLI == ThisTimeLineID) { - *pageTLI = ThisTimeLineID; - flushptr = GetFlushRecPtr(); + /* + * We're reading from the current timeline so wemight have to
+ * wait for the desired record to be generated
(or, for a standby,
+ * received & replayed) + */ + if (!RecoveryInProgress()) + { + *pageTLI = ThisTimeLineID; + read_upto = GetFlushRecPtr(); + } + else + read_upto = GetXLogReplayRecPtr(pageTLI); + + if (loc <= read_upto) + break; + + CHECK_FOR_INTERRUPTS(); + pg_usleep(1000L); } else - flushptr = GetXLogReplayRecPtr(pageTLI); + { + /* + * We're on a historical timeline, so limitreading to the switch
+ * point where we moved to the next timeline. + */ + read_upto = state->currTLIValidUntil;Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If
so, how come?
We're reading from a segment where the newest timeline on that segment is
historical, i.e. the server has since replayed WAL from a newer timeline on
a more recent segment. We therefore know that there must be a complete
segment and we can't ever need to wait for new WAL.
An assertion that loc <= GetFlushRecPtr() wouldn't hurt.
- /* more than one block available */ - if (targetPagePtr + XLOG_BLCKSZ <= flushptr) + if (targetPagePtr + XLOG_BLCKSZ <= read_upto) + { + /* + * more than one block available; read only that block,have caller
+ * come back if they need more. + */ count = XLOG_BLCKSZ; - /* not enough data there */ - else if (targetPagePtr + reqLen > flushptr) + } + else if (targetPagePtr + reqLen > read_upto) + { + /* not enough data there */ return -1; - /* part of the page available */ + } else - count = flushptr - targetPagePtr; + { + /* enough bytes available to satisfy the request */ + count = read_upto - targetPagePtr; + }- XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ); + XLogRead(cur_page, *pageTLI, targetPagePtr, count);When are we reading less than a page? That should afaik never be required.
Because pages are pre-allocated and zeroed it's safe to read from the last
page past the point we've actually written WAL to. But in that case why do
we bother determining 'count' in the first place, only to ignore it?
This is largely cosmetic TBH.
+ /*
+ * We start reading xlog from the restart lsn, even though
in
+ * CreateDecodingContext we set the snapshot builder up
using the
+ * slot's candidate_restart_lsn. This means we might read
xlog we
+ * don't actually decode rows from, but the snapshot
builder might
+ * need it to get to a consistent point. The point we
start returning
+ * data to *users* at is the candidate restart lsn from
the decoding
+ * context.
+ */Uh? Where are we using candidate_restart_lsn that way?
That's a major brain-fart - it should've been confirmed_flush. I'm glad you
caught that, since it took me some time to figure out why logical decoding
was trying to read WAL I hadn't asked for, but adding an explanatory
comment that's *wrong* sure doesn't help the next person.
I'm quite impressed and disturbed that I managed to write out as "candidate
restart lsn" instead of "confirmed lsn" in full as well. (Now, where were
my glasses, I swear I had them a minute ago... oh, they're on my face!)
The important bit is that where we start reading xlog is the restart_lsn of
the slot, even though we won't return anything we decoded to the user until
we reach confirmed_flush, which is looked up by CreateDecodingContext
completely independently of what pg_logical_slot_get_changes_guts does,
because we passed InvalidXLogRecPtr to CreateDecodingContext to tell it to
automagically look up the confirmed_lsn. Maybe it'd be clearer to just pass
the confirmed_lsn to CreateDecodingContext
from pg_logical_slot_get_changes_guts. When I was debugging this stuff I
found it pretty confusing that the LSN we started reading xlog at was
different to the LSN the user wants decoded changes from, hence the comment.
@@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
fcinfo, bool confirm, bool binCHECK_FOR_INTERRUPTS();
}+ /* Make sure timeline lookups use the start of the next
record */
+ startptr = ctx->reader->EndRecPtr;
Huh? startptr isn't used after this, so I'm not sure what this even
means?
That snuck in from a prior revision where timeline following was done in
pg_logical_slot_get_changes_guts, before I moved it down into the
xlogreader callback so it could be shared between the walsender and SQL
interfaces for logical decoding. It's bogus.
+ /* + * The XLogReader will read a page past the valid end ofWAL because
+ * it doesn't know about timelines. When we switch
timelines and ask
+ * it for the first page on the new timeline it will think
it has it
+ * cached, but it'll have the old partial page and say it
can't find
+ * the next record. So flush the cache. + */ + XLogReaderInvalCache(ctx->reader); +dito.
Ditto ;)
I really should've caught that. The problem is I've spent way, way too long
staring at this code over too many revisions as I figured out what T.F. was
going on with timeline following and slots. That's exactly why I really
appreciate the fresh eyes.
+ * Create a new logical slot, with invalid LSN and xid, directly. This
does not
+ * use the snapshot builder or logical decoding machinery. It's only
intended
+ * for creating a slot on a replica that mirrors the state of a slot on
an
+ * upstream master. + * + * You should immediately decoding_failover_advance_logical_slot(...) it + * after creation. + */Uh. I doubt we want this, even if it's formally located in
src/test/modules. These comments make it appear not to be only intended
for that, and I have serious doubts about the validity of the concept as
is.
As I expressed to Álvaro, I don't really mind if this test module and the
associated round of t/ tests that rely on it get cut, so long as the first
round of t/ tests that use a filesystem level copy to clone the slot are
kept so there's some test coverage for timeline following in logical slots.
I wrote it as a PoC to show that it worked, and the best way to do that was
to write it as a new test suite component. Since it seemed useful to test
logical decoding timeline following for a slot created after a base backup
is made or cloned after pg_basebackup discarded the slots, I included it.
The approach isn't ideal. It's what I plan to do for pglogical if failover
slots doesn't make the cut for 9.6, though. The main problem is that
because the slot updates don't come through WAL they can lag behind the
catalog_xmin the master's using to make decisions about vacuuming the
catalogs. If the master advances catalog_xmin on a slot then starts writing
the resulting vacuum activity to WAL, and if we replay that WAL *before* we
see the slot advance and sync it to the replica, the replica's slot will
have an incorrect catalog_xmin that does not reflect the actual on-disk
state. Not great. However, if the client is keeping track of its
confirmed_lsn (as it must, for correctness) then we know it'll never ask to
replay anything older than what it already sent confirmation to the old
master for, before failover. Since that's how the slot got advanced and got
a new catalog_xmin. That means we won't be attempting to decode anything in
the range where the recorded catalog_xmin on the promoted standby after
failover would be a problem. The slot will advance to a sensible position
when the client specifies the new start lsn.
At least, that's my reading of things, and that's what my tests have shown
so far. We do start decoding from restart_lsn, so we'll be decoding WAL in
the range where catalog_xmin is lying about what's really in the heap, but
I don't see anywhere where we're looking at it. It's just collecting up
transaction information at that point, right?
This same issue will occur if we attempt to do failover slots v2 for 9.7
using non-WAL transport to allow decoding from a replica with failover to
cascading standbys, as you mentioned wanting earlier. We'd have to have a
way to make sure the slot state on the replica was updated before we
replayed past the point in WAL where that slot was updated to the same
state on the master.
To that end I've thought about proposing a hook to let plugins intercept
slot write-out. That way they can take note of the current server LSN and
slot state, then make sure they sync that over to the replica before it
replays WAL past that LSN. I was really hoping to get failover slots in
place so this wouldn't be necessary but it's not looking too promising, and
this would provide a somewhat safer way to capture slot advances than just
peeking at pg_replication_slots but without having to get the full failover
slots stuff in. Having this would at least eliminate the possibility of
catalog_xmin being wrong on the replica.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig already replied to most points. I think we can sum up as "we've
done the best we can with what we have, maybe somebody can figure how to
do better in the future but for now this works."
Andres Freund wrote:
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
+#ifndef FRONTEND + if (state->timelineHistory) + list_free_deep(state->timelineHistory); +#endifHm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
Exactly. I had the same comment, but the argument that we would need to
rewrite a lot of timeline.c won me. That can wait for a future patch;
we certainly don't want to be rewriting these things during the last CF.
if (state->ReadRecPtr == InvalidXLogRecPtr) - randAccess = true; + randAccess = true; /* allow readPageTLI to go backwards */randAccess is doing more than that, so I'm doubtful that comment is an
improvment.
As I said, it's only a copy of what's already a few lines above. I
would be happier removing both, and instead adding an explanatory
comment about that variable elsewhere.
#include <unistd.h>
-#include "miscadmin.h"
-spurious change imo.
Yeah, in a way this is cleanup after previous patches that have messed
up things somewhat. I wouldn't have posted it unless I was close to
committing this ... I think it'd be better to push these cleanup changes
separately.
/* Need to seek in the file? */ if (sendOff != startoff) { if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0) - { - char path[MAXPGPATH]; - - XLogFilePath(path, tli, sendSegNo); - ereport(ERROR, (errcode_for_file_access(), errmsg("could not seek in log segment %s to offset %u: %m", - path, startoff))); - } + XLogFileNameP(tli, sendSegNo), startoff))); sendOff = startoff; }Not a serious issue, more a general remark: I'm doubtful that going for
palloc in error situations is good practice. This will be allocated in
the current memory context; without access to the emergency error
reserves.
Yeah, this code was copied from elsewhere in 422a55a68784f but there is
already another version of this routine in walsender which is using
XLogFileNameP instead of the stack-allocated one. I just made this one
more similar to the walsender's (which is in backend environment just
like this one) instead of continuing to use the frontend-limited
version (which is where this one was copied from).
I'm having a hard time getting concerned about using palloc there,
considering that every single call of XLogFileNameP occurs in an error
message. If we want to reject this one, surely we should change all the
others too.
I'm also getting the feeling that the patch is bordering on doing some
relatively random cleanups mixed in with architectural changes. Makes
things a bit harder to review.
Yes.
+static void +XLogReadDetermineTimeline(XLogReaderState *state) +{ + /* Read the history on first time through */ + if (state->timelineHistory == NIL) + state->timelineHistory = readTimeLineHistory(ThisTimeLineID); + + /* + * Are we reading the record immediately following the one we read last + * time? If not, then don't use the cached timeline info. + */ + if (state->currRecPtr != state->EndRecPtr) + { + state->currTLI = 0; + state->currTLIValidUntil = InvalidXLogRecPtr; + }Hm. So we grow essentially a second version of the last end position and
the randAccess stuff in XLogReadRecord().
Craig's argument against that seems reasonable to me.
XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
whether there's not a simpler way to write this.
Feel free to suggest something :-)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
Thanks very much for the review.
Are you planning to update the patch?
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22 March 2016 at 19:48, Andres Freund <andres@anarazel.de> wrote:
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
Thanks very much for the review.
Are you planning to update the patch?
Yes. I just spoke to Álvaro earlier. I'll pick up his modified version of
my patch, remove the unrelated cleanup stuff he added, amend the other
issues and re-post today.
Then I'll try to get an updated pglogical_output up, now that I've knocked
off a variety of urgent work elsewhere...
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 22 March 2016 at 19:48, Andres Freund <andres@anarazel.de> wrote:
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
Thanks very much for the review.
Are you planning to update the patch?
Updated patch attached.
It removes the questionable cleanups, fixes the randAccess comment (IMO),
changes the "open a new xlog segment" wording and explains why we don't
need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline.
I removed the change that did less than a whole page at the XLogRead call
and instead added a comment explaining it. Fixed the brainfrart with
candidate_restart_lsn, removed the outdated startptr comment and update and
the unnecessary XLogReaderInvalCache call after it.
Also renamed src/test/modules/decoding_failover to
src/test/modules/test_slot_timelines . Best name I could come up with that
wasn't 1000chars long.
Passes main 'check', contrib/test_decoding check,
src/test/modules/test_slot_timelines check and src/test/recovery check.
Available attached or at
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Timeline-following-for-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0001-Timeline-following-for-logical-decoding.patchDownload+865-52
Craig Ringer wrote:
It removes the questionable cleanups, fixes the randAccess comment (IMO),
I pushed cleanup separately, including the addition of a few comments
that were documenting the original code. I actually made a mistake in
extracting those, so there's one comment that's actually bogus in that
commit (the candidate_restart_lsn one); it's fixed in the second commit.
Sorry about that. I also introduced XLogReaderInvalReadState here,
called XLogReaderInvalCache originally, since Andres objected to the
"cache" terminology (though you will note that the word "cache" was used
in the original code too.)
changes the "open a new xlog segment" wording and explains why we don't
need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline.
I removed the change that did less than a whole page at the XLogRead call
and instead added a comment explaining it. Fixed the brainfrart with
candidate_restart_lsn, removed the outdated startptr comment and update and
the unnecessary XLogReaderInvalCache call after it.Also renamed src/test/modules/decoding_failover to
src/test/modules/test_slot_timelines . Best name I could come up with that
wasn't 1000chars long.Passes main 'check', contrib/test_decoding check,
src/test/modules/test_slot_timelines check and src/test/recovery check.Available attached or at
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
And pushed this too.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 March 2016 at 07:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Available attached or at
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
And pushed this too.
Much appreciated. Marked as committed at
https://commitfest.postgresql.org/9/568/ .
This gives us an option for failover of logical replication in 9.6, even if
it's a bit cumbersome and complex for the client, in case failover slots
don't make the cut. And, of course, it's a pre-req for failover slots,
which I'll rebase on top of it shortly.
Andres, I tried to address your comments as best I could. The main one that
I think stayed open was about the loop that finds the last timeline on a
segment. If you think that's better done by directly scanning the List* of
timeline history entries I'm happy to prep a follow-up.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2016-03-31 08:52:34 +0800, Craig Ringer wrote:
On 31 March 2016 at 07:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Available attached or at
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
And pushed this too.
Much appreciated. Marked as committed at
https://commitfest.postgresql.org/9/568/ .This gives us an option for failover of logical replication in 9.6, even if
it's a bit cumbersome and complex for the client, in case failover slots
don't make the cut. And, of course, it's a pre-req for failover slots,
which I'll rebase on top of it shortly.
FWIW, I think it's dangerous to use this that way. If people manipulate
slots that way we'll have hellishly to debug issues. The test code needs
a big disclaimer to never ever be used in production, and we should
"disclaim any warranty" if somebody does that. To the point of not
fixing issues around it in back branches.
Andres, I tried to address your comments as best I could. The main one that
I think stayed open was about the loop that finds the last timeline on a
segment. If you think that's better done by directly scanning the List* of
timeline history entries I'm happy to prep a follow-up.
Have to look again.
+ * We start reading xlog from the restart lsn, even though in
+ * CreateDecodingContext we set the snapshot builder up using the
+ * slot's confirmed_flush. This means we might read xlog we don't
+ * actually decode rows from, but the snapshot builder might need it
+ * to get to a consistent point. The point we start returning data to
+ * *users* at is the confirmed_flush lsn set up in the decoding
+ * context.
+ */
still seems pretty misleading - and pretty much unrelated to the
callsite.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 March 2016 at 16:09, Andres Freund <andres@anarazel.de> wrote:
This gives us an option for failover of logical replication in 9.6, even
if
it's a bit cumbersome and complex for the client, in case failover slots
don't make the cut. And, of course, it's a pre-req for failover slots,
which I'll rebase on top of it shortly.FWIW, I think it's dangerous to use this that way. If people manipulate
slots that way we'll have hellishly to debug issues.
I think it's pretty unsafe from SQL, to be sure.
Unless failover slots get in to 9.6 we'll need to do exactly that from
internal C stuff in pglogical to support following physical failover,
though. I'm not thrilled by that, especially since we have no way (in 9.6)
to insert generic WAL records in xlog to allow the slot updates to
accurately pace replay. I don't mean logical wal messages here, they
wouldn't help, it'd actually be pluggable generic WAL that we'd need to
sync slots fully consistently in the absence of failover slots.
However: It's safe for the slot state on the replica to be somewhat behind
the local replay from the master, so the slot state on the replica is older
than what it would've been at an equivalent time on the master... so long
as it's not so far behind that the replica has replayed vacuum activity
that removes rows still required by the slot's declared catalog_xmin. Even
then it should actually be OK in practice because the failing-over client
will always have replayed past that point on the master (otherwise
catalog_xmin couldn't have advanced on the master), so it'll always ask to
start replay at a point at or after the lsn where the catalog_xmin was
advanced to its most recent value on the old master before failover. It's
only safe for walsender based decoding though, since there's no way to
specify the startpoint for sql-level decoding.
My only real worry there is whether invalidations are a concern - if
internal replay starts at the restart_lsn and replays over part of history
where the catalog entries have been vacuumed before it starts collecting
rows for return to the decoding client at the requested decoding
startpoint, can that cause problems with invalidation processing? I haven't
got my head around what, exactly, logical decoding is doing with its
invalidations stuff yet. I didn't find any problems in testing, but wasn't
sure how to create the conditions under which failures might occur.
It's also OK for the slot state to be *ahead* of the replica's replay
position, i.e. from the future. restart_lsn and catalog_xmin will be higher
than they were on the master at the same point in time, but that doesn't
matter at all, since the client will always be requesting to start from
that point or later, having replayed up to there on the old master before
failover.
It's even possible for the restart_lsn and catalog_xmin to be in the
replica's future, i.e. the lsn > the current replay lsn and the
catalog_xmin greater than the next xid. In that case we know the logical
client replayed state from the old master that hasn't been applied to the
replica by the time it's promoted. If this state of affairs exists when we
promote a replica to a master we should really kill the slot, since the
client has obviously replayed from the old master past the point of
divergence with the promoted replica and there's a potentially an
unresolvable timeline fork. I'd like to add this if I can manage it,
probably by WARNing a complaint then dropping the slot.
Needless to say I'd really rather just have failover slots, where we know
the slot will be consistent with the DB state and updated in sync with it.
This is a fallback plan and I don't like it too much.
The test code needs
a big disclaimer to never ever be used in production, and we should
"disclaim any warranty" if somebody does that. To the point of not
fixing issues around it in back branches.
I agree. The README has just that, and there are comments above the
individual functions like:
* Note that this is test harness code. You shouldn't expose slot internals
* to SQL like this for any real world usage. See the README.
Andres, I tried to address your comments as best I could. The main one
thatI think stayed open was about the loop that finds the last timeline on a
segment. If you think that's better done by directly scanning the List*of
timeline history entries I'm happy to prep a follow-up.
Have to look again.
+ * We start reading xlog from the restart lsn, even though in + * CreateDecodingContext we set the snapshot builder up using the + * slot's confirmed_flush. This means we might read xlog we don't + * actually decode rows from, but the snapshot builder might need it + * to get to a consistent point. The point we start returning data to + * *users* at is the confirmed_flush lsn set up in the decoding + * context. + */ still seems pretty misleading - and pretty much unrelated to the callsite.
I think it's pretty confusing that the function uses's the slot's
restart_lsn and never refers to confirmed_flush which is where it actually
starts decoding from as far as the user's concerned. It took me a while to
figure out what was going on there.
I think the comment would be less necessary, and could be moved up to the
CreateDecodingContext call, if we passed the slot's confirmed_flush
explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
and having the CreateDecodingContext call infer the start point. That way
what's going on would be a lot clearer when reading the code.
You're thinking from the perspective of someone who knows this code
intimately. It's obvious that restart_lsn controls where the xlogreader
starts processing and feeding into the snapshot builder to you. Similarly,
it'll be obvious that the decoding context defaults to the confirmed_flush
point which is where we start returning decoded rows to users. This really
will NOT be obvious to most readers though, and I think we could use making
the logical decoding code easier to understand as it gets more and more
use. I'm trying to help with that, and while I suffer from not knowing it
as well as you, that also means I can still see some of the things that
might be confusing to newer readers. Then get them wrong when explaining
them ;)
Speaking of which, did you see the proposed README I sent for
src/backend/replication/logical ?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2016-04-01 13:16:18 +0800, Craig Ringer wrote:
I think it's pretty unsafe from SQL, to be sure.
Unless failover slots get in to 9.6 we'll need to do exactly that from
internal C stuff in pglogical to support following physical failover,
I know. And this makes me scared shitless.
I'll refuse to debug anything related to decoding, when this "feature"
has been used. And I think there'll be plenty; if you notice the issues
that is.
The more I think about it, the more I think we should rip this out
again, until we have the actual feature is in. With proper review.
However: It's safe for the slot state on the replica to be somewhat behind
the local replay from the master, so the slot state on the replica is older
than what it would've been at an equivalent time on the master... so long
as it's not so far behind that the replica has replayed vacuum activity
that removes rows still required by the slot's declared catalog_xmin. Even
then it should actually be OK in practice because the failing-over client
will always have replayed past that point on the master (otherwise
catalog_xmin couldn't have advanced on the master), so it'll always ask to
start replay at a point at or after the lsn where the catalog_xmin was
advanced to its most recent value on the old master before failover. It's
only safe for walsender based decoding though, since there's no way to
specify the startpoint for sql-level decoding.
This whole logic fails entirely flats on its face by the fact that even
if you specify a startpoint, we read older WAL and process the
records. The startpoint filters every transaction that commits earlier
than the "startpoint". But with a long-running transaction you still
will have old changes being processed, which need the corresponding
catalog state.
My only real worry there is whether invalidations are a concern - if
internal replay starts at the restart_lsn and replays over part of history
where the catalog entries have been vacuumed before it starts collecting
rows for return to the decoding client at the requested decoding
startpoint, can that cause problems with invalidation processing?
Yes.
It's also OK for the slot state to be *ahead* of the replica's replay
position, i.e. from the future. restart_lsn and catalog_xmin will be higher
than they were on the master at the same point in time, but that doesn't
matter at all, since the client will always be requesting to start from
that point or later, having replayed up to there on the old master before
failover.
I don't think that's true. See my point above about startpoint.
Andres, I tried to address your comments as best I could. The main one
thatI think stayed open was about the loop that finds the last timeline on a
segment. If you think that's better done by directly scanning the List*of
timeline history entries I'm happy to prep a follow-up.
Have to look again.
+ * We start reading xlog from the restart lsn, even though in + * CreateDecodingContext we set the snapshot builder up using the + * slot's confirmed_flush. This means we might read xlog we don't + * actually decode rows from, but the snapshot builder might need it + * to get to a consistent point. The point we start returning data to + * *users* at is the confirmed_flush lsn set up in the decoding + * context. + */ still seems pretty misleading - and pretty much unrelated to the callsite.I think it's pretty confusing that the function uses's the slot's
restart_lsn and never refers to confirmed_flush which is where it actually
starts decoding from as far as the user's concerned. It took me a while to
figure out what was going on there.
That's a fundamental misunderstanding on your part (perhaps created by
imprecise docs). The startpoint isn't about where to start
decoding. It's about skip calling the output plugin of a transaction if
it *commits* before the startpoint. We almost always will *decode* rows
from before the startpoint. And that's pretty much unavoidable: The
consumer of a decoding slot only really can do anything with commit LSNs
wrt replay progress: But for a remembered progress LSN there can be a
lot of in-progress xacts (about which said client doesn't know
anything); and they start *earlier* than the commit LSN of the just
replayed xact. So we have to be able to re-process their state and send
it out.
I think the comment would be less necessary, and could be moved up to the
CreateDecodingContext call, if we passed the slot's confirmed_flush
explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
and having the CreateDecodingContext call infer the start point. That way
what's going on would be a lot clearer when reading the code.
How would that solve anything? We still need to process the old records,
hence the xlogreader needs to instructed to read them. Hence
logicalfuncs.c needs to know about it.
You're thinking from the perspective of someone who knows this code
intimately.
Maybe. But that's not addressed by adding half-true comments to places
they only belong to peripherally. And not to all the relevant places,
since you've not added the same comment to StartLogicalReplication().
Speaking of which, did you see the proposed README I sent for
src/backend/replication/logical ?
I skimmed it. But given we have a CF full of patches, some submitted
over a year ago, it seems unfair to spend time on a patch submitted a
few days ago.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2016-04-01 08:46:01 +0200, Andres Freund wrote:
That's a fundamental misunderstanding on your part (perhaps created by
imprecise docs).
Speaking of which, did you see the proposed README I sent for
src/backend/replication/logical ?I skimmed it. But given we have a CF full of patches, some submitted
over a year ago, it seems unfair to spend time on a patch submitted a
few days ago.
For that purpos
WRT design readme, it might be interesting to look at 0009 in
http://archives.postgresql.org/message-id/20140127162006.GA25670%40awork2.anarazel.de
That's not up2date obviously, but it still might help.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers