Rework the way multixact truncations work
Hi,
As discussed on list, over IM and in person at pgcon I want to make
multixact truncations be WAL logged to address various bugs.
Since that's a comparatively large and invasive change I thought it'd be
a good idea to start a new thread instead of burying it in a already
long thread.
Here's the commit message which hopefully explains what's being changed
and why:
Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.
We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.
This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
this has gotten much easier
During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
segments. This happend to be hard or impossible to hit due to the
interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
that doesn't work for offsets that haven't yet been flushed to
disk. Flush out before running to fix. Not pretty, but it feels
slightly safer to only make decisions based on on-disk state.
To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.
To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.
Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.
I've tested this a bunch, including using a newer standby against a
older master and such. What I have yet to test is that the concurrency
protections against multiple backends truncating at the same time are
correct.
It'd be very welcome to see some wider testing and review on this.
I've attached three commits:
0001: Add functions to burn through multixacts - that should get its own file.
0002: Lower the lower bound limits for *_freeze_max_age - I think we should
just do that. There really is no reason for the current limits
and they make testing hard and force space wastage.
0003: The actual truncation patch.
Greetings,
Andres Freund
Attachments:
0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload+61-9
0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload+3-4
0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload+423-303
Andres Freund wrote:
Rework the way multixact truncations work.
I spent some time this morning reviewing this patch and had some
comments that I relayed over IM to Andres. The vast majority is
cosmetic, but there are two larger things:
1. I think this part of PerformMembersTruncation() is very confusing:
/* verify whether the current segment is to be deleted */
if (segment != startsegment && segment != endsegment)
SlruDeleteSegment(MultiXactMemberCtl, segment);
I think this works correctly in that it preserves both endpoint files,
but the files in between are removed ... which is a confusing interface,
IMO. I think this merits a longer explanation.
2. We set PGXACT->delayChkpt while the truncation is executed. This
seems reasonable, and there's a good reason for it, but all the other
users of this facility only do small operations with this thing grabbed,
while the multixact truncation could take a long time because a large
number of files might be deleted. Maybe it's not a problem to have
checkpoints be delayed by several seconds, or who knows maybe even a
minute in a busy system. (We will have checkpointer sleeping in 10ms
intervals until the truncation is complete).
Maybe this is fine, not sure.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-26 14:48:35 -0300, Alvaro Herrera wrote:
Andres Freund wrote:
Rework the way multixact truncations work.
I spent some time this morning reviewing this patch and had some
comments that I relayed over IM to Andres.
Thanks for that!
2. We set PGXACT->delayChkpt while the truncation is executed. This
seems reasonable, and there's a good reason for it, but all the other
users of this facility only do small operations with this thing grabbed,
while the multixact truncation could take a long time because a large
number of files might be deleted. Maybe it's not a problem to have
checkpoints be delayed by several seconds, or who knows maybe even a
minute in a busy system. (We will have checkpointer sleeping in 10ms
intervals until the truncation is complete).
I don't think this is a problem. Consider that we're doing all this in
the checkpointer today, blocking much more than just the actual xlog
insertion. That's a bigger problem, as we'll not do the paced writing
during that and such. The worst thatthis can cause is a bunch of sleeps,
that seems fairly harmless.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:
It'd be very welcome to see some wider testing and review on this.
I started looking at this and doing some testing. Here is some
initial feedback:
Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?
MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?
In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId. If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).
I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo. The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).
In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?
In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist. Should we do something like this
instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a
I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-29 23:54:40 +1200, Thomas Munro wrote:
On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:
It'd be very welcome to see some wider testing and review on this.
I started looking at this and doing some testing. Here is some
initial feedback:Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?
It has done more before, so I don't really see a connection to this
patch...
MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?
Oops.
In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId. If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).
IIRC I did it that way to make clear this is just 'byte' type offsets,
without other meaning. Wasn't a good idea.
I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo. The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).
That hasn't changed unless I miss something?
In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?
I think it's safer than not doing it, but don't particularly care.
In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist. Should we do something like this
instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a
I'm currently slightly inclined to do it "my way". They way these
functions are used it doesn't seem like a bad property to ensure things
are on disk.
I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.
Cool, thanks!
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I'm still working on this patch. I found a bunch of issues. Non of them
super critical and some pre-existing; but nonetheless I don't feel like
it's ready to push yet. So we'll have the first alpha without those
fixes :(
The biggest issues so far are:
* There's, both in the posted patch and as-is in all the branches, a lot
of places that aren't actually safe against a concurrent truncation. A
bunch of places grab oldestMultiXactId with an lwlock held, release
it, and then make decisions based on that.
A bunch of places (including the find_multixact_start callsites) is
actually vulnerable to that, for a bunch of others it's less likely to
be a problem. All callers of GetMultiXactIdMembers() are vulnerable,
but with the exception of pg_get_multixact_members() they'll never
pass in a value that's older than the new oldest member value.
That's a problem for the current branches. Afaics that can lead to a
useless round of emergency autovacs via
SetMultiXactIdLimit()->SetOffsetVacuumLimit().
SetOffsetVacuumLimit() can protect easily agains that by taking the
new MultiXactTruncationLock lock. We could do the same for
pg_get_multixact_members() - afaics the only caller that'll look up too
old values otherwise - but I don't think it matters, you'll get a
slightly obscure error if you access a too old xact that's just being
truncated away and that is that.
* There was no update of the in-memory oldest* values when replaying a
truncation. That's problematic if a standby is promoted after
replaying a truncation record, but before a following checkpoint
record. This would be fixed by an emergency autovacuum, but that's
obviously not nice. Trivial to fix.
* The in-memory oldest values were updated *after* the truncation
happened. It's unlikely to matter in reality, but it's safer to
update them before, so a concurrent GetMultiXactIdMembers() of stuff
from before the truncation will get the proper error.
* PerformMembersTruncation() probably confused Alvaro because it wasn't
actually correct - there's no need to have the segment containing old
oldestOffset (in contrast to oldestOffsetAlive) survive. Except
leaking a segment that's harmless, but obviously not desirable.
Additionally I'm changing some stuff, some requested by review:
* xl_multixact_truncate's members are now called
(start|end)Trunc(Off|Memb)
* (start|end)TruncOff have the appropriate type now
* typo fixes
* comment improvements
* pgindent
New version attached.
Greetings,
Andres Freund
Attachments:
0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload+61-9
0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload+3-4
0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload+471-308
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
New version attached.
0002 looks good, but the commit message should perhaps mention the
comment fix. Or commit that separately.
Will look at 0003 next.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Will look at 0003 next.
+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
I don't think we typically use this style for notating intervals.
case XLOG_MULTIXACT_CREATE_ID:
id = "CREATE_ID";
break;
+ case XLOG_MULTIXACT_TRUNCATE_ID:
+ id = "TRUNCATE";
+ break;
If XLOG_MULTIXACT_CREATE_ID -> "CREATE_ID", then why not
XLOG_MULTIXACT_TRUNCATE_ID -> "TRUNCATE_ID"?
+ * too old to general truncation records.
s/general/generate/
+ MultiXactId oldestMXactDB;
Data type should be OID.
+ * Recompute limits as we are now fully started, we now can correctly
+ * compute how far a members wraparound is away.
s/,/:/ or something. This isn't particularly good English as written.
+ * Computing the actual limits is only possible once the data directory is
+ * in a consistent state. There's no need to compute the limits while
+ * still replaying WAL as no new multis can be created anyway. So we'll
+ * only do further checks after TrimMultiXact() has been called.
Multis can be and are created during replay. What this should really
say is that we have no choice about whether to create them or not: we
just have to replay whatever's there.
+ (errmsg("performing legacy multixact truncation,
upgrade master")));
This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.
I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact(). The checkpoint hasn't really happened at that
point yet; you might truncate away stuff, then crash before the
checkpoint is complete, and then we you restart recovery, you've got
trouble. I think you should be very, very cautious about rejiggering
the order of operations here. The current situation is not good, but
casually rejiggering it can make things much worse.
- * If no multixacts exist, then oldestMultiXactId will be the next
- * multixact that will be created, rather than an existing multixact.
+ * Determine the offset of the oldest multixact. Normally, we can read
+ * the offset from the multixact itself, but there's an important special
+ * case: if there are no multixacts in existence at all, oldestMXact
+ * obviously can't point to one. It will instead point to the multixact
+ * ID that will be assigned the next time one is needed.
There's no need to change this; it means the same thing either way.
Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here. The existing code is careful never to set
oldestOffsetKnown false when it was previously true. Your rewrite
removes that property. Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.
I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush(). I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background. If somebody called
find_multixact_start() with any frequency, you'd be sad. You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof. I prefer Thomas's approach.
If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?
+ * Make sure to only attempt truncation if there's values to truncate
+ * away. In normal processing values shouldn't go backwards, but there's
+ * some corner cases (due to bugs) where that's possible.
I think this comment should be more detailed. Is that talking about
the same thing as this comment:
- * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
- * oldestMXact might point to a multixact that does not exist.
- * Autovacuum will eventually advance it to a value that does exist,
- * and we want to set a proper offsetStopLimit when that happens,
- * so call DetermineSafeOldestOffset here even if we're not actually
- * truncating.
This comment seems to be saying there's a race condition:
+ * XXX: It's also possible that the page that oldestMXact is on has
+ * already been truncated away, and we crashed before updating
+ * oldestMXact.
But why is that an XXX? I think this is just a case of recovery
needing tolerate redo of an action already redone.
I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:
- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.
The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.
Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused. We might be
kind of hosed in that situation anyway, because TruncateMultiXact()
and some other places assume that circular comparisons will return
sensible values. But that could be fixed, and probably should be
fixed eventually.
+ (errmsg("supposedly still alive MultiXact %u not
found, skipping truncation",
Maybe "cannot truncate MultiXact %u because it does not exist on disk,
skipping truncation"?
I think "frozenMulti" is a slightly confusing variable name and
deserves a comment. AUIU, that's the oldest multiXact we need to
keep. So it's actually the oldest multi that is NOT guaranteed to be
frozen. minMulti might be a better variable name, but a comment is
justified either way.
+ * Update in-memory limits before performing the truncation, while inside
+ * the critical section: Have to do it before truncation, to prevent
+ * concurrent lookups of those values. Has to be inside the critical
+ * section asotherwise a future call to this function would error out,
+ * while looking up the oldest member in offsets, if our caller crashes
+ * before updating the limits.
Missing space: asotherwise.
Who else might be concurrently looking up those values? Nobody else
can truncate while we're truncating, because we hold
MultiXactTruncationLock. And nobody else should be getting here from
looking up tuples, because if they are, we truncated too soon.
- * Startup MultiXact. We need to do this early for two reasons: one is
- * that we might try to access multixacts when we do tuple freezing, and
- * the other is we need its state initialized because we attempt
- * truncation during restartpoints.
+ * Startup MultiXact, we need to do this early, to be able to replay
+ * truncations.
The period after "Startup MultiXact" was more correct than the comma
you've replaced it with.
Phew. That's all I see on a first read-through, but I've only spent a
couple of hours on this, so I might easily have missed some things.
But let me stop here, hit send, and see what you think of these
comments.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Will look at 0003 next.
+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
I don't think we typically use this style for notating intervals.
I don't think we really have a very consistent style for xlog messages -
this seems to describe the meaning accurately?
[several good points]
+ (errmsg("performing legacy multixact truncation,
upgrade master")));This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().
But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?
The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.
We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?
I think you should
be very, very cautious about rejiggering the order of operations here.
The current situation is not good, but casually rejiggering it can
make things much worse.
The current placement - as part of the restartpoint - is utterly broken
and unpredictable. There'll frequently be no restartpoints performed at
all (due to different checkpoint segments, slow writeout, or pending
actions). Because there's no careful timing of when this happens it's
much harder to understand the exact state in which the truncation
happens - I think moving it to a specific location during replay makes
things considerably easier.
Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here. The existing code is careful never to set
oldestOffsetKnown false when it was previously true. Your rewrite
removes that property. Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.
A lot of that has to do that the whole stuff about truncations happening
during checkpoints is gone and that thus the split with
DetermineSafeOldestOffset() doesn't make sense anymore.
That oldestOffsetKnown is unset is wrong - the if (prevOldestOffsetKnown
&& !oldestOffsetKnown) block should be a bit earlier.
I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush(). I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background. If somebody called
find_multixact_start() with any frequency, you'd be sad. You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof. I prefer Thomas's approach.
I don't strongly care, but I do think it has some value to be sure about
the on-disk state for the current callers. I think it'd be a pretty odd
thing to call find_multixact_start() frequently.
If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?
That seems like premature optimization to me. And one that's not that
easy to do correctly - what if the current caller actually has a new,
lower, minimum mxid?
+ * XXX: It's also possible that the page that oldestMXact is on has + * already been truncated away, and we crashed before updating + * oldestMXact.But why is that an XXX? I think this is just a case of recovery
needing tolerate redo of an action already redone.
Should rather have been NB.
I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:
- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.
Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused.
Unless I miss something the stop limits will prevent that from
happening? SetMultiXactIdLimit() is called only *after* the truncation
has finished?
+ * Update in-memory limits before performing the truncation, while inside + * the critical section: Have to do it before truncation, to prevent + * concurrent lookups of those values. Has to be inside the critical + * section asotherwise a future call to this function would error out, + * while looking up the oldest member in offsets, if our caller crashes + * before updating the limits.Missing space: asotherwise.
Who else might be concurrently looking up those values? Nobody else
can truncate while we're truncating, because we hold
MultiXactTruncationLock. And nobody else should be getting here from
looking up tuples, because if they are, we truncated too soon.
pg_get_multixact_members(), a concurrent call to SetMultiXactIdLimit()
(SetOffsetLimit()->find_multixact_start()) from vac_truncate_clog().
Phew. That's all I see on a first read-through, but I've only spent a
couple of hours on this, so I might easily have missed some things.
But let me stop here, hit send, and see what you think of these
comments.
Thanks for the look so far!
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Will look at 0003 next.
+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
I don't think we typically use this style for notating intervals.
I don't think we really have a very consistent style for xlog messages -
this seems to describe the meaning accurately?
Although I realize this is supposed to be interval notation, I'm not
sure everyone will immediately figure that out. I believe it has
created some confusion in the past. I'm not going to spend a lot of
time arguing with you about it, but I'd do something else, like
offsets from %u stop before %u, members %u stop before %u.
[several good points]
+ (errmsg("performing legacy multixact truncation,
upgrade master")));This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?
I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.
The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?
You're still going to have to read the SLRU for as long as you are
doing legacy truncations, at least.
If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?That seems like premature optimization to me. And one that's not that
easy to do correctly - what if the current caller actually has a new,
lower, minimum mxid?
Doesn't the next truncation just catch up? But sure, I agree this is
inessential (and maybe better left alone for now).
I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused.Unless I miss something the stop limits will prevent that from
happening? SetMultiXactIdLimit() is called only *after* the truncation
has finished?
Hmm, that might be, I'd have to reread the patch. The reason we
originally had it this way was because VACUUM was updating the limit
and then checkpoint was truncating, but now I guess vacuum + truncate
happen so close together that you might only need one value.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(quick answer, off now)
On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.
The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint segments
settings), may not happen at all (pending incomplete actions), and can
just be slowed down.
That's a currently existing bug that's easy to reproduce.
The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?You're still going to have to read the SLRU for as long as you are
doing legacy truncations, at least.
I'm not following. Sure, we read the SLRUs as we do today. But, in
contrast to the current positioning in recovery, with the patch they're
done at pretty much the same point on the standby as on the primary
today?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
(quick answer, off now)
On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint segments
settings), may not happen at all (pending incomplete actions), and can
just be slowed down.That's a currently existing bug that's easy to reproduce.
You might be right; I haven't tested that.
On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On July 5, 2015 8:50:57 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>
wrote:(quick answer, off now)
On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>
wrote:
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
I seriously, seriously doubt that it is a good idea to perform
the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().But where should TruncateMultiXact() be called from? I mean, we
could
move the logic from inside MultiXactAdvanceOldest() to some
special case
in the replay routine, but what'd be the advantage?
I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpointsegments
settings), may not happen at all (pending incomplete actions), and
can
just be slowed down.
That's a currently existing bug that's easy to reproduce.
You might be right; I haven't tested that.
On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.
The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.
I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:
On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.
I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.
So, where are we with this patch?
In my opinion, we ought to do something about master and 9.5 before
beta, so that we're doing *yet another* major release with unfixed
multixact bugs. Let's make the relevant truncation changes in master
and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
be used with a 9.5beta master. Then, we don't need any of this legacy
truncation stuff at all, and 9.5 is hopefully in a much better state
than 9.4 and 9.3.
Now, that still potentially leaves 9.4 and 9.3 users hanging out to
dry. But we don't have a tremendous number of those people clamoring
about this, and if we get 9.5+ correct, then we can go and change the
logic in 9.4 and 9.3 later when, and if, we are confident that's the
right thing to do. I am still not altogether convinced that it's a
good idea, nor am I altogether convinced that this code is right.
Perhaps it is, and if we consensus on it, fine. But regardless of
that, we should not send a third major release to beta with the
current broken system unless there is really no viable alternative.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:
On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.
I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.
So, where are we with this patch?
Uh. I'd basically been waiting on further review and then forgot about
it.
In my opinion, we ought to do something about master and 9.5 before
beta, so that we're doing *yet another* major release with unfixed
multixact bugs. Let's make the relevant truncation changes in master
and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
be used with a 9.5beta master. Then, we don't need any of this legacy
truncation stuff at all, and 9.5 is hopefully in a much better state
than 9.4 and 9.3.
Hm.
Now, that still potentially leaves 9.4 and 9.3 users hanging out to
dry. But we don't have a tremendous number of those people clamoring
about this, and if we get 9.5+ correct, then we can go and change the
logic in 9.4 and 9.3 later when, and if, we are confident that's the
right thing to do. I am still not altogether convinced that it's a
good idea, nor am I altogether convinced that this code is right.
Perhaps it is, and if we consensus on it, fine.
To me the current logic is much worse than what's in the patch, so I
don't think that's the best way to go. But I'm not not absolutely gung
ho on that.
But regardless of that, we should not send a third major release to
beta with the current broken system unless there is really no viable
alternative.
Agreed. I'll update the patch.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/21/2015 07:36 AM, Andres Freund wrote:
On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
So, where are we with this patch?
Uh. I'd basically been waiting on further review and then forgot about
it.
Does the current plan to never expire XIDs in 9.6 affect multixact
truncation at all?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMb5e3b355c05ab6f6dd4914ec4f4f520ed2de236421db14eb6913a998e6cf22acd2cf78c7e58756f337724e769a1b0dc3@asav-3.01.com
On 2015-09-21 10:30:59 -0700, Josh Berkus wrote:
On 09/21/2015 07:36 AM, Andres Freund wrote:
On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
So, where are we with this patch?
Uh. I'd basically been waiting on further review and then forgot about
it.Does the current plan to never expire XIDs in 9.6 affect multixact
truncation at all?
I doubt that it'd in a meaningful manner. Truncations will still need to
happen to contain space usage.
Besides, I'm pretty sceptical of shaping the design of bug fixes to suit
some unwritten feature we only know the highest level design of as of
yet.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-02 11:52:04 -0400, Robert Haas wrote:
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
New version attached.
0002 looks good, but the commit message should perhaps mention the
comment fix. Or commit that separately.
I'm inclined to backpatch the applicable parts to 9.0 - seems pointless
to have differing autovacuum_freeze_max_age values and the current value
sucks for testing and space consumption there as well.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-21 16:36:03 +0200, Andres Freund wrote:
Agreed. I'll update the patch.
Here's updated patches against master. These include the "legacy"
truncation support. There's no meaningful functional differences in this
version except addressing the review comments that I agreed with, and a
fair amount of additional polishing.
I've not:
* removed legacy truncation support
* removed SimpleLruFlush() from find_multixact_start() - imo it's easier
to reason about the system when the disk state is in sync with the in
memory state.
* removed the interval syntax from debug messages and xlogdump - they're
a fair bit more concise and the target audience of those will be able
to figure it out.
* unsplit DetermineSafeOldestOffset & SetOffsetVacuumLimit - imo the
separate functions don't make sense anymore now that limits and
truncations aren't as separate anymore.
What I've tested is the following:
* continous burning of multis, both triggered via members and offsets
* a standby keeping up when the primary is old
* a standby keeping up when the primary is new
* basebackups made while a new primary is under load
* verified that we properly PANIC when a truncation record is replayed
in an old standby.
Does anybody have additional tests in mind?
I plan to push 0002 fairly soon, it seemed to be pretty
uncontroversial. I'll then work tomorrow afternoon on producing branch
specific versions of 0003 and on producing 0004 removing all the legacy
stuff for 9.5 & master.
Greetings,
Andres Freund
Attachments:
0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload+61-9
0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload+3-4
0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload+507-321
On Tue, Sep 22, 2015 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-21 16:36:03 +0200, Andres Freund wrote:
Agreed. I'll update the patch.
Here's updated patches against master. These include the "legacy"
truncation support. There's no meaningful functional differences in this
version except addressing the review comments that I agreed with, and a
fair amount of additional polishing.
0002 looks fine.
Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4. This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances. I admit that I
don't see a specific way that that can go wrong, but there are so many
different old versions with slightly different multixact truncation
behaviors that it seems very hard to be sure that we're not going to
make things worse rather than better by introducing yet another
approach to the problem. I realize that you disagree and will
probably commit this to those branches anyway. But I want it to be
clear that I don't endorse that.
I wish more people were paying attention to these patches. These are
critical data-corrupting bugs, the code in question is very tricky,
it's been majorly revised multiple times, and we're revising it again.
And nobody except me and Andres is looking at this, and I'm definitely
not smart enough to get this all right.
Other issues:
- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.
Assorted minor nitpicking:
- "happend" is misspelled in the commit message for 0003
- "in contrast to before" should have a comma after it, also in that
commit message
- "how far the next members wraparound is away" -> "how far away the
next members wraparound is"
- "seing" -> "seeing"
- "Upgrade the primary," -> "Upgrade the primary;"
- "toMultiXact" -> "to MultiXact"
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers