Summary and Plan for Hot Standby
After some time thinking about the best way forward for Hot Standby, I
have some observations and proposals.
First, the project is very large. We have agreed ways to trim the patch,
yet it remains large. Trying to do everything in one lump is almost
always a bad plan, so we need to phase things.
Second, everybody is keen that HS hits the tree, so we can have alpha
code etc.. There are a few remaining issues that should *not* be rushed.
The only way to remove this dependency is to decouple parts of the
project.
Third, testing the patch is difficult and continuous change makes it
harder to guarantee everything is working.
There are two remaining areas of significant thought/effort:
* Issues relating to handling of prepared transactions
* How fast Hot Standby mode is enabled in the standby
I propose that we stabilise and eventually commit a version of HS that
circumvents/defers those issues and then address the issues with
separate patches afterwards. This approach will allow us to isolate the
areas of further change so we can have a test blitz to remove silly
mistakes, then follow it with a commit to CVS, and then release as Alpha
to allow further testing.
Let's look at the two areas of difficulty in more detail
* Issues relating to handling of prepared transactions
There are some delicate issues surrounding what happens at the end of
recovery if there is a prepared transaction still holding an access
exclusive lock. It is straightforward to say, as an interim measure,
"Hot Standby will not work with max_prepared_transactions > 0". I see
that this has a fiddly, yet fairly clear solution.
* How fast Hot Standby mode is enabled in the standby
We need to have full snapshot information on the standby before we can
allow connections and queries. There are two basic approaches: i) we
wait until we *know* we have full info or ii) we try to collect data and
inject a correct starting condition. Waiting (i) may take a while, but
is clean and requires only a few lines of code. Injecting the starting
condition (ii) requires boatloads of hectic code and we have been unable
to agree a way forwards. If we did have that code, all it would give us
is a faster/more reliable starting point for connections on the standby.
Until we can make approach (ii) work, we should just rely on the easy
approach (i). In many cases, the starting point is very similar. (In
some cases we can actually make (i) faster because the overhead of data
collection forces us to derive the starting conditions minutes apart.)
Phasing the commit seems like the only way.
Please can we agree a way forwards?
--
Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 09:06, Simon Riggs <simon@2ndquadrant.com> wrote:
* Issues relating to handling of prepared transactions
There are some delicate issues surrounding what happens at the end of
recovery if there is a prepared transaction still holding an access
exclusive lock. It is straightforward to say, as an interim measure,
"Hot Standby will not work with max_prepared_transactions > 0". I see
that this has a fiddly, yet fairly clear solution.
If that restriction will solve issues we have now, I find that a
perfectly reasonable restriction. Even if it were to still be there
past release, and only get fixed in a future release. The vast
majority of our users don't use 2PC at all. Most cases where people
had it enalbed used to be because it was enabled by default, and the
large majority of cases where I've seen people increase it has
actually been because they thought it meant prepared statements, not
prepared transactions.
So definitely +1.
* How fast Hot Standby mode is enabled in the standby
We need to have full snapshot information on the standby before we can
allow connections and queries. There are two basic approaches: i) we
wait until we *know* we have full info or ii) we try to collect data and
inject a correct starting condition. Waiting (i) may take a while, but
is clean and requires only a few lines of code. Injecting the starting
condition (ii) requires boatloads of hectic code and we have been unable
to agree a way forwards. If we did have that code, all it would give us
is a faster/more reliable starting point for connections on the standby.
Until we can make approach (ii) work, we should just rely on the easy
approach (i). In many cases, the starting point is very similar. (In
some cases we can actually make (i) faster because the overhead of data
collection forces us to derive the starting conditions minutes apart.)
That also seems perfectly reasonable, depending on how long the
waiting on (i) will be :-) What does the time depend on?
Phasing the commit seems like the only way.
Yeah, we usually recommend that in other cases, so I don't see why it
shouldn't apply to HS. Seems like a good way forward.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:
What does the time depend on?
We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.
The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.
There are various ways around this and I'm certain we'll come up with
something ingenious but my main point is that we don't need to wait for
this issue to be solved in order for HS to be usable.
--
Simon Riggs www.2ndQuadrant.com
On Sunday, November 15, 2009, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:
What does the time depend on?
We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.There are various ways around this and I'm certain we'll come up with
something ingenious but my main point is that we don't need to wait for
this issue to be solved in order for HS to be usable.
Yeah, with that explanation (thanks for clearing it up) I agree - it
will definitely still be hugely useful even with this restriction, so
we realy don't need to delay an initial (or the alpha at least)
commit.
Thus, +1 on the second one as well :)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Simon Riggs wrote:
We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.
WAL-logging every new subtransaction wouldn't actually help. The problem
with subtransactions is that if the subxid cache overflows in the proc
array in the master, the information about the parent-child
relationshiop is only stored in pg_subtrans, not in proc array. So when
we take the running-xacts snapshot, we can't include that information,
because there's no easy and fast way to scan pg_subtrans for it. Because
that information is not included in the snapshot, the standby doesn't
have all the information it needs until after it has seen that all the
transactions that had an overflowed xid cache have finished.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote:
* Issues relating to handling of prepared transactions
There are some delicate issues surrounding what happens at the end of
recovery if there is a prepared transaction still holding an access
exclusive lock.
Can you describe in more detail what problem this is again? We had
various problems with prepared transactions, but I believe what's in the
git repository now handles all those cases (although I just noticed and
fixed a bug in it, so it's not very well tested or reviewed yet).
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, Nov 15, 2009 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Please can we agree a way forwards?
I don't have a strong position on the technical issues, but I am very
much in favor of getting something committed, even something with
limitations, as soon as we practically can. Getting this feature into
the tree will get a lot more eyeballs on it, and it's much better to
do that now, while we still have several months remaining before beta,
so those eyeballs can be looking at it for longer - and testing it as
part of the regular alpha release process. It will also eliminate the
need to repeatedly merge with the main tree, etc.
...Robert
Simon Riggs wrote:
There are two remaining areas of significant thought/effort:
Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:
- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.
- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.
- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.
- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.
- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.
- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.
- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c
- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.
- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
There are two remaining areas of significant thought/effort:
Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.
Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.
- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.
That was the issue I mentioned.
- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.
Exactly my point. There are already parts of the patch that may cause
usage problems and need further thought. The earlier we get this to
people the earlier we will find out what they all are and begin doing
something about them.
- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.
Yes, its an optimization. Not one I consider critical, yet cool and
interesting.
- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.
Looks like a new bug, not previously mentioned.
- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c
- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.
An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.
- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?
Nice?
--
Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.
I don't think you need a heartbeat to solve this particular case. You
just need to define the "standby delay" to be "current timestamp -
timestamp of the conflicting candidate commit record".
--
greg
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.
There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 14:47 +0000, Greg Stark wrote:
On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.I don't think you need a heartbeat to solve this particular case. You
just need to define the "standby delay" to be "current timestamp -
timestamp of the conflicting candidate commit record".
That's not possible unfortunately.
We only have times for commits and aborts. So there could be untimed WAL
records ahead of the last timed record.
The times of events we know from the log records give us no clue as to
when the last non-commit/abort record arrived. We can only do that by
(i) specifically augmenting the log with regular, timed WAL records, or
(ii) asking WALreceiver directly when it last spoke with the master
(ii) is the obvious way to do this when we have streaming replication,
and HS assumes this will be available. It need not, and we can do (i)
Heikki's case is close to one I would expect to see in many cases: a
database that is only active during day feeds a system that runs queries
24x7. Run a VACUUM on the master at night and you could get conflicts
that follow the pattern described.
--
Simon Riggs www.2ndQuadrant.com
On Sun, 2009-11-15 at 16:50 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.
It can still fail even after we kick everybody off. So why bother? Most
people run nowhere near the size limit of their lock tables, and on the
standby we only track AccessExclusiveLocks in the Startup process. We
gain little by spending time on partial protection against an unlikely
issue.
(BTW, I'm not suggesting you commit HS immediately. Only that we split
into phases, stabilise and test pase 1 soon, then fix the remaining
issues later.)
--
Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 9:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.
OK, but... there will always be things that will bring down the
stand-by, just as there will always be things that can bring down the
primary. A bucket of ice-water will probably do it, for example. I
mean, it would be great to make it better, but is it so bad that we
can't postpone that improvement to a follow-on patch? It's not clear
to me that it is. I think we should really focus in on things that
are likely to make this (1) give wrong answers or (2) won't be able to
be fixed in a follow-on patch if they're not right in the original
one. Only one or two of the items on your list of additional TODOs
seem like they might fall into category (2), and none of them appear
to fall into category (1).
I predict that if we commit a basic version of this with some annoying
limitations for 8.5, people who need the feature will start writing
patches to address some of the limitations. No one else is going to
undertake any serious development work as long as this remains outside
the main tree, for fear of everything changing under them and all
their work being wasted. I would like this feature to be as good as
possible, but I would like to have it at all more.
...Robert
Robert Haas wrote:
OK, but... there will always be things that will bring down the
stand-by, just as there will always be things that can bring down the
primary. A bucket of ice-water will probably do it, for example. I
mean, it would be great to make it better, but is it so bad that we
can't postpone that improvement to a follow-on patch?
We're not talking about a bucket of ice-water. We're talking about
issuing SELECTs to a lot of different tables in a single transaction.
Only one or two of the items on your list of additional TODOs
seem like they might fall into category (2), and none of them appear
to fall into category (1).
At least the b-tree vacuum bug could cause incorrect answers, even
though it would be extremely hard to run into it in practice.
I predict that if we commit a basic version of this with some annoying
limitations for 8.5, people who need the feature will start writing
patches to address some of the limitations. No one else is going to
undertake any serious development work as long as this remains outside
the main tree, for fear of everything changing under them and all
their work being wasted. I would like this feature to be as good as
possible, but I would like to have it at all more.
Agreed. Believe me, I'd like to have this committed as much as everyone
else. But once I do that, I'm also committing myself to fix all the
remaining issues before the release. The criteria for committing is: is
it good enough that we could release it tomorrow with no further
changes? Nothing more, nothing less.
We have *already* postponed a lot of nice-to-have stuff like the
functions to control recovery. And yes, many of the things I listed in
the TODO are not must-haves and we could well release without them.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.
That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.
We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records. If you do this the
straightforward way then it will just generate a stream of btree
deletion records that will ruin usability.
You spotted this issue only this morning??
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.
Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)
You spotted this issue only this morning??
Yesterday evening.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)
Remember that blocks are marked with an LSN? When we insert a WAL record
it has an LSN also. So we can tell which btree blocks might have had
been written to after the HeapInfo record is generated. So if a block
hasn't been recently updated or it doesn't have any killed tuples then
we need not generate a record to handle a possible conflict.
--
Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Agreed. Believe me, I'd like to have this committed as much as everyone
else. But once I do that, I'm also committing myself to fix all the
remaining issues before the release. The criteria for committing is: is
it good enough that we could release it tomorrow with no further
changes? Nothing more, nothing less.
I agree with the criteria but I think their application to the present
set of facts is debatable. If the b-tree vacuum bug can cause
incorrect answers, then it is a bug and we have to fix it. But a
query getting canceled because it touches a lot of tables sounds more
like a limitation than an outright bug, and I'm not sure you should
feel like you're on the hook for that, especially if the problem can
be mitigated by adjusting settings. Of course, on the flip side, if
the problem is likely to occur frequently enough to make the whole
system unusable in practice, then maybe it does need to be fixed. I
don't know. It's not my place and I don't intend to question your
technical judgment on what does or does not need to be fixed, the
moreso since I haven't read or thought deeply about the latest patch.
I'm just throwing it out there.
The other problem is that we have another big patch sitting right
behind this one waiting for your attention as soon as you get this one
off your chest. I know Simon has said that he feels that the effort
to finish the HS and SR patches for 9/15 was somewhat of an artificial
deadline, but ISTM that with only 3 months remaining until the close
of the final CommitFest for this release, and two major patches to
merged, we're starting to get tight on time. Presumably there will be
problems with both patches that are discovered only after committing
them, and we need some time for those to shake out. If not enough of
that shaking out happens during the regular development cycle, it will
either prolong beta and therefore delay the release, or the release
will be buggy.
All that having been said, the possibility that I'm a pessimistic
worry-wort certainly can't be ruled out. :-)
...Robert
Simon Riggs wrote:
On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.
That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)Remember that blocks are marked with an LSN? When we insert a WAL record
it has an LSN also. So we can tell which btree blocks might have had
been written to after the HeapInfo record is generated. So if a block
hasn't been recently updated or it doesn't have any killed tuples then
we need not generate a record to handle a possible conflict.
Hmm, perhaps we're talking about the same thing. What I'm seeing is that
we could easily do this:
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 843,855 **** restart:
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
ItemPointer htup;
! itup = (IndexTuple) PageGetItem(page,
! PageGetItemId(page, offnum));
htup = &(itup->t_tid);
! if (callback(htup, callback_state))
deletable[ndeletable++] = offnum;
}
}
--- 843,856 ----
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
+ ItemId itemid;
IndexTuple itup;
ItemPointer htup;
! itemid = PageGetItemId(page, offnum);
! itup = (IndexTuple) PageGetItem(page, itemid);
htup = &(itup->t_tid);
! if (callback(htup, callback_state) || ItemIdIsDead(itemid))
deletable[ndeletable++] = offnum;
}
}
But if we do that, b-tree vacuum records are going to need conflict
resolution, just like the b-tree non-vacuum deletion records. The LSN
doesn't help there, because when an itemid is marked as dead, the LSN is
not updated.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com