Summary and Plan for Hot Standby

simon@2ndQuadrant.com

over 16 years ago

In reply to: Magnus Hagander (#2)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:

What does the time depend on?

We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.

The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.

There are various ways around this and I'm certain we'll come up with
something ingenious but my main point is that we don't need to wait for
this issue to be solved in order for HS to be usable.

--
Simon Riggs www.2ndQuadrant.com

Magnus Hagander

magnus@hagander.net

over 16 years ago

In reply to: Simon Riggs (#3)

Re: Summary and Plan for Hot Standby

On Sunday, November 15, 2009, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:

What does the time depend on?

We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.

The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.

There are various ways around this and I'm certain we'll come up with
something ingenious but my main point is that we don't need to wait for
this issue to be solved in order for HS to be usable.

Yeah, with that explanation (thanks for clearing it up) I agree - it
will definitely still be hugely useful even with this restriction, so
we realy don't need to delay an initial (or the alpha at least)
commit.

Thus, +1 on the second one as well :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#3)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.

The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.

WAL-logging every new subtransaction wouldn't actually help. The problem
with subtransactions is that if the subxid cache overflows in the proc
array in the master, the information about the parent-child
relationshiop is only stored in pg_subtrans, not in proc array. So when
we take the running-xacts snapshot, we can't include that information,
because there's no easy and fast way to scan pg_subtrans for it. Because
that information is not included in the snapshot, the standby doesn't
have all the information it needs until after it has seen that all the
transactions that had an overflowed xid cache have finished.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#1)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

* Issues relating to handling of prepared transactions
There are some delicate issues surrounding what happens at the end of
recovery if there is a prepared transaction still holding an access
exclusive lock.

Can you describe in more detail what problem this is again? We had
various problems with prepared transactions, but I believe what's in the
git repository now handles all those cases (although I just noticed and
fixed a bug in it, so it's not very well tested or reviewed yet).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

robertmhaas@gmail.com

over 16 years ago

In reply to: Simon Riggs (#1)

Re: Summary and Plan for Hot Standby

On Sun, Nov 15, 2009 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Please can we agree a way forwards?

I don't have a strong position on the technical issues, but I am very
much in favor of getting something committed, even something with
limitations, as soon as we practically can. Getting this feature into
the tree will get a lot more eyeballs on it, and it's much better to
do that now, while we still have several months remaining before beta,
so those eyeballs can be looking at it for longer - and testing it as
part of the regular alpha release process. It will also eliminate the
need to repeatedly merge with the main tree, etc.

...Robert

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#1)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

There are two remaining areas of significant thought/effort:

Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.

- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.

- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.

- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.

- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.

- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#8)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

There are two remaining areas of significant thought/effort:

Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.

That was the issue I mentioned.

- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.

Exactly my point. There are already parts of the patch that may cause
usage problems and need further thought. The earlier we get this to
people the earlier we will find out what they all are and begin doing
something about them.

- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.

- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.

Yes, its an optimization. Not one I consider critical, yet cool and
interesting.

- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.

Looks like a new bug, not previously mentioned.

- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.

- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?

Nice?

--
Simon Riggs www.2ndQuadrant.com

#10

bruce@momjian.us

over 16 years ago

In reply to: Simon Riggs (#9)

Re: Summary and Plan for Hot Standby

On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.

I don't think you need a heartbeat to solve this particular case. You
just need to define the "standby delay" to be "current timestamp -
timestamp of the conflicting candidate commit record".

--
greg

#11

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#9)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12

simon@2ndQuadrant.com

over 16 years ago

In reply to: Bruce Momjian (#10)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 14:47 +0000, Greg Stark wrote:

On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master. After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.

I don't think you need a heartbeat to solve this particular case. You
just need to define the "standby delay" to be "current timestamp -
timestamp of the conflicting candidate commit record".

That's not possible unfortunately.

We only have times for commits and aborts. So there could be untimed WAL
records ahead of the last timed record.

The times of events we know from the log records give us no clue as to
when the last non-commit/abort record arrived. We can only do that by

(i) specifically augmenting the log with regular, timed WAL records, or
(ii) asking WALreceiver directly when it last spoke with the master

(ii) is the obvious way to do this when we have streaming replication,
and HS assumes this will be available. It need not, and we can do (i)

Heikki's case is close to one I would expect to see in many cases: a
database that is only active during day feeds a system that runs queries
24x7. Run a VACUUM on the master at night and you could get conflicts
that follow the pattern described.

--
Simon Riggs www.2ndQuadrant.com

#13

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#11)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 16:50 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.

It can still fail even after we kick everybody off. So why bother? Most
people run nowhere near the size limit of their lock tables, and on the
standby we only track AccessExclusiveLocks in the Startup process. We
gain little by spending time on partial protection against an unlikely
issue.

(BTW, I'm not suggesting you commit HS immediately. Only that we split
into phases, stabilise and test pase 1 soon, then fix the remaining
issues later.)

--
Simon Riggs www.2ndQuadrant.com

#14

robertmhaas@gmail.com

over 16 years ago

In reply to: Heikki Linnakangas (#11)

Re: Summary and Plan for Hot Standby

On Sun, Nov 15, 2009 at 9:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.

OK, but... there will always be things that will bring down the
stand-by, just as there will always be things that can bring down the
primary. A bucket of ice-water will probably do it, for example. I
mean, it would be great to make it better, but is it so bad that we
can't postpone that improvement to a follow-on patch? It's not clear
to me that it is. I think we should really focus in on things that
are likely to make this (1) give wrong answers or (2) won't be able to
be fixed in a follow-on patch if they're not right in the original
one. Only one or two of the items on your list of additional TODOs
seem like they might fall into category (2), and none of them appear
to fall into category (1).

I predict that if we commit a basic version of this with some annoying
limitations for 8.5, people who need the feature will start writing
patches to address some of the limitations. No one else is going to
undertake any serious development work as long as this remains outside
the main tree, for fear of everything changing under them and all
their work being wasted. I would like this feature to be as good as
possible, but I would like to have it at all more.

...Robert

#15

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Robert Haas (#14)

Re: Summary and Plan for Hot Standby

Robert Haas wrote:

OK, but... there will always be things that will bring down the
stand-by, just as there will always be things that can bring down the
primary. A bucket of ice-water will probably do it, for example. I
mean, it would be great to make it better, but is it so bad that we
can't postpone that improvement to a follow-on patch?

We're not talking about a bucket of ice-water. We're talking about
issuing SELECTs to a lot of different tables in a single transaction.

Only one or two of the items on your list of additional TODOs
seem like they might fall into category (2), and none of them appear
to fall into category (1).

At least the b-tree vacuum bug could cause incorrect answers, even
though it would be extremely hard to run into it in practice.

I predict that if we commit a basic version of this with some annoying
limitations for 8.5, people who need the feature will start writing
patches to address some of the limitations. No one else is going to
undertake any serious development work as long as this remains outside
the main tree, for fear of everything changing under them and all
their work being wasted. I would like this feature to be as good as
possible, but I would like to have it at all more.

Agreed. Believe me, I'd like to have this committed as much as everyone
else. But once I do that, I'm also committing myself to fix all the
remaining issues before the release. The criteria for committing is: is
it good enough that we could release it tomorrow with no further
changes? Nothing more, nothing less.

We have *already* postponed a lot of nice-to-have stuff like the
functions to control recovery. And yes, many of the things I listed in
the TODO are not must-haves and we could well release without them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#8)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.

That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.

We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records. If you do this the
straightforward way then it will just generate a stream of btree
deletion records that will ruin usability.

You spotted this issue only this morning??

--
Simon Riggs www.2ndQuadrant.com

#17

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#16)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.

That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.

We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.

Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)

You spotted this issue only this morning??

Yesterday evening.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#17)

Re: Summary and Plan for Hot Standby

On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.

That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.

We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.

Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)

Remember that blocks are marked with an LSN? When we insert a WAL record
it has an LSN also. So we can tell which btree blocks might have had
been written to after the HeapInfo record is generated. So if a block
hasn't been recently updated or it doesn't have any killed tuples then
we need not generate a record to handle a possible conflict.

--
Simon Riggs www.2ndQuadrant.com

#19

robertmhaas@gmail.com

over 16 years ago

In reply to: Heikki Linnakangas (#15)

Re: Summary and Plan for Hot Standby

On Sun, Nov 15, 2009 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Agreed. Believe me, I'd like to have this committed as much as everyone
else. But once I do that, I'm also committing myself to fix all the
remaining issues before the release. The criteria for committing is: is
it good enough that we could release it tomorrow with no further
changes? Nothing more, nothing less.

I agree with the criteria but I think their application to the present
set of facts is debatable. If the b-tree vacuum bug can cause
incorrect answers, then it is a bug and we have to fix it. But a
query getting canceled because it touches a lot of tables sounds more
like a limitation than an outright bug, and I'm not sure you should
feel like you're on the hook for that, especially if the problem can
be mitigated by adjusting settings. Of course, on the flip side, if
the problem is likely to occur frequently enough to make the whole
system unusable in practice, then maybe it does need to be fixed. I
don't know. It's not my place and I don't intend to question your
technical judgment on what does or does not need to be fixed, the
moreso since I haven't read or thought deeply about the latest patch.
I'm just throwing it out there.

The other problem is that we have another big patch sitting right
behind this one waiting for your attention as soon as you get this one
off your chest. I know Simon has said that he feels that the effort
to finish the HS and SR patches for 9/15 was somewhat of an artificial
deadline, but ISTM that with only 3 months remaining until the close
of the final CommitFest for this release, and two major patches to
merged, we're starting to get tight on time. Presumably there will be
problems with both patches that are discovered only after committing
them, and we need some time for those to shake out. If not enough of
that shaking out happens during the regular development cycle, it will
either prolong beta and therefore delay the release, or the release
will be buggy.

All that having been said, the possibility that I'm a pessimistic
worry-wort certainly can't be ruled out. :-)

...Robert

#20

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#18)

Re: Summary and Plan for Hot Standby

Simon Riggs wrote:

On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum.
That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will
then need to fix b-tree vacuum records accordingly. We'd probably need
to do something about the previous item first to keep performance
acceptable.

We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records.

Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)

Remember that blocks are marked with an LSN? When we insert a WAL record
it has an LSN also. So we can tell which btree blocks might have had
been written to after the HeapInfo record is generated. So if a block
hasn't been recently updated or it doesn't have any killed tuples then
we need not generate a record to handle a possible conflict.

Hmm, perhaps we're talking about the same thing. What I'm seeing is that
we could easily do this:

*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 843,855 **** restart:
  				 offnum <= maxoff;
  				 offnum = OffsetNumberNext(offnum))
  			{
  				IndexTuple	itup;
  				ItemPointer htup;

! 				itup = (IndexTuple) PageGetItem(page,
! 												PageGetItemId(page, offnum));
  				htup = &(itup->t_tid);
! 				if (callback(htup, callback_state))
  					deletable[ndeletable++] = offnum;
  			}
  		}
--- 843,856 ----
  				 offnum <= maxoff;
  				 offnum = OffsetNumberNext(offnum))
  			{
+ 				ItemId		itemid;
  				IndexTuple	itup;
  				ItemPointer htup;

! itemid = PageGetItemId(page, offnum);
! itup = (IndexTuple) PageGetItem(page, itemid);
htup = &(itup->t_tid);
! if (callback(htup, callback_state) || ItemIdIsDead(itemid))
deletable[ndeletable++] = offnum;
}
}

But if we do that, b-tree vacuum records are going to need conflict
resolution, just like the b-tree non-vacuum deletion records. The LSN
doesn't help there, because when an itemid is marked as dead, the LSN is
not updated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Robert Haas (#19)

#22

simon@2ndQuadrant.com

over 16 years ago

In reply to: Robert Haas (#19)

#23

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#20)

#24

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#23)

#25

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#21)

#26

bruce@momjian.us

over 16 years ago

In reply to: Simon Riggs (#25)

#27

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#24)

#28

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#25)

#29

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#27)

#30

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#28)

#31

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#30)

#32

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#31)

#33

Josh Berkus

josh@agliodbs.com

over 16 years ago

In reply to: Simon Riggs (#32)

#34

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#32)

#35

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#34)

#36

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Josh Berkus (#33)

#37

robertmhaas@gmail.com

over 16 years ago

In reply to: Simon Riggs (#35)

#38

David E. Wheeler

david@kineticode.com

over 16 years ago

In reply to: Tom Lane (#36)

#39

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: David E. Wheeler (#38)

#40

t-ishii@sra.co.jp

over 16 years ago

In reply to: Simon Riggs (#1)

#41

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tatsuo Ishii (#40)

#42

t-ishii@sra.co.jp

over 16 years ago

In reply to: Heikki Linnakangas (#41)

#43

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tatsuo Ishii (#42)

#44

simon@2ndQuadrant.com

over 16 years ago

In reply to: Tatsuo Ishii (#40)

#45

t-ishii@sra.co.jp

over 16 years ago

In reply to: Simon Riggs (#44)

#46

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Tatsuo Ishii (#45)

#47

robertmhaas@gmail.com

over 16 years ago

In reply to: Kevin Grittner (#46)

#48

simon@2ndQuadrant.com

over 16 years ago

In reply to: Tatsuo Ishii (#45)

#49

t-ishii@sra.co.jp

over 16 years ago

In reply to: Simon Riggs (#48)

#50

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Robert Haas (#47)

#51

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tatsuo Ishii (#49)

#52

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#51)

#53

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#52)

#54

t-ishii@sra.co.jp

over 16 years ago

In reply to: Heikki Linnakangas (#53)

#55

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#53)

#56

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tatsuo Ishii (#54)

#57

simon@2ndQuadrant.com

over 16 years ago

In reply to: Tatsuo Ishii (#54)

#58

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Heikki Linnakangas (#56)

#59

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tom Lane (#58)

#60

Josh Berkus

josh@agliodbs.com

over 16 years ago

In reply to: Heikki Linnakangas (#8)

#61

Joshua D. Drake

jd@commandprompt.com

over 16 years ago

In reply to: Josh Berkus (#60)

#62

Andrew Dunstan

andrew@dunslane.net

over 16 years ago

In reply to: Joshua D. Drake (#61)

#63

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

over 16 years ago

In reply to: Simon Riggs (#57)

#64

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Joshua D. Drake (#61)

#65

Selena Deckelmann

selenamarie@gmail.com

over 16 years ago

In reply to: Josh Berkus (#60)

#66

bruce@momjian.us

over 16 years ago

In reply to: Andrew Dunstan (#62)

#67

simon@2ndQuadrant.com

over 16 years ago

In reply to: Josh Berkus (#60)

#68

simon@2ndQuadrant.com

over 16 years ago

In reply to: Bruce Momjian (#66)

#69

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#68)

#70

Josh Berkus

josh@agliodbs.com

over 16 years ago

In reply to: Simon Riggs (#67)

#71

bruce@momjian.us

over 16 years ago

In reply to: Heikki Linnakangas (#69)

#72

Joshua D. Drake

jd@commandprompt.com

over 16 years ago

In reply to: Tom Lane (#36)

#73

Joshua D. Drake

jd@commandprompt.com

over 16 years ago

In reply to: Josh Berkus (#70)

#74

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Bruce Momjian (#71)

#75

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#53)

#76

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#8)

#77

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#76)

#78

simon@2ndQuadrant.com

over 16 years ago

In reply to: Heikki Linnakangas (#8)

#79

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Simon Riggs (#78)

#80