Recovery Test Framework
Recovery doesn't have a test framework as yet. I would like to add one
for this release, especially since we have so much recovery-related code
being added to the release (and manual testing is so time consuming).
Testing Hot Standby will also test sync rep, PITR etc, and could easily
uncover a few problems hiding in the background that have lain dormant.
The current regression tests are all self-contained tests that create
objects, insert data, run tests and then cleanup again. Almost every
single test case is read-write.
This gives a few problems for recovery & Hot Standby
* tests cannot easily be split so that read/write happens on master and
test execution happens on standby (or on both master and standby)
* there is no easy way to synchronise object creation on master and test
execution on standby
So I propose to setup two test schedules
* rep_master_schedule
* rep_standby_schedule
to be executed using pg_regress concurrently on separate database
servers running on different ports on the same system.
A test table would keep track of which tables have had their
prerequisites met, and rep_standby_schedule would wait until a test was
correctly set up before running the test. This would be achieved using
the attached test framework code.
We would then include newly written tests, rather than attempt to use
existing tests - but use the same framework of /sql /out /expected. Some
of these have already been written for HS.
Is this something the community would like to see included within the
distribution, or should we just keep or private and publish test results
using it. I would prefer the former since it would allow us to prove
that the code works and to be able to check for specific regressions as
bugs appear. It may also help the community to work together on the
various aspects of recovery code that are being included in 8.4.
It would be massively cool to be able to add this to the build farm.
There would be few blockers because with two servers running on same
system there are few OS specific aspects to this.
If people can discuss what would be required we should be able to get it
done in the near term, i.e. over next 2-3 weeks.
Comments?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
hs.testframework.v1.sqltext/x-vhdl; charset=UTF-8; name=hs.testframework.v1.sqlDownload
Simon Riggs <simon@2ndQuadrant.com> writes:
Recovery doesn't have a test framework as yet. I would like to add one
for this release, especially since we have so much recovery-related code
being added to the release (and manual testing is so time consuming).
I've been thinking for some time that putting replication into 8.4
has proven to be an unreasonably optimistic goal. Seeing new
requirements like this one pop up two months after feature freeze
kind of drives the point home.
I think it's time to back off and agree that we should target all this
stuff for 8.5. I don't want our first release of replication to be
flaky, but I hardly see how it will be anything else if it ships in 8.4.
regards, tom lane
On Sun, Jan 11, 2009 at 12:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
Recovery doesn't have a test framework as yet. I would like to add one
for this release, especially since we have so much recovery-related code
being added to the release (and manual testing is so time consuming).I've been thinking for some time that putting replication into 8.4
has proven to be an unreasonably optimistic goal. Seeing new
requirements like this one pop up two months after feature freeze
kind of drives the point home.
I don't have a strong feeling as to which of the replication-related
patches are ready to commit, but I don't think that the fact that
Simon has an idea for improving the test framework is a reason for
rejecting them. Referring to an idea for a new test framework for
recovery as "a new requirement for replication" is quite a stretch.
It might also be pointed out that the "Infrastructure Changes for
Recovery" patch was originally submitted for the September CommitFest,
but since review stopped for about 6 weeks beginning just after the
first of October, picked up briefly again on November 17th with a few
messages from Heikki, and then died again until late December, it's
perhaps not surprising that not a lot of progress has been made....
particularly since no reviewers were assigned for a ridiculously long
time. "Infrastructure Changes for Recovery" was moved from the
September Commitfest with your name and Heikki's name already on it,
and no one else was ever assigned (nor did you provide any more
review, at least as far as I can remember seeing on -hackers).
"Hot Standby" finally had a reviewer assigned on November 26th, when
Pavan Daolesee was added - but I'm not even sure that was an official
RRR assignment, I think he may have just picked it up. At any rate,
when a reviewer isn't assigned for almost four weeks, and it's at that
point the day before Thanksgiving, well, don't expect a lot to get
done before Christmas. SE-Postgresql was even more egregious - it
certainly never had a round robin reviewer and was listed as being
reviewed by "Nobody" for over a month.
All of this is particularly mysterious to me in light of the fact that
it was you yourself who suggested that we should make sure to get
feedback out to the authors of major patches early.
http://archives.postgresql.org/message-id/3555.1225336370@sss.pgh.pa.us
I personally reviewed at least 10 patches for this CommitFest. I
thought the point of that was to take some of the load off of the
committers so that they could focus on major new features like
replication. Otherwise, what is the point of having round robin
reviewers? And what is the point of saying that we want replication?
...Robert
On Sun, 2009-01-11 at 12:07 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
Recovery doesn't have a test framework as yet. I would like to add one
for this release, especially since we have so much recovery-related code
being added to the release (and manual testing is so time consuming).I've been thinking for some time that putting replication into 8.4
has proven to be an unreasonably optimistic goal. Seeing new
requirements like this one pop up two months after feature freeze
kind of drives the point home.I think it's time to back off and agree that we should target all this
stuff for 8.5. I don't want our first release of replication to be
flaky, but I hardly see how it will be anything else if it ships in 8.4.
I understand that. As you know, I have been concerned and disappointed
by the progress of sync rep in particular, though salute Fujii-san's
personal effort and skill.
Which patches were you thinking of when you say "all this stuff"?
As a main reviewer of Sync Rep, I can say it's shaping up nicely. I
don't have any reason now to veto it for architectural reasons and it
covers many subtle, second level issues very well that would be easily
missed in re-designs. It has some innovative features that make it best
in class. Is it flaky? Not fundamentally; code wise I see it more as a
question of time. Does it do everything? No, some advanced features
(multiple streaming standbys, single command setup for small installs)
have been deferred to later releases.
Now that it's time to discuss such things I personally think we have run
out of time though for WAL I/O read-ahead ("Proposal of PITR performance
improvement") especially since tests show it has little advantage with
FPW enabled. If we really need to we could lose most of my
Infrastructure patch, since that adds fast failover and additional
performance with bgwriter.
If we insist upon cuts, we can lose some patches and code and yet still
maintain the popular headline features of both Sync Rep and Hot Standby.
Realistically, we need your attention if we are to include them. I can
list points where your attention would be especially welcome since the
patches are relatively large. Let's look at the detail of what we need
to do rather than the broad brush.
Back to the test framework: this is not relevant to replication. PITR
and crash recovery are all manually tested and have been for years.
Testing Hot Standby *revealed* a bug in visibility maps that went
through otherwise unnoticed and I think there are others, new and old.
Even if we reject replication entirely, a recovery test framework is
going to increase the robustness of what we *currently* have. It's not a
new requirement; what is new is I now have some sponsorship money
explicitly earmarked for this, but as part of the HS project and indeed
the test framework relies upon HS to operate at all.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
Is it flaky? Not fundamentally; code wise I see it more as a
question of time.
"a question of time" indeed.
If we insist upon cuts, ...
Even if we reject replication entirely ...
There's a clear difference between how you're thinking about this and I do.
The way I see it nobody suggested cutting or rejecting anything, just
committing it into a different branch for a different release date. It would
give us a year of experience seeing the code in action before releasing it on
the world.
I'm not sure whether it's too immature to commit, I haven't read the patch;
from what I see in the mailing list it seems about as ready as other large
patches in the past which were committed. But from my point of view it would
just always be better to commit large patches immediately after forking a
release instead of just before the beta to give them a whole release cycle of
exposure to developers before beta testers.
I'm not sure if this is the right release cycle to start this policy, but I
would like to convince people of this at some point so we can start having a
flood of big commits at the *beginning* of the release cycle and then a whole
release cycle of incremental polishing to those features rather than always
having freshly committed features in our releases that none of us has much
experience with.
a recovery test framework is going to increase the robustness of what we
*currently* have. It's not a new requirement; what is new is I now have some
sponsorship money explicitly earmarked for this, but as part of the HS
project and indeed the test framework relies upon HS to operate at all.
I agree with you that additional tests don't represent any immaturity in the
patch. They don't affect the run-time behaviour and I love the fact that they
might turn up any problems with our existing recovery process.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes:
... But from my point of view it would
just always be better to commit large patches immediately after forking a
release instead of just before the beta to give them a whole release cycle of
exposure to developers before beta testers.
I'm in favor of such an approach for this work, but it'll never fly as a
general project policy. People already dislike the fact that it takes
up to a year before their work gets reflected into a public release.
With such a policy we'd be telling developers "whatever you submit won't
see the light of day for one to two years". Not good for a project that
depends on the willingness of developers to scratch their own itches.
However, we are getting off onto a tangent. I wasn't trying to start
a discussion about general project policies, but about the specific
status of this particular group of patches.
regards, tom lane
On Mon, 2009-01-12 at 09:04 -0500, Tom Lane wrote:
I wasn't trying to start
a discussion about general project policies, but about the specific
status of this particular group of patches.
Which ones exactly?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
On Mon, 2009-01-12 at 09:04 -0500, Tom Lane wrote:
I wasn't trying to start
a discussion about general project policies, but about the specific
status of this particular group of patches.
Which ones exactly?
Well, one of the things that makes me uncomfortable is that it's not
even clear exactly which set of patches is currently proposed for
inclusion. We've seen a whole lot of URLs fly back and forth, many
of them pointing at pages that aren't there a few days later.
I've been too busy with non-replication-related patches to pay really
close attention, but I certainly don't get the impression that there's
a stable set of patches waiting to be applied. (And for the record,
there is nothing I like even a little bit about the practice of posting
a URL instead of an actual patch.)
regards, tom lane
On Mon, 2009-01-12 at 09:55 -0500, Tom Lane wrote:
(And for the record,
there is nothing I like even a little bit about the practice of posting
a URL instead of an actual patch.)
I don't like it either.
The patchsets are too big to post to the list directly, at least that is
the reason in my case and with Fujii-san and KaiGai's cases.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Mon, 2009-01-12 at 09:55 -0500, Tom Lane wrote:
(And for the record,
there is nothing I like even a little bit about the practice of posting
a URL instead of an actual patch.)I don't like it either.
The patchsets are too big to post to the list directly, at least that is
the reason in my case and with Fujii-san and KaiGai's cases.
yeah - afaik we still have a 100k limit on -hackers.
Stefan
On Mon, Jan 12, 2009 at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, we are getting off onto a tangent. I wasn't trying to start
a discussion about general project policies, but about the specific
status of this particular group of patches.
I concur with Gregory on this one.
IM(Very)HO, it's really too late in the cycle to commit these features
(ie sync rep and hot standby). They are supposed to guarantee high
availability and data security and they must be rock solid. Having
them commited just before the release seems to me like a very
dangerous way to publish them.
A lot of users are waiting for these features so they really should be
usable and rock solid before they get released to the public. One more
year without them is perhaps better than causing problems on critical
databases.
Apart from the features themselves, what people expect the most (at
least the ones I met) is a replication feature which is simple to set
up and integrated. A polished "user interface" is probably what is the
most important from the user point of view (correctness and stability
are a minimum). That's what is going to make a difference with what
already existed (for the users I know).
I'm just handwaving but I think there's probably need for at least one
more month to get these patches reviewed and ready to commit
(considering there are very few people able to review them and to fix
problem in this set of patches).
Note that I don't question the quality of the patches, just that there
will be very little time to test the final code commited before the
release.
--
Guillaume
Simon Riggs wrote:
Recovery doesn't have a test framework as yet.
I have been having these concerns as well. In fact, I recall
discussions at least 8 years back about how pg_dump doesn't really have
any organized testing, and we also have little regular testing of PITR
aside from specific exercises that users or developers occasionally run.
The question remains how to do it. Running read-only queries on a slave
doesn't show anything about how well the write-relevant parts of WAL
archiving work. That's not to say it's not interesting to test that,
but there is really a lot more to having a full test suite for our
backup and recovery facilities.
On 1/12/09, Guillaume Smet <guillaume.smet@gmail.com> wrote:
On Mon, Jan 12, 2009 at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, we are getting off onto a tangent. I wasn't trying to start
a discussion about general project policies, but about the specific
status of this particular group of patches.I concur with Gregory on this one.
IM(Very)HO, it's really too late in the cycle to commit these features
(ie sync rep and hot standby). They are supposed to guarantee high
availability and data security and they must be rock solid. Having
them commited just before the release seems to me like a very
dangerous way to publish them.
I disagree at least with hot standby. I've been using/testing (as
have others) it under a variety of workloads for several months now
with no issues outside of corrected issues in the very early patches.
Also, a relatively few amount of people update/build from cvs
frequently so being committed late in the release cycle isn't as
important as you are claiming...the real 'wider net' testing happens
when the beta period begins.
IMO, Simon needs to produce a patch (quickly), have it be reviewed,
and get it included/excluded based on its merits.
merlin
On Mon, Jan 12, 2009 at 4:56 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
I disagree at least with hot standby. I've been using/testing (as
have others) it under a variety of workloads for several months now
with no issues outside of corrected issues in the very early patches.
Also, a relatively few amount of people update/build from cvs
frequently so being committed late in the release cycle isn't as
important as you are claiming...the real 'wider net' testing happens
when the beta period begins.
Update/build from CVS != Update/build from CVS + apply the replication
patches + test them explicitely.
That said, I didn't have the time to test them myself so I feel also
responsible for that.
My point is that what Simon currently has (and so what you tested) is
different from what is going to be commited (note the "final" in what
I wrote) and I suspect there will be a certain number of non
negligible adjustments (see the last discussions between Simon and
Heikki and I don't think Tom has taken a look at these patches yet).
I'm not sure that the beta/rc testing cycle is sufficient for such a
critical feature and that we probably need some time to polish it.
But once again, it's just MVHO.
--
Guillaume
On Mon, Jan 12, 2009 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Mon, 2009-01-12 at 09:04 -0500, Tom Lane wrote:
Well, one of the things that makes me uncomfortable is that it's not
even clear exactly which set of patches is currently proposed for
inclusion. We've seen a whole lot of URLs fly back and forth, many
of them pointing at pages that aren't there a few days later.
I've been too busy with non-replication-related patches to pay really
close attention, but I certainly don't get the impression that there's
a stable set of patches waiting to be applied.
See this is one of the things which bothers me. I don't see any
advantage in forcing Simon to stop making improvements -- and there
are always improvements to be made -- just to make his code seem more
stable.
Obviously we want to avoid having people actively stepping on each
others' toes, but as long as the code isn't actively being worked on
by anyone else by it would be silly to ask Simon to just sit on his
hands when he sees further things that can be done.
--
greg
"Guillaume Smet" <guillaume.smet@gmail.com> writes:
On Mon, Jan 12, 2009 at 4:56 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
I disagree at least with hot standby. I've been using/testing (as
have others) it under a variety of workloads for several months now
with no issues outside of corrected issues in the very early patches.
My point is that what Simon currently has (and so what you tested) is
different from what is going to be commited (note the "final" in what
I wrote) and I suspect there will be a certain number of non
negligible adjustments (see the last discussions between Simon and
Heikki and I don't think Tom has taken a look at these patches yet).
The thing that's disturbing me is that (to judge by what I've been
seeing on the mailing list) there's been a steady stream of "non
negligible adjustments" for the past two months. That's good from
the standpoint that problems are getting found and fixed, but it's
not giving me any warm fuzzies about the code being ready to go.
Basically I think we are up against the same type of project management
decision we've had several times before: are we willing to slip the
8.4 release schedule for however long it will take for hot standby
and the other replication-related features to be ready? At this point
I think there can be no question that it will not be a small slip;
in fact I'm not even prepared to guess at how long it will take.
regards, tom lane
On Mon, Jan 12, 2009 at 11:07 AM, Guillaume Smet
<guillaume.smet@gmail.com> wrote:
On Mon, Jan 12, 2009 at 4:56 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
I disagree at least with hot standby. I've been using/testing (as
have others) it under a variety of workloads for several months now
with no issues outside of corrected issues in the very early patches.
Also, a relatively few amount of people update/build from cvs
frequently so being committed late in the release cycle isn't as
important as you are claiming...the real 'wider net' testing happens
when the beta period begins.Update/build from CVS != Update/build from CVS + apply the replication
patches + test them explicitely.That said, I didn't have the time to test them myself so I feel also
responsible for that.
In the general case I think plenty of people update and build from CVS
regularly. It's great that the FSM has been in for a couple months
before the beta, we've uncovered a couple problems which could easily
have slipped through the betas for example.
In the case of hot standby and replication I'm not really sure that
logic applies. It takes quite a lot of work to test these features and
they don't turn up problems in other areas when you're not running
them. So I doubt it would really have helped in this case.
--
greg
On Mon, Jan 12, 2009 at 11:11:20AM -0500, Greg Stark wrote:
On Mon, Jan 12, 2009 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Mon, 2009-01-12 at 09:04 -0500, Tom Lane wrote:
Well, one of the things that makes me uncomfortable is that it's
not even clear exactly which set of patches is currently proposed
for inclusion. We've seen a whole lot of URLs fly back and forth,
many of them pointing at pages that aren't there a few days later.
I've been too busy with non-replication-related patches to pay
really close attention, but I certainly don't get the impression
that there's a stable set of patches waiting to be applied.See this is one of the things which bothers me. I don't see any
advantage in forcing Simon to stop making improvements -- and there
are always improvements to be made -- just to make his code seem
more stable.
Two things to fix this, and several other problems:
1. Remove the messages size limits on -hackers. They serve no useful
purpose, and they interfere with our development process. If -hackers
isn't already subscriber-only, now would be the time to make it so.
2. Start using more git, as many hackers and committers have already
started to do. This is the kind of situation where CVS just plain
falls down because branching and merging are unmanageably difficult in
it, where in git, they're many-times-a-day operations.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes:
Two things to fix this, and several other problems:
1. Remove the messages size limits on -hackers. They serve no useful
purpose, and they interfere with our development process.
Agreed, or at least boost it up a good bit more.
If -hackers
isn't already subscriber-only, now would be the time to make it so.
Not sure how that's relevant?
2. Start using more git, as many hackers and committers have already
started to do. This is the kind of situation where CVS just plain
falls down because branching and merging are unmanageably difficult in
it, where in git, they're many-times-a-day operations.
This is a red herring, unless your proposal also includes making the
master CVS^H^H^Hgit repository world-writable. The complaint I have
about people posting URLs is that there's no stable archive of what the
patches really were, and just because it came out of someone's local git
repository doesn't help that.
regards, tom lane
On Mon, 2009-01-12 at 11:18 -0500, Tom Lane wrote:
My point is that what Simon currently has (and so what you tested) is
different from what is going to be commited (note the "final" in what
I wrote) and I suspect there will be a certain number of non
negligible adjustments (see the last discussions between Simon and
Heikki and I don't think Tom has taken a look at these patches yet).The thing that's disturbing me is that (to judge by what I've been
seeing on the mailing list) there's been a steady stream of "non
negligible adjustments" for the past two months. That's good from
the standpoint that problems are getting found and fixed, but it's
not giving me any warm fuzzies about the code being ready to go.
This is the same thing that makes me nervous. The feature appears to be
"Under heavy development". As I understand the development model the
heavy development is supposed to happen before commit fest.
Basically I think we are up against the same type of project management
decision we've had several times before: are we willing to slip the
8.4 release schedule for however long it will take for hot standby
and the other replication-related features to be ready?
I would certainly not like to see 8.4 slip.
At this point
I think there can be no question that it will not be a small slip;
in fact I'm not even prepared to guess at how long it will take.
Not a comforting thought.
Sincerely,
Joshua D Drake
regards, tom lane
--
PostgreSQL
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997