snapshot too old, configured by time
As discussed when the "proof of concept" patch was submitted during
9.5 development, here is a version intended to be considered for
commit to 9.6, with the following changes:
1. It is configured using time rather than number of transactions.
Not only was there unanimous agreement here that this was better,
but the EDB customer who had requested this feature and who had
been testing it independently made the same request.
2. The "proof of concept" patch only supported heap and btree
checking; this supports all index types.
3. Documentation has been added.
4. Tests have been added. They are currently somewhat minimal,
since this is using a whole new technique for testing from any
existing committed tests -- I wanted to make sure that this
approach to testing was OK with everyone before expanding it. If
it is, I assume we will want to move some of the more generic
portions to a .pm file to make it available for other tests.
Basically, this patch aims to limit bloat when there are snapshots
that are kept registered for prolonged periods. The immediate
reason for this is a customer application that keeps read-only
cursors against fairly static data open for prolonged periods, and
automatically fields SQLSTATE 72000 to re-open them if necessary.
When used, it should also provide some protections against extreme
bloat from forgotten "idle in transaction" connections which are
left holding a snapshot.
Once a snapshot reaches the age threshold, it can be terminated if
reads data modified after the snapshot was built. It is expected
that useful ranges will normally be somewhere from a few hours to a
few days.
By default old_snapshot_threshold is set to -1, which disables the
new behavior.
The customer has been testing a preliminary version of this
time-based patch for several weeks, and is happy with the results
-- it is preventing bloat for them and not generating "snapshot too
old" errors at unexpected times.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On 08/31/2015 10:07 AM, Kevin Grittner wrote:
Kevin,
I've started to do a review on this patch but I am a bit confused with
some of what I am seeing.
The attached testcase fails I replace the cursor in your test case with
direct selects from the table.
I would have expected this to generate the snapshot too old error as
well but it doesn't.
# Failed test 'expect "snapshot too old" error'
# at t/002_snapshot_too_old_select.pl line 64.
# got: ''
# expected: '72000'
# Looks like you failed 1 test of 9.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/9 subtests
Am I misunderstanding something or is the patch not working as expected?
Show quoted text
As discussed when the "proof of concept" patch was submitted during
9.5 development, here is a version intended to be considered for
commit to 9.6, with the following changes:1. It is configured using time rather than number of transactions.
Not only was there unanimous agreement here that this was better,
but the EDB customer who had requested this feature and who had
been testing it independently made the same request.2. The "proof of concept" patch only supported heap and btree
checking; this supports all index types.3. Documentation has been added.
4. Tests have been added. They are currently somewhat minimal,
since this is using a whole new technique for testing from any
existing committed tests -- I wanted to make sure that this
approach to testing was OK with everyone before expanding it. If
it is, I assume we will want to move some of the more generic
portions to a .pm file to make it available for other tests.Basically, this patch aims to limit bloat when there are snapshots
that are kept registered for prolonged periods. The immediate
reason for this is a customer application that keeps read-only
cursors against fairly static data open for prolonged periods, and
automatically fields SQLSTATE 72000 to re-open them if necessary.
When used, it should also provide some protections against extreme
bloat from forgotten "idle in transaction" connections which are
left holding a snapshot.Once a snapshot reaches the age threshold, it can be terminated if
reads data modified after the snapshot was built. It is expected
that useful ranges will normally be somewhere from a few hours to a
few days.By default old_snapshot_threshold is set to -1, which disables the
new behavior.The customer has been testing a preliminary version of this
time-based patch for several weeks, and is happy with the results
-- it is preventing bloat for them and not generating "snapshot too
old" errors at unexpected times.--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
I'm starting to read through this and have a few questions.
Kevin Grittner wrote:
4. Tests have been added. They are currently somewhat minimal,
since this is using a whole new technique for testing from any
existing committed tests -- I wanted to make sure that this
approach to testing was OK with everyone before expanding it. If
it is, I assume we will want to move some of the more generic
portions to a .pm file to make it available for other tests.
I think your test module is a bit unsure about whether it's called tso
or sto. I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in src/test.
(This is the whole point of src/test/modules, after all). I haven't
written or reviewed our TestLib stuff so I can't comment on whether the
test code is good or not. How long is the test supposed to last?
It bothers me a bit to add #include rel.h to snapmgr.h because of the
new macro definition. It seems out of place. I would instead move the
macro closer to bufmgr headers or rel.h itself perhaps. Also, the
definition isn't accompanied by an explanatory comment (other than why
is it a macro instead of a function).
So if I understand correctly, every call to BufferGetPage needs to have
a TestForOldSnapshot immediately afterwards? It seems easy to later
introduce places that fail to test for old snapshots. What happens if
they do? Does vacuum remove tuples anyway and then the query returns
wrong results? That seems pretty dangerous. Maybe the snapshot could
be an argument of BufferGetPage?
Please have the comments be clearer on what "align to minute boundaries"
means. Is it floor or ceil? Also, in the OldSnapshotControlData big
comment you talk about "length" and then the variable is called "used".
Maybe that should be more consistent, for clarity.
How large is the struct kept in shmem? I don't think the size is
configurable, is it? I would have guessed that the size would depend on
the GUC param ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks to both Steve and Álvaro for review and comments. I won't
be able to address all of those within the time frame of the
current CF, so I have moved it to the next CF and flagged it as
"Waiting on Author". I will post a patch to address all comments
before that CF starts. I will discuss now, though.
Steve Singer <steve@ssinger.info> wrote:
The attached testcase fails I replace the cursor in your test
case with direct selects from the table.
I would have expected this to generate the snapshot too old error
as well but it doesn't.
Good idea for an additional test; it does indeed show a case where
the patch could do better. I've reviewed the code and see how I
can adjust the calculations to fix this.
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Kevin Grittner wrote:
4. Tests have been added. They are currently somewhat minimal,
since this is using a whole new technique for testing from any
existing committed tests -- I wanted to make sure that this
approach to testing was OK with everyone before expanding it.
If it is, I assume we will want to move some of the more generic
portions to a .pm file to make it available for other tests.
I think your test module is a bit unsure about whether it's
called tso or sto.
Oops. Yeah, will fix. They all should be sto for "snapshot too
old". There is heavy enough use of timestamp fields name ts in the
code that my fingers got confused.
I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in
src/test.
As it stands now, the test is getting run as part of `make
check-world`, and it seems like src/test/modules is about testing
separate executables, so I don't think it makes sense to move the
tests -- but I could be convinced that I'm missing something.
I haven't written or reviewed our TestLib stuff so I can't
comment on whether the test code is good or not. How long is the
test supposed to last?
Well, I can't see how to get a good test of some code with a
setting of 0 (snapshot can become "too old" immediately). That
setting may keep parts of the test fast, but to exercise much of
the important code you need to configure to '1min' and have the
time pass the 0 second mark for a minute at various points in the
test; so it will be hard to avoid having this test take a few
minutes.
It bothers me a bit to add #include rel.h to snapmgr.h because of
the new macro definition. It seems out of place.
Yeah, I couldn't find anywhere to put the macro that I entirely
liked.
I would instead move the macro closer to bufmgr headers or rel.h
itself perhaps.
I'll try that, or something similar.
Also, the definition isn't accompanied by an explanatory comment
(other than why is it a macro instead of a function).
Will fix.
So if I understand correctly, every call to BufferGetPage needs
to have a TestForOldSnapshot immediately afterwards?
No, every call that is part of a scan. Access for internal
purposes (such as adding an index entry or vacuuming) does not need
this treatment.
It seems easy to later introduce places that fail to test for old
snapshots. What happens if they do?
That could cause incorrect results for those using the "snapshot
too old" feature, because a query might fail to recognize that one
or more rows it should see are missing.
Does vacuum remove tuples anyway and then the query returns
wrong results?
Right. That.
That seems pretty dangerous. Maybe the snapshot could be an
argument of BufferGetPage?
It would need that, the Relation (or go find it from the Buffer),
and an indication of whether this access is due to a scan.
Please have the comments be clearer on what "align to minute
boundaries" means. Is it floor or ceil?
ok
Also, in the OldSnapshotControlData big comment you talk about
"length" and then the variable is called "used". Maybe that
should be more consistent, for clarity.
Will work on that comment.
How large is the struct kept in shmem?
To allow a setting up to 60 days, 338kB.
I don't think the size is configurable, is it?
Not in the posted patch.
I would have guessed that the size would depend on the GUC param ...
I had been trying to make this GUC PGC_SIGHUP, but found a lot of
devils hiding in the details of that. When I gave up on that and
went back to PGC_POSTMASTER I neglected to change this allocation
back. I agree that it should be changed, and that will make the
size depend on the configuration. If the feature is not used, no
significant RAM will be allocated. If, for example, someone wants
to configure to '5h' they would need only about 1.2kB for this
structure.
Again, thanks to both of you for the review!
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin Grittner wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in
src/test.As it stands now, the test is getting run as part of `make
check-world`, and it seems like src/test/modules is about testing
separate executables, so I don't think it makes sense to move the
tests -- but I could be convinced that I'm missing something.
It's not conceived just as a way to test separate executables; maybe it
is at the moment (though I don't think it is?) but if so that's just an
accident. The intention is to have modules that get tested without them
being installed, which wasn't the case when they were in contrib.
The problem with check-world is that buildfarm doesn't run it. We don't
want to set up separate buildfarm modules for each subdir in src/test;
that would be pretty tedious.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tuesday, September 15, 2015 12:07 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Kevin Grittner wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in
src/test.As it stands now, the test is getting run as part of `make
check-world`, and it seems like src/test/modules is about testing
separate executables, so I don't think it makes sense to move the
tests -- but I could be convinced that I'm missing something.It's not conceived just as a way to test separate executables; maybe it
is at the moment (though I don't think it is?) but if so that's just an
accident. The intention is to have modules that get tested without them
being installed, which wasn't the case when they were in contrib.The problem with check-world is that buildfarm doesn't run it. We don't
want to set up separate buildfarm modules for each subdir in src/test;
that would be pretty tedious.
OK, moved.
All other issues raised by Álvaro and Steve have been addressed,
except for this one, which I will argue against:
So if I understand correctly, every call to BufferGetPage needs to have
a TestForOldSnapshot immediately afterwards? It seems easy to later
introduce places that fail to test for old snapshots. What happens if
they do? Does vacuum remove tuples anyway and then the query returns
wrong results? That seems pretty dangerous. Maybe the snapshot could
be an argument of BufferGetPage?
There are 486 occurences of BufferGetPage in the source code, and
this patch follows 36 of them with TestForOldSnapshot. This only
needs to be done when a page is going to be used for a scan which
produces user-visible results. That is, it is *not* needed for
positioning within indexes to add or vacuum away entries, for heap
inserts or deletions (assuming the row to be deleted has already
been found). It seems wrong to modify about 450 BufferGetPage
references to add a NULL parameter; and if we do want to make that
noop change as insurance, it seems like it should be a separate
patch, since the substance of this patch would be buried under the
volume of that.
I will add this to the November CF.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
snapshot-too-old-v3.difftext/x-diffDownload+907-64
On 10/15/2015 05:47 PM, Kevin Grittner wrote:
All other issues raised by Álvaro and Steve have been addressed,
except for this one, which I will argue against:
I've been looking through the updated patch
In snapmgr.c
+ * XXX: If we can trust a read of an int64 value to be atomic, we can
skip the
+ * spinlock here.
+ */
+int64
+GetOldSnapshotThresholdTimestamp(void)
Was your intent with the XXX for it to be a TODO to only aquire the lock
on platforms without the atomic 64bit operations?
On a more general note:
I've tried various manual tests of this feature and it sometimes works
as expected and sometimes doesn't.
I'm getting the feeling that how I expect it to work isn't quite in sync
with how it does work.
I'd expect the following to be sufficient to generate the test
T1: Obtains a snapshot that can see some rows
T2: Waits 60 seconds and performs an update on those rows
T2: Performs a vacuum
T1: Waits 60 seconds, tries to select from the table. The snapshot
should be too old
For example it seems that in test 002 the select_ok on conn1 following
the vacuum but right before the final sleep is critical to the snapshot
too old error showing up (ie if I remove that select_ok but leave in the
sleep I don't get the error)
Is this intended and if so is there a better way we can explain how
things work?
Also is 0 intended to be an acceptable value for old_snapshot_threshold
and if so what should we expect to see then?
So if I understand correctly, every call to BufferGetPage needs to have
a TestForOldSnapshot immediately afterwards? It seems easy to later
introduce places that fail to test for old snapshots. What happens if
they do? Does vacuum remove tuples anyway and then the query returns
wrong results? That seems pretty dangerous. Maybe the snapshot could
be an argument of BufferGetPage?There are 486 occurences of BufferGetPage in the source code, and
this patch follows 36 of them with TestForOldSnapshot. This only
needs to be done when a page is going to be used for a scan which
produces user-visible results. That is, it is *not* needed for
positioning within indexes to add or vacuum away entries, for heap
inserts or deletions (assuming the row to be deleted has already
been found). It seems wrong to modify about 450 BufferGetPage
references to add a NULL parameter; and if we do want to make that
noop change as insurance, it seems like it should be a separate
patch, since the substance of this patch would be buried under the
volume of that.I will add this to the November CF.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:
On 10/15/2015 05:47 PM, Kevin Grittner wrote:
All other issues raised by Álvaro and Steve have been addressed, except
for this one, which I will argue against:I've been looking through the updated patch
In snapmgr.c
+ * XXX: If we can trust a read of an int64 value to be atomic, we can skip the + * spinlock here. + */ +int64 +GetOldSnapshotThresholdTimestamp(void)Was your intent with the XXX for it to be a TODO to only aquire the lock on
platforms without the atomic 64bit operations?On a more general note:
I've tried various manual tests of this feature and it sometimes works as
expected and sometimes doesn't.
I'm getting the feeling that how I expect it to work isn't quite in sync
with how it does work.I'd expect the following to be sufficient to generate the test
T1: Obtains a snapshot that can see some rows
T2: Waits 60 seconds and performs an update on those rows
T2: Performs a vacuum
T1: Waits 60 seconds, tries to select from the table. The snapshot should
be too oldFor example it seems that in test 002 the select_ok on conn1 following the
vacuum but right before the final sleep is critical to the snapshot too old
error showing up (ie if I remove that select_ok but leave in the sleep I
don't get the error)Is this intended and if so is there a better way we can explain how things
work?Also is 0 intended to be an acceptable value for old_snapshot_threshold and
if so what should we expect to see then?
There has been a review but no replies for more than 1 month. Returned
with feedback?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 2, 2015 at 12:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:
In snapmgr.c
+ * XXX: If we can trust a read of an int64 value to be atomic, we can skip the + * spinlock here. + */ +int64 +GetOldSnapshotThresholdTimestamp(void)Was your intent with the XXX for it to be a TODO to only aquire the lock on
platforms without the atomic 64bit operations?
I'm not sure whether we can safely assume a read of an int64 to be
atomic on any platform; if we actually can on some platforms, and
we have a #define to tell us whether we are in such an environment,
we could condition the spinlock calls on that. Are we there yet?
On a more general note:
I've tried various manual tests of this feature and it sometimes works as
expected and sometimes doesn't.
I'm getting the feeling that how I expect it to work isn't quite in sync
with how it does work.I'd expect the following to be sufficient to generate the test
T1: Obtains a snapshot that can see some rows
T2: Waits 60 seconds and performs an update on those rows
T2: Performs a vacuum
T1: Waits 60 seconds, tries to select from the table. The snapshot should
be too oldFor example it seems that in test 002 the select_ok on conn1 following the
vacuum but right before the final sleep is critical to the snapshot too old
error showing up (ie if I remove that select_ok but leave in the sleep I
don't get the error)Is this intended and if so is there a better way we can explain how things
work?
At every phase I took a conservative approach toward deferring
pruning of tuples still visible to any snapshot -- often reducing
the overhead of tracking by letting things go to the next minute
boundary. The idea is that an extra minute or two probably isn't
going to be a big deal in terms of bloat, so if we can save any
effort on the bookkeeping by letting things go a little longer, it
is a worthwhile trade-off. That does make it hard to give a
precise statement of exactly when a transaction *will* be subject
to cancellation based on this feature, so I have emphasized the
minimum guaranteed time that a transaction will be *safe*. In
reviewing what you describe, I think I still don't have it as
aggressive as I can (and probably should). My biggest concern is
that a long-running transaction which gets a transaction ID
matching the xmin on a snapshot it will hold for a long time may
not be subject to cancellation. That's probably not too hard to
fix, but the bigger problem is the testing.
People have said that issuing SQL commands directly from a TAP test
via DBD::Pg is not acceptable for a core feature, and (despite
assertions to the contrary) I see no way to test this feature with
existing testing mechanisms. The bigger set of work here, if we
don't want this feature to go in without any testing scripts (which
is not acceptable IMO), is to enhance the isolation tester or
hybridize TAP testing with the isolation tester.
Also is 0 intended to be an acceptable value for old_snapshot_threshold and
if so what should we expect to see then?
The docs in the patch say this:
+ <para>
+ A value of <literal>-1</> disables this feature, and is the default.
+ Useful values for production work probably range from a small number
+ of hours to a few days. The setting will be coerced to a granularity
+ of minutes, and small numbers (such as <literal>0</> or
+ <literal>1min</>) are only allowed because they may sometimes be
+ useful for testing. While a setting as high as <literal>60d</> is
+ allowed, please note that in many workloads extreme bloat or
+ transaction ID wraparound may occur in much shorter time frames.
+ </para>
Once we can agree on a testing methodology I expect that I will be
adding a number of tests based on a cluster started with
old_snapshot_threshold = 0, but as I said in my initial post of the
patch I was keeping the tests in the patch thin until it was
confirmed whether this testing methodology was acceptable. Since
it isn't, that was just as well. The time put into learning enough
about perl and TAP tests to create those tests already exceeds the
time to develop the actual patch, and it looks like even more will
be needed for a test methodology that doesn't require adding a
package or downloading a CPAN module. C'est la vie. I did expand
my perl and TAP knowledge considerably, for what that's worth.
There has been a review but no replies for more than 1 month. Returned
with feedback?
I do intend to post another version of the patch to tweak the
calculations again, after I can get a patch in to expand the
testing capabilities to allow an acceptable way to test the patch
-- so I put it into the next CF instead.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 3, 2015 at 5:48 AM, Kevin Grittner wrote:
There has been a review but no replies for more than 1 month. Returned
with feedback?I do intend to post another version of the patch to tweak the
calculations again, after I can get a patch in to expand the
testing capabilities to allow an acceptable way to test the patch
-- so I put it into the next CF instead.
OK, thanks.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-02 14:48:24 -0600, Kevin Grittner wrote:
On Wed, Dec 2, 2015 at 12:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:
In snapmgr.c
+ * XXX: If we can trust a read of an int64 value to be atomic, we can skip the + * spinlock here. + */ +int64 +GetOldSnapshotThresholdTimestamp(void)Was your intent with the XXX for it to be a TODO to only aquire the lock on
platforms without the atomic 64bit operations?I'm not sure whether we can safely assume a read of an int64 to be
atomic on any platform; if we actually can on some platforms, and
we have a #define to tell us whether we are in such an environment,
we could condition the spinlock calls on that. Are we there yet?
We currently don't assume it's atomic. And there are platforms, e.g 32
bit arm, where that's not the case
(c.f. https://wiki.postgresql.org/wiki/Atomics). It'd be rather useful
to abstract that knowledge into a macro...
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin Grittner wrote:
People have said that issuing SQL commands directly from a TAP test
via DBD::Pg is not acceptable for a core feature, and (despite
assertions to the contrary) I see no way to test this feature with
existing testing mechanisms. The bigger set of work here, if we
don't want this feature to go in without any testing scripts (which
is not acceptable IMO), is to enhance the isolation tester or
hybridize TAP testing with the isolation tester.
Is it possible to use the PostgresNode stuff to test this? If not,
perhaps if you restate what additional capabilities you need we could
look into adding them there. I suspect that what you need is the
ability to keep more than one session open and feed them commands;
perhaps we could have the framework have a function that opens a psql
process and returns a FD to which the test program can write, using the
IPC::Run stuff (start / pump / finish).
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin Grittner wrote:
There has been a review but no replies for more than 1 month. Returned
with feedback?I do intend to post another version of the patch to tweak the
calculations again, after I can get a patch in to expand the
testing capabilities to allow an acceptable way to test the patch
-- so I put it into the next CF instead.
Two months passed since this, and no activity. I'm closing this as
returned-with-feedback now; you're of course free to resubmit to
2016-03.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 8, 2016 at 5:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
People have said that issuing SQL commands directly from a TAP test
via DBD::Pg is not acceptable for a core feature, and (despite
assertions to the contrary) I see no way to test this feature with
existing testing mechanisms. The bigger set of work here, if we
don't want this feature to go in without any testing scripts (which
is not acceptable IMO), is to enhance the isolation tester or
hybridize TAP testing with the isolation tester.Is it possible to use the PostgresNode stuff to test this? If not,
perhaps if you restate what additional capabilities you need we could
look into adding them there. I suspect that what you need is the
ability to keep more than one session open and feed them commands;
perhaps we could have the framework have a function that opens a psql
process and returns a FD to which the test program can write, using the
IPC::Run stuff (start / pump / finish).
Resubmitting for the March CF.
The main thing that changed is that I can now run all the
regression and isolation tests using installcheck with
old_snapshot_threshold = 0 and get a clean run. That probably gets
better overall coverage than specific tests to demonstrate the
"snapshot too old" error, but of course we need those, too. While
I can do that with hand-run psql sessions or through connectors
from different languages, I have not been able to wrangle the
testing tools we support through the build system into working for
this purpose. (I had been hoping that the recent improvements to
the TAP testing libraries would give me the traction to get there,
but either it's still not there or my perl-fu is just too weak to
figure out how to use those features -- suggestions welcome.)
Basically, a connection needs to remain open and interleave
commands with other connections, which the isolation tester does
just fine; but it needs to do that using a custom postgresql.conf
file, which TAP does just fine. I haven't been able to see the
right way to get a TAP test to set up a customized installation to
run isolation tests against. If I can get that working, I have
additional tests I can drop into that.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
snapshot-too-old-v4.patchinvalid/octet-stream; name=snapshot-too-old-v4.patchDownload+675-61
On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
Basically, a connection needs to remain open and interleave
commands with other connections, which the isolation tester does
just fine; but it needs to do that using a custom postgresql.conf
file, which TAP does just fine. I haven't been able to see the
right way to get a TAP test to set up a customized installation to
run isolation tests against. If I can get that working, I have
additional tests I can drop into that.
Check contrib/test_decoding's makefile. It does just that with
isolationtester.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 1, 2016 at 9:35 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
Basically, a connection needs to remain open and interleave
commands with other connections, which the isolation tester does
just fine; but it needs to do that using a custom postgresql.conf
file, which TAP does just fine. I haven't been able to see the
right way to get a TAP test to set up a customized installation to
run isolation tests against. If I can get that working, I have
additional tests I can drop into that.
Launching psql from PostgresNode does not hold the connection, we
would need to reinvent/integrate the logic of existing drivers to hold
the connection context properly, but that's utterly complicated with
not that much gain.
Check contrib/test_decoding's makefile. It does just that with
isolationtester.
pg_isolation_regress --temp-config is the key item here, you can
enforce a test to run on a server with a wanted configuration set.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 1, 2016 at 12:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Mar 1, 2016 at 9:35 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
Basically, a connection needs to remain open and interleave
commands with other connections, which the isolation tester does
just fine; but it needs to do that using a custom postgresql.conf
file, which TAP does just fine. I haven't been able to see the
right way to get a TAP test to set up a customized installation to
run isolation tests against. If I can get that working, I have
additional tests I can drop into that.
Check contrib/test_decoding's makefile. It does just that with
isolationtester.pg_isolation_regress --temp-config is the key item here, you can
enforce a test to run on a server with a wanted configuration set.
Thanks for the tips. Attached is a minimal set of isolation tests.
I can expand on it if needed, but wanted:
(1) to confirm that this is the right way to do this, and
(2) how long people were willing to tolerate these tests running.
Since we're making this time-based (by popular demand), there must
be delays to see the new behavior. This very minimal pair of tests
runs in just under one minute on my i7. Decent coverage of all the
index AMs would probably require tests which run for at least 10
minutes, and probably double that. I don't recall any satisfactory
resolution to prior discussions about long-running tests.
This is a follow-on patch, just to add isolation testing; the prior
patch must be applied, too.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
snapshot-too-old-v4a.patchtext/x-diff; charset=US-ASCII; name=snapshot-too-old-v4a.patchDownload+380-0
On Thu, Mar 3, 2016 at 2:40 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
Thanks for the tips. Attached is a minimal set of isolation tests.
I can expand on it if needed, but wanted:(1) to confirm that this is the right way to do this, and
(2) how long people were willing to tolerate these tests running.
Since we're making this time-based (by popular demand), there must
be delays to see the new behavior. This very minimal pair of tests
runs in just under one minute on my i7. Decent coverage of all the
index AMs would probably require tests which run for at least 10
minutes, and probably double that. I don't recall any satisfactory
resolution to prior discussions about long-running tests.This is a follow-on patch, just to add isolation testing; the prior
patch must be applied, too.
Michael, any chance that you could take a look at what Kevin did here
and see if it looks good?
I'm sure the base patch could use more review too, if anyone can find the time.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 2:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 3, 2016 at 2:40 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
Thanks for the tips. Attached is a minimal set of isolation tests.
I can expand on it if needed, but wanted:(1) to confirm that this is the right way to do this, and
(2) how long people were willing to tolerate these tests running.
Since we're making this time-based (by popular demand), there must
be delays to see the new behavior. This very minimal pair of tests
runs in just under one minute on my i7. Decent coverage of all the
index AMs would probably require tests which run for at least 10
minutes, and probably double that. I don't recall any satisfactory
resolution to prior discussions about long-running tests.This is a follow-on patch, just to add isolation testing; the prior
patch must be applied, too.Michael, any chance that you could take a look at what Kevin did here
and see if it looks good?
OK, I am marking this email. Just don't expect any updates from my
side until mid/end of next week.
I'm sure the base patch could use more review too, if anyone can find the time.
I guess I am going to need to look at the patch if if feedback for the
tests is needed.. There is no point in looking at the tests without
poking at the patch.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
New patch just to merge in recent commits -- it was starting to
show some bit-rot. Tests folded in with main patch.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company