pgsql: Modify the isolation tester so that multiple sessions can wait.

Started by Robert Haasalmost 10 years ago9 messages
#1Robert Haas
rhaas@postgresql.org

Modify the isolation tester so that multiple sessions can wait.

This allows testing of deadlock scenarios. Scenarios that would
previously have been considered invalid are now simply taken as a
scenario in which more than one backend will wait.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/38f8bdcac4982215beb9f65a19debecaf22fd470

Modified Files
--------------
src/test/isolation/README | 9 +-
src/test/isolation/isolationtester.c | 330 ++++++++++++++++++++---------------
2 files changed, 194 insertions(+), 145 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#1)
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

Robert Haas wrote:

Modify the isolation tester so that multiple sessions can wait.

Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and
it's been failing in isolationtester. (I kindly accepted my suggestion
to put it to run even though it is failing so that we could look at the
logs). I think this commit is the problem; see the backtrace he sent me
on private email.

#0 0x00001e18c62af87a in thrkill () at <stdin>:2
2 <stdin>: No such file or directory.
in <stdin>
(gdb) bt
#0 0x00001e18c62af87a in thrkill () at <stdin>:2
#1 0x00001e18c62aaf39 in *_libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:52
#2 0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at
/usr/src/lib/libc/string/memcpy.c:65
#3 0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20,
nsteps=11, steps=Variable "steps" is not available.
) at isolationtester.c:617
#4 0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at
isolationtester.c:369
#5 0x00001e16a79053ce in main (argc=Variable "argc" is not available.
) at isolationtester.c:251

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#2)
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

On Wed, Apr 27, 2016 at 1:51 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

Modify the isolation tester so that multiple sessions can wait.

Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and
it's been failing in isolationtester. (I kindly accepted my suggestion
to put it to run even though it is failing so that we could look at the
logs). I think this commit is the problem; see the backtrace he sent me
on private email.

#0 0x00001e18c62af87a in thrkill () at <stdin>:2
2 <stdin>: No such file or directory.
in <stdin>
(gdb) bt
#0 0x00001e18c62af87a in thrkill () at <stdin>:2
#1 0x00001e18c62aaf39 in *_libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:52
#2 0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at
/usr/src/lib/libc/string/memcpy.c:65
#3 0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20,
nsteps=11, steps=Variable "steps" is not available.
) at isolationtester.c:617
#4 0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at
isolationtester.c:369
#5 0x00001e16a79053ce in main (argc=Variable "argc" is not available.
) at isolationtester.c:251

That looks like malloc() returned NULL. I noticed when writing that
patch that isolationtester has never had any checks for malloc
returning NULL, which is bad, and probably worth fixing, but I didn't
choose to stop and fix it at that time.

I don't know off-hand why you see that problem starting at this commit
and not before, or why it doesn't show up on other machines.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#3)
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

Robert Haas wrote:

That looks like malloc() returned NULL. I noticed when writing that
patch that isolationtester has never had any checks for malloc
returning NULL, which is bad, and probably worth fixing, but I didn't
choose to stop and fix it at that time.

I didn't actually check closely but I wondered whether the pointer
arithmetic is actually correct. Note that the memcpy length is zero.
I doubt malloc returning null is the problem; how could it happen
exactly at the same spot every time the suite has run?

I don't know off-hand why you see that problem starting at this commit
and not before, or why it doesn't show up on other machines.

Perhaps it's only a problem for OpenBSD's libc and not for glibc which
is the most common. The only other openbsd machine in buildfarm doesn't
run the isolation tests.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Alvaro Herrera (#4)
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

That looks like malloc() returned NULL. I noticed when writing that
patch that isolationtester has never had any checks for malloc
returning NULL, which is bad, and probably worth fixing, but I didn't
choose to stop and fix it at that time.

I didn't actually check closely but I wondered whether the pointer
arithmetic is actually correct. Note that the memcpy length is zero.
I doubt malloc returning null is the problem; how could it happen
exactly at the same spot every time the suite has run?

I don't know off-hand why you see that problem starting at this commit
and not before, or why it doesn't show up on other machines.

Perhaps it's only a problem for OpenBSD's libc and not for glibc which
is the most common. The only other openbsd machine in buildfarm doesn't
run the isolation tests.

Also happens on OpenBSD 5.8. Isn't this a classic case where memmove
is called for? Replacing the memcpy at line 617 with memmove makes
the tests run successfully, but at first glance the other two
instances of memcpy in run_permutation should also be changed to
memmove, no?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#5)
1 attachment(s)
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

On Thu, Apr 28, 2016 at 8:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

That looks like malloc() returned NULL. I noticed when writing that
patch that isolationtester has never had any checks for malloc
returning NULL, which is bad, and probably worth fixing, but I didn't
choose to stop and fix it at that time.

I didn't actually check closely but I wondered whether the pointer
arithmetic is actually correct. Note that the memcpy length is zero.
I doubt malloc returning null is the problem; how could it happen
exactly at the same spot every time the suite has run?

I don't know off-hand why you see that problem starting at this commit
and not before, or why it doesn't show up on other machines.

Perhaps it's only a problem for OpenBSD's libc and not for glibc which
is the most common. The only other openbsd machine in buildfarm doesn't
run the isolation tests.

Also happens on OpenBSD 5.8. Isn't this a classic case where memmove
is called for? Replacing the memcpy at line 617 with memmove makes
the tests run successfully, but at first glance the other two
instances of memcpy in run_permutation should also be changed to
memmove, no?

That abort at memcpy.c:65 is indeed a check for overlapping regions:

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memcpy.c?rev=1.2&amp;content-type=text/x-cvsweb-markup

That leaves the question of why the backtrace shows arguments that
wouldn't trigger it (for example memcpy exits early for length == 0).
But it appears that those stack arguments are mangled by the time gdb
dumps them, like this:

#include <string.h>
int main(int argc, char *argv[])
{
char buffer[] = "hello world";
memcpy(&buffer[0], &buffer[1], 2);
return 0;
}

#0 0x00000bce27eab90a in kill () at <stdin>:2
#1 0x00000bce27ee5b19 in abort () at /usr/src/lib/libc/stdlib/abort.c:53
#2 0x00000bce27ebcde8 in memcpy (dst0=0xf7ec5, src0=0x6, length=0) at
/usr/src/lib/libc/string/memcpy.c:65
#3 0x00000bcb5fc00c7a in main (argc=1, argv=0x7f7fffff6818) at foo.c:6

Suggested patch attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

memmove.patchapplication/octet-stream; name=memmove.patchDownload
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 2969ce9..908a7ce 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -554,8 +554,8 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 
 				/* Remove that step from the waiting[] array. */
 				if (w + 1 < nwaiting)
-					memcpy(&waiting[w], &waiting[w + 1],
-						   (nwaiting - (w + 1)) * sizeof(Step *));
+					memmove(&waiting[w], &waiting[w + 1],
+							(nwaiting - (w + 1)) * sizeof(Step *));
 				nwaiting--;
 
 				break;
@@ -582,8 +582,8 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 					/* This one finished, too! */
 					errorstep[nerrorstep++] = waiting[w];
 					if (w + 1 < nwaiting)
-						memcpy(&waiting[w], &waiting[w + 1],
-							   (nwaiting - (w + 1)) * sizeof(Step *));
+						memmove(&waiting[w], &waiting[w + 1],
+								(nwaiting - (w + 1)) * sizeof(Step *));
 					nwaiting--;
 				}
 			}
@@ -614,8 +614,8 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 			{
 				errorstep[nerrorstep++] = waiting[w];
 				if (w + 1 < nwaiting)
-					memcpy(&waiting[w], &waiting[w + 1],
-						   (nwaiting - (w + 1)) * sizeof(Step *));
+					memmove(&waiting[w], &waiting[w + 1],
+							(nwaiting - (w + 1)) * sizeof(Step *));
 				nwaiting--;
 			}
 		}
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#6)
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Also happens on OpenBSD 5.8. Isn't this a classic case where memmove
is called for? Replacing the memcpy at line 617 with memmove makes
the tests run successfully, but at first glance the other two
instances of memcpy in run_permutation should also be changed to
memmove, no?

Yeah, that's clearly busted. Surprising that it has not failed on
any other platforms.

Suggested patch attached.

Will push in a moment.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Tom Lane (#7)
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

On 2016-04-28 00:15, Tom Lane wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Also happens on OpenBSD 5.8. Isn't this a classic case where memmove
is called for? Replacing the memcpy at line 617 with memmove makes
the tests run successfully, but at first glance the other two
instances of memcpy in run_permutation should also be changed to
memmove, no?

Yeah, that's clearly busted. Surprising that it has not failed on
any other platforms.

Suggested patch attached.

Will push in a moment.

And that seems to have done the trick. I started a manual run of HEAD
on curculio and now it's green on the build farm.

Thanks!

/Mikael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikael Kjellström (#8)
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:

And that seems to have done the trick. I started a manual run of HEAD
on curculio and now it's green on the build farm.

Thanks!

Thank *you*, for putting up a buildfarm member that caught this bug.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers