infinite loop in parallel hash joins / DSA / get_best_segment

Started by Tomas Vondraover 7 years ago5 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
2 attachment(s)

Hi,

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:

34.66% postgres [.] get_segment_by_index
29.44% postgres [.] get_best_segment
29.22% postgres [.] unlink_segment.isra.2
6.66% postgres [.] fls
0.02% [unknown] [k] 0xffffffffb10014b0

So all the time seems to be spent within get_best_segment.

2) The backtrace looks like this (full backtrace attached):

#0 0x0000561a748c4f89 in get_segment_by_index
#1 0x0000561a748c5653 in get_best_segment
#2 0x0000561a748c67a9 in dsa_allocate_extended
#3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
#4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
#5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
#6 ExecHashJoinImpl
#7 ExecParallelHashJoin
#8 ExecProcNode
...

3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

That is, we call the function with the same index over and over.

Why is that? Well:

(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}

So, we loop forever.

I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

explain.logtext/x-log; name=explain.logDownload
backtrace.txttext/plain; charset=UTF-8; name=backtrace.txtDownload
#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tomas Vondra (#1)
Re: infinite loop in parallel hash joins / DSA / get_best_segment

On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:

34.66% postgres [.] get_segment_by_index
29.44% postgres [.] get_best_segment
29.22% postgres [.] unlink_segment.isra.2
6.66% postgres [.] fls
0.02% [unknown] [k] 0xffffffffb10014b0

So all the time seems to be spent within get_best_segment.

2) The backtrace looks like this (full backtrace attached):

#0 0x0000561a748c4f89 in get_segment_by_index
#1 0x0000561a748c5653 in get_best_segment
#2 0x0000561a748c67a9 in dsa_allocate_extended
#3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
#4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
#5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
#6 ExecHashJoinImpl
#7 ExecParallelHashJoin
#8 ExecProcNode
...

3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

That is, we call the function with the same index over and over.

Why is that? Well:

(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}

So, we loop forever.

I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.

Urgh. Thanks Tomas. I will investigate.

--
Thomas Munro
http://www.enterprisedb.com

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#2)
1 attachment(s)
Re: infinite loop in parallel hash joins / DSA / get_best_segment

On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

...

Urgh. Thanks Tomas. I will investigate.

Thanks very much to Tomas for giving me access to his benchmarking
machine where this could be reproduced. Tomas was doing performance
testing with no assertions, but with a cassert built I was able to hit
an assertion failure after a while and eventually figure out what was
going wrong. The problem is that the 'segment bins' (linked lists
that group segments by the largest contiguous run of free pages) can
become corrupted when segments become completely free and are returned
to the operating system and then the same segment slot (index number)
is recycled, with the right sequence of allocations and frees and
timing. There is an LWLock that protects segment slot and bin
manipulations, but there is a kind of ABA problem where one backend
can finish up looking at the defunct former inhabitant of a slot that
another backend has recently create a new segment in. There is
handling for that in the form of freed_segment_counter, a kind of
generation/invalidation signalling, but there are a couple of paths
that fail to check it at the right times.

With the attached draft patch, Tomas's benchmark script runs happily
for long periods. A bit more study required with fresh eyes,
tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-dsa-segment-free-bug.patchapplication/octet-stream; name=fix-dsa-segment-free-bug.patchDownload
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 488678191b..f89048c823 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -405,6 +405,7 @@ static dsa_area *create_internal(void *place, size_t size,
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 				dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
+static void check_for_freed_segments_locked(dsa_area *area);
 
 /*
  * Create a new shared area in a new DSM segment.  Further DSM segments will
@@ -1778,8 +1779,6 @@ destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
 	int			size_class = span->size_class;
 	dsa_segment_map *segment_map;
 
-	segment_map =
-		get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
 
 	/* Remove it from its fullness class list. */
 	unlink_span(area, span);
@@ -1790,6 +1789,9 @@ destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
 	 * could deadlock.
 	 */
 	LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+	check_for_freed_segments_locked(area);
+	segment_map =
+		get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
 	FreePageManagerPut(segment_map->fpm,
 					   DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
 					   span->npages);
@@ -1944,6 +1946,7 @@ get_best_segment(dsa_area *area, Size npages)
 	Size		bin;
 
 	Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+	check_for_freed_segments_locked(area);
 
 	/*
 	 * Start searching from the first bin that *might* have enough contiguous
@@ -2220,10 +2223,27 @@ check_for_freed_segments(dsa_area *area)
 	freed_segment_counter = area->control->freed_segment_counter;
 	if (unlikely(area->freed_segment_counter != freed_segment_counter))
 	{
-		int			i;
-
 		/* Check all currently mapped segments to find what's been freed. */
 		LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+		check_for_freed_segments_locked(area);
+		LWLockRelease(DSA_AREA_LOCK(area));
+	}
+}
+
+/*
+ * Workhorse for check_for_free_segments_unlocked, when the area lock is already
+ * held.
+ */
+static void
+check_for_freed_segments_locked(dsa_area *area)
+{
+	Size		freed_segment_counter;
+	int		i;
+
+	Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+	freed_segment_counter = area->control->freed_segment_counter;
+	if (unlikely(area->freed_segment_counter != freed_segment_counter))
+	{
 		for (i = 0; i <= area->high_segment_index; ++i)
 		{
 			if (area->segment_maps[i].header != NULL &&
@@ -2235,7 +2255,6 @@ check_for_freed_segments(dsa_area *area)
 				area->segment_maps[i].mapped_address = NULL;
 			}
 		}
-		LWLockRelease(DSA_AREA_LOCK(area));
 		area->freed_segment_counter = freed_segment_counter;
 	}
 }
#4Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#3)
1 attachment(s)
Re: infinite loop in parallel hash joins / DSA / get_best_segment

On Mon, Sep 17, 2018 at 9:12 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

...

With the attached draft patch, Tomas's benchmark script runs happily
for long periods. A bit more study required with fresh eyes,
tomorrow.

Here is a tidied up version that I propose to back-patch to 10 & 11.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Fix-segment_bins-corruption-in-dsa.c.patchapplication/octet-stream; name=0001-Fix-segment_bins-corruption-in-dsa.c.patchDownload
From 9e5db70291ab85ce04a77401683a549964dfe9dc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Tue, 18 Sep 2018 09:00:23 +1200
Subject: [PATCH] Fix segment_bins corruption in dsa.c.

If a segment has been freed by dsa.c because it is entirely empty, other
backends must make sure to unmap it before following links to new
segments that might happen to have the same index number, or they could
inadvertantly finish up looking at a defunct segment and then corrupt
the segment_bins lists.  The correct protocol requires checking
freed_segment_counter after acquiring the area lock and before resolving
any index number to a segment.  Add the missing checks and an assertion.

Back-patch to 10, where dsa.c first arrived.

Author: Thomas Munro
Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/CAEepm%3D0thg%2Bja5zGVa7jBy-uqyHrTqTm8HGhEOtMmigGrAqTbw%40mail.gmail.com
---
 src/backend/utils/mmgr/dsa.c | 49 ++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 488678191b..ec7c475ca2 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -405,6 +405,7 @@ static dsa_area *create_internal(void *place, size_t size,
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 				dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
+static void check_for_freed_segments_locked(dsa_area *area);
 
 /*
  * Create a new shared area in a new DSM segment.  Further DSM segments will
@@ -1065,6 +1066,7 @@ dsa_dump(dsa_area *area)
 	 */
 
 	LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+	check_for_freed_segments_locked(area);
 	fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
 	fprintf(stderr, "  max_total_segment_size: %zu\n",
 			area->control->max_total_segment_size);
@@ -1762,6 +1764,22 @@ get_segment_by_index(dsa_area *area, dsa_segment_index index)
 			   (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
 	}
 
+	/*
+	 * Callers of dsa_get_address() or dsa_free() don't hold the area lock,
+	 * but it's a bug in the calling code and undefined behavior if
+	 * the address is not live (ie if the segment might possibly have been
+	 * freed, they're trying to use a dangling pointer).
+	 *
+	 * For code that manipulates segment_bins lists, it would be a bug in
+	 * dsa.c if we ever reach a freed segment that way.  After it's marked
+	 * as freed, the only thing any backend should do with it is unmap it,
+	 * and it should always do that via check_for_freed_segments_locked()
+	 * before arriving here to resolve an index to a segment_map.
+	 *
+	 * Either way we can assert that we didn't arrive at a freed segment.
+	 */
+	Assert(!area->segment_maps[index].header->freed);
+
 	return &area->segment_maps[index];
 }
 
@@ -1778,8 +1796,6 @@ destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
 	int			size_class = span->size_class;
 	dsa_segment_map *segment_map;
 
-	segment_map =
-		get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
 
 	/* Remove it from its fullness class list. */
 	unlink_span(area, span);
@@ -1790,6 +1806,9 @@ destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
 	 * could deadlock.
 	 */
 	LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+	check_for_freed_segments_locked(area);
+	segment_map =
+		get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
 	FreePageManagerPut(segment_map->fpm,
 					   DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
 					   span->npages);
@@ -1944,6 +1963,7 @@ get_best_segment(dsa_area *area, Size npages)
 	Size		bin;
 
 	Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+	check_for_freed_segments_locked(area);
 
 	/*
 	 * Start searching from the first bin that *might* have enough contiguous
@@ -2220,10 +2240,30 @@ check_for_freed_segments(dsa_area *area)
 	freed_segment_counter = area->control->freed_segment_counter;
 	if (unlikely(area->freed_segment_counter != freed_segment_counter))
 	{
-		int			i;
-
 		/* Check all currently mapped segments to find what's been freed. */
 		LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+		check_for_freed_segments_locked(area);
+		LWLockRelease(DSA_AREA_LOCK(area));
+	}
+}
+
+/*
+ * Workhorse for check_for_free_segments(), and also used directly in path
+ * where the area lock is already held.  This should be called after acquiring
+ * the lock but before looking up any segment by index number, to make sure we
+ * unmap any stale segments that might have previously had the same index as a
+ * current segment.
+ */
+static void
+check_for_freed_segments_locked(dsa_area *area)
+{
+	Size		freed_segment_counter;
+	int		i;
+
+	Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+	freed_segment_counter = area->control->freed_segment_counter;
+	if (unlikely(area->freed_segment_counter != freed_segment_counter))
+	{
 		for (i = 0; i <= area->high_segment_index; ++i)
 		{
 			if (area->segment_maps[i].header != NULL &&
@@ -2235,7 +2275,6 @@ check_for_freed_segments(dsa_area *area)
 				area->segment_maps[i].mapped_address = NULL;
 			}
 		}
-		LWLockRelease(DSA_AREA_LOCK(area));
 		area->freed_segment_counter = freed_segment_counter;
 	}
 }
-- 
2.17.0

#5Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#4)
Re: infinite loop in parallel hash joins / DSA / get_best_segment

On Tue, Sep 18, 2018 at 11:35 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 9:12 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

With the attached draft patch, Tomas's benchmark script runs happily
for long periods. A bit more study required with fresh eyes,
tomorrow.

Here is a tidied up version that I propose to back-patch to 10 & 11.

Done.

I figured reviews would be unlikely to show up soon for a dsa.c patch,
but of course post-commit review is very welcome. I thought it best
to have this fix in REL_11_STABLE being tested for as long as possible
before any potential RC1 tag, especially if Tomas plans to continue
his "big join" testing on that branch.

--
Thomas Munro
http://www.enterprisedb.com