Atomics for heap_parallelscan_nextpage()

Started by David Rowleyover 8 years ago32 messages
#1David Rowley
david.rowley@2ndquadrant.com
2 attachment(s)

Hi,

A while back I did some benchmarking on a big 4 socket machine to
explore a bit around the outer limits of parallel aggregates. I
discovered along the way that, given enough workers, and a simple
enough task, that seq-scan workers were held up waiting for the lock
to be released in heap_parallelscan_nextpage().

I've since done a little work in this area to try to improve things. I
ended up posting about it yesterday in [1]/messages/by-id/CAKJS1f-XhfQ2-=85wgYo5b3WtEs=ys=2Rsq=NuvnmaV4ZsM1XQ@mail.gmail.com. My original patch used
batching to solve the issue; instead of allocating 1 block at a time,
the batching patch allocated a range of 10 blocks for the worker to
process. However, the implementation still needed a bit of work around
reporting sync-scan locations.

Andres mentioned in [2]/messages/by-id/20170505023646.3uhnmf2hbwtm63lc@alap3.anarazel.de that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage(). I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Performance:

With parallel_workers=71, it looks something like:

Query 1: 881 GB, ~6 billion row TPC-H lineitem table.

tpch=# select count(*) from lineitem;
count
------------
5999989709
(1 row)

-- Master
Time: 123421.283 ms (02:03.421)
Time: 118895.846 ms (01:58.896)
Time: 118632.546 ms (01:58.633)

-- Atomics patch
Time: 74038.813 ms (01:14.039)
Time: 73166.200 ms (01:13.166)
Time: 72492.338 ms (01:12.492)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 76364.215 ms (01:16.364)
Time: 75808.900 ms (01:15.809)
Time: 74927.756 ms (01:14.928)

Query 2: Single int column table with 2 billion rows.

tpch=# select count(*) from a;
count
------------
2000000000
(1 row)

-- Master
Time: 5853.918 ms (00:05.854)
Time: 5925.633 ms (00:05.926)
Time: 5859.223 ms (00:05.859)

-- Atomics patch
Time: 5825.745 ms (00:05.826)
Time: 5849.139 ms (00:05.849)
Time: 5815.818 ms (00:05.816)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 5789.237 ms (00:05.789)
Time: 5837.395 ms (00:05.837)
Time: 5821.492 ms (00:05.821)

I've also attached a text file with the perf report for the lineitem
query. You'll notice that the heap_parallelscan_nextpage() is very
visible in master, but not on each of the two patches.

With the 2nd query, heap_parallelscan_nextpage is fairly insignificant
on master's profile, it's only showing up as 0.48%. Likely this must
be due to more tuples being read from the page, and more aggregation
work getting done before the next page is needed. I'm uncertain why I
previously saw a speed up in this case in [1]/messages/by-id/CAKJS1f-XhfQ2-=85wgYo5b3WtEs=ys=2Rsq=NuvnmaV4ZsM1XQ@mail.gmail.com.

I've also noticed that both the atomics patch and unpatched master do
something that looks a bit weird with synchronous seq-scans. If the
parallel seq-scan piggybacked on another scan, then subsequent
parallel scans will start at the same non-zero block location, even
when no other concurrent scans exist. I'd have expected this should go
back to block 0 again, but maybe I'm just failing to understand the
reason for reporting the startblock to ss_report_location() at the end
of the scan.

I'll now add this to the first commitfest of pg11. I just wanted to
note that I've done this, so that it's less likely someone else goes
and repeats the same work.

[1]: /messages/by-id/CAKJS1f-XhfQ2-=85wgYo5b3WtEs=ys=2Rsq=NuvnmaV4ZsM1XQ@mail.gmail.com

[2]: /messages/by-id/20170505023646.3uhnmf2hbwtm63lc@alap3.anarazel.de

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

parallel_next_page_perf.txttext/plain; charset=UTF-8; name=parallel_next_page_perf.txtDownload
parallel_nextpage_atomics.patchapplication/octet-stream; name=parallel_nextpage_atomics.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..1bbaca9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -58,6 +58,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -89,6 +90,7 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static void heap_parallelscan_startblock_init(HeapScanDesc scan);
 static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
@@ -510,6 +512,8 @@ heapgettup(HeapScanDesc scan,
 			}
 			if (scan->rs_parallel != NULL)
 			{
+				heap_parallelscan_startblock_init(scan);
+
 				page = heap_parallelscan_nextpage(scan);
 
 				/* Other processes might have already finished the scan. */
@@ -812,6 +816,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 			}
 			if (scan->rs_parallel != NULL)
 			{
+				heap_parallelscan_startblock_init(scan);
+
 				page = heap_parallelscan_nextpage(scan);
 
 				/* Other processes might have already finished the scan. */
@@ -1535,14 +1541,10 @@ heap_rescan(HeapScanDesc scan,
 
 		/*
 		 * Caller is responsible for making sure that all workers have
-		 * finished the scan before calling this, so it really shouldn't be
-		 * necessary to acquire the mutex at all.  We acquire it anyway, just
-		 * to be tidy.
+		 * finished the scan before calling this.
 		 */
 		parallel_scan = scan->rs_parallel;
-		SpinLockAcquire(&parallel_scan->phs_mutex);
-		parallel_scan->phs_cblock = parallel_scan->phs_startblock;
-		SpinLockRelease(&parallel_scan->phs_mutex);
+		pg_atomic_write_u64(&parallel_scan->phs_blockcounter, 0);
 	}
 }
 
@@ -1635,7 +1637,7 @@ heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
 		!RelationUsesLocalBuffers(relation) &&
 		target->phs_nblocks > NBuffers / 4;
 	SpinLockInit(&target->phs_mutex);
-	target->phs_cblock = InvalidBlockNumber;
+	pg_atomic_write_u64(&target->phs_blockcounter, 0);
 	target->phs_startblock = InvalidBlockNumber;
 	SerializeSnapshot(snapshot, target->phs_snapshot_data);
 }
@@ -1660,20 +1662,17 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
- *		heap_parallelscan_nextpage - get the next page to scan
+ *		heap_parallelscan_startblock_init - find and set the scan's startblock
  *
- *		Get the next page to scan.  Even if there are no pages left to scan,
- *		another backend could have grabbed a page to scan and not yet finished
- *		looking at it, so it doesn't follow that the scan is done when the
- *		first backend gets an InvalidBlockNumber return.
+ *		Determine where the parallel seq scan should start.  This function may
+ *		be called many times, once by each parallel worker. We must be careful
+ *		only to set the startblock once.
  * ----------------
  */
-static BlockNumber
-heap_parallelscan_nextpage(HeapScanDesc scan)
+static void
+heap_parallelscan_startblock_init(HeapScanDesc scan)
 {
-	BlockNumber page = InvalidBlockNumber;
 	BlockNumber sync_startpage = InvalidBlockNumber;
-	BlockNumber report_page = InvalidBlockNumber;
 	ParallelHeapScanDesc parallel_scan;
 
 	Assert(scan->rs_parallel);
@@ -1705,46 +1704,64 @@ retry:
 			sync_startpage = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 			goto retry;
 		}
-		parallel_scan->phs_cblock = parallel_scan->phs_startblock;
 	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Get the next page to scan.  Even if there are no pages left to scan,
+ *		another backend could have grabbed a page to scan and not yet finished
+ *		looking at it, so it doesn't follow that the scan is done when the
+ *		first backend gets an InvalidBlockNumber return.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan)
+{
+	BlockNumber page;
+	ParallelHeapScanDesc parallel_scan;
+	uint64 counter;
+
+	Assert(scan->rs_parallel);
+	parallel_scan = scan->rs_parallel;
 
 	/*
-	 * The current block number is the next one that needs to be scanned,
-	 * unless it's InvalidBlockNumber already, in which case there are no more
-	 * blocks to scan.  After remembering the current value, we must advance
-	 * it so that the next call to this function returns the next block to be
-	 * scanned.
+	 * We make use of phs_blockcounter to track how many pages have been
+	 * allocated to workers already.  We're able to easily determine if we've
+	 * allocated all blocks out already by testing if counter has reached or
+	 * exceeded rs_nblocks.  Even after the first worker has found counter to
+	 * have exceeded rs_nblocks, subsequent workers might report in here
+	 * looking for another page, which will further increment
+	 * phs_blockcounter.  For this reason the block counter must be 64bit. We
+	 * don't want to risk an overflow when scanning very large relations.
+	 *
+	 * The actual page to return must be calculated using the counter.  We
+	 * simply add on the startblock number and modulo rs_nblocks to handle
+	 * the scanning of any blocks earlier than phs_startblock.
 	 */
-	page = parallel_scan->phs_cblock;
-	if (page != InvalidBlockNumber)
-	{
-		parallel_scan->phs_cblock++;
-		if (parallel_scan->phs_cblock >= scan->rs_nblocks)
-			parallel_scan->phs_cblock = 0;
-		if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
-		{
-			parallel_scan->phs_cblock = InvalidBlockNumber;
-			report_page = parallel_scan->phs_startblock;
-		}
-	}
+	counter = pg_atomic_fetch_add_u64(&parallel_scan->phs_blockcounter, 1);
 
-	/* Release the lock. */
-	SpinLockRelease(&parallel_scan->phs_mutex);
+	/* Have we already allocated all blocks of this relation? */
+	if (counter >= scan->rs_nblocks)
+		page = InvalidBlockNumber;
+	else
+		page = (counter + parallel_scan->phs_startblock) % scan->rs_nblocks;
 
 	/*
 	 * Report scan location.  Normally, we report the current page number.
 	 * When we reach the end of the scan, though, we report the starting page,
 	 * not the ending page, just so the starting positions for later scans
 	 * doesn't slew backwards.  We only report the position at the end of the
-	 * scan once, though: subsequent callers will have report nothing, since
-	 * they will have page == InvalidBlockNumber.
+	 * scan once, though: subsequent callers will report nothing.
 	 */
 	if (scan->rs_syncscan)
 	{
-		if (report_page == InvalidBlockNumber)
-			report_page = page;
-		if (report_page != InvalidBlockNumber)
-			ss_report_location(scan->rs_rd, report_page);
+		if (page != InvalidBlockNumber)
+			ss_report_location(scan->rs_rd, page);
+		else if (counter == scan->rs_nblocks)
+			ss_report_location(scan->rs_rd, parallel_scan->phs_startblock);
 	}
 
 	return page;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3fc726d..bf19b88 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -35,9 +35,10 @@ typedef struct ParallelHeapScanDescData
 	Oid			phs_relid;		/* OID of relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	BlockNumber phs_nblocks;	/* # blocks in relation at start of scan */
-	slock_t		phs_mutex;		/* mutual exclusion for block number fields */
+	slock_t		phs_mutex;		/* mutual exclusion for setting startblock */
 	BlockNumber phs_startblock; /* starting block number */
-	BlockNumber phs_cblock;		/* current block number */
+	pg_atomic_uint64 phs_blockcounter;	/* number of blocks allocated to
+										 * workers so far. */
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
 }	ParallelHeapScanDescData;
 
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: David Rowley (#1)
Re: Atomics for heap_parallelscan_nextpage()

On Sat, May 6, 2017 at 7:27 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

I've also noticed that both the atomics patch and unpatched master do
something that looks a bit weird with synchronous seq-scans. If the
parallel seq-scan piggybacked on another scan, then subsequent
parallel scans will start at the same non-zero block location, even
when no other concurrent scans exist.

That is what we expect. The same is true for non-parallel scans as
well, refer below code for non-parallel scans.

heapgettup_pagemode()
{
..

finished = (page == scan->rs_startblock) ||
(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);

/*
* Report our new scan position for synchronization purposes. We
* don't do that when moving backwards, however. That would just
* mess up any other forward-moving scanners.
*
* Note: we do this before checking for end of scan so that the
* final state of the position hint is back at the start of the
* rel. That's not strictly necessary, but otherwise when you run
* the same query multiple times the starting position would shift
* a little bit backwards on every invocation, which is confusing.
* We don't guarantee any specific ordering in general, though.
*/
if (scan->rs_syncscan)
ss_report_location(scan->rs_rd, page);
..
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: David Rowley (#1)
Re: Atomics for heap_parallelscan_nextpage()

On 05/06/2017 04:57 PM, David Rowley wrote:

Andres mentioned in [2] that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage(). I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Looks reasonable. I edited the comments and the variable names a bit, to
my liking, and committed. Thanks!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#3)
Re: Atomics for heap_parallelscan_nextpage()

On 08/16/2017 04:20 PM, Heikki Linnakangas wrote:

On 05/06/2017 04:57 PM, David Rowley wrote:

Andres mentioned in [2] that it might be worth exploring using
atomics to do the same job. So I went ahead and did that, and came
up with the attached, which is a slight variation on what he
mentioned in the thread.

To keep things a bit more simple, and streamline, I ended up
pulling out the logic for setting the startblock into another
function, which we only call once before the first call to
heap_parallelscan_nextpage(). I also ended up changing phs_cblock
and replacing it with a counter that always starts at zero. The
actual block is calculated based on that + the startblock modulo
nblocks. This makes things a good bit more simple for detecting
when we've allocated all the blocks to the workers, and also works
nicely when wrapping back to the start of a relation when we
started somewhere in the middle due to piggybacking with a
synchronous scan.

Looks reasonable. I edited the comments and the variable names a bit,
to my liking, and committed. Thanks!

A couple of 32-bit x86 buildfarm members don't seem to be happy with
this. I'll investigate, but if anyone has a clue, I'm all ears...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#4)
Re: Atomics for heap_parallelscan_nextpage()

Heikki Linnakangas <hlinnaka@iki.fi> writes:

A couple of 32-bit x86 buildfarm members don't seem to be happy with
this. I'll investigate, but if anyone has a clue, I'm all ears...

dromedary's issue seems to be alignment:

TRAP: UnalignedPointer("(((uintptr_t) ((uintptr_t)(ptr)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))) != (uintptr_t)(ptr)", File: "../../../../src/include/port/atomics.h", Line: 452)
2017-08-16 11:11:38.558 EDT [75693:3] LOG: server process (PID 76277) was terminated by signal 6: Abort trap
2017-08-16 11:11:38.558 EDT [75693:4] DETAIL: Failed process was running: select count(*) from a_star;

Not sure if this is your bug or if it's exposing a pre-existing
deficiency in the atomics code, viz, failure to ensure that
pg_atomic_uint64 is actually a 64-bit-aligned type. Andres?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 11:16:58 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

A couple of 32-bit x86 buildfarm members don't seem to be happy with
this. I'll investigate, but if anyone has a clue, I'm all ears...

dromedary's issue seems to be alignment:

TRAP: UnalignedPointer("(((uintptr_t) ((uintptr_t)(ptr)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))) != (uintptr_t)(ptr)", File: "../../../../src/include/port/atomics.h", Line: 452)
2017-08-16 11:11:38.558 EDT [75693:3] LOG: server process (PID 76277) was terminated by signal 6: Abort trap
2017-08-16 11:11:38.558 EDT [75693:4] DETAIL: Failed process was running: select count(*) from a_star;

Not sure if this is your bug or if it's exposing a pre-existing
deficiency in the atomics code, viz, failure to ensure that
pg_atomic_uint64 is actually a 64-bit-aligned type. Andres?

I suspect it's the former. Suspect that the shared memory that holds
the "parallel desc" isn't properly aligned:

void
ExecSeqScanInitializeDSM(SeqScanState *node,
ParallelContext *pcxt)
{
...
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
heap_parallelscan_initialize(pscan,
node->ss.ss_currentRelation,
estate->es_snapshot);

/*
...
* We allocate backwards from the end of the segment, so that the TOC entries
* can grow forward from the start of the segment.
*/
extern void *
shm_toc_allocate(shm_toc *toc, Size nbytes)
{
volatile shm_toc *vtoc = toc;
Size total_bytes;
Size allocated_bytes;
Size nentry;
Size toc_bytes;

/* Make sure request is well-aligned. */
nbytes = BUFFERALIGN(nbytes);
...
return ((char *) toc) + (total_bytes - allocated_bytes - nbytes);
}

/*
* Initialize a region of shared memory with a table of contents.
*/
shm_toc *
shm_toc_create(uint64 magic, void *address, Size nbytes)
{
shm_toc *toc = (shm_toc *) address;

Assert(nbytes > offsetof(shm_toc, toc_entry));
toc->toc_magic = magic;
SpinLockInit(&toc->toc_mutex);
toc->toc_total_bytes = nbytes;
toc->toc_allocated_bytes = 0;
toc->toc_nentry = 0;

return toc;
}

Afaict shm_create/shm_toc_allocate don't actually guarantee that the end
of the toc's memory is suitably aligned. But I didn't yet have any
coffee, so ...

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#6)
Re: Atomics for heap_parallelscan_nextpage()

On Wed, Aug 16, 2017 at 9:55 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-16 11:16:58 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

A couple of 32-bit x86 buildfarm members don't seem to be happy with
this. I'll investigate, but if anyone has a clue, I'm all ears...

dromedary's issue seems to be alignment:

TRAP: UnalignedPointer("(((uintptr_t) ((uintptr_t)(ptr)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))) != (uintptr_t)(ptr)", File: "../../../../src/include/port/atomics.h", Line: 452)
2017-08-16 11:11:38.558 EDT [75693:3] LOG: server process (PID 76277) was terminated by signal 6: Abort trap
2017-08-16 11:11:38.558 EDT [75693:4] DETAIL: Failed process was running: select count(*) from a_star;

Not sure if this is your bug or if it's exposing a pre-existing
deficiency in the atomics code, viz, failure to ensure that
pg_atomic_uint64 is actually a 64-bit-aligned type. Andres?

I suspect it's the former. Suspect that the shared memory that holds
the "parallel desc" isn't properly aligned:

Clang has an -fsanitize=alignment option that may be useful here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Re: Atomics for heap_parallelscan_nextpage()

Robert, Tom,

On 2017-08-16 09:55:15 -0700, Andres Freund wrote:

Not sure if this is your bug or if it's exposing a pre-existing
deficiency in the atomics code, viz, failure to ensure that
pg_atomic_uint64 is actually a 64-bit-aligned type. Andres?

I suspect it's the former. Suspect that the shared memory that holds
the "parallel desc" isn't properly aligned:

Or, well, a mixture of both, because it seems like a deficiency in the
shm_toc, code, rather than the atomics code.

Afaict shm_create/shm_toc_allocate don't actually guarantee that the end
of the toc's memory is suitably aligned. But I didn't yet have any
coffee, so ...

Robert, I'm not quite sure what the intended behaviour of shm_toc is wrt
alignment. I see that individual chunks are BUFFERALIGNed (both during
estimation, and allocation). But I don't see how the size of the entire
toc is aligned, which seems a requirement, given we allocate from the
end.
Seems like we'd just have to add alignment of the total size to
shm_toc_estimate()?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#7)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 09:57:35 -0700, Peter Geoghegan wrote:

On Wed, Aug 16, 2017 at 9:55 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-16 11:16:58 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

A couple of 32-bit x86 buildfarm members don't seem to be happy with
this. I'll investigate, but if anyone has a clue, I'm all ears...

dromedary's issue seems to be alignment:

TRAP: UnalignedPointer("(((uintptr_t) ((uintptr_t)(ptr)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))) != (uintptr_t)(ptr)", File: "../../../../src/include/port/atomics.h", Line: 452)
2017-08-16 11:11:38.558 EDT [75693:3] LOG: server process (PID 76277) was terminated by signal 6: Abort trap
2017-08-16 11:11:38.558 EDT [75693:4] DETAIL: Failed process was running: select count(*) from a_star;

Not sure if this is your bug or if it's exposing a pre-existing
deficiency in the atomics code, viz, failure to ensure that
pg_atomic_uint64 is actually a 64-bit-aligned type. Andres?

I suspect it's the former. Suspect that the shared memory that holds
the "parallel desc" isn't properly aligned:

Clang has an -fsanitize=alignment option that may be useful here.

Given that we're crashing in an assert that does nothing but check for
alignment, before the memory is actually used in a "dangerous" manner, I
don't quite see how?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#8)
1 attachment(s)
Re: Atomics for heap_parallelscan_nextpage()

On 08/16/2017 08:10 PM, Andres Freund wrote:

Afaict shm_create/shm_toc_allocate don't actually guarantee that the end
of the toc's memory is suitably aligned. But I didn't yet have any
coffee, so ...

Robert, I'm not quite sure what the intended behaviour of shm_toc is wrt
alignment. I see that individual chunks are BUFFERALIGNed (both during
estimation, and allocation). But I don't see how the size of the entire
toc is aligned, which seems a requirement, given we allocate from the
end.
Seems like we'd just have to add alignment of the total size to
shm_toc_estimate()?

Yeah, that's the gist of it.

The attached patch seems to fix this. I'm not too familiar with this DSM
stuff, but this seems right to me. Unless someone has a better idea
soon, I'll commit this to make the buildfarm happy.

- Heikki

Attachments:

fix-shm_toc-alignment-1.patchtext/x-patch; name=fix-shm_toc-alignment-1.patchDownload
diff --git a/src/backend/storage/ipc/shm_toc.c b/src/backend/storage/ipc/shm_toc.c
index 9f259441f0..121d5a1ec9 100644
--- a/src/backend/storage/ipc/shm_toc.c
+++ b/src/backend/storage/ipc/shm_toc.c
@@ -44,7 +44,7 @@ shm_toc_create(uint64 magic, void *address, Size nbytes)
 	Assert(nbytes > offsetof(shm_toc, toc_entry));
 	toc->toc_magic = magic;
 	SpinLockInit(&toc->toc_mutex);
-	toc->toc_total_bytes = nbytes;
+	toc->toc_total_bytes = BUFFERALIGN_DOWN(nbytes);
 	toc->toc_allocated_bytes = 0;
 	toc->toc_nentry = 0;
 
@@ -252,7 +252,8 @@ shm_toc_lookup(shm_toc *toc, uint64 key, bool noError)
 Size
 shm_toc_estimate(shm_toc_estimator *e)
 {
-	return add_size(offsetof(shm_toc, toc_entry),
-					add_size(mul_size(e->number_of_keys, sizeof(shm_toc_entry)),
-							 e->space_for_chunks));
+	return BUFFERALIGN(
+		add_size(offsetof(shm_toc, toc_entry),
+				 add_size(mul_size(e->number_of_keys, sizeof(shm_toc_entry)),
+						  e->space_for_chunks)));
 }
diff --git a/src/include/c.h b/src/include/c.h
index 9066e3c578..af799dc1df 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -598,6 +598,7 @@ typedef NameData *Name;
 #define LONGALIGN_DOWN(LEN)		TYPEALIGN_DOWN(ALIGNOF_LONG, (LEN))
 #define DOUBLEALIGN_DOWN(LEN)	TYPEALIGN_DOWN(ALIGNOF_DOUBLE, (LEN))
 #define MAXALIGN_DOWN(LEN)		TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
+#define BUFFERALIGN_DOWN(LEN)	TYPEALIGN_DOWN(ALIGNOF_BUFFER, (LEN))
 
 /*
  * The above macros will not work with types wider than uintptr_t, like with
#11Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: Atomics for heap_parallelscan_nextpage()

On Wed, Aug 16, 2017 at 1:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yeah, that's the gist of it.

The attached patch seems to fix this. I'm not too familiar with this DSM
stuff, but this seems right to me. Unless someone has a better idea soon,
I'll commit this to make the buildfarm happy.

Mmm, clever. Yeah, that looks reasonable at first glance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#10)
Re: Atomics for heap_parallelscan_nextpage()

Hi,

On 2017-08-16 20:19:29 +0300, Heikki Linnakangas wrote:

diff --git a/src/backend/storage/ipc/shm_toc.c b/src/backend/storage/ipc/shm_toc.c
index 9f259441f0..121d5a1ec9 100644
--- a/src/backend/storage/ipc/shm_toc.c
+++ b/src/backend/storage/ipc/shm_toc.c
@@ -44,7 +44,7 @@ shm_toc_create(uint64 magic, void *address, Size nbytes)
Assert(nbytes > offsetof(shm_toc, toc_entry));
toc->toc_magic = magic;
SpinLockInit(&toc->toc_mutex);
-	toc->toc_total_bytes = nbytes;
+	toc->toc_total_bytes = BUFFERALIGN_DOWN(nbytes);
toc->toc_allocated_bytes = 0;
toc->toc_nentry = 0;

Don't think we require BUFFERALIGN - MAXALIGN ought to be
sufficient. The use of BUFFERALIGN presumably is to space out things
into different cachelines, but that doesn't really seem to be important
with this. Then we can just avoid defining the new macro...

@@ -252,7 +252,8 @@ shm_toc_lookup(shm_toc *toc, uint64 key, bool noError)
Size
shm_toc_estimate(shm_toc_estimator *e)
{
-	return add_size(offsetof(shm_toc, toc_entry),
-					add_size(mul_size(e->number_of_keys, sizeof(shm_toc_entry)),
-							 e->space_for_chunks));
+	return BUFFERALIGN(
+		add_size(offsetof(shm_toc, toc_entry),
+				 add_size(mul_size(e->number_of_keys, sizeof(shm_toc_entry)),
+						  e->space_for_chunks)));
}

I think splitting this into separate statements would be better. I think
we also should rather *ALLIGN the size without e->space_for_chunks. The
latter is already aligned properly, and the important bit is that we
have enough space for e->space_for_chunks afterwards, and only padding
is in th eway of that...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#10)
Re: Atomics for heap_parallelscan_nextpage()

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/16/2017 08:10 PM, Andres Freund wrote:

Seems like we'd just have to add alignment of the total size to
shm_toc_estimate()?

Yeah, that's the gist of it.

I can confirm that on dromedary, that regression test case is attempting
to create a TOC with a not-well-aligned size: 93268 = 0x16c54 bytes.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: Atomics for heap_parallelscan_nextpage()

I wrote:

I can confirm that on dromedary, that regression test case is attempting
to create a TOC with a not-well-aligned size: 93268 = 0x16c54 bytes.

... although, on closer look, it still seems like we have a fundamental
bit of schizophrenia here, because on this machine

$ grep ALIGN pg_config.h
#define ALIGNOF_DOUBLE 4
#define ALIGNOF_INT 4
#define ALIGNOF_LONG 4
#define ALIGNOF_LONG_LONG_INT 4
#define ALIGNOF_SHORT 2
#define MAXIMUM_ALIGNOF 4

Basically, therefore, ISTM that it is not a good thing that the atomics
code thinks it can rely on 8-byte-aligned data when the entire rest of
the system believes that 4-byte alignment is enough for anything.

I was wondering why the shm_toc code was using BUFFERALIGN and not
MAXALIGN, and I now suspect that the answer is "it's an entirely
undocumented kluge to make the atomics code not crash on 32-bit
machines, so long as nobody puts a pg_atomic_uint64 anywhere except
in a shm_toc".

I'm not sure that that's good enough, and I'm damn sure that it
shouldn't be undocumented.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Atomics for heap_parallelscan_nextpage()

Andres Freund <andres@anarazel.de> writes:

Don't think we require BUFFERALIGN - MAXALIGN ought to be
sufficient.

Uh, see my other message just now.

The use of BUFFERALIGN presumably is to space out things
into different cachelines, but that doesn't really seem to be important
with this. Then we can just avoid defining the new macro...

I was feeling a bit uncomfortable with the BUFFERALIGN_DOWN() for a
different reason: if the caller has specified the exact amount of space it
needs, having shm_toc_create discard some could lead to an unexpected
failure. I wonder whether maybe shm_toc_create should just error out if
the number it's handed isn't aligned already.

+	return BUFFERALIGN(
+		add_size(offsetof(shm_toc, toc_entry),
+				 add_size(mul_size(e->number_of_keys, sizeof(shm_toc_entry)),
+						  e->space_for_chunks)));

I think splitting this into separate statements would be better.

+1, it was too complicated already.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#14)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 13:40:09 -0400, Tom Lane wrote:

I wrote:

I can confirm that on dromedary, that regression test case is attempting
to create a TOC with a not-well-aligned size: 93268 = 0x16c54 bytes.

... although, on closer look, it still seems like we have a fundamental
bit of schizophrenia here, because on this machine

$ grep ALIGN pg_config.h
#define ALIGNOF_DOUBLE 4
#define ALIGNOF_INT 4
#define ALIGNOF_LONG 4
#define ALIGNOF_LONG_LONG_INT 4
#define ALIGNOF_SHORT 2
#define MAXIMUM_ALIGNOF 4

Basically, therefore, ISTM that it is not a good thing that the atomics
code thinks it can rely on 8-byte-aligned data when the entire rest of
the system believes that 4-byte alignment is enough for anything.

That's a hardware requirement, we can't do much about it. Several
[micro-]architectures don't support unaligned atomic 8 byte writes.

I was wondering why the shm_toc code was using BUFFERALIGN and not
MAXALIGN, and I now suspect that the answer is "it's an entirely
undocumented kluge to make the atomics code not crash on 32-bit
machines, so long as nobody puts a pg_atomic_uint64 anywhere except in
a shm_toc".

I don't think there were any atomics in affected code until earlier
today... And given it didn't work for shm_toc anyway, I'm not quite
following.

I'm not sure that that's good enough, and I'm damn sure that it
shouldn't be undocumented.

8 byte alignment would be good enough, so BUFFERALIGN ought to be
sufficient. But it'd be nicer to have a separate more descriptive knob.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: Atomics for heap_parallelscan_nextpage()

On Wed, Aug 16, 2017 at 1:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was wondering why the shm_toc code was using BUFFERALIGN and not
MAXALIGN, and I now suspect that the answer is "it's an entirely
undocumented kluge to make the atomics code not crash on 32-bit
machines, so long as nobody puts a pg_atomic_uint64 anywhere except
in a shm_toc".

Well, shm_toc considerably predates 64-bit atomics, so I think the
causality cannot run in that direction. shm_toc.c first appeared in
the tree in January of 2014. src/include/port/atomics didn't show up
until September of that year, and 64-bit atomics weren't actually
usable in practice until e8fdbd58fe564a29977f4331cd26f9697d76fc40 in
April of 2017.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 13:44:28 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Don't think we require BUFFERALIGN - MAXALIGN ought to be
sufficient.

Uh, see my other message just now.

Yup, you're right.

The use of BUFFERALIGN presumably is to space out things
into different cachelines, but that doesn't really seem to be important
with this. Then we can just avoid defining the new macro...

I was feeling a bit uncomfortable with the BUFFERALIGN_DOWN() for a
different reason: if the caller has specified the exact amount of space it
needs, having shm_toc_create discard some could lead to an unexpected
failure.

Well, that's why shm_toc_estimate() increases the size appropriately.

I wonder whether maybe shm_toc_create should just error out if the
number it's handed isn't aligned already.

I think that's going to be harder atm, because it's not the size shm_toc
computes, it's what the caller to shm_toc_estimate_chunk() provides.
And that size is already guaranteed to be upsized by BUFFERALIGN in
shm_toc_estimate_chunk(). It's just that the base-offset from where the
allocations start isn't aligned.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: Atomics for heap_parallelscan_nextpage()

On Wed, Aug 16, 2017 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was feeling a bit uncomfortable with the BUFFERALIGN_DOWN() for a
different reason: if the caller has specified the exact amount of space it
needs, having shm_toc_create discard some could lead to an unexpected
failure.

Well, that's why Heikki also patched shm_toc_estimate. With the
patch, if the size being used in shm_toc_create comes from
shm_toc_estimate, it will always be aligned and nothing bad will
happen. If the user invents another size out of whole cloth, then
they might get a few bytes less than they expect, but that's what you
get for not using shm_toc_estimate().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#17)
Re: Atomics for heap_parallelscan_nextpage()

On August 16, 2017 10:47:23 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 16, 2017 at 1:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was wondering why the shm_toc code was using BUFFERALIGN and not
MAXALIGN, and I now suspect that the answer is "it's an entirely
undocumented kluge to make the atomics code not crash on 32-bit
machines, so long as nobody puts a pg_atomic_uint64 anywhere except
in a shm_toc".

Well, shm_toc considerably predates 64-bit atomics, so I think the
causality cannot run in that direction. shm_toc.c first appeared in
the tree in January of 2014. src/include/port/atomics didn't show up
until September of that year, and 64-bit atomics weren't actually
usable in practice until e8fdbd58fe564a29977f4331cd26f9697d76fc40 in
April of 2017.

Well, not for core code. I certainly know about production code using it, because crusty platforms are considered irrelevant...

Independent of that, a comment explaining what the BUFFERALIGN is intending would be good.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#19)
Re: Atomics for heap_parallelscan_nextpage()

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 16, 2017 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was feeling a bit uncomfortable with the BUFFERALIGN_DOWN() for a
different reason: if the caller has specified the exact amount of space it
needs, having shm_toc_create discard some could lead to an unexpected
failure.

Well, that's why Heikki also patched shm_toc_estimate. With the
patch, if the size being used in shm_toc_create comes from
shm_toc_estimate, it will always be aligned and nothing bad will
happen.

Sure. So the point is entirely about what should happen if someone
doesn't use shm_toc_estimate.

If the user invents another size out of whole cloth, then
they might get a few bytes less than they expect, but that's what you
get for not using shm_toc_estimate().

I don't buy that argument. A caller might think "Why do I need
shm_toc_estimate, when I can compute the *exact* size I need?".
And it would have worked, up till this proposed patch.

I think the new API spec for such cases ought to be "if you supply an
unaligned size, we'll error out", not "if you supply an unaligned size,
we'll silently lose some of it". The former is going to behave a lot
more predictably.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#16)
Re: Atomics for heap_parallelscan_nextpage()

Andres Freund <andres@anarazel.de> writes:

On 2017-08-16 13:40:09 -0400, Tom Lane wrote:

I was wondering why the shm_toc code was using BUFFERALIGN and not
MAXALIGN, and I now suspect that the answer is "it's an entirely
undocumented kluge to make the atomics code not crash on 32-bit
machines, so long as nobody puts a pg_atomic_uint64 anywhere except in
a shm_toc".

I don't think there were any atomics in affected code until earlier
today... And given it didn't work for shm_toc anyway, I'm not quite
following.

Right, Robert pointed out that it's pre-existing code. My point should
be read as "it's just blind luck that shm_toc is using bigger than
MAXALIGN alignment, or this would never work on 32-bit machines".

I'm not sure that that's good enough, and I'm damn sure that it
shouldn't be undocumented.

8 byte alignment would be good enough, so BUFFERALIGN ought to be
sufficient. But it'd be nicer to have a separate more descriptive knob.

What I meant by possibly not good enough is that pg_atomic_uint64 used
in other places isn't going to be very safe.

We might be effectively all right as long as we have a coding rule that
pg_atomic_uint64 can only be placed in memory handed out by ShmemAlloc
or shm_toc_allocate, which both have bigger-than-MAXALIGN alignment
practices. But this needs to be documented.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#21)
Re: Atomics for heap_parallelscan_nextpage()

On 08/16/2017 09:00 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 16, 2017 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was feeling a bit uncomfortable with the BUFFERALIGN_DOWN() for a
different reason: if the caller has specified the exact amount of space it
needs, having shm_toc_create discard some could lead to an unexpected
failure.

Well, that's why Heikki also patched shm_toc_estimate. With the
patch, if the size being used in shm_toc_create comes from
shm_toc_estimate, it will always be aligned and nothing bad will
happen.

Sure. So the point is entirely about what should happen if someone
doesn't use shm_toc_estimate.

If the user invents another size out of whole cloth, then
they might get a few bytes less than they expect, but that's what you
get for not using shm_toc_estimate().

I don't buy that argument. A caller might think "Why do I need
shm_toc_estimate, when I can compute the *exact* size I need?".
And it would have worked, up till this proposed patch.

Well, no. The size of the shm_toc struct is subtracted from the size
that you give to shm_toc_create. As well as the sizes of the TOC
entries. And those sizes are private to shm_toc.c, so a caller has no
way to know what size it should pass to shm_toc_create(), in order to
have N bytes of space actually usable. You really need to use
shm_toc_estimate() if you want any guarantees on what will fit.

I've pushed the patch to fix this, with some extra comments and
reformatting shm_toc_estimate.

8 byte alignment would be good enough, so BUFFERALIGN ought to be
sufficient. But it'd be nicer to have a separate more descriptive knob.

What I meant by possibly not good enough is that pg_atomic_uint64 used
in other places isn't going to be very safe.

We might be effectively all right as long as we have a coding rule that
pg_atomic_uint64 can only be placed in memory handed out by ShmemAlloc
or shm_toc_allocate, which both have bigger-than-MAXALIGN alignment
practices. But this needs to be documented.

Yeah. We are implicitly also relying on ShmemAlloc() to return
sufficiently-aligned memory. Palloc() too, although you probably
wouldn't use atomic ops on a palloc'd struct. I think we should
introduce a new ALIGNOF macro for that, and document that those
functions return memory with enough alignment. GENUINEMAX_ALIGNOF?
MAXSTRUCT_ALIGNOF?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 14:09:08 -0400, Tom Lane wrote:

I'm not sure that that's good enough, and I'm damn sure that it
shouldn't be undocumented.

8 byte alignment would be good enough, so BUFFERALIGN ought to be
sufficient. But it'd be nicer to have a separate more descriptive knob.

What I meant by possibly not good enough is that pg_atomic_uint64 used
in other places isn't going to be very safe.

Well, it's not used otherwise in core so far, leaving test code
aside. It's correctly aligned if part of a aligned struct - the atomics
code itself can't really do anything about aligning that struct itself
isn't aligned.

We might be effectively all right as long as we have a coding rule that
pg_atomic_uint64 can only be placed in memory handed out by ShmemAlloc
or shm_toc_allocate, which both have bigger-than-MAXALIGN alignment
practices. But this needs to be documented.

Well, one could argue the alignment checks in every function are that
:). But yea, we probably should mention it more than that.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#3)
Re: Atomics for heap_parallelscan_nextpage()

On 2017-08-16 16:20:28 +0300, Heikki Linnakangas wrote:

On 05/06/2017 04:57 PM, David Rowley wrote:

Andres mentioned in [2] that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage(). I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Looks reasonable. I edited the comments and the variable names a bit, to my
liking, and committed. Thanks!

Brief post-commit review:

+	 * phs_nallocated tracks how many pages have been allocated to workers
+	 * already.  When phs_nallocated >= rs_nblocks, all blocks have been
+	 * allocated.

allocated seems a bit of a confusing terminology.

@@ -1635,8 +1637,8 @@ heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
 		!RelationUsesLocalBuffers(relation) &&
 		target->phs_nblocks > NBuffers / 4;
 	SpinLockInit(&target->phs_mutex);
-	target->phs_cblock = InvalidBlockNumber;
 	target->phs_startblock = InvalidBlockNumber;
+	pg_atomic_write_u64(&target->phs_nallocated, 0);
 	SerializeSnapshot(snapshot, target->phs_snapshot_data);
 }

It's not ok to initialize an atomic with pg_atomic_write_u64 - works
well enough for "plain" atomics, but the fallback implementation isn't
ok with it. You're probably going to get a failure on the respective
buildfarm animal soon.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#23)
Re: Atomics for heap_parallelscan_nextpage()

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/16/2017 09:00 PM, Tom Lane wrote:

I don't buy that argument. A caller might think "Why do I need
shm_toc_estimate, when I can compute the *exact* size I need?".
And it would have worked, up till this proposed patch.

Well, no. The size of the shm_toc struct is subtracted from the size
that you give to shm_toc_create. As well as the sizes of the TOC
entries. And those sizes are private to shm_toc.c, so a caller has no
way to know what size it should pass to shm_toc_create(), in order to
have N bytes of space actually usable. You really need to use
shm_toc_estimate() if you want any guarantees on what will fit.

Good point --- objection withdrawn.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#25)
Re: Atomics for heap_parallelscan_nextpage()

Andres Freund <andres@anarazel.de> writes:

On 2017-08-16 16:20:28 +0300, Heikki Linnakangas wrote:
+ pg_atomic_write_u64(&target->phs_nallocated, 0);

It's not ok to initialize an atomic with pg_atomic_write_u64 - works
well enough for "plain" atomics, but the fallback implementation isn't
ok with it. You're probably going to get a failure on the respective
buildfarm animal soon.

Indeed, gaur fails with

2017-08-16 17:09:38.315 EDT [13043:11] PANIC: stuck spinlock detected at pg_at\
omic_compare_exchange_u64_impl, atomics.c:196
2017-08-16 17:09:38.315 EDT [13043:12] STATEMENT: select count(*) from a_star;
2017-08-16 17:09:40.350 EDT [12437:3] LOG: server process (PID 13043) was term\
inated by signal 6
2017-08-16 17:09:40.350 EDT [12437:4] DETAIL: Failed process was running: sele\
ct count(*) from a_star;

and I'm sure pademelon will fail once it gets to that. I thought we
had other buildfarm animals testing the fallback path, though?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#27)
Re: Atomics for heap_parallelscan_nextpage()

On 08/17/2017 12:20 AM, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-08-16 16:20:28 +0300, Heikki Linnakangas wrote:
+ pg_atomic_write_u64(&target->phs_nallocated, 0);

It's not ok to initialize an atomic with pg_atomic_write_u64 - works
well enough for "plain" atomics, but the fallback implementation isn't
ok with it. You're probably going to get a failure on the respective
buildfarm animal soon.

Indeed, gaur fails with

2017-08-16 17:09:38.315 EDT [13043:11] PANIC: stuck spinlock detected at pg_at\
omic_compare_exchange_u64_impl, atomics.c:196
2017-08-16 17:09:38.315 EDT [13043:12] STATEMENT: select count(*) from a_star;
2017-08-16 17:09:40.350 EDT [12437:3] LOG: server process (PID 13043) was term\
inated by signal 6
2017-08-16 17:09:40.350 EDT [12437:4] DETAIL: Failed process was running: sele\
ct count(*) from a_star;

and I'm sure pademelon will fail once it gets to that.

Fixed.

I thought we had other buildfarm animals testing the fallback path,
though?

Yes, at least piculet is building with --disable-atomics.

I was able to reproduce this locally, with --disable-atomics, but only
after hacking it to fill the struct with garbage, before initializing
it. IOW, it failed to fail, because the spinlock happened to be
initialized correctly by accident. Perhaps that's happening on piculet, too.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#28)
Re: Atomics for heap_parallelscan_nextpage()

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/17/2017 12:20 AM, Tom Lane wrote:

Indeed, gaur fails with
2017-08-16 17:09:38.315 EDT [13043:11] PANIC: stuck spinlock detected at pg_atomic_compare_exchange_u64_impl, atomics.c:196

I was able to reproduce this locally, with --disable-atomics, but only
after hacking it to fill the struct with garbage, before initializing
it. IOW, it failed to fail, because the spinlock happened to be
initialized correctly by accident. Perhaps that's happening on piculet, too.

Oh, right. HPPA is unique among our platforms, I think, in that the
"unlocked" state of a spinlock is not "all zeroes". So if you're dealing
with pre-zeroed memory, which shmem generally would be, failing to
initialize a spinlock does not cause visible errors except on HPPA.

I wonder whether it's sensible to have --enable-cassert have the effect
of filling memory allocated by ShmemAlloc or the DSA code with junk (as
palloc does) instead of leaving it at zeroes. It's not modeling the
same kind of effect, since we have no shmem-freeing primitives, but
it might be useful for this sort of thing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#29)
Re: Atomics for heap_parallelscan_nextpage()

On August 16, 2017 3:09:27 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/17/2017 12:20 AM, Tom Lane wrote:

Indeed, gaur fails with
2017-08-16 17:09:38.315 EDT [13043:11] PANIC: stuck spinlock

detected at pg_atomic_compare_exchange_u64_impl, atomics.c:196

I was able to reproduce this locally, with --disable-atomics, but

only

after hacking it to fill the struct with garbage, before initializing

it. IOW, it failed to fail, because the spinlock happened to be
initialized correctly by accident. Perhaps that's happening on

piculet, too.

Oh, right. HPPA is unique among our platforms, I think, in that the
"unlocked" state of a spinlock is not "all zeroes". So if you're
dealing
with pre-zeroed memory, which shmem generally would be, failing to
initialize a spinlock does not cause visible errors except on HPPA.

I wonder whether it's sensible to have --enable-cassert have the effect
of filling memory allocated by ShmemAlloc or the DSA code with junk (as
palloc does) instead of leaving it at zeroes. It's not modeling the
same kind of effect, since we have no shmem-freeing primitives, but
it might be useful for this sort of thing.

We kind of do - crash restarts... So yes, that's probably a good idea.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#30)
Re: Atomics for heap_parallelscan_nextpage()

Andres Freund <andres@anarazel.de> writes:

On August 16, 2017 3:09:27 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's sensible to have --enable-cassert have the effect
of filling memory allocated by ShmemAlloc or the DSA code with junk (as
palloc does) instead of leaving it at zeroes. It's not modeling the
same kind of effect, since we have no shmem-freeing primitives, but
it might be useful for this sort of thing.

We kind of do - crash restarts... So yes, that's probably a good idea.

Crash restart releases the shmem segment and acquires a new one,
doesn't it? Or am I misremembering? I thought that it did do so,
if only to make darn sure that no old processes remain connected
to shmem.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32David Rowley
david.rowley@2ndquadrant.com
In reply to: Heikki Linnakangas (#3)
Re: Atomics for heap_parallelscan_nextpage()

On 17 August 2017 at 01:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Looks reasonable. I edited the comments and the variable names a bit, to my
liking, and committed. Thanks!

Thanks for committing. I've just been catching up on all that went on
while I was sleeping. Thanks for handling the cleanup too.

I'll feel better once pademelon goes green again. From looking at the
latest failure on it, it appears that your swapping of
pg_atomic_write_u64 for pg_atomic_init_u64 should fix this. My
apologies for that mistake.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers