pgsql: Compute XID horizon for page level index vacuum on primary.
Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
easy enough before the introducing of tableam (we knew it had to be
heap, although some trickery around logging the heap relfilenodes
was required). But to properly handle table AMs we need
per-database catalog access to look up the AM handler, which
recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
decoding on standbys. When on a catalog table, we need to be able
to conflict with slots that have an xid horizon that's too old. But
computing the horizon by visiting the heap only works once
consistency is reached, but we always need to be able to detect
conflicts.
There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).
Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.
To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers. As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled. That's
possibly not the perfect formula, but seems good enough for now.
Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.
Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
/messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
/messages/by-id/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
/messages/by-id/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/558a9165e081d1936573e5a7d576f5febd7fb55a
Modified Files
--------------
src/backend/access/hash/hash_xlog.c | 153 +--------------------
src/backend/access/hash/hashinsert.c | 17 ++-
src/backend/access/heap/heapam.c | 221 +++++++++++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 1 +
src/backend/access/index/genam.c | 37 ++++++
src/backend/access/nbtree/nbtpage.c | 8 +-
src/backend/access/nbtree/nbtxlog.c | 156 +---------------------
src/backend/access/rmgrdesc/hashdesc.c | 5 +-
src/backend/access/rmgrdesc/nbtdesc.c | 3 +-
src/include/access/genam.h | 5 +
src/include/access/hash_xlog.h | 2 +-
src/include/access/heapam.h | 4 +
src/include/access/nbtxlog.h | 3 +-
src/include/access/tableam.h | 19 +++
src/include/access/xlog_internal.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 316 insertions(+), 321 deletions(-)
On Wed, Mar 27, 2019 at 1:06 PM Andres Freund <andres@anarazel.de> wrote:
Compute XID horizon for page level index vacuum on primary.
Hi Andres,
I have a virtual machine running FreeBSD 12.0 on i386 on which
contrib/test_decoding consistently self-deadlocks in the "rewrite"
test, with the stack listed below. You can see that we wait for a
share lock that we already hold exclusively. Peter Geoghegan spotted
the problem: this code path shouldn't access syscache, or at least not
for a catalog table. He suggested something along these lines:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6977,7 +6977,10 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
* simplistic, but at the moment there is no evidence of that
or any idea
* about what would work better.
*/
- io_concurrency =
get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
+ if (IsCatalogRelation(rel))
+ io_concurrency = 1;
+ else
+ io_concurrency =
get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
prefetch_distance = Min((io_concurrency) + 10, MAX_IO_CONCURRENCY);
/* Start prefetching. */
Indeed that seems to fix the problem for me.
* frame #0: 0x28c04ca1 libc.so.7`__sys__umtx_op + 5
frame #1: 0x28bed0ab libc.so.7`sem_clockwait_np + 283
frame #2: 0x28bed1ae libc.so.7`sem_wait + 62
frame #3: 0x0858d837 postgres`PGSemaphoreLock(sema=0x290141a8) at
pg_sema.c:316
frame #4: 0x08678b94 postgres`LWLockAcquire(lock=0x295365a4,
mode=LW_SHARED) at lwlock.c:1244
frame #5: 0x08639d8f postgres`LockBuffer(buffer=129, mode=1) at
bufmgr.c:3565
frame #6: 0x08187f7d postgres`_bt_getbuf(rel=0x31c3e95c, blkno=1,
access=1) at nbtpage.c:806
frame #7: 0x081887f9 postgres`_bt_getroot(rel=0x31c3e95c,
access=1) at nbtpage.c:323
frame #8: 0x081932aa postgres`_bt_search(rel=0x31c3e95c,
key=0xffbfad00, bufP=0xffbfb31c, access=1, snapshot=0x08b27c58) at
nbtsearch.c:99
frame #9: 0x08195bfc postgres`_bt_first(scan=0x31db73a0,
dir=ForwardScanDirection) at nbtsearch.c:1246
frame #10: 0x08190f96 postgres`btgettuple(scan=0x31db73a0,
dir=ForwardScanDirection) at nbtree.c:245
frame #11: 0x0817d3fa postgres`index_getnext_tid(scan=0x31db73a0,
direction=ForwardScanDirection) at indexam.c:550
frame #12: 0x0817d6a8 postgres`index_getnext_slot(scan=0x31db73a0,
direction=ForwardScanDirection, slot=0x31df2320) at indexam.c:642
frame #13: 0x0817b4c9
postgres`systable_getnext(sysscan=0x31df242c) at genam.c:450
frame #14: 0x0887e3a3 postgres`ScanPgRelation(targetRelId=1213,
indexOK=true, force_non_historic=false) at relcache.c:365
frame #15: 0x088742e1 postgres`RelationBuildDesc(targetRelId=1213,
insertIt=true) at relcache.c:1055
frame #16: 0x0887356a
postgres`RelationIdGetRelation(relationId=1213) at relcache.c:2030
frame #17: 0x080d7ac5 postgres`relation_open(relationId=1213,
lockmode=1) at relation.c:59
frame #18: 0x081cc2b6 postgres`table_open(relationId=1213,
lockmode=1) at table.c:43
frame #19: 0x0886597b
postgres`SearchCatCacheMiss(cache=0x31c2b200, nkeys=1,
hashValue=1761185739, hashIndex=3, v1=1663, v2=0, v3=0, v4=0) at
catcache.c:1357
frame #20: 0x088622db
postgres`SearchCatCacheInternal(cache=0x31c2b200, nkeys=1, v1=1663,
v2=0, v3=0, v4=0) at catcache.c:1299
frame #21: 0x08862354 postgres`SearchCatCache1(cache=0x31c2b200,
v1=1663) at catcache.c:1167
frame #22: 0x0888406a postgres`SearchSysCache1(cacheId=61,
key1=1663) at syscache.c:1119
frame #23: 0x088834de postgres`get_tablespace(spcid=1663) at spccache.c:136
frame #24: 0x08883617
postgres`get_tablespace_io_concurrency(spcid=0) at spccache.c:217
frame #25: 0x08155a82
postgres`heap_compute_xid_horizon_for_tuples(rel=0x31cbee40,
tids=0x31df146c, nitems=3) at heapam.c:6980
frame #26: 0x0817b09d
postgres`table_compute_xid_horizon_for_tuples(rel=0x31cbee40,
items=0x31df146c, nitems=3) at tableam.h:708
frame #27: 0x0817b03a
postgres`index_compute_xid_horizon_for_tuples(irel=0x31c3e95c,
hrel=0x31cbee40, ibuf=129, itemnos=0xffbfbb8c, nitems=3) at
genam.c:306
frame #28: 0x0818ae92 postgres`_bt_delitems_delete(rel=0x31c3e95c,
buf=129, itemnos=0xffbfbb8c, nitems=3, heapRel=0x31cbee40) at
nbtpage.c:1111
frame #29: 0x0818405b postgres`_bt_vacuum_one_page(rel=0x31c3e95c,
buffer=129, heapRel=0x31cbee40) at nbtinsert.c:2270
frame #30: 0x08180a4f postgres`_bt_findinsertloc(rel=0x31c3e95c,
insertstate=0xffbfcce0, checkingunique=true, stack=0x00000000,
heapRel=0x31cbee40) at nbtinsert.c:736
frame #31: 0x0817f40c postgres`_bt_doinsert(rel=0x31c3e95c,
itup=0x31db69f4, checkUnique=UNIQUE_CHECK_YES, heapRel=0x31cbee40) at
nbtinsert.c:281
frame #32: 0x08190416 postgres`btinsert(rel=0x31c3e95c,
values=0xffbfce54, isnull=0xffbfce34, ht_ctid=0x31db42e4,
heapRel=0x31cbee40, checkUnique=UNIQUE_CHECK_YES,
indexInfo=0x31db67dc) at nbtree.c:203
frame #33: 0x0817c173
postgres`index_insert(indexRelation=0x31c3e95c, values=0xffbfce54,
isnull=0xffbfce34, heap_t_ctid=0x31db42e4, heapRelation=0x31cbee40,
checkUnique=UNIQUE_CHECK_YES, indexInfo=0x31db67dc) at indexam.c:212
frame #34: 0x0823cca4
postgres`CatalogIndexInsert(indstate=0x31db9228, heapTuple=0x31db42e0)
at indexing.c:140
frame #35: 0x0823cd72
postgres`CatalogTupleUpdate(heapRel=0x31cbee40, otid=0x31db42e4,
tup=0x31db42e0) at indexing.c:215
frame #36: 0x088768ed
postgres`RelationSetNewRelfilenode(relation=0x31c2fb38,
persistence='p', freezeXid=0, minmulti=0) at relcache.c:3508
frame #37: 0x0823b5df postgres`reindex_index(indexId=2672,
skip_constraint_checks=true, persistence='p', options=0) at
index.c:3700
frame #38: 0x0823bf00 postgres`reindex_relation(relid=1262,
flags=18, options=0) at index.c:3946
frame #39: 0x08320063 postgres`finish_heap_swap(OIDOldHeap=1262,
OIDNewHeap=16580, is_system_catalog=true, swap_toast_by_content=true,
check_constraints=false, is_internal=true, frozenXid=673,
cutoffMulti=1, newrelpersistence='p') at cluster.c:1673
frame #40: 0x0831f5a3
postgres`rebuild_relation(OldHeap=0x31c2ff68, indexOid=0,
verbose=false) at cluster.c:629
frame #41: 0x0831eecd postgres`cluster_rel(tableOid=1262,
indexOid=0, options=0) at cluster.c:435
frame #42: 0x083f7c1d postgres`vacuum_rel(relid=1262,
relation=0x28b4b9dc, params=0xffbfd670) at vacuum.c:1743
frame #43: 0x083f6f87 postgres`vacuum(relations=0x31d6c1cc,
params=0xffbfd670, bstrategy=0x31d6c090, isTopLevel=true) at
vacuum.c:372
frame #44: 0x083f6837 postgres`ExecVacuum(pstate=0x31ca2c90,
vacstmt=0x28b4ba54, isTopLevel=true) at vacuum.c:175
frame #45: 0x0869f145
postgres`standard_ProcessUtility(pstmt=0x28b4bb18, queryString="VACUUM
FULL pg_database;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x00000000, queryEnv=0x00000000, dest=0x28b4bc90,
completionTag="") at utility.c:670
frame #46: 0x0869e68e postgres`ProcessUtility(pstmt=0x28b4bb18,
queryString="VACUUM FULL pg_database;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x00000000,
queryEnv=0x00000000, dest=0x28b4bc90, completionTag="") at
utility.c:360
frame #47: 0x0869ddfb postgres`PortalRunUtility(portal=0x31bee090,
pstmt=0x28b4bb18, isTopLevel=true, setHoldSnapshot=false,
dest=0x28b4bc90, completionTag="") at pquery.c:1175
frame #48: 0x0869ce02 postgres`PortalRunMulti(portal=0x31bee090,
isTopLevel=true, setHoldSnapshot=false, dest=0x28b4bc90,
altdest=0x28b4bc90, completionTag="") at pquery.c:1321
frame #49: 0x0869c363 postgres`PortalRun(portal=0x31bee090,
count=2147483647, isTopLevel=true, run_once=true, dest=0x28b4bc90,
altdest=0x28b4bc90, completionTag="") at pquery.c:796
frame #50: 0x08696e68
postgres`exec_simple_query(query_string="VACUUM FULL pg_database;") at
postgres.c:1215
frame #51: 0x08695eec postgres`PostgresMain(argc=1,
argv=0x31be6658, dbname="contrib_regression", username="munro") at
postgres.c:4247
frame #52: 0x085aedb0 postgres`BackendRun(port=0x31be1000) at
postmaster.c:4399
frame #53: 0x085adf9a postgres`BackendStartup(port=0x31be1000) at
postmaster.c:4090
frame #54: 0x085accd5 postgres`ServerLoop at postmaster.c:1703
frame #55: 0x085a9d95 postgres`PostmasterMain(argc=8,
argv=0xffbfe608) at postmaster.c:1376
frame #56: 0x0849ec92 postgres`main(argc=8, argv=0xffbfe608) at main.c:228
frame #57: 0x080bf5eb postgres`_start1(cleanup=0x28b1e540, argc=8,
argv=0xffbfe608) at crt1_c.c:73
frame #58: 0x080bf4b8 postgres`_start at crt1_s.S:49
(lldb) print num_held_lwlocks
(int) $0 = 1
(lldb) print held_lwlocks[0]
(LWLockHandle) $1 = {
lock = 0x295365a4
mode = LW_EXCLUSIVE
}
--
Thomas Munro
https://enterprisedb.com
On 2019-03-28 17:34:52 +1300, Thomas Munro wrote:
On Wed, Mar 27, 2019 at 1:06 PM Andres Freund <andres@anarazel.de> wrote:
Compute XID horizon for page level index vacuum on primary.
Hi Andres,
I have a virtual machine running FreeBSD 12.0 on i386 on which
contrib/test_decoding consistently self-deadlocks in the "rewrite"
test, with the stack listed below. You can see that we wait for a
share lock that we already hold exclusively. Peter Geoghegan spotted
the problem: this code path shouldn't access syscache, or at least not
for a catalog table. He suggested something along these lines:--- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6977,7 +6977,10 @@ heap_compute_xid_horizon_for_tuples(Relation rel, * simplistic, but at the moment there is no evidence of that or any idea * about what would work better. */ - io_concurrency = get_tablespace_io_concurrency(rel->rd_rel->reltablespace); + if (IsCatalogRelation(rel)) + io_concurrency = 1; + else + io_concurrency = get_tablespace_io_concurrency(rel->rd_rel->reltablespace); prefetch_distance = Min((io_concurrency) + 10, MAX_IO_CONCURRENCY);
Hm, good catch. I don't like this fix very much (even if it were
commented), but I don't have a great idea right now. I'm mildly
inclined to take effective_io_concurrency into account even if we can't
use get_tablespace_io_concurrency - that should be doable from a
locking POV.
Do you want to apply the above?
- Andres
On Thu, Mar 28, 2019 at 5:28 AM Andres Freund <andres@anarazel.de> wrote:
Hm, good catch. I don't like this fix very much (even if it were
commented), but I don't have a great idea right now.
That was just a POC, to verify the problem. Not a proposal.
--
Peter Geoghegan
Hi,
On March 28, 2019 11:24:46 AM EDT, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Mar 28, 2019 at 5:28 AM Andres Freund <andres@anarazel.de>
wrote:Hm, good catch. I don't like this fix very much (even if it were
commented), but I don't have a great idea right now.That was just a POC, to verify the problem. Not a proposal.
I'm mildly inclined to push a commented version of this. And add a open items entry. Alternatively I'm thinking of just but taking the tablespace setting into account.
Andres
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Mar 28, 2019 at 8:27 AM Andres Freund <andres@anarazel.de> wrote:
I'm mildly inclined to push a commented version of this. And add a open items entry. Alternatively I'm thinking of just but taking the tablespace setting into account.
I would just hard code something if there needs to be a temporary band-aid fix.
--
Peter Geoghegan
On Thu, Mar 28, 2019 at 8:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
I would just hard code something if there needs to be a temporary band-aid fix.
I also suggest that you remove the #include for heapam_xlog.h from
both nbtxlog.c and hash_xlog.c.
--
Peter Geoghegan
On Wed, 27 Mar 2019 at 00:06, Andres Freund <andres@anarazel.de> wrote:
Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay.
This commit message was quite confusing. It took me a while to realize this
relates to btree index deletes and that what you mean is that we are
calculcating the latestRemovedXid for index entries. That is related to but
not same thing as the horizon itself. So now I understand the "was computed
only during WAL replay" since it seemed obvious that the xmin horizon was
calculcated regularly on the master, but as you say the latestRemovedXid
was not.
Now I understand, I'm happy that you've moved this from redo into mainline.
And you've optimized it, which is also important (since performance was the
original objection and why it was placed in redo). I can see you've removed
duplicate code in hash indexes as well, which is good.
The term "xid horizon" was only used once in the code in PG11. That usage
appears to be a typo, since in many other places the term "xmin horizon" is
used to mean the point at which we can finally mark tuples as dead. Now we
have some new, undocumented APIs that use the term "xid horizon" yet still
code that refers to "xmin horizon", with neither term being defined. I'm
hoping you'll do some later cleanup of that to avoid confusion.
While trying to understand this, I see there is an even better way to
optimize this. Since we are removing dead index tuples, we could alter the
killed index tuple interface so that it returns the xmax of the tuple being
marked as killed, rather than just a boolean to say it is dead. Indexes can
then mark the killed tuples with the xmax that killed them rather than just
a hint bit. This is possible since the index tuples are dead and cannot be
used to follow the htid to the heap, so the htid is redundant and so the
block number of the tid could be overwritten with the xmax, zeroing the
itemid. Each killed item we mark with its xmax means one less heap fetch we
need to perform when we delete the page - it's possible we optimize that
away completely by doing this.
Since this point of the code is clearly going to be a performance issue it
seems like something we should do now.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
This commit message was quite confusing. It took me a while to realize this
relates to btree index deletes and that what you mean is that we are
calculcating the latestRemovedXid for index entries. That is related to but
not same thing as the horizon itself.
Well, it's the page level horizon...
While trying to understand this, I see there is an even better way to
optimize this. Since we are removing dead index tuples, we could alter the
killed index tuple interface so that it returns the xmax of the tuple being
marked as killed, rather than just a boolean to say it is dead.
Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.
Indexes can then mark the killed tuples with the xmax that killed them
rather than just a hint bit. This is possible since the index tuples
are dead and cannot be used to follow the htid to the heap, so the
htid is redundant and so the block number of the tid could be
overwritten with the xmax, zeroing the itemid. Each killed item we
mark with its xmax means one less heap fetch we need to perform when
we delete the page - it's possible we optimize that away completely by
doing this.
That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.
And I'm also doubtful it's worth it because:
Since this point of the code is clearly going to be a performance issue it
seems like something we should do now.
I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so. What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away. I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.
Greetings,
Andres Freund
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
While trying to understand this, I see there is an even better way to
optimize this. Since we are removing dead index tuples, we could alterthe
killed index tuple interface so that it returns the xmax of the tuple
being
marked as killed, rather than just a boolean to say it is dead.
Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.
I wasn't suggesting we change when the horizon is calculated, so no change
there.
The idea was to cache the data for later use, replacing the hint bit with a
hint xid.
That won't change the rate of conflicts, up or down - but it does avoid I/O.
Indexes can then mark the killed tuples with the xmax that killed them
rather than just a hint bit. This is possible since the index tuples
are dead and cannot be used to follow the htid to the heap, so the
htid is redundant and so the block number of the tid could be
overwritten with the xmax, zeroing the itemid. Each killed item we
mark with its xmax means one less heap fetch we need to perform when
we delete the page - it's possible we optimize that away completely by
doing this.That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.
Don't see that.
I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed, so
I can't see how that would ever cross a page boundary.
And I'm also doubtful it's worth it because:
Since this point of the code is clearly going to be a performance issue
it
seems like something we should do now.
I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so. What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away. I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.
The code can do literally hundreds of random I/Os in an 8192 blocksize.
What happens with 16 or 32kB?
"Small regression" ?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.Don't see that.
I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed, so
I can't see how that would ever cross a page boundary.
They're 8 bytes, and MAXALIGN often is 4 bytes:
struct ItemPointerData {
BlockIdData ip_blkid; /* 0 4 */
OffsetNumber ip_posid; /* 4 2 */
/* size: 6, cachelines: 1, members: 2 */
/* last cacheline: 6 bytes */
};
struct IndexTupleData {
ItemPointerData t_tid; /* 0 6 */
short unsigned int t_info; /* 6 2 */
/* size: 8, cachelines: 1, members: 2 */
/* last cacheline: 8 bytes */
};
So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.
But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.
And I'm also doubtful it's worth it because:
Since this point of the code is clearly going to be a performance issue
it
seems like something we should do now.
I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so. What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away. I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.
The code can do literally hundreds of random I/Os in an 8192 blocksize.
What happens with 16 or 32kB?
It's really hard to construct such cases after the sorting changes, but
obviously not impossible. But to make it actually painful you need a
workload where the implied randomness of accesses isn't already a major
bottleneck - and that's hard.
This has been discussed publically for a few months...
Greetings,
Andres Freund
On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
That's far from a trivial feature imo. It seems quite possible that
we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.Don't see that.
I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed,so
I can't see how that would ever cross a page boundary.
They're 8 bytes, and MAXALIGN often is 4 bytes:
xids are 4 bytes, so we're good.
If MAXALIGN could ever be 2 bytes, we'd have a problem.
So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.
You can have unlogged changes that you rely on - that is exactly how hints
work.
If the hint is lost, we do the I/O. Worst case it would be the same as what
you have now.
I'm talking about saving many I/Os - this doesn't need to provably avoid
all I/Os to work, its incremental benefit all the way.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 29, 2019 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.
It might be possible to WAL-log the _bt_check_unique() item killing.
That seems to be much more effective than the similar and better known
kill_prior_tuple optimization in practice. I don't think that that
should be in scope for v12, though. I for one am satisfied with your
explanation.
The code can do literally hundreds of random I/Os in an 8192 blocksize.
What happens with 16 or 32kB?It's really hard to construct such cases after the sorting changes, but
obviously not impossible. But to make it actually painful you need a
workload where the implied randomness of accesses isn't already a major
bottleneck - and that's hard.
There is also the fact that in many cases you'll just have accessed
the same buffers from within _bt_check_unique() anyway.
--
Peter Geoghegan
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
That's far from a trivial feature imo. It seems quite possible that
we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.Don't see that.
I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed,so
I can't see how that would ever cross a page boundary.
They're 8 bytes, and MAXALIGN often is 4 bytes:
xids are 4 bytes, so we're good.
I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes. You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors. If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.
Greetings,
Andres Freund
On Fri, Mar 29, 2019 at 4:27 AM Andres Freund <andres@anarazel.de> wrote:
On March 28, 2019 11:24:46 AM EDT, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Mar 28, 2019 at 5:28 AM Andres Freund <andres@anarazel.de>
wrote:Hm, good catch. I don't like this fix very much (even if it were
commented), but I don't have a great idea right now.That was just a POC, to verify the problem. Not a proposal.
I'm mildly inclined to push a commented version of this. And add a open items entry. Alternatively I'm thinking of just but taking the tablespace setting into account.
I didn't understand that last sentence.
Here's an attempt to write a suitable comment for the quick fix. And
I suppose effective_io_concurrency is a reasonable default.
It's pretty hard to think of a good way to get your hands on the real
value safely from here. I wondered if there was a way to narrow this
to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
but that doesn't work, we access other catalog too in that path.
Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
of asynchronous I/O requests" according to config.sgml, but here 0
will prefetch 10 buffers.
--
Thomas Munro
https://enterprisedb.com
Attachments:
0001-Fix-deadlock-in-heap_compute_xid_horizon_for_tuples.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-deadlock-in-heap_compute_xid_horizon_for_tuples.patchDownload
From 835d94dc1e37473a7bc93cee095aa3c123c8e614 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 30 Mar 2019 22:40:10 +1300
Subject: [PATCH] Fix deadlock in heap_compute_xid_horizon_for_tuples().
We can't call code that uses syscache while we hold buffer locks
on a catalog. If called for a catalog, just fall back to the
effective_io_concurrency GUC rather than trying to look up the
tablespace's IO concurrency setting.
Diagnosed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGLCwPF0S4Mk7S8qw%2BDK0Bq65LueN9rofAA3HHSYikW-Zw%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index de5bb9194e..8fefd2a3fe 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6976,8 +6976,15 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
* more prefetching in this case, too. It may be that this formula is too
* simplistic, but at the moment there is no evidence of that or any idea
* about what would work better.
+ *
+ * Since the caller holds a buffer lock somewhere in rel, we'd better make
+ * sure that isn't a catalog before we call code that does syscache
+ * lookups, or it could deadlock.
*/
- io_concurrency = get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
+ if (IsCatalogRelation(rel))
+ io_concurrency = effective_io_concurrency;
+ else
+ io_concurrency = get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
prefetch_distance = Min((io_concurrency) + 10, MAX_IO_CONCURRENCY);
/* Start prefetching. */
--
2.21.0
On Sat, Mar 30, 2019 at 6:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
I didn't understand that last sentence.
Here's an attempt to write a suitable comment for the quick fix. And
I suppose effective_io_concurrency is a reasonable default.It's pretty hard to think of a good way to get your hands on the real
value safely from here. I wondered if there was a way to narrow this
to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
but that doesn't work, we access other catalog too in that path.Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
of asynchronous I/O requests" according to config.sgml, but here 0
will prefetch 10 buffers.
Mmmph. I'm starting to think we're not going to get a satisfactory
result here unless we make this controlled by something other than
effective_io_concurrency. There's just no reason to suppose that the
same setting that we use to control prefetching for bitmap index scans
is also going to be right for what's basically a bulk operation.
Interestingly, Dilip Kumar ran into similar issues recently while
working on bulk processing for undo records for zheap. In that case,
you definitely want to prefetch the undo aggressively, because you're
reading it front to back and backwards scans suck without prefetching.
And you possibly also want to prefetch the data pages to which the
undo that you are prefetching applies, but maybe not as aggressively
because you're going to be doing a WAL write for each data page and
flooding the system with too many reads could be counterproductive, at
least if pg_wal and the rest of $PGDATA are not on separate spindles.
And even if they are, it's possible that as you suck in undo pages and
the zheap pages that they need to update, you might evict dirty pages,
generating write activity against the data directory.
Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK. Let's just create a new GUC for this and default it to 10
or something and go home.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Mar 30, 2019 at 8:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK. Let's just create a new GUC for this and default it to 10
or something and go home.
I agree. If you invent a new GUC, then everybody notices, and it
usually has to be justified quite rigorously. There is a strong
incentive to use an existing GUC, if only because the problem that
this creates is harder to measure than the supposed problem that it
avoids. This can perversely work against the goal of making the system
easy to use. Stretching the original definition of a GUC is bad.
I take issue with the general assumption that not adding a GUC at
least makes things easier for users. In reality, it depends entirely
on the situation at hand.
--
Peter Geoghegan
On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Mar 30, 2019 at 8:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK. Let's just create a new GUC for this and default it to 10
or something and go home.I agree. If you invent a new GUC, then everybody notices, and it
usually has to be justified quite rigorously. There is a strong
incentive to use an existing GUC, if only because the problem that
this creates is harder to measure than the supposed problem that it
avoids. This can perversely work against the goal of making the system
easy to use. Stretching the original definition of a GUC is bad.I take issue with the general assumption that not adding a GUC at
least makes things easier for users. In reality, it depends entirely
on the situation at hand.
I'm not sure I understand why this is any different from the bitmap
heapscan case though, or in fact why we are adding 10 in this case.
In both cases we will soon be reading the referenced buffers, and it
makes sense to queue up prefetch requests for the blocks if they
aren't already in shared buffers. In both cases, the number of
prefetch requests we want to send to the OS is somehow linked to the
amount of IO requests we think the OS can handle concurrently at once
(since that's one factor determining how fast it drains them), but
it's not necessarily the same as that number, AFAICS. It's useful to
queue some number of prefetch requests even if you have no IO
concurrency at all (a single old school spindle), just because the OS
will chew on that queue in the background while we're also doing
stuff, which is probably what that "+ 10" is expressing. But that
seems to apply to bitmap heapscan too, doesn't it?
--
Thomas Munro
https://enterprisedb.com
Hi,
On March 30, 2019 5:33:12 PM EDT, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Mar 30, 2019 at 8:44 AM Robert Haas <robertmhaas@gmail.com>
wrote:
Overall I'm inclined to think that we're making the same mistake
here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC andthings
will be OK. Let's just create a new GUC for this and default it to
10
or something and go home.
I agree. If you invent a new GUC, then everybody notices, and it
usually has to be justified quite rigorously. There is a strong
incentive to use an existing GUC, if only because the problem that
this creates is harder to measure than the supposed problem that it
avoids. This can perversely work against the goal of making thesystem
easy to use. Stretching the original definition of a GUC is bad.
I take issue with the general assumption that not adding a GUC at
least makes things easier for users. In reality, it depends entirely
on the situation at hand.I'm not sure I understand why this is any different from the bitmap
heapscan case though, or in fact why we are adding 10 in this case.
In both cases we will soon be reading the referenced buffers, and it
makes sense to queue up prefetch requests for the blocks if they
aren't already in shared buffers. In both cases, the number of
prefetch requests we want to send to the OS is somehow linked to the
amount of IO requests we think the OS can handle concurrently at once
(since that's one factor determining how fast it drains them), but
it's not necessarily the same as that number, AFAICS. It's useful to
queue some number of prefetch requests even if you have no IO
concurrency at all (a single old school spindle), just because the OS
will chew on that queue in the background while we're also doing
stuff, which is probably what that "+ 10" is expressing. But that
seems to apply to bitmap heapscan too, doesn't it?
The index page deletion code does work on behalf of multiple backends, bitmap scans don't. If your system is busy it makes sense to like resource usage of per backend work, but not really work on shared resources like page reuse. A bit like work mem vs mwm.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, 29 Mar 2019 at 16:32, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de>
wrote:
That's far from a trivial feature imo. It seems quite possible that
we'd
end up with increased overhead, because the current logic can get
away
with only doing hint bit style writes - but would that be true if
we
started actually replacing the item pointers? Because I don't see
any
guarantee they couldn't cross a page boundary etc? So I think we'd
need
to do WAL logging during index searches, which seems prohibitively
expensive.Don't see that.
I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples areMAXALIGNed,
so
I can't see how that would ever cross a page boundary.
They're 8 bytes, and MAXALIGN often is 4 bytes:
xids are 4 bytes, so we're good.
I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes. You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors. If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.
Yes, I agree, I was thinking the same thing after my last comment, but was
afk. The idea requires the atomic update of at least 4 bytes plus at least
1 bit and so would require at least 8byte MAXALIGN to be useful. Your other
points suggesting that multiple things all need to be set accurately for
this to work aren't correct. The idea was that we would write a hint that
would avoid later I/O, so the overall idea is still viable.
Anyway, thinking some more, I think the whole idea of generating
lastRemovedXid is moot and there are better ways in the future of doing
this that would avoid a performance issue altogether. Clearly not PG12.
The main issue relates to the potential overhead of moving this to the
master. I agree its the right thing to do, but we should have some way of
checking it is not a performance issue in practice.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Mar 30, 2019 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's an attempt to write a suitable comment for the quick fix. And
I suppose effective_io_concurrency is a reasonable default.
Pushed.
--
Thomas Munro
https://enterprisedb.com
Hi,
On 2019-03-30 11:44:36 -0400, Robert Haas wrote:
On Sat, Mar 30, 2019 at 6:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
I didn't understand that last sentence.
Here's an attempt to write a suitable comment for the quick fix. And
I suppose effective_io_concurrency is a reasonable default.It's pretty hard to think of a good way to get your hands on the real
value safely from here. I wondered if there was a way to narrow this
to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
but that doesn't work, we access other catalog too in that path.Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
of asynchronous I/O requests" according to config.sgml, but here 0
will prefetch 10 buffers.Mmmph. I'm starting to think we're not going to get a satisfactory
result here unless we make this controlled by something other than
effective_io_concurrency. There's just no reason to suppose that the
same setting that we use to control prefetching for bitmap index scans
is also going to be right for what's basically a bulk operation.Interestingly, Dilip Kumar ran into similar issues recently while
working on bulk processing for undo records for zheap. In that case,
you definitely want to prefetch the undo aggressively, because you're
reading it front to back and backwards scans suck without prefetching.
And you possibly also want to prefetch the data pages to which the
undo that you are prefetching applies, but maybe not as aggressively
because you're going to be doing a WAL write for each data page and
flooding the system with too many reads could be counterproductive, at
least if pg_wal and the rest of $PGDATA are not on separate spindles.
And even if they are, it's possible that as you suck in undo pages and
the zheap pages that they need to update, you might evict dirty pages,
generating write activity against the data directory.
I'm not yet convinced it's necessary to create a new GUC, but also not
strongly opposed. I've created an open items issue for it, so we don't
forget.
Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK. Let's just create a new GUC for this and default it to 10
or something and go home.
I agree that we needed to split work_mem, but a) that was far less clear
for many years b) there was no logic ot use more work_mem in
maintenance-y cases...
Greetings,
Andres Freund
Hi,
On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
I'm not yet convinced it's necessary to create a new GUC, but also not
strongly opposed. I've created an open items issue for it, so we don't
forget.
My current inclination is to not do anything for v12. Robert, do you
disagree?
Greetings,
Andres Freund
On Wed, May 1, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
I'm not yet convinced it's necessary to create a new GUC, but also not
strongly opposed. I've created an open items issue for it, so we don't
forget.My current inclination is to not do anything for v12. Robert, do you
disagree?
Not strongly enough to argue about it very hard. The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, May 1, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:
My current inclination is to not do anything for v12. Robert, do you
disagree?
Not strongly enough to argue about it very hard. The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.
I think there's nothing that fails to suck about a hardwired "+ 10".
We should either remove that and use effective_io_concurrency as-is,
or decide that it's worth having a separate GUC for bulk operations.
At this stage of the cycle I'd incline to the former, but if somebody
is excited enough to prepare a patch for a new GUC, I wouldn't push
back on that solution.
regards, tom lane
On Wed, May 1, 2019 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not strongly enough to argue about it very hard. The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.I think there's nothing that fails to suck about a hardwired "+ 10".
It avoids a performance regression without adding another GUC.
That may not be enough reason to keep it like that, but it is one
thing that does fail to suck.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, May 2, 2019 at 5:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 1, 2019 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not strongly enough to argue about it very hard. The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.I think there's nothing that fails to suck about a hardwired "+ 10".
It avoids a performance regression without adding another GUC.
That may not be enough reason to keep it like that, but it is one
thing that does fail to suck.
This is listed as an open item to resolve for 12. IIUC the options on
the table are:
1. Do nothing, and ship with effective_io_concurrency + 10.
2. Just use effective_io_concurrency without the hardwired boost.
3. Switch to a new GUC maintenance_io_concurrency (or some better name).
The rationale for using a different number is that this backend is
working on behalf of multiple sessions, so you might want to give it
some more juice, much like maintenance_work_mem.
I vote for option 3. I have no clue how to set it, but at least users
have a fighting chance of experimenting and figuring it out that way.
I volunteer to write the patch if we get a consensus.
--
Thomas Munro
https://enterprisedb.com
Hi,
On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
On Thu, May 2, 2019 at 5:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 1, 2019 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not strongly enough to argue about it very hard. The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.I think there's nothing that fails to suck about a hardwired "+ 10".
It avoids a performance regression without adding another GUC.
That may not be enough reason to keep it like that, but it is one
thing that does fail to suck.This is listed as an open item to resolve for 12. IIUC the options on
the table are:1. Do nothing, and ship with effective_io_concurrency + 10.
2. Just use effective_io_concurrency without the hardwired boost.
3. Switch to a new GUC maintenance_io_concurrency (or some better name).The rationale for using a different number is that this backend is
working on behalf of multiple sessions, so you might want to give it
some more juice, much like maintenance_work_mem.I vote for option 3. I have no clue how to set it, but at least users
have a fighting chance of experimenting and figuring it out that way.
I volunteer to write the patch if we get a consensus.
I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
also a good option - it's easy to imagine to later use it for for
VACUUM, ANALYZE and the like. I think 2 is a bad idea.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
This is listed as an open item to resolve for 12. IIUC the options on
the table are:1. Do nothing, and ship with effective_io_concurrency + 10.
2. Just use effective_io_concurrency without the hardwired boost.
3. Switch to a new GUC maintenance_io_concurrency (or some better name).I vote for option 3. I have no clue how to set it, but at least users
have a fighting chance of experimenting and figuring it out that way.
I volunteer to write the patch if we get a consensus.
I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
also a good option - it's easy to imagine to later use it for for
VACUUM, ANALYZE and the like. I think 2 is a bad idea.
FWIW, I also agree with settling for #1 at this point. A new GUC would
make more sense if we have multiple use-cases for it, which we probably
will at some point, but not today. I'm concerned that if we invent a
GUC now, we might find out that it's not really usable for other cases
in future (e.g., default value is no good for other cases). It's the
old story that inventing an API with only one use-case in mind leads
to a bad API.
So yeah, let's leave this be for now, ugly as it is. Improving it
can be future work.
regards, tom lane
On Thu, May 16, 2019 at 3:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
This is listed as an open item to resolve for 12. IIUC the options on
the table are:1. Do nothing, and ship with effective_io_concurrency + 10.
2. Just use effective_io_concurrency without the hardwired boost.
3. Switch to a new GUC maintenance_io_concurrency (or some better name).I vote for option 3. I have no clue how to set it, but at least users
have a fighting chance of experimenting and figuring it out that way.
I volunteer to write the patch if we get a consensus.I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
also a good option - it's easy to imagine to later use it for for
VACUUM, ANALYZE and the like. I think 2 is a bad idea.FWIW, I also agree with settling for #1 at this point. A new GUC would
make more sense if we have multiple use-cases for it, which we probably
will at some point, but not today. I'm concerned that if we invent a
GUC now, we might find out that it's not really usable for other cases
in future (e.g., default value is no good for other cases). It's the
old story that inventing an API with only one use-case in mind leads
to a bad API.So yeah, let's leave this be for now, ugly as it is. Improving it
can be future work.
Cool, I moved it to the resolved section.
--
Thomas Munro
https://enterprisedb.com