Allow "snapshot too old" error, to prevent bloat
This patch is related to the "Reduce pinning in btree indexes"
patch submitted here:
/messages/by-id/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com
That describes how they evolved and how they relate; I won't
duplicate that here.
Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes. In spite of that, I have thrown the worst test
cases I could think of at it (and only succeeded in uncovering a
bug which was already out there in production), and it has shown
its value in a two-day test test simulating a 300 user load with
complex real-world applications (although the only indexes it used
were btree indexes). Without the patches the database growth was
39GB per day; with the patches it was 28.5GB per day. (The test
does involve more inserts than deletes, so some growth is
expected.) At the end of the tests, pgstattuple reported eight
times as many dead tuples in the database without the patches.
More importantly, without the patches the CPU load started at 60%
and showed linear growth to 92% over the course of the first day;
with the patches it stayed at a stable 60% throughout the test.
What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed. It also
saves the current insertion LSN into every snapshot when it is
created. When reading from the heap or any index, if the snapshot
is vulnerable to showing incorrect data because the threshold has
been crossed since it was generated, reading any page with an LSN
past the snapshot LSN causes a "snapshot too old" error to be
thrown. Since this is LSN-based, the new logic is not used for any
relation which is not WAL-logged.
Note that if you don't read data from a page which has been
modified after your snapshot was taken, the threshold doesn't
matter.
All `make installcheck` tests succeed with any setting. With a
setting of 0 (the most extreme), `make installcheck-world` sees
four isolation tests fail. Those all pass if you raise the
setting to 2. The postgres_fdw test needs a setting of 4 to
succeed. I would expect most shops would want to tune this to
something in the six-digit to eight-digit range. In the tests
mentioned above it was set to 150000 (which corresponded to just
under 4 minutes of txid consumption) and there were no "snapshot
too old" errors, even though some cursors were left open for the
entire two-day run.
The patch still lacks (as mentioned above) support for index AMs
other than btree, and lacks documentation for the new GUC. I'm
sure that there are some comments and README files that need
adjustment, too. As I said, this is still POC.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
snapshot-too-old-v1.patchtext/x-diffDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 366,371 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
--- 366,372 ----
LockBuffer(buffer, BUFFER_LOCK_SHARE);
dp = (Page) BufferGetPage(buffer);
+ TestForOldSnapshot(snapshot, scan->rs_rd, dp);
lines = PageGetMaxOffsetNumber(dp);
ntup = 0;
***************
*** 496,501 **** heapgettup(HeapScanDesc scan,
--- 497,503 ----
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
dp = (Page) BufferGetPage(scan->rs_cbuf);
+ TestForOldSnapshot(snapshot, scan->rs_rd, dp);
lines = PageGetMaxOffsetNumber(dp);
/* page and lineoff now reference the physically next tid */
***************
*** 538,543 **** heapgettup(HeapScanDesc scan,
--- 540,546 ----
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
dp = (Page) BufferGetPage(scan->rs_cbuf);
+ TestForOldSnapshot(snapshot, scan->rs_rd, dp);
lines = PageGetMaxOffsetNumber(dp);
if (!scan->rs_inited)
***************
*** 696,701 **** heapgettup(HeapScanDesc scan,
--- 699,705 ----
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
dp = (Page) BufferGetPage(scan->rs_cbuf);
+ TestForOldSnapshot(snapshot, scan->rs_rd, dp);
lines = PageGetMaxOffsetNumber((Page) dp);
linesleft = lines;
if (backward)
***************
*** 1573,1578 **** heap_fetch(Relation relation,
--- 1577,1583 ----
*/
LockBuffer(buffer, BUFFER_LOCK_SHARE);
page = BufferGetPage(buffer);
+ TestForOldSnapshot(snapshot, relation, page);
/*
* We'd better check for out-of-range offnum in case of VACUUM since the
***************
*** 1902,1907 **** heap_get_latest_tid(Relation relation,
--- 1907,1913 ----
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&ctid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
page = BufferGetPage(buffer);
+ TestForOldSnapshot(snapshot, relation, page);
/*
* Check for bogus item number. This is not treated as an error
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 92,103 **** heap_page_prune_opt(Relation relation, Buffer buffer)
* need to use the horizon that includes slots, otherwise the data-only
* horizon can be used. Note that the toast relation of user defined
* relations are *not* considered catalog relations.
*/
if (IsCatalogRelation(relation) ||
RelationIsAccessibleInLogicalDecoding(relation))
OldestXmin = RecentGlobalXmin;
else
! OldestXmin = RecentGlobalDataXmin;
Assert(TransactionIdIsValid(OldestXmin));
--- 92,112 ----
* need to use the horizon that includes slots, otherwise the data-only
* horizon can be used. Note that the toast relation of user defined
* relations are *not* considered catalog relations.
+ *
+ * It is OK to apply the old snapshot limit before acquiring the cleanup
+ * lock because the worst that can happen is that we are not quite as
+ * aggressive about the cleanup (by however many transaction IDs are
+ * consumed between this point and acquiring the lock). This allows us to
+ * save significant overhead in the case where the page is found not to be
+ * prunable.
*/
if (IsCatalogRelation(relation) ||
RelationIsAccessibleInLogicalDecoding(relation))
OldestXmin = RecentGlobalXmin;
else
! OldestXmin =
! TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
! relation);
Assert(TransactionIdIsValid(OldestXmin));
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 118,124 **** _bt_doinsert(Relation rel, IndexTuple itup,
top:
/* find the first page containing this key */
! stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);
offset = InvalidOffsetNumber;
--- 118,124 ----
top:
/* find the first page containing this key */
! stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);
offset = InvalidOffsetNumber;
***************
*** 134,140 **** top:
* precise description.
*/
buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
! true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
--- 134,140 ----
* precise description.
*/
buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
! true, stack, BT_WRITE, NULL);
/*
* If we're not allowing duplicates, make sure the key isn't already in
***************
*** 1654,1660 **** _bt_insert_parent(Relation rel,
elog(DEBUG2, "concurrent ROOT page split");
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
/* Find the leftmost page at the next level up */
! pbuf = _bt_get_endpoint(rel, lpageop->btpo.level + 1, false);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
--- 1654,1661 ----
elog(DEBUG2, "concurrent ROOT page split");
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
/* Find the leftmost page at the next level up */
! pbuf = _bt_get_endpoint(rel, lpageop->btpo.level + 1, false,
! NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 1254,1260 **** _bt_pagedel(Relation rel, Buffer buf)
itup_scankey = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page containing this key */
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey,
! false, &lbuf, BT_READ);
/* don't need a pin on the page */
_bt_relbuf(rel, lbuf);
--- 1254,1260 ----
itup_scankey = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page containing this key */
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey,
! false, &lbuf, BT_READ, NULL);
/* don't need a pin on the page */
_bt_relbuf(rel, lbuf);
*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 22,27 ****
--- 22,28 ----
#include "storage/predicate.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+ #include "utils/snapmgr.h"
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
***************
*** 29,35 **** static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
! static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
--- 30,36 ----
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
! static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
***************
*** 48,53 **** static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
--- 49,58 ----
* address of the leaf-page buffer, which is read-locked and pinned.
* No locks are held on the parent pages, however!
*
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
***************
*** 56,62 **** static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
! Buffer *bufP, int access)
{
BTStack stack_in = NULL;
--- 61,67 ----
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
! Buffer *bufP, int access, Snapshot snapshot)
{
BTStack stack_in = NULL;
***************
*** 65,71 **** _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
--- 70,79 ----
/* If index is empty and access = BT_READ, no root page is created. */
if (!BufferIsValid(*bufP))
+ {
+ /* FIXME: old snapshot checking special case here? */
return (BTStack) NULL;
+ }
/* Loop iterates once per level descended in the tree */
for (;;)
***************
*** 93,99 **** _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
*/
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
(access == BT_WRITE), stack_in,
! BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
--- 101,107 ----
*/
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
(access == BT_WRITE), stack_in,
! BT_READ, snapshot);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
***************
*** 166,171 **** _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
--- 174,183 ----
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
*/
Buffer
_bt_moveright(Relation rel,
***************
*** 175,181 **** _bt_moveright(Relation rel,
bool nextkey,
bool forupdate,
BTStack stack,
! int access)
{
Page page;
BTPageOpaque opaque;
--- 187,194 ----
bool nextkey,
bool forupdate,
BTStack stack,
! int access,
! Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
***************
*** 201,206 **** _bt_moveright(Relation rel,
--- 214,220 ----
for (;;)
{
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque))
***************
*** 937,943 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
! stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ);
/* don't need to keep the stack around... */
_bt_freestack(stack);
--- 951,958 ----
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
! stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
! scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
***************
*** 1308,1313 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1323,1329 ----
so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
/* check for deleted page */
page = BufferGetPage(so->currPos.buf);
+ TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (!P_IGNORE(opaque))
{
***************
*** 1344,1350 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
}
/* Step to next physical page */
! so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
--- 1360,1367 ----
}
/* Step to next physical page */
! so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
! scan->xs_snapshot);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
***************
*** 1356,1361 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1373,1379 ----
* and do it all again.
*/
page = BufferGetPage(so->currPos.buf);
+ TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (!P_IGNORE(opaque))
{
***************
*** 1386,1392 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
* again if it's important.
*/
static Buffer
! _bt_walk_left(Relation rel, Buffer buf)
{
Page page;
BTPageOpaque opaque;
--- 1404,1410 ----
* again if it's important.
*/
static Buffer
! _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
***************
*** 1416,1421 **** _bt_walk_left(Relation rel, Buffer buf)
--- 1434,1440 ----
CHECK_FOR_INTERRUPTS();
buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
***************
*** 1442,1453 **** _bt_walk_left(Relation rel, Buffer buf)
--- 1461,1474 ----
blkno = opaque->btpo_next;
buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
/* Return to the original page to see what's up */
buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISDELETED(opaque))
{
***************
*** 1465,1470 **** _bt_walk_left(Relation rel, Buffer buf)
--- 1486,1492 ----
blkno = opaque->btpo_next;
buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (!P_ISDELETED(opaque))
break;
***************
*** 1501,1507 **** _bt_walk_left(Relation rel, Buffer buf)
* The returned buffer is pinned and read-locked.
*/
Buffer
! _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
{
Buffer buf;
Page page;
--- 1523,1530 ----
* The returned buffer is pinned and read-locked.
*/
Buffer
! _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
! Snapshot snapshot)
{
Buffer buf;
Page page;
***************
*** 1524,1529 **** _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
--- 1547,1553 ----
return InvalidBuffer;
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
for (;;)
***************
*** 1543,1548 **** _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
--- 1567,1573 ----
RelationGetRelationName(rel));
buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
***************
*** 1595,1601 **** _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
! buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
if (!BufferIsValid(buf))
{
--- 1620,1626 ----
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
! buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
if (!BufferIsValid(buf))
{
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 441,447 **** vacuum_set_xid_limits(Relation rel,
* working on a particular table at any time, and that each vacuum is
* always an independent transaction.
*/
! *oldestXmin = GetOldestXmin(rel, true);
Assert(TransactionIdIsNormal(*oldestXmin));
--- 441,448 ----
* working on a particular table at any time, and that each vacuum is
* always an independent transaction.
*/
! *oldestXmin =
! TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
Assert(TransactionIdIsNormal(*oldestXmin));
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 59,64 ****
--- 59,65 ----
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/pg_rusage.h"
+ #include "utils/snapmgr.h"
#include "utils/timestamp.h"
#include "utils/tqual.h"
***************
*** 267,273 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
if (possibly_freeable > 0 &&
(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
! possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION))
lazy_truncate_heap(onerel, vacrelstats);
/* Vacuum the Free Space Map */
--- 268,275 ----
possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
if (possibly_freeable > 0 &&
(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
! possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
! old_snapshot_threshold < 0)
lazy_truncate_heap(onerel, vacrelstats);
/* Vacuum the Free Space Map */
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1609,1614 **** GetSnapshotData(Snapshot snapshot)
--- 1609,1620 ----
snapshot->regd_count = 0;
snapshot->copied = false;
+ /*
+ * Capture the current WAL stream location in case this snapshot becomes
+ * old enough to need to fall back on the special "old snapshot" logic.
+ */
+ snapshot->lsn = GetXLogInsertRecPtr();
+
return snapshot;
}
*** a/src/backend/utils/errcodes.txt
--- b/src/backend/utils/errcodes.txt
***************
*** 410,415 **** Section: Class 58 - System Error (errors external to PostgreSQL itself)
--- 410,419 ----
58P01 E ERRCODE_UNDEFINED_FILE undefined_file
58P02 E ERRCODE_DUPLICATE_FILE duplicate_file
+ Section: Class 72 - Snapshot Failure
+ # (class borrowed from Oracle)
+ 72000 E ERRCODE_SNAPSHOT_TOO_OLD snapshot_too_old
+
Section: Class F0 - Configuration File Error
# (PostgreSQL-specific error class)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2448,2453 **** static struct config_int ConfigureNamesInt[] =
--- 2448,2463 ----
},
{
+ {"old_snapshot_threshold", PGC_POSTMASTER, RESOURCES_ASYNCHRONOUS,
+ gettext_noop("The number of transaction IDs which must be consumed before a snapshot can be considered too old."),
+ gettext_noop("A value of -1 disables this feature.")
+ },
+ &old_snapshot_threshold,
+ -1, -1, 2000000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"tcp_keepalives_idle", PGC_USERSET, CLIENT_CONN_OTHER,
gettext_noop("Time between issuing TCP keepalives."),
gettext_noop("A value of 0 uses the system default."),
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 54,59 ****
--- 54,60 ----
#include "storage/sinval.h"
#include "utils/builtins.h"
#include "utils/memutils.h"
+ #include "utils/rel.h"
#include "utils/resowner_private.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
***************
*** 61,66 ****
--- 62,73 ----
/*
+ * GUC parameters
+ */
+ int old_snapshot_threshold;
+
+
+ /*
* CurrentSnapshot points to the only snapshot taken in transaction-snapshot
* mode, and to the latest one taken in a read-committed transaction.
* SecondarySnapshot is a snapshot that's always up-to-date as of the current
***************
*** 1065,1070 **** pg_export_snapshot(PG_FUNCTION_ARGS)
--- 1072,1104 ----
/*
+ * TransactionIdLimitedForOldSnapshots -- apply old snapshot limit, if any
+ */
+ TransactionId
+ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+ Relation relation)
+ {
+ if (TransactionIdIsNormal(recentXmin)
+ && old_snapshot_threshold >= 0
+ && RelationNeedsWAL(relation)
+ && !IsCatalogRelation(relation)
+ && !RelationIsAccessibleInLogicalDecoding(relation))
+ {
+ TransactionId xlimit;
+
+ xlimit = ShmemVariableCache->latestCompletedXid;
+ Assert(TransactionIdIsNormal(xlimit));
+ xlimit -= old_snapshot_threshold;
+ TransactionIdAdvance(xlimit);
+ if (NormalTransactionIdFollows(xlimit, recentXmin))
+ return xlimit;
+ }
+
+ return recentXmin;
+ }
+
+
+ /*
* Parsing subroutines for ImportSnapshot: parse a line with the given
* prefix followed by a value, and advance *s to the next line. The
* filename is provided for use in error messages.
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 669,685 **** extern int _bt_pagedel(Relation rel, Buffer buf);
*/
extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
! Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
! int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
! extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
/*
* prototypes for functions in nbtutils.c
--- 669,686 ----
*/
extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
! Buffer *bufP, int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
! int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
! extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
! Snapshot snapshot);
/*
* prototypes for functions in nbtutils.c
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 15,20 ****
--- 15,21 ----
#define REL_H
#include "access/tupdesc.h"
+ #include "access/xlog.h"
#include "catalog/pg_am.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
***************
*** 14,21 ****
--- 14,41 ----
#define SNAPMGR_H
#include "fmgr.h"
+ #include "catalog/catalog.h"
#include "utils/resowner.h"
#include "utils/snapshot.h"
+ #include "utils/tqual.h"
+
+
+ #define TestForOldSnapshot(snapshot, relation, page) \
+ do { \
+ if (old_snapshot_threshold >= 0 \
+ && ((snapshot) != NULL) \
+ && (snapshot)->satisfies == HeapTupleSatisfiesMVCC \
+ && !XLogRecPtrIsInvalid((snapshot)->lsn) \
+ && PageGetLSN(page) > (snapshot)->lsn \
+ && NormalTransactionIdFollows(TransactionIdLimitedForOldSnapshots((snapshot)->xmin, relation), (snapshot)->xmin)) \
+ ereport(ERROR, \
+ (errcode(ERRCODE_SNAPSHOT_TOO_OLD), \
+ errmsg("snapshot too old"))); \
+ } while (0)
+
+
+ /* GUC variables */
+ extern int old_snapshot_threshold;
extern bool FirstSnapshotSet;
***************
*** 54,59 **** extern void ImportSnapshot(const char *idstr);
--- 74,81 ----
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
extern bool ThereAreNoPriorRegisteredSnapshots(void);
+ extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+ Relation relation);
extern char *ExportSnapshot(Snapshot snapshot);
*** a/src/include/utils/snapshot.h
--- b/src/include/utils/snapshot.h
***************
*** 14,19 ****
--- 14,20 ----
#define SNAPSHOT_H
#include "access/htup.h"
+ #include "access/xlogdefs.h"
#include "lib/pairingheap.h"
#include "storage/buf.h"
***************
*** 95,100 **** typedef struct SnapshotData
--- 96,103 ----
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+
+ XLogRecPtr lsn; /* position in the WAL stream */
} SnapshotData;
/*
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
***************
*** 15,20 ****
--- 15,21 ----
#ifndef TQUAL_H
#define TQUAL_H
+ #include "utils/hsearch.h"
#include "utils/snapshot.h"
Kevin Grittner <kgrittn@ymail.com> writes:
What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.
TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.
Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes.
But, having said that, why would the index AMs care? Seems like what
you are describing should be strictly a matter for VACUUM's removal
rules. If we're going to have something as ugly as this, I would much
rather it had a very small code footprint.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.
I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.
Note that the default is to maintain legacy PostgreSQL behavior;
the postgresql.conf file must be modified to see this error.
Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes.But, having said that, why would the index AMs care? Seems like what
you are describing should be strictly a matter for VACUUM's removal
rules. If we're going to have something as ugly as this, I would much
rather it had a very small code footprint.
If a tuple is vacuumed away, you would not notice that if you were
running an index scan because you would not visit the page it used
to be on to see that the LSN had changed. We tried to sell the
idea that *any* scan using a snapshot past the threshold should
error out (which is the only way I can see to avoid making the
index AMs aware of this), but they insisted that there was no
reason to have scans fail when reading an unmodified table using an
old snapshot, or even an unmodified page of a modified table.
Due to the cost to this company of modifying their software to deal
with differrent behavior, they will not move without the behavior
I'm shooting for with this patch. Due to the value of this
customer to us, we will put this (or something to accomplish this
behavior) into our fork. We're happy to share it with the
community. If the community does not feel they need this behavior,
or wants to generate the error more aggressively to avoid touching
the index AMs, I understand.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin Grittner <kgrittn@ymail.com> writes:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.
TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.
I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.
No doubt, preventing bloat is a good thing, but that doesn't mean this
is the best API we could create for the issue. The proposition this
patch offers to DBAs is: "You can turn this knob to reduce bloat by some
hard-to-quantify factor. The cost is that some long-running transactions
might fail. You won't know which ones are at risk, the failures won't be
the same from time to time, and you won't be able to do anything to spare
high-value transactions from that fate except by turning that knob back
again globally (which requires a database restart)." Maybe refugees from
Oracle will think that sounds good, but nobody else will.
I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2/15/15 10:36 AM, Tom Lane wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.No doubt, preventing bloat is a good thing, but that doesn't mean this
is the best API we could create for the issue. The proposition this
patch offers to DBAs is: "You can turn this knob to reduce bloat by some
hard-to-quantify factor. The cost is that some long-running transactions
might fail. You won't know which ones are at risk, the failures won't be
the same from time to time, and you won't be able to do anything to spare
high-value transactions from that fate except by turning that knob back
again globally (which requires a database restart)." Maybe refugees from
Oracle will think that sounds good, but nobody else will.I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.
A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.
An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
On 2/15/15 10:36 AM, Tom Lane wrote:
I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.
A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.
Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.
An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
On 2/15/15 10:36 AM, Tom Lane wrote:
I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.
We suggested this to our customer and got out of the meeting with
it looking like it *might* fly. In the next meeting, however, they
said they had run it by others and reviewed the code and it was
completely unacceptable -- they would not consider pg with this as
the solution.
A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.
These they were comfortable with, and did *not* consider to be
unpredictable or something they could not do something about.
I really don't feel I can say more than that, though, without
disclosing more than I should.
An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.
While this option was not specifically suggested, based on their
their reasons that numerous other options we suggested were
unacceptable, I feel sure that this would not be acceptable.
I think Tom hit the nail on the head when he said "Maybe refugees
from Oracle will think that sounds good..." It is precisely those
with very large code bases which have been modified over long
periods of time to work with Oracle that would find this solution
attractive, or perhaps *necessary* to consider a move to Postgres.
That's a potential market we don't care to write off.
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.
What the customer most doesn't want to be "nondeterministic" is
whether the error is generated only when the snapshot has been used
to read a page which has been modified since the snapshot was
taken. If tables or materialized views are set up before a query
and then not subsequently modified during the execution of the
query, that query must not be canceled even if it runs for days,
but it must not cause unconstrained bloat during the run. So far I
don't see any way to avoid including the LSN with the snapshot or
modifying the index AMs. Let's be clear on the footprint for that;
for the btree implementation it is:
src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)
What this discussion has made me reconsider is the metric for
considering a transaction "too old". The number of transaction IDs
consumed seems inferior as the unit of measure for that to LSN or
time.
It looks to me to be pretty trivial (on the order of maybe 30 lines
of code) to specify this GUC in minutes rather than transaction
IDs. At first glance this seems like it would be vulnerable to the
usual complaints about mismanaged clocks, but those are easily
answered by using a cached time in shared memory that we populate
in such a way that it can never move backward. Done right, this
could even allow the GUC to be changeable on reload rather than
only at restart. A badly mismanaged system clock could not cause a
query to generate incorrect results; the worst that could happen is
that this feature would fail to control bloat as much as expected
or reads of modified data could generate the "snapshot too old"
error around the time of the clock adjustment.
As before, this would default to a magic value to mean that you
want the historical PostgreSQL behavior.
If that makes the idea more palatable to anyone, I can submit a
patch to that effect within the next 24 hours.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-16 05:19:11 +0000, Kevin Grittner wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
On 2/15/15 10:36 AM, Tom Lane wrote:
I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.We suggested this to our customer and got out of the meeting with
it looking like it *might* fly. In the next meeting, however, they
said they had run it by others and reviewed the code and it was
completely unacceptable -- they would not consider pg with this as
the solution.A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.These they were comfortable with, and did *not* consider to be
unpredictable or something they could not do something about.
I really don't feel I can say more than that, though, without
disclosing more than I should.
I agree that we need to do something about the dangers of long
snapshots, I'm not sure this is it. I'm not sure of the contrary either.
What I'm definitely not a fan of though, is this implementation. Having
to add checks to a large number of places is a recipe for silently wrong
query results.
One thing I was wondering about recently was introducing an optimization
for repeatedly updated rows into the vacuum code: A row that has xmin =
xmax where these have committed can be removed, even if the xid is above
the xmin horizon - no other transaction is ever going to see it. While
there's some hairy-ness implementing that, it doesn't seem too hard. And
there's a fair number of cases where that'd allow less bloat to
accumulate. Obviously it'd be better if we had logic to do that for
other patterns as well (where the updates aren't in the same xact), but
it seems like a good start.
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.What the customer most doesn't want to be "nondeterministic" is
whether the error is generated only when the snapshot has been used
to read a page which has been modified since the snapshot was
taken. If tables or materialized views are set up before a query
and then not subsequently modified during the execution of the
query, that query must not be canceled even if it runs for days,
but it must not cause unconstrained bloat during the run. So far I
don't see any way to avoid including the LSN with the snapshot or
modifying the index AMs. Let's be clear on the footprint for that;
for the btree implementation it is:
IIUC you never would want cancellations when accessing the the tables
these longrunning backends actually access. The errors about too old
snapshots are just a stopgap because we can't compute the xmin per table
in a meaningful way atm. Right?
Is the bloat caused by rows these transactions actually can see or are
the not-removed rows newer than the transaction's xmax?
Since you actually don't want cancellations for the long running
reporting queries it very much might be sensible to switch to a more
complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
queries. One that can recognize if rows are actually visible to any
backend, or if there are just snapshots that see older rows. I've
previously wondered how hard this would be, but I don't think it's
*that* hard. And it'd be a much more general solution.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2/16/15 2:41 AM, Andres Freund wrote:
Since you actually don't want cancellations for the long running
reporting queries it very much might be sensible to switch to a more
complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
queries. One that can recognize if rows are actually visible to any
backend, or if there are just snapshots that see older rows. I've
previously wondered how hard this would be, but I don't think it's
*that* hard. And it'd be a much more general solution.
Josh Berkus mentioned on IRC that they're working on improving vacuum,
and I think it was something along these lines.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 15, 2015 at 11:36:50AM -0500, Tom Lane wrote:
I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.
While we have prided ourselves on not generating query cancel messages
due to snapshot-too-old, there is the downside of bloat. I understand
the need to allow users to make the trade-off between bloat and query
cancellation.
It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.
FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.
But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.
Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because of
confusing units amongst other things. Which in terms seems fairly close to
Kevins suggestions, unfortunately.
/Magnus
Magnus Hagander <magnus@hagander.net> wrote:
On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
It seems we already have a mechanism in place that allows
tuning of query cancel on standbys vs. preventing standby
queries from seeing old data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We
obsessed about how users were going to react to these odd
variables, but there has been little negative feedback.FWIW, I think that's a somewhat skewed perception. I think it
was right to introduce those, because we didn't really have any
alternatives.
As far as I can see, the "alternatives" suggested so far are all
about causing heap bloat to progress more slowly, but still without
limit. I suggest, based on a lot of user feedback (from the
customer I've talked about at some length on this thread, as well
as numerous others), that unlimited bloat based on the activity of
one connection is a serious deficiency in the product; and that
there is no real alternative to something like a "snapshot too old"
error if we want to fix that deficiency. Enhancements to associate
a snapshot with a database and using a separate vacuum xmin per
database, not limiting vacuum of a particular object by snapshots
that cannot see that snapshot, etc., are all Very Good Things and I
hope those changes are made, but none of that fixes a very
fundamental flaw.
But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for
questions and complaints that I/we see.Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because
of confusing units amongst other things. Which in terms seems
fairly close to Kevins suggestions, unfortunately.
Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.
Unless this idea gets some +1 votes Real Soon Now, I will mark the
patch as Rejected and move on.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin,
* Kevin Grittner (kgrittn@ymail.com) wrote:
Magnus Hagander <magnus@hagander.net> wrote:
On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
It seems we already have a mechanism in place that allows
tuning of query cancel on standbys vs. preventing standby
queries from seeing old data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We
obsessed about how users were going to react to these odd
variables, but there has been little negative feedback.FWIW, I think that's a somewhat skewed perception. I think it
was right to introduce those, because we didn't really have any
alternatives.As far as I can see, the "alternatives" suggested so far are all
about causing heap bloat to progress more slowly, but still without
limit. I suggest, based on a lot of user feedback (from the
customer I've talked about at some length on this thread, as well
as numerous others), that unlimited bloat based on the activity of
one connection is a serious deficiency in the product; and that
there is no real alternative to something like a "snapshot too old"
error if we want to fix that deficiency. Enhancements to associate
a snapshot with a database and using a separate vacuum xmin per
database, not limiting vacuum of a particular object by snapshots
that cannot see that snapshot, etc., are all Very Good Things and I
hope those changes are made, but none of that fixes a very
fundamental flaw.
I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).
The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.
In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started. Another way of thinking about it is with this
timeline:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/\ /\ /\ /\
Long running row created row deleted vacuum
transaction starts
Now, if all transactions currently running started either before the row
was created or after the row was deleted, then the row is dead and
vacuum can reclaim it. Finding those gaps in the transaction space for
all of the currently running processes might not be cheap, but for this
kind of a use-case, it might be worth the effort. On a high-churn
system where the actual set of rows being copied over constantly isn't
very high then you could end up with some amount of duplication of rows-
those to satisfy the ancient transaction and the current working set,
but there wouldn't be any further bloat and it'd almost cerainly be a
much better situation than what is being seen now.
Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.Unless this idea gets some +1 votes Real Soon Now, I will mark the
patch as Rejected and move on.
I'm not against having a knob like this, which is defaulted to off,
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up. If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> wrote:
I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.
I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.
This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:
(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.
(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.
IMO all of these changes people are working are very valuable, and
complement each other. This particular approach is likely to be
especially appealing to those moving from Oracle because it is a
familiar mechanism, and one which some of them have written their
software to be aware of and deal with gracefully.
I'm not against having a knob like this, which is defaulted to off,
Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.
+1
But they are not mutually exclusive; I see them as complementary.
If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).
That's how I see it, too. I would not suggest making the default
anything other than "off", but there are situations where it would
be a nice tool to have in the toolbox.
Thanks for the feedback!
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.
But I think Kevin would agree with you there. That's why he's interested in
having the errors not occur if you don't read from the volatile tables. Ie,
your reporting query reading from read-only tables would be guaranteed to
succeed even if some other table had had some rows vacuumed away.
I'm not sure that's really going to work because of things like hint bit
updates or hot pruning. That is, say your table is read-only now but last
week some of the records were updated. You might reasonably expect those
rows to be safe and indeed the rows themselves will still be in your
report. But the earlier versions of the rows could still be sitting on some
pages invisible to every query but not vacuumable quite yet and then
suddenly they get vacuumed away and the LSN on the page gets bumped.
Fwiw I would strongly suggest that instead of having a number of
transactions to have a time difference. I haven't been following the
"commit timestamp" thread but I'm hoping that patch has the infrastructure
needed to look up an lsn and find out the timestamp and vice versa. So when
you take a snapshot you could look up the effective cut-off LSN based on
the time difference specified in the GUC. As a DBA I would find it
infinitely more comforting specifying "old_snapshot_threshold=4h" than
specifying some arbitrary large number of transactions and hoping I
calculated the transaction rate reasonably accurately.
--
greg
* Kevin Grittner (kgrittn@ymail.com) wrote:
Stephen Frost <sfrost@snowman.net> wrote:
I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.
This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result? If there is overlap between the
long-running cursors then there would be less bloat, and most systems
which I'm familiar with don't turn the entire database over in 20
minutes, 20 hours, or even 20 days except in pretty specific cases.
Perhaps this is one of those, and if so then I'm all wet, but the
feeling I get is that this is a way to dismiss this solution because
it's not what's wanted, which is "what Oracle did."
(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.
That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon). That said, there are capabilities in other
RDBMS's which are valuable and which we *do* want, so the fact that
Oracle does this also isn't a reason to not include it.
IMO all of these changes people are working are very valuable, and
complement each other. This particular approach is likely to be
especially appealing to those moving from Oracle because it is a
familiar mechanism, and one which some of them have written their
software to be aware of and deal with gracefully.
For my 2c, I'd much rather provide them with a system where they don't
have to deal with broken snapshots than give them a way to have them the
way Oracle provided them. :) That said, even the approach Andres
outlined will cause bloat and it may be beyond what's acceptable in some
environments, and it's certainly more complicated and unlikely to get
done in the short term.
I'm not against having a knob like this, which is defaulted to off,
Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)
So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself. I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).
A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this. Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around? On what basis?
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.+1
But they are not mutually exclusive; I see them as complementary.
I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)
If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).That's how I see it, too. I would not suggest making the default
anything other than "off", but there are situations where it would
be a nice tool to have in the toolbox.
Agreed.
Thanks!
Stephen
On Wed, Feb 18, 2015 at 10:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Magnus Hagander <magnus@hagander.net> wrote:
On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com>
wrote:
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for
questions and complaints that I/we see.Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because
of confusing units amongst other things. Which in terms seems
fairly close to Kevins suggestions, unfortunately.Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.
Yes, it would definitely look much better. My reference per above was
exactly that - having a setting in the unit "number of xids" confused a lot
of users and made it really hard to tune. Having something in time units is
a lot easier to understand and tune for most people.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Greg Stark <stark@mit.edu> wrote:
On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There might be something in that, but again it's not much like
this patch. The key point I think we're both making is that
nondeterministic failures are bad, especially when you're
talking about long-running, expensive-to-retry transactions.But I think Kevin would agree with you there. That's why he's
interested in having the errors not occur if you don't read from
the volatile tables. Ie, your reporting query reading from
read-only tables would be guaranteed to succeed even if some
other table had had some rows vacuumed away.I'm not sure that's really going to work because of things like
hint bit updates or hot pruning. That is, say your table is
read-only now but last week some of the records were updated. You
might reasonably expect those rows to be safe and indeed the rows
themselves will still be in your report. But the earlier versions
of the rows could still be sitting on some pages invisible to
every query but not vacuumable quite yet and then suddenly they
get vacuumed away and the LSN on the page gets bumped.
That's a very interesting point. In the case that the reporting
tables are like that (versus being created right before the run,
for the report), it would argue for either very aggressive
autovacuum settings or an explicit vacuum on those tables before
starting the report.
Fwiw I would strongly suggest that instead of having a number of
transactions to have a time difference.
On the 15th I said this:
| What this discussion has made me reconsider is the metric for
| considering a transaction "too old". The number of transaction IDs
| consumed seems inferior as the unit of measure for that to LSN or
| time.
|
| It looks to me to be pretty trivial (on the order of maybe 30 lines
| of code) to specify this GUC in minutes rather than transaction
| IDs. At first glance this seems like it would be vulnerable to the
| usual complaints about mismanaged clocks, but those are easily
| answered by using a cached time in shared memory that we populate
| in such a way that it can never move backward. Done right, this
| could even allow the GUC to be changeable on reload rather than
| only at restart. A badly mismanaged system clock could not cause a
| query to generate incorrect results; the worst that could happen is
| that this feature would fail to control bloat as much as expected
| or reads of modified data could generate the "snapshot too old"
| error around the time of the clock adjustment.
|
| As before, this would default to a magic value to mean that you
| want the historical PostgreSQL behavior.
|
| If that makes the idea more palatable to anyone, I can submit a
| patch to that effect within the next 24 hours.
Until yesterday I didn't get any feedback suggesting that such a
change would make the patch more palatable. Now that I have had,
I'll try to post a patch to that effect today.
Thanks!
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Stephen Frost <sfrost@snowman.net> wrote:
Kevin Grittner (kgrittn@ymail.com) wrote:
Stephen Frost <sfrost@snowman.net> wrote:
In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result?
If it sounds like off-the-cuff it is because I am summarizing the
mountain of data which has been accumulated over weeks of
discussions and many rounds of multi-day tests using their actual
software driven by a software test driver that simulates users,
with "think time" and all.
To be usable, we need to run on a particular set of hardware with a
simulated load of 3000 users. In a two-day test without the
patches I'm proposing, their actual, working production software
running against PostgreSQL bloated the database by 37% and was on a
linear rate of increase. Unpatched, CPU time consumed at a linear
rate due to the bloat until in the second day it saturated the CPUs
and the driver was unable to get acceptable user performance.
Because of that we haven't moved on to test what the bloat rate
would be like with 3000 users, but I assume it would be higher than
with 300 users.
With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time. I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.
(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon).
I understand that, and if the community doesn't want this style of
limitation on bloat we can make it part of PPAS (which *does*
duplicate PL/SQL, among other things). We are offering it to the
community first.
That said, there are capabilities in other RDBMS's which are
valuable and which we *do* want, so the fact that Oracle does this
also isn't a reason to not include it.
:-)
I'm not against having a knob like this, which is defaulted to off,
Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself. I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).
This conversation has gotten a little confusing because some of my
posts seem to have been held up in the email systems for some
reason, and are arriving out-of-order (and at least one I don't see
yet). But I have been responding to these points. For one thing I
stated four days ago that I now feel that the metric for
determining that a snapshot is "old" should be *time*, not
transactions IDs consumed, and offered to modify that patch in that
direction if people agreed. I now see one person agreeing and
several people arguing that I should do just that (as though they
had not seem me propose it), so I'll try to get a new version out
today like that. In any event, I am sure it is possible and feel
that it would be better. (Having posted that now at least 4 times,
hopefully people will pick up on it.)
For another thing, as mentioned earlier, this is the btree footprint:
src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)
That consists mainly of two things. First, there are 13 places
(all in nbtsearch.c) immediately following a call to
BufferGetPage() that a line like this was added:
TestForOldSnapshot(snapshot, rel, page);
Second, some functions that were not previously passed a snapshot
had that added to the function signature and the snapshot was
passed from the caller where needed. C files other than nbtsearch
were only modified to pass a NULL in to calls to the functions with
modified signatures -- in a total of four places.
That is it.
To be sure I had to initially spend some time figuring out which
BufferGetPage() calls needed to be followed with this test, but
it's not rocket science.
A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this. Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around? On what basis?
I will review that; it may indeed need some attention. The index
entry is removed before the line pointer can be freed, but the
question is whether there is a window where pruning or vacuum has
removed the tuple and changed the page visibility map bit to say
that everything on the page is visible to all before the index
entry is removed. There may be a need to, as you say, suppress
setting the all-visible flag in some particular cases.
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.+1
But they are not mutually exclusive; I see them as complementary.
I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)
I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason. If you
have any ideas about how to improve that, I'm all ears. But do
consider the actual scope of what was needed for the btree coverage
(above). (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/19/2015 09:44 AM, Kevin Grittner wrote:
On the 15th I said this:
| What this discussion has made me reconsider is the metric for
| considering a transaction "too old". The number of transaction IDs
| consumed seems inferior as the unit of measure for that to LSN or
| time.
|
| It looks to me to be pretty trivial (on the order of maybe 30 lines
| of code) to specify this GUC in minutes rather than transaction
| IDs. At first glance this seems like it would be vulnerable to the
| usual complaints about mismanaged clocks, but those are easily
| answered by using a cached time in shared memory that we populate
| in such a way that it can never move backward. Done right, this
| could even allow the GUC to be changeable on reload rather than
| only at restart. A badly mismanaged system clock could not cause a
| query to generate incorrect results; the worst that could happen is
| that this feature would fail to control bloat as much as expected
| or reads of modified data could generate the "snapshot too old"
| error around the time of the clock adjustment.
|
| As before, this would default to a magic value to mean that you
| want the historical PostgreSQL behavior.
|
| If that makes the idea more palatable to anyone, I can submit a
| patch to that effect within the next 24 hours.Until yesterday I didn't get any feedback suggesting that such a
change would make the patch more palatable. Now that I have had,
I'll try to post a patch to that effect today.Thanks!
I understand why this make people nervous. I wonder if it might be more
palatable if there were a per-table setting that could enable it? If we
could ensure that this was only applied to high churn queue tables, say,
while tables touched by the report writer would not have it applied,
that would calm a lot of my fears.
I'm also interested in handling the case Stephen Frost described, where
a tuple is effectively dead but we don't currently have the means of
discovering the fact, because there is an older long running transaction
which is never in fact going to be able to see the tuple.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 18, 2015 at 4:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for
questions and complaints that I/we see.Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because
of confusing units amongst other things. Which in terms seems
fairly close to Kevins suggestions, unfortunately.Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.
I like this thought. One of the first things I do in a new Pg environment
is setup a cronjob that watches pg_stat_activity and terminates most
backends over N minutes in age (about 5x the length of normal work) with an
exception for a handful of accounts doing backups and other maintenance
operations. This prevents a stuck client from jamming up the database.
Would pg_dump be able to opt-out of such a restriction?
regards,
Rod
On Thu, Feb 19, 2015 at 6:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason. If you
have any ideas about how to improve that, I'm all ears. But do
consider the actual scope of what was needed for the btree coverage
(above). (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)
If I understood the issue correctly, you have long running snapshots
accessing one part of the data, while you have high churn on a
disjoint part of data. We would need to enable vacuum on the high
churn data while still being able to run those long queries. The
"snapshot too old" solution limits the maximum age of global xmin at
the cost of having to abort transactions that are older than this and
happen to touch a page that was modified after they were started.
What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having "normal"
transactions abort).
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> wrote:
On 02/19/2015 09:44 AM, Kevin Grittner wrote:
I understand why this make people nervous. I wonder if it might be more
palatable if there were a per-table setting that could enable it? If we
could ensure that this was only applied to high churn queue tables, say,
while tables touched by the report writer would not have it applied,
that would calm a lot of my fears.
That's an interesting idea. I think I should switch the unit of
measure for "too old" to a time-based value first, since there
seems to be universal agreement that it would be better than number
of transactions. Once I've cleared that hurdle I'll see what this
would take.
I'm also interested in handling the case Stephen Frost described, where
a tuple is effectively dead but we don't currently have the means of
discovering the fact, because there is an older long running transaction
which is never in fact going to be able to see the tuple.
Absolutely. That's one of several other issues that I've been
looking at over the last few weeks. It sounds like people are
already working on that one, which is great. My personal priority
list included that, but after the two I submitted here and a patch
to allow combining near-empty btree pages so that btree bloat is
constrained without having to reindex periodically for the cases
where index tuples are added in index order (at one or a few
insertion points) and most-but-not-all are deleted. You can
currently wind up with a worst-case of one index tuple per page
with no action to reduce that bloat by vacuum processes.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 19, 2015 at 3:44 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
I'm also interested in handling the case Stephen Frost described, where
a tuple is effectively dead but we don't currently have the means of
discovering the fact, because there is an older long running transaction
which is never in fact going to be able to see the tuple.Absolutely. That's one of several other issues that I've been
looking at over the last few weeks. It sounds like people are
already working on that one, which is great. My personal priority
list included that, but after the two I submitted here and a patch
to allow combining near-empty btree pages so that btree bloat is
constrained without having to reindex periodically for the cases
where index tuples are added in index order (at one or a few
insertion points) and most-but-not-all are deleted. You can
currently wind up with a worst-case of one index tuple per page
with no action to reduce that bloat by vacuum processes.
I'd be willing to test that patch.
I have a big database that does that, and a script that fills the disk
with said bloat. That's forced me to do periodic reindexing, which
sucks.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Rod Taylor <rod.taylor@gmail.com> wrote:
Would pg_dump be able to opt-out of such a restriction?
I don't see how, since vacuum would be removing recently dead
tuples that are still visible; the alternative to getting a
"snapshot too old" error when reading a page which could be
affected is to return incorrect results, and nobody wants that.
The best you could do if you wanted to run pg_dump (or similar) and
it might take more time than your old_snapshot_threshold would be
to increase the threshold, reload, dump, set it back to the
"normal" setting, and reload again.
Andrew's suggestion of setting this per table would not help here.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Ants Aasma <ants@cybertec.at> wrote:
If I understood the issue correctly, you have long running snapshots
accessing one part of the data, while you have high churn on a
disjoint part of data. We would need to enable vacuum on the high
churn data while still being able to run those long queries. The
"snapshot too old" solution limits the maximum age of global xmin at
the cost of having to abort transactions that are older than this and
happen to touch a page that was modified after they were started.
Pretty much. It's not quite as "black and white" regarding high
churn and stable tables, though -- at least for the customer whose
feedback drove my current work on containing bloat. They are
willing to tolerate an occasional "snapshot too old" on a cursor,
and they can automatically pick up again by restarting the cursor
with a new "lower limit" and a new snapshot.
What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having "normal"
transactions abort).
Let me make sure I understand your suggestion. You are suggesting
that within a transaction you can declare a list of tables which
should get the "snapshot too old" error (or something like it) if a
page is read which was modified after the snapshot was taken?
Snapshots within that transaction would not constrain the effective
global xmin for purposes of vacuuming those particular tables?
If I'm understanding your proposal, that would help some of the
cases the "snapshot too old" case addresses, but it would not
handle the accidental "idle in transaction" case (e.g., start a
repeatable read transaction, run one query, then sit idle
indefinitely). I don't think it would work quite as well for some
of the other cases I've seen, but perhaps it could be made to fit
if we could figure out the accidental "idle in transaction" case.
I also think it might be hard to develop a correct global xmin for
vacuuming a particular table with that solution. How do you see
that working?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/19/2015 03:31 PM, Kevin Grittner wrote:
What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having "normal"
transactions abort).Let me make sure I understand your suggestion. You are suggesting
that within a transaction you can declare a list of tables which
should get the "snapshot too old" error (or something like it) if a
page is read which was modified after the snapshot was taken?
Snapshots within that transaction would not constrain the effective
global xmin for purposes of vacuuming those particular tables?
I thought it meant that the declared tables would only be vacuumed
conservatively, so the transaction would expect not to see "snapshot too
old" from them.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Feb 19, 2015 10:31 PM, "Kevin Grittner" <kgrittn@ymail.com> wrote:
What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having "normal"
transactions abort).Let me make sure I understand your suggestion. You are suggesting
that within a transaction you can declare a list of tables which
should get the "snapshot too old" error (or something like it) if a
page is read which was modified after the snapshot was taken?
Snapshots within that transaction would not constrain the effective
global xmin for purposes of vacuuming those particular tables?
Sorry, I should have been clearer. I'm proposing that a transaction can
declare what tables it will access. After that the transaction will
constrain xmin for only those tables. Accessing any other table would give
an error immediately.
If I'm understanding your proposal, that would help some of the
cases the "snapshot too old" case addresses, but it would not
handle the accidental "idle in transaction" case (e.g., start a
repeatable read transaction, run one query, then sit idle
indefinitely). I don't think it would work quite as well for some
of the other cases I've seen, but perhaps it could be made to fit
if we could figure out the accidental "idle in transaction" case.
Accidental idle in transaction seems better handled by just terminating
transactions older than some timeout. This is already doable externally,
but an integrated solution would be nice if we could figure out how the
permissions for setting such a timeout work.
I also think it might be hard to develop a correct global xmin for
vacuuming a particular table with that solution. How do you see
that working?
I'm imagining the long running transaction would register it's xmin in a
shared map keyed by relation for each referenced relation and remove the
xmin from procarray. Vacuum would access this map by relation, determine
the minimum and use that if it's earlier than the global xmin. I'm being
deliberately vague here about the datastructure in shared memory as I don't
have a great idea what to use there. It's somewhat similar to the lock
table in that in theory the size is unbounded, but in practice it's
expected to be relatively tiny.
Regards,
Ants Aasma
Kevin,
* Kevin Grittner (kgrittn@ymail.com) wrote:
Stephen Frost <sfrost@snowman.net> wrote:
Kevin Grittner (kgrittn@ymail.com) wrote:
(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result?If it sounds like off-the-cuff it is because I am summarizing the
mountain of data which has been accumulated over weeks of
discussions and many rounds of multi-day tests using their actual
software driven by a software test driver that simulates users,
with "think time" and all.
My apologies for, unintentionally, implying that you hadn't done
sufficient research or testing. That certainly was *not* my intent and
it's absolutely clear to me that you've spent quite a bit of time
analyzing this problem.
What I was trying to get at is that the approach which I outlined was,
perhaps, not closely analyzed. Now, analyzing a problem based on a
solution which doesn't actually exist isn't exactly trivial, and while
it might be possible to do in this case it'd probably be easier to
simply write the code and the review the results than perform an
independent analysis. I do feel that the solution might have been
dismissed on the basis of "doubling the database for each long-running
transaction is not acceptable" or even "having bloat because of a
long-running transaction is not acceptable," but really analyzing what
would actually happen is a much more difficult exercise.
With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time. I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.
Understood. I'd encourage you to post your thoughts on it- I'm
certainly interested in it, even if it isn't something we'll be able to
really work on for 9.5.
That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon).I understand that, and if the community doesn't want this style of
limitation on bloat we can make it part of PPAS (which *does*
duplicate PL/SQL, among other things). We are offering it to the
community first.
That is certainly appreciated and I'd encourage you to continue to do
so. I do feel that the community, in general, is interested in these
kinds of knobs- it's just a question of implemenation and making sure
that it's a carefully considered solution and not a knee-jerk reaction
to match what another RDBMS does.
I'm not against having a knob like this, which is defaulted to off,
Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself. I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).This conversation has gotten a little confusing because some of my
posts seem to have been held up in the email systems for some
reason, and are arriving out-of-order (and at least one I don't see
yet). But I have been responding to these points. For one thing I
stated four days ago that I now feel that the metric for
determining that a snapshot is "old" should be *time*, not
transactions IDs consumed, and offered to modify that patch in that
direction if people agreed. I now see one person agreeing and
several people arguing that I should do just that (as though they
had not seem me propose it), so I'll try to get a new version out
today like that. In any event, I am sure it is possible and feel
that it would be better. (Having posted that now at least 4 times,
hopefully people will pick up on it.)
I agree with others that having it be time-based would be better. I
did see that but it's really an independent consideration from the
footprint of the patch.
For another thing, as mentioned earlier, this is the btree footprint:
src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)
Understood- but that's not really the issue here, it's...
That consists mainly of two things. First, there are 13 places
(all in nbtsearch.c) immediately following a call to
BufferGetPage() that a line like this was added:TestForOldSnapshot(snapshot, rel, page);
This. That's quite a few different places which had to be updated and
while they are obvious now, it might not be obvious in the future which
BufferGetPage() calls need that follows TestForOldSnapshot() call.
Second, some functions that were not previously passed a snapshot
had that added to the function signature and the snapshot was
passed from the caller where needed. C files other than nbtsearch
were only modified to pass a NULL in to calls to the functions with
modified signatures -- in a total of four places.
This part doesn't bother me nearly as much as having to add those tests
into a lot of different places in the code, and having to get the index
AMs involved.
A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this. Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around? On what basis?I will review that; it may indeed need some attention. The index
entry is removed before the line pointer can be freed, but the
question is whether there is a window where pruning or vacuum has
removed the tuple and changed the page visibility map bit to say
that everything on the page is visible to all before the index
entry is removed. There may be a need to, as you say, suppress
setting the all-visible flag in some particular cases.
Right, there may be a period of time where the flag is set to
all-visible and the index entry still exists.
Isn't there a concern about the index entry disappearing also though?
If there is a specific query against a particular value in the index and
the index no longer has that value then the query would return zero
results- but this could happen without visiting the heap, right?
Therefore, how do you know that the value was, or wasn't, visible at the
time that the original long-running snapshot was taken.
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.+1
But they are not mutually exclusive; I see them as complementary.
I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason. If you
have any ideas about how to improve that, I'm all ears. But do
consider the actual scope of what was needed for the btree coverage
(above). (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)
Maintenance is certainly one concern among many but I'm not sure that
we're quite ready to write this capability off due to the
maintainability of this particular patch- it needs to be considered
against the alternatives proposed and the risk of breaking this
particular capability in the future. I agree that the index issue is
not AM-specific but it's still a consideration to consider.
Thanks!
Stephen
On 2/19/15 1:54 PM, Kevin Grittner wrote:
Rod Taylor <rod.taylor@gmail.com> wrote:
Would pg_dump be able to opt-out of such a restriction?
I don't see how, since vacuum would be removing recently dead
tuples that are still visible; the alternative to getting a
"snapshot too old" error when reading a page which could be
affected is to return incorrect results, and nobody wants that.
The best you could do if you wanted to run pg_dump (or similar) and
it might take more time than your old_snapshot_threshold would be
to increase the threshold, reload, dump, set it back to the
"normal" setting, and reload again.
While I think pg_dump is a great solution for small to medium
installations, there are a number of excellent file-based backup options
available. Anyone who is seriously worried about bloat (or locking)
should be looking to those solutions.
--
- David Steele
david@pgmasters.net
Stephen Frost <sfrost@snowman.net> wrote:
Kevin Grittner (kgrittn@ymail.com) wrote:
With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time. I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.Understood. I'd encourage you to post your thoughts on it- I'm
certainly interested in it, even if it isn't something we'll be able to
really work on for 9.5.
OK, if you're up for discussing it even while it's in early
development, I'll start a separate thread to outline the idea. As
a preliminary "teaser" in very abbreviated form, the idea is to
basically use the "page split" logic, but splitting all tuples from
multiple adjacent pages onto a new page to the right. The old tuples
maintain their pointers (to support scans positioned in that range),
but all pointers *to* those pages will be redirected to the new
page. Let's not bifurcate this thread by discussing it further
here; please wait until I start the new thread, which will be after
I post the next version of this "snapshot too old" patch.
Isn't there a concern about the index entry disappearing also though?
If there is a specific query against a particular value in the index and
the index no longer has that value then the query would return zero
results- but this could happen without visiting the heap, right?
Right; that's why I don't see how to avoid touching the index AMs.
The index pages also contain the LSN, and if you are not visiting
the heap it is because some page in the index was changed, and you
will see the "too recent" LSN there.
Therefore, how do you know that the value was, or wasn't, visible at the
time that the original long-running snapshot was taken.
Strictly speaking, you don't know anything that specific. You know
that changes were made that *might* have removed tuples that would
have been visible to the snapshot. Once a snapshot passes the
threshold for being considered "old", then viewing any page changed
after the snapshot was built has to trigger the error. That means
some false positives, but as long as you can't get false negatives
there is never a silently incorrect query result.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 16, 2015 at 10:49 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
What this discussion has made me reconsider is the metric for
considering a transaction "too old". The number of transaction IDs
consumed seems inferior as the unit of measure for that to LSN or
time.It looks to me to be pretty trivial (on the order of maybe 30 lines
of code) to specify this GUC in minutes rather than transaction
IDs.
It seems to me that SQL Server also uses similar mechanism to
avoid the bloat in version store (place to store previous versions or
record).
Mechanism used by them is that during the shrink process, the
longest running transactions that have not yet generated row
versions are marked as victims. If a transaction is marked as a
victim, it can no longer read the row versions in the version store.
When it attempts to read row versions, an error message is
generated and the transaction is rolled back.
I am not sure how much weight it adds to the usefulness of this
patch, however I think if other leading databases provide a way to
control the bloat, it indicates that most of the customers having
write-intesive workload would like to see such an option.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems to me that SQL Server also uses similar mechanism to
avoid the bloat in version store (place to store previous
versions or record).
I think if other leading databases provide a way to control the
bloat, it indicates that most of the customers having
write-intesive workload would like to see such an option.
Yeah, I think many users are surprised by the damage that can be
done to a database by an idle connection or long-running read-only
query, especially when coming from other databases which have
protections against that.
The bad news is that changing to specifying the limit in time
rather than transaction IDs is far more complicated than I thought.
Basically, we need to associate times in the past with the value of
latestCompletedXid (or derived values like snapshot xmin) at those
times. I was initially thinking that by adding a timestamp to the
snapshot we could derive this from active snapshots. For any one
connection, it's not that hard for it to scan the pairingheap of
snapshots and pick out the right values; but the problem is that it
is process-local memory and this needs to work while the connection
is sitting idle for long periods of time. I've got a couple ideas
on how to approach it, but neither seems entirely satisfying:
(1) Use a new timer ID to interrupt the process whenever its
oldest snapshot which hasn't yet crossed the "old snapshot"
threshold does so. The problem is that this seems as though it
would need to do an awful lot more work (scanning the pairing heap
for the best match) than anything else is currently doing in a
timeout handler. One alternative would be to keep a second pairing
heap to track snapshots which have not crossed the threshold,
removing items as they get old -- that would make the work needed
in the timeout handler closer to other handlers.
(2) Use a course enough granularity on time and a short enough
maximum for the GUC to just keep a circular buffer of the mappings
in memory. We might be able to make this dense enough that one
minute resolution for up to 60 days could fit in 338kB. Obviously
that could be reduced with courser granularity or a shorter
maximum.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/22/2015 11:48 AM, Kevin Grittner wrote:
(2) Use a course enough granularity on time and a short enough
maximum for the GUC to just keep a circular buffer of the mappings
in memory. We might be able to make this dense enough that one
minute resolution for up to 60 days could fit in 338kB. Obviously
that could be reduced with courser granularity or a shorter
maximum.
This doesn't sound too bad to me. Presumably these would be tunable. I
think one minute granularity would be fine for most purposes, but I can
imagine people who would want it finer, like 10 seconds, say.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems to me that SQL Server also uses similar mechanism to
avoid the bloat in version store (place to store previous
versions or record).I think if other leading databases provide a way to control the
bloat, it indicates that most of the customers having
write-intesive workload would like to see such an option.Yeah, I think many users are surprised by the damage that can be
done to a database by an idle connection or long-running read-only
query, especially when coming from other databases which have
protections against that.The bad news is that changing to specifying the limit in time
rather than transaction IDs is far more complicated than I thought.
Basically, we need to associate times in the past with the value of
latestCompletedXid (or derived values like snapshot xmin) at those
times. I was initially thinking that by adding a timestamp to the
snapshot we could derive this from active snapshots. For any one
connection, it's not that hard for it to scan the pairingheap of
snapshots and pick out the right values; but the problem is that it
is process-local memory and this needs to work while the connection
is sitting idle for long periods of time.
Could you please explain in slightly more detail why can't it work
if we use timestamp instead of snapshot->xmin in your patch in
function TestForOldSnapshot()?
I've got a couple ideas
on how to approach it, but neither seems entirely satisfying:(1) Use a new timer ID to interrupt the process whenever its
oldest snapshot which hasn't yet crossed the "old snapshot"
threshold does so. The problem is that this seems as though it
would need to do an awful lot more work (scanning the pairing heap
for the best match) than anything else is currently doing in a
timeout handler. One alternative would be to keep a second pairing
heap to track snapshots which have not crossed the threshold,
removing items as they get old -- that would make the work needed
in the timeout handler closer to other handlers.(2) Use a course enough granularity on time and a short enough
maximum for the GUC to just keep a circular buffer of the mappings
in memory. We might be able to make this dense enough that one
minute resolution for up to 60 days could fit in 338kB. Obviously
that could be reduced with courser granularity or a shorter
maximum.
How exactly will this allow to return "snapshot old" error, do we
need a check for page (page lsn) as in your current patch?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> wrote:
Could you please explain in slightly more detail why can't it work> if we use timestamp instead of snapshot->xmin in your patch in
function TestForOldSnapshot()?
It works fine for the additional visibility checking in scans, but
it doesn't cover the vacuuming -- that needs to use a transaction ID
for its cutoff.
How exactly will this allow to return "snapshot old" error, do we
need a check for page (page lsn) as in your current patch?
The change to the part where it actually throws the error is very
small, just checking time instead of xmin.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 February 2015 at 06:35, Magnus Hagander <magnus@hagander.net> wrote:
On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because of
confusing units amongst other things. Which in terms seems fairly close to
Kevins suggestions, unfortunately.
I agree with Bruce that we already have a mechanism similar to this
for Hot Standby, so it should therefore be OK to have it for normal
running when users desire it. The key difference here is that in this
patch we use the LSN to look for changed data blocks, allowing queries
to proceed for longer than they would do on Hot Standby where they
currently just get blown away regardless. I like the proposal in this
patch but it is more extreme than the mechanism Hot Standby provides.
(If we implement this then I would want to see it work for Hot Standby
also so we can keep the mechanisms in step, I think that is too late
in the day to add that for 9.5.)
I also agree with Andres and Magnus in saying that in the current
definition of the tunable parameter it would be hard to use.
In response to Tom's perspective that it is the single most annoying
feature of Oracle, I agree. It just happens we have a similar problem:
bloat.
Having a parameter to define "maximum acceptable bloat" would be a
very useful thing in PostgreSQL. IIRC other DBMS had a lot of work to
make "snapshot too old" occur in controllable circumstances. So I
would call having a very crap parameter like "old_snapshot_limit" a
good start, but clearly not the end point of an initiative too improve
our control of bloat.
+1 to include this patch in 9.5, as long as there is agreement to
improve this in the future
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transaction would allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat without introducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16/03/15 13:04, Rui DeSousa wrote:
Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transaction would allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat without introducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.
Hmm...
Better per transaction, or even per statement.
If I fire of a query I expect should be completed within 30 seconds,
then I probably don't mind it aborting after 5 minutes. However, if I
fire of a query that I expect to take between an hour and 3 hours, then
I definitely don't want it aborted after 5 minutes!
It could be that 99.99% of transactions will complete with a minute, but
a small minority might reasonably be expected to take much longer.
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <rui@crazybean.net> wrote:
Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transaction would allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat without introducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.
An advantage of Kevin's approach is that you don't have to abort *all*
long-running transactions, only those that touch data which has been
modified since then. If your system is read-mostly, that could
dramatically reduce the number of aborts.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 16, 2015 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <rui@crazybean.net> wrote:
Would a parameter to auto terminate long running transactions be a
better solution? Terminating the long running transaction would allow for
the normal vacuuming process to cleanup the deleted records thus avoiding
database bloat without introducing new semantics to Postgres's MVCC. I
would also recommend that the default be disabled.An advantage of Kevin's approach is that you don't have to abort *all*
long-running transactions, only those that touch data which has been
modified since then. If your system is read-mostly, that could
dramatically reduce the number of aborts.
Indeed. The suggestion of auto-terminating long running transactions misses
the point, ISTM. Most of the use cases I can see for this involve vacuuming
tables that the long running transaction will have no interest in at all
(which is why I suggested being able to set it on a per table basis). I
certainly don't want to be terminating my long running report transaction
so that my queue table which is only ever touched by very short-lived
transactions can be reasonably vacuumed. But that's what we have to do now.
cheers
andrew
On Sun, Nov 8, 2020 at 5:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
On 2/15/15 10:36 AM, Tom Lane wrote:
I wonder if we couldn't achieve largely the same positive effects
through
adding a simple transaction-level timeout option.
A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad,
Based on my experience, a long transaction which is caused by a long-running
query is not so many. A common case I met is user start a transaction
but forget it to close it, I am thinking if we can do something for this
situation.
Since most users will use Read Committed isolation level, so after a user
completes a query, the next query will use a fresh new snapshot, so there
is no
need to block the oldest xmin because of this. will it be correct to
advance the oldest
xmin in this case? If yes, what would be the blocker in implementing this
feature?
--
Best Regards
Andy Fan