BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Started by PG Bug reporting formover 5 years ago28 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16722
Logged by: Sergey Koposov
Email address: skoposov@ed.ac.uk
PostgreSQL version: 11.9
Operating system: debian
Description:

Hi,

When ingesting a billion or so rows in the table that has some array columns
(see schema below), at some point all the backends doing the ingestion hang
with 100% CPU usage. When investigating, I think I can trace this to the
limit of 2^32 toasted records per table. See the gdb full backtrace of
hanged backends in the bottom. When the problem occurs, it has 272 mill
records.
Basically the hanging happens in GetNewOidWithIndex called by
toast_save_datum.
While I understand the limit for the toast number is there to stay, but the
behaviour of PG next to the limit is IMO a bug (or at least non-trivial) .

I would rather prefer to see an error-message as opposed to backends
hanging.
I see a similar report BUG #14137, but it didn't seem to receive any reply.

If it matters, the ingestion inside the table is done by 10 parallel
backends doing COPY in the table. (I don't know if it affects the way the
objids are allocated
for toast objects)

Thanks,
Sergey Koposov

GDB output
*****************
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
No locals.
#1 0x00000000006b3f42 in PGSemaphoreLock (sema=0x7f607fe1cbb8)
at pg_sema.c:316
errStatus = <optimized out>
__func__ = "PGSemaphoreLock"
#2 0x000000000071ea14 in LWLockAcquire (lock=0x7f689257bf20,
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1244
mustwait = <optimized out>
proc = 0x7f68adeb46b0
result = true
extraWaits = 0
__func__ = "LWLockAcquire"
#3 0x0000000000700440 in StartBufferIO (buf=buf@entry=0x7f6083a2e640,
forInput=forInput@entry=true) at bufmgr.c:3887
buf_state = <optimized out>
#4 0x000000000070204d in BufferAlloc (foundPtr=0x7fff36ee18db,
strategy=0x0,
blockNum=7605075, forkNum=MAIN_FORKNUM, relpersistence=<optimized out>,

smgr=0x130f380) at bufmgr.c:1251
newPartitionLock = 0x7f607fe24080
oldTag = {rnode = {spcNode = 20206528, dbNode = 0, relNode =
8688143},
forkNum = MAIN_FORKNUM, blockNum = 8469712}
oldPartitionLock = <optimized out>
buf = 0x7f6083a2e640
oldHash = <optimized out>
oldFlags = <optimized out>
buf_id = <optimized out>
newTag = {rnode = {spcNode = 1663, dbNode = 16818,
relNode = 2229524624}, forkNum = MAIN_FORKNUM, blockNum =
7605075}
newHash = 4182877260
valid = false
buf_state = 0
#5 ReadBuffer_common (smgr=0x130f380, relpersistence=<optimized out>,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=7605075,
mode=RBM_NORMAL, strategy=0x0, hit=0x7fff36ee198f) at bufmgr.c:745
bufHdr = <optimized out>
bufBlock = <optimized out>
found = true
isExtend = false
isLocalBuf = <optimized out>
__func__ = "ReadBuffer_common"
#6 0x00000000007023e1 in ReadBufferExtended (reln=0x7f607fd9bc08,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=7605075,
mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at
bufmgr.c:664
hit = false
buf = <optimized out>
__func__ = "ReadBufferExtended"
#7 0x00000000007024eb in ReadBuffer (blockNum=<optimized out>,
reln=<optimized out>) at bufmgr.c:596
No locals.
#8 ReleaseAndReadBuffer (buffer=buffer@entry=20373756,
relation=<optimized out>, blockNum=<optimized out>) at bufmgr.c:1545
bufHdr = <optimized out>
#9 0x00000000004c6db2 in index_fetch_heap (scan=0x0,
scan@entry=0x136e098)
at indexam.c:595
prev_buf = 20373756
tid = 0x136e0fc
all_dead = false
got_heap_tuple = <optimized out>
#10 0x00000000004c6e00 in index_getnext (scan=0x136e098,
direction=direction@entry=ForwardScanDirection) at indexam.c:691
heapTuple = <optimized out>
tid = <optimized out>
#11 0x00000000004c6299 in systable_getnext
(sysscan=sysscan@entry=0x136dec8)
at genam.c:425
htup = <optimized out>
__func__ = "systable_getnext"
#12 0x000000000050cd09 in GetNewOidWithIndex (relation=0x7f607fd9bc08,
indexId=310933857, oidcolumn=1) at catalog.c:371
newOid = 2744075972
scan = 0x136dec8
key = {sk_flags = 0, sk_attno = 1, sk_strategy = 3, sk_subtype = 0,

sk_collation = 100, sk_func = {fn_addr = 0x7b6320 <oideq>,
fn_oid = 184, fn_nargs = 2, fn_strict = true, fn_retset = false,

fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x13453c0,
fn_expr = 0x0}, sk_argument = 2744075972}
collides = <optimized out>
#13 0x00000000004c3063 in toast_save_datum (value=20557421, oldexternal=0x0,

options=0, rel=<optimized out>, rel=<optimized out>) at
tuptoaster.c:1574
toastrel = 0x7f607fd9bc08
toastidxs = 0x136bba0
toasttup = 0x7f607fd98540
t_values = {300189817, 0, 140734114962192}
t_isnull = {false, false, false}
result = 0xfffffffffffffe00
chunk_data = {hdr = {vl_len_ = "\340\000\000",
vl_dat = 0x7fff36ee1b14 "\001"},
data =
"\340\000\000\000\001\000\000\000\000\000\000\000\274\002\000\000\b\000\000\000\001\000\000\000̥\203@\225\002\365@z\210,A/<gA\a\233}AYf\210Aɀ\215AД\226A.U)?\320\325.?\242aA?~\311:?i\215\211?\334\020\203?\000\000\000\000\272
\354@\222\202ي\321\070\354@\356\274\001ԀM\354@\"\272o\016nf\354@Ӯ\327N\275z\354@\242\250\"K\365\223\354@\372k\252\305D\250\354@",
'\000' <repeats 1696 times>..., align_it = 224}
chunk_size = 0
num_indexes = 1
validIndex = 0
#14 0x00000000004c45a2 in toast_insert_or_update (rel=0x7f607fd983a8,
newtup=0x0, oldtup=0xb5, options=-1) at tuptoaster.c:867
old_value = 20557421
tupleDesc = 0x12beb88
numAttrs = 181
need_change = 69
need_free = 109
toast_action = "pp pppp ppppppxppp ", 'p' <repeats 51 times>,
"xxxxxxxxxx ", 'x' <repeats 12 times>, 'p' <repeats 61 times>, "
ppppppxxppppppxxpppppppp", '\000' <repeats 1418 times>
toast_isnull = {false <repeats 1600 times>}
toast_oldisnull = {false <repeats 1600 times>}
toast_values = {9011, 508863, 20557040, 2663, 1, 2048, 130,
20557062,
4642831835879822414, 4629766747987641402, 1478912696, 1475380501,

1151322629, 1152363744, 20557104, 1018397949, 4678179332087438334,

4678191406000876033, 20557168, 1325489961962492288,
18446744072652356428, 18446744072655684425, 1052577735,
1094195074,
1088406363, 1106636035, 1157361664, 1100416689, 1139960709, 319,

1100668600, 1108163953, 29, 1100071362, 1111486228, 33, 0, 0, 0,

300, 297, 1051787629, 0, 2143289344, 2143289344, 1067470643,
1052143608, 31, 1099325834, 1107053637, 1109951334, 1096537497,
1083978235, 1077223218, 18446744072661571125, 1115025818,
1101955695, 1117711694, 1081872323, 1066791225, 987449489,
930850946, 1095769653, 1103322737, 1107459414, 1095769653,
1103322737, 1107459414, 20374504, 20557421, 20557474, 20557527,
20557580, 20557633, 20557686, 20557739, 20557792, 20557845,
20557898, 20557951, 20373112, 20373032, 20573080, 20558163,
20558204, 20558245, 20558286, 20558327, 20558368, 20558409,
20558450, 20558491, 20558532, 20558573, 20558614, 1064299422,
1064635873, 1064948113, 1065291023, 1065314998, 1065345062,
1065350129, 1, 1, 4, 278, 278, 31, 16, 1074988052, 1076692416,
1076469703, 1076226148, 1066023726, 1057373083, 1041813708,
910775196, 897988541, 958014467, 1073709436, 1080820995,
1087871082,
1105555510, 996563946, 990516632, 1007439877, 1065342546,
1065343854, 1065326389, 0, 0, 0, 0, 0, 0, 0, 0, 1070595106,
1067663254, 1068637012, 1132005523, 1120866882, 1124400175,
1126997579, 1115152811, 1117702755, 1086250728, 1080808387,
1082966116, 1086250960, 1080808777, 1082966483, 1089896366,
1069203394, 987853215, 932092729, 20558872, 1123232006,
1124468680,
20374216, 20374168, 20374120, 20374072, 20559216, 20559267,
20374024, 20373976, 20373928, 20373464, 20373416, 20313024,
20559936, 20559987, 0 <repeats 1427 times>}
toast_oldvalues = {0 <repeats 1245 times>, 8599425, 0, 256,
140734115001632, 0, 0, 0, 0, 8602692, 0, 19089440,
140734115001632,
19356882, 140734115001634, 13444600, 3, 19356880, 13444600,
8796705,
16967, 8796705, 16967, 0 <repeats 23 times>, 140734115003392,
140734115003376, 10204964, 10204968, 0, 140087678209897, 0, 0,
140734115002240, 0, 0, 0, 0, 0, 140734115002192,
0 <repeats 11 times>, 32, 0, 0, 0, 0, 4294967295, 140087678216602,

0, 0, 4294967296, 18446744073709551615, 5, 0, 140087678216602, 0,

10204964, 0, 6, 10204968, 0, 0, 0, 0, 0, 0, 0, 0, 206158430232,
140734115003968, 140734115003776, 0 <repeats 125 times>,
140087698075212, 0, 1037797632, 7, 9, 140087700155240,
140087698075212, 10204964, 2090588023, 1, 9, 140087700155240,
140087698077630, 140734115003664, 140734115003456,
140087695826704,
140087695833544, 140734115003728, 32665437, 140734115003712,
4277816, 0, 140087700155384, 140087700264280, 4607254,
140087695834744, 4274192, 4294967296, 4294967581, 1,
140087700155384, 140734115003872, 140087700276488,
140734115003912,
140087700275632, 1, 140087698078061, 0, 140087700155384, 1, 0, 1,

140087700275632, 140734115003680, 140734115003664, 1037797632,
4495133, 0, 140087700276488, 140734115003728, 140734115003712,
2090588023, 4607254, 4294967295, 140087678425001, 140087695833544,

140087700264280, 165776, 140087681730168, 131072, 140087678425001,

18446744073709547520, 140087678400681, 0, 19492864,
140734115005392,
19327088, 165776, 13428184, 0, 27, 27, 0, 19327104,
140087698096119,
4294967297, 0, 18446744073709551504, 140087695833544, 28,
19327104,
140734115004064, 140087698122005, 0, 0, 0, 0, 140734115004064,
19328986, 140087678216602, 140087700275632, 108, 8796505...}
toast_oldexternal = {0x0 <repeats 181 times>,
0x833781 <hash_search_with_hash_value+193>, 0x3c, 0x100,
0x7fff36eec5e0, 0x0, 0x0, 0x0, 0x0, 0x834444 <string_hash+36>,
0x7b,
0x1234820, 0x7fff36eec5e0, 0x124eee3, 0x7fff36000000,
0xcd25f8 <ConfigureNamesString+4408>, 0x0, 0x0, 0x6aa00000000,
0x1100000009, 0xffffffff00540100, 0xffffffff1a5d09cb,
0xffffffff9b26ada0, 0xffffffff9bd60520, 0xffffffff9ccf30a0,
0xffffffff9da4c3a0, 0xffffffff9e9c9da0, 0xffffffff9f971aa0,
0xffffffffa085ba20, 0xffffffffa176fca0, 0xffffffffa2659c20,
0xffffffffa37bc8a0, 0xffffffffa44eb8a0, 0xffffffffa53ffb20,
0xffffffffa6256020, 0xffffffffa727c620, 0xffffffffa82a2c20,
0xffffffffa8ebf8a0, 0xffffffffaa00d3a0, 0xffffffffaad51520,
0xffffffffabe9f020, 0xffffffffacc76c20, 0xffffffffadc9d220,
0xffffffffaea74e20, 0xffffffffafa079a0, 0xffffffffb0873020,
0xffffffffb192d0a0, 0xffffffffb2704ca0, 0xffffffffb372b2a0,
0xffffffffb4502ea0, 0xffffffffb5495a20, 0xffffffffb63010a0,
0xffffffffb73276a0, 0xffffffffb80ff2a0, 0xffffffffb91258a0,
0xffffffffb9efd4a0, 0xffffffffbae90020, 0xffffffffbbd8f120,
0xffffffffbcdb5720, 0xffffffffbdb8d320, 0xffffffffbeb1fea0,
0xffffffffbf98b520, 0xffffffffc09b1b20, 0xffffffffc1789720,
0xffffffffc27afd20, 0xffffffffc3587920, 0xffffffffc451a4a0,
0xffffffffc5385b20, 0xffffffffc63ac120, 0xffffffffc758d6a0,
0xffffffffc7da09a0, 0xffffffffca162690, 0xffffffffca975990,
0xffffffffcbd11e90, 0xffffffffcc773b90, 0xffffffffcdb10090,
0xffffffffce605810, 0xffffffffcf90e290, 0xffffffffd06e5e90,
0xffffffffd1721610, 0xffffffffd1fb3210, 0xffffffffd269fe20,
0xffffffffd36329a0, 0xffffffffd449e020, 0xffffffffd51e21a0,
0xffffffffd542fd90, 0xffffffffd5dfe010, 0xffffffffd64eac20,
0xffffffffd6fe03a0, 0xffffffffd82e8e20, 0xffffffffd8f99520,
0xffffffffda0e7020, 0xffffffffdaebec20, 0xffffffffdbe517a0,
0xffffffffdccbce20, 0xffffffffddc4f9a0, 0xffffffffdeb4eaa0,
0xffffffffdfae1620, 0xffffffffe094cca0, 0xffffffffe17248a0,
0xffffffffe26b7420, 0xffffffffe3522aa0, 0xffffffffe45490a0,
0xffffffffe5320ca0, 0xffffffffe63dad20, 0xffffffffe71b2920,
0xffffffffe81454a0, 0xffffffffe8fb0b20, 0xffffffffe9fd7120,
0xffffffffeadaed20, 0xffffffffebdd5320, 0xffffffffecbacf20,
0xffffffffedb3faa0, 0xffffffffee9ab120, 0xffffffffef8167a0,
0xfffffffff09f7d20, 0xfffffffff16149a0, 0xfffffffff27f5f20,
0xfffffffff34a6620, 0xfffffffff45f4120, 0xfffffffff5210da0,
0xfffffffff63f2320, 0xfffffffff700efa0, 0xfffffffff81f0520,
0xfffffffff8e0d1a0, 0xfffffffff9fee720, 0xfffffffffac0b3a0,
0xfffffffffbe803a0, 0xfffffffffc7baba0, 0xfffffffffdc7bb70,
0x370c620, 0x4295820, 0x550a820, 0x6093a20, 0x7308a20, 0x7e91c20,

0x9106c20, 0x9c8fe20, 0xaf04e20, 0xbb21aa0, 0xcd03020, 0xd91fca0,

0xeb01220, 0xf71dea0, 0x10992ea0, 0x1151c0a0, 0x127910a0,
0x1331a2a0, 0x1458f2a0, 0x1523eb90, 0x1638c690, 0x1703cd90,
0x1818a890, 0x18e3af90, 0x19f88a90, 0x1ac39190, 0x1be1a710,
0x1cacae10, 0x1dc18910, 0x1e8c9010, 0x1fa16b10, 0x206c7210,
0x21814d10, 0x224c5410, 0x23612f10, 0x242c3610, 0x254a4b90,
0x260c1810, 0x272a2d90, 0x27f53490, 0x290a0f90, 0x29d51690,
0x2ae9f190, 0x2bb4f890, 0x2cc9d390, 0x2d94da90, 0x2ea9b590,
0x2f74bc90, 0x30899790, 0x315dd910, 0x3272b410, 0x333dbb10,
0x34529610, 0x351d9d10, 0x36327810, 0x36fd7f10, 0x381b9490,
0x38dd6110, 0x39fb7690, 0x3abd4310, 0x3bdb5890...}
toast_sizes = {0, 0, 9, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 41, 0, 0,
0,
3, 0 <repeats 49 times>, 53 <repeats 12 times>, 27, 27, 27,
41 <repeats 12 times>, 0 <repeats 61 times>, 9, 0, 0, 81, 81, 81,

81, 51, 51, 81, 81, 81, 81, 144, 144, 51, 51, 0 <repeats 1427
times>}
toast_free = {false <repeats 68 times>, true,
false <repeats 11 times>, true, true, true,
false <repeats 76 times>, true, true, true, true, false, false,
true, true, true, true, true, true, false <repeats 1429 times>}
toast_delold = {false <repeats 1600 times>}
#15 0x00000000004b8a87 in heap_multi_insert (relation=0x7f607fe1cbb8,
tuples=0x0, tuples@entry=0x1341428, ntuples=0, ntuples@entry=22, cid=0,

options=0, bistate=0x7fffffff, bistate@entry=0x12c9710) at
heapam.c:2730
xid = 1708296130
heaptuples = 0x136dc08
scratch = {
data =
"\002/\004\000\311\a\265\000\006\b\030\000\063#\000\000\277\303\a\000\023\062\064\065\061p325\000\000\000b\n\000\000\001\000\000\000\000\000\tPSF\000\000\000\000\000\000g\022\240\276\361\246n@\342\273\327E^1@@\372\036\231N\a\"\256N7A\240DF\355dC\226\000\000\000(\000\000\000\004\001\000\003\001\274\002\000\000\005\b\000\000\000\001\020L\264CB0V\313>B\004\033\002\001\000\000\000J\n\254<\000\000\000\000\230a\311\357p;\354@\001\252\367\061LF\354@\003",
'\000' <repeats 119 times>..., force_align_d = 2.9948299028864589e-305,
force_align_i64 = 50955530654723842}
page = 0x1 <error: Cannot access memory at address 0x1>
#16 0x0000000000586edf in CopyFromInsertBatch
(cstate=cstate@entry=0x12c92d0,
estate=estate@entry=0x12d0e80, mycid=mycid@entry=0,
hi_options=hi_options@entry=0,
resultRelInfo=resultRelInfo@entry=0x128fde8,
myslot=myslot@entry=0x12d2db8, bistate=0x12c9710, nBufferedTuples=22,
bufferedTuples=0x1341428, firstBufferedLineNo=2663) at copy.c:2973
oldcontext = 0x128c110
i = <optimized out>
save_cur_lineno = 2684
#17 0x000000000058c77e in CopyFrom (cstate=cstate@entry=0x12c92d0)
at copy.c:2815
slot = 0x12d2db8
skip_tuple = false
loaded_oid = 0
tuple = 0x136cbf8
tupDesc = 0x12beb88
values = 0x12d34b8
nulls = 0x1329c78
resultRelInfo = <optimized out>
saved_resultRelInfo = 0x0
estate = 0x12d0e80
mtstate = 0x12c9500
econtext = 0x12d1090
myslot = 0x12d2db8
oldcontext = 0x128c110
errcallback = {previous = 0x0,
callback = 0x586d00 <CopyFromErrorCallback>, arg = 0x12c92d0}
mycid = 0
hi_options = 0
bistate = 0x12c9710
processed = 2683
useHeapMultiInsert = true
nBufferedTuples = <optimized out>
prev_leaf_part_index = -1
bufferedTuples = 0x1341428
bufferedTuplesSize = 67408
firstBufferedLineNo = 2663
__func__ = "CopyFrom"
#18 0x000000000058cb53 in DoCopy (pstate=0x128c220, stmt=<optimized out>,
stmt_location=0, stmt_len=2354, processed=0x7fff36ef14b8) at
copy.c:996
cstate = 0x12c92d0
is_from = true
pipe = <optimized out>
rel = 0x7f607fd983a8
relid = 310933852
query = 0x0
__func__ = "DoCopy"
#19 0x0000000000731258 in standard_ProcessUtility (pstmt=0x1234560,
queryString=0x1233030 "COPY decals9
(release,brickid,brickname,objid,brick_primary,maskbits,fitbits,type,ra,dec,ra_ivar,dec_ivar,bx,by,dchisq,ebv,mjd_min,mjd_max,ref_cat,ref_id,pmra,pmdec,parallax,pmra_ivar,pmdec_ivar,paral"...,

context=PROCESS_UTILITY_TOPLEVEL, params=0xffffffffffffffff,
queryEnv=0x0,
dest=0x7fffffff, completionTag=0x7fff36ef1740 "") at utility.c:551
processed = 19943856
pstate = 0x128c220
__func__ = "standard_ProcessUtility"
#20 0x000000000072e5ba in PortalRunUtility (portal=0x12f4150,
pstmt=0x1234560,
isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>,
dest=0x1234670, completionTag=<optimized out>) at pquery.c:1178
utilityStmt = <optimized out>
snapshot = <optimized out>
#21 0x000000000072f027 in PortalRunMulti (portal=portal@entry=0x12f4150,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x1234670,

altdest=altdest@entry=0x1234670,
completionTag=completionTag@entry=0x7fff36ef1740 "") at pquery.c:1331
pstmt = 0x1234560
active_snapshot_set = false
stmtlist_item = 0x12bc400
#22 0x000000000072fbd7 in PortalRun (portal=0x12f4150,
count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimized out>, dest=0x1234670, altdest=0x1234670,
completionTag=0x7fff36ef1740 "") at pquery.c:799
save_exception_stack = 0x7fff36ef19b0
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {1, -6064393298424732441, 19645008,

140734115026752, 19874128, 2, 6064270506639647975,
-6064395009656331033}, __mask_was_saved = 0, __saved_mask =
{
__val = {0, 140734115026655, 0, 9981816, 8602692, 1, 19882336,

10328646, 19874128, 10328646, 1, 1, 112, 19644960, 19080960,

9895294}}}}
result = <optimized out>
nprocessed = <optimized out>
saveTopTransactionResourceOwner = 0x1285838
saveTopTransactionContext = 0x13050a0
saveActivePortal = 0x0
saveResourceOwner = 0x1285838
savePortalContext = 0x0
saveMemoryContext = 0x13050a0
__func__ = "PortalRun"
#23 0x000000000072be17 in exec_simple_query (
query_string=0x7f607fe1cbb8 "\001") at postgres.c:1145
parsetree = 0x12bc220
portal = 0x12f4150
commandTag = 0x7fff36ef1740 ""
completionTag =
"\000\335'\001\000\000\000\000\b\375'\001\002\000\000\000\000\000\000\000\002\000\000\000Q\000\000\000\000\000\000\000\060\060#\001\000\000\000\000\b\375'\001\000\000\000\000\b\375'\001",
'\000' <repeats 11 times>
plantree_list = 0x1234670
receiver = 0x1234670
format = 0
dest = DestRemote
parsetree_item = 0x12bc250
msec_str =
"\000\335'\001\000\000\000\000\b\375'\001\002\000\000\000\000\000\000\000\002\000\000\000Q\000\000\000\000\000\000"
__func__ = "exec_simple_query"
#24 0x000000000072dade in PostgresMain (argc=1, argv=0x1233030,
dbname=0x127fd20 "wsdb", username=0x127fd08 "koposov") at
postgres.c:4193
query_string = 0x1233030 "COPY decals9
(release,brickid,brickname,objid,brick_primary,maskbits,fitbits,type,ra,dec,ra_ivar,dec_ivar,bx,by,dchisq,ebv,mjd_min,mjd_max,ref_cat,ref_id,pmra,pmdec,parallax,pmra_ivar,pmdec_ivar,paral"...
firstchar = 19083312
input_message = {
data = 0x1233030 "COPY decals9
(release,brickid,brickname,objid,brick_primary,maskbits,fitbits,type,ra,dec,ra_ivar,dec_ivar,bx,by,dchisq,ebv,mjd_min,mjd_max,ref_cat,ref_id,pmra,pmdec,parallax,pmra_ivar,pmdec_ivar,paral"...,

len = 2356, maxlen = 4096, cursor = 2356}
local_sigjmp_buf = {{__jmpbuf = {2523489864, -6064395710111914777,

19397896, 19397896, 0, 19382096, 6064270506719339751,
-6064395007743597337}, __mask_was_saved = 1, __saved_mask =
{
__val = {0, 19397920, 140734115028992, 47, 19080960, 1024,
19398080, 19397896, 8711789, 0, 140734115027568,
140734115027664, 6479486, 13500160, 8610716, 19086720}}}}
send_ready_for_query = false
__func__ = "PostgresMain"
#25 0x000000000047e812 in BackendRun (port=0x128b780) at postmaster.c:4356
ac = 1
secs = 658859760
usecs = 545108
i = 1
av = 0x127fdc0
maxac = <optimized out>
#26 BackendStartup (port=0x128b780) at postmaster.c:4028
bn = 0x127bf50
pid = 0
#27 ServerLoop () at postmaster.c:1707
rmask = {fds_bits = {128, 0 <repeats 15 times>}}
selres = <optimized out>
now = <optimized out>
readmask = {fds_bits = {248, 0 <repeats 15 times>}}
last_lockfile_recheck_time = <optimized out>
last_touch_time = <optimized out>
__func__ = "ServerLoop"
#28 0x00000000006c39a3 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x122d3c0) at postmaster.c:1380
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = true
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#29 0x000000000047f748 in main (argc=3, argv=0x122d3c0) at main.c:228
No locals.
quit
A debugging session is active.

Inferior 1 [process 115811] will be detached.

Quit anyway? (y or n) Detaching from program: /opt/pgsql11/bin/postgres,
process 115811
***************************************
SCHEMA:
************************************

Table "koposov.decals9"
Column | Type | Collation |
Nullable | Default
-----------------------------------+--------------------+-----------+----------+---------
release | smallint | |
|
brickid | integer | |
|
brickname | text | |
|
objid | integer | |
|
brick_primary | smallint | |
|
maskbits | smallint | |
|
fitbits | smallint | |
|
type | text | |
|
ra | double precision | |
|
dec | double precision | |
|
ra_ivar | real | |
|
dec_ivar | real | |
|
bx | real | |
|
by | real | |
|
dchisq | real[] | |
|
ebv | real | |
|
mjd_min | double precision | |
|
mjd_max | double precision | |
|
ref_cat | text | |
|
ref_id | bigint | |
|
pmra | real | |
|
pmdec | real | |
|
parallax | real | |
|
pmra_ivar | real | |
|
pmdec_ivar | real | |
|
parallax_ivar | real | |
|
ref_epoch | real | |
|
gaia_phot_g_mean_mag | real | |
|
gaia_phot_g_mean_flux_over_error | real | |
|
gaia_phot_g_n_obs | smallint | |
|
gaia_phot_bp_mean_mag | real | |
|
gaia_phot_bp_mean_flux_over_error | real | |
|
gaia_phot_bp_n_obs | smallint | |
|
gaia_phot_rp_mean_mag | real | |
|
gaia_phot_rp_mean_flux_over_error | real | |
|
gaia_phot_rp_n_obs | smallint | |
|
gaia_phot_variable_flag | smallint | |
|
gaia_astrometric_excess_noise | real | |
|
gaia_astrometric_excess_noise_sig | real | |
|
gaia_astrometric_n_obs_al | smallint | |
|
gaia_astrometric_n_good_obs_al | smallint | |
|
gaia_astrometric_weight_al | real | |
|
gaia_duplicated_source | smallint | |
|
gaia_a_g_val | real | |
|
gaia_e_bp_min_rp_val | real | |
|
gaia_phot_bp_rp_excess_factor | real | |
|
gaia_astrometric_sigma5d_max | real | |
|
gaia_astrometric_params_solved | smallint | |
|
flux_g | real | |
|
flux_r | real | |
|
flux_z | real | |
|
flux_w1 | real | |
|
flux_w2 | real | |
|
flux_w3 | real | |
|
flux_w4 | real | |
|
flux_ivar_g | real | |
|
flux_ivar_r | real | |
|
flux_ivar_z | real | |
|
flux_ivar_w1 | real | |
|
flux_ivar_w2 | real | |
|
flux_ivar_w3 | real | |
|
flux_ivar_w4 | real | |
|
fiberflux_g | real | |
|
fiberflux_r | real | |
|
fiberflux_z | real | |
|
fibertotflux_g | real | |
|
fibertotflux_r | real | |
|
fibertotflux_z | real | |
|
apflux_g | real[] | |
|
apflux_r | real[] | |
|
apflux_z | real[] | |
|
apflux_resid_g | real[] | |
|
apflux_resid_r | real[] | |
|
apflux_resid_z | real[] | |
|
apflux_blobresid_g | real[] | |
|
apflux_blobresid_r | real[] | |
|
apflux_blobresid_z | real[] | |
|
apflux_ivar_g | real[] | |
|
apflux_ivar_r | real[] | |
|
apflux_ivar_z | real[] | |
|
apflux_masked_g | real[] | |
|
apflux_masked_r | real[] | |
|
apflux_masked_z | real[] | |
|
apflux_w1 | real[] | |
|
apflux_w2 | real[] | |
|
apflux_w3 | real[] | |
|
apflux_w4 | real[] | |
|
apflux_resid_w1 | real[] | |
|
apflux_resid_w2 | real[] | |
|
apflux_resid_w3 | real[] | |
|
apflux_resid_w4 | real[] | |
|
apflux_ivar_w1 | real[] | |
|
apflux_ivar_w2 | real[] | |
|
apflux_ivar_w3 | real[] | |
|
apflux_ivar_w4 | real[] | |
|
mw_transmission_g | real | |
|
mw_transmission_r | real | |
|
mw_transmission_z | real | |
|
mw_transmission_w1 | real | |
|
mw_transmission_w2 | real | |
|
mw_transmission_w3 | real | |
|
mw_transmission_w4 | real | |
|
nobs_g | smallint | |
|
nobs_r | smallint | |
|
nobs_z | smallint | |
|
nobs_w1 | smallint | |
|
nobs_w2 | smallint | |
|
nobs_w3 | smallint | |
|
nobs_w4 | smallint | |
|
rchisq_g | real | |
|
rchisq_r | real | |
|
rchisq_z | real | |
|
rchisq_w1 | real | |
|
rchisq_w2 | real | |
|
rchisq_w3 | real | |
|
rchisq_w4 | real | |
|
fracflux_g | real | |
|
fracflux_r | real | |
|
fracflux_z | real | |
|
fracflux_w1 | real | |
|
fracflux_w2 | real | |
|
fracflux_w3 | real | |
|
fracflux_w4 | real | |
|
fracmasked_g | real | |
|
fracmasked_r | real | |
|
fracmasked_z | real | |
|
fracin_g | real | |
|
fracin_r | real | |
|
fracin_z | real | |
|
anymask_g | smallint | |
|
anymask_r | smallint | |
|
anymask_z | smallint | |
|
allmask_g | smallint | |
|
allmask_r | smallint | |
|
allmask_z | smallint | |
|
wisemask_w1 | smallint | |
|
wisemask_w2 | smallint | |
|
psfsize_g | real | |
|
psfsize_r | real | |
|
psfsize_z | real | |
|
psfdepth_g | real | |
|
psfdepth_r | real | |
|
psfdepth_z | real | |
|
galdepth_g | real | |
|
galdepth_r | real | |
|
galdepth_z | real | |
|
nea_g | real | |
|
nea_r | real | |
|
nea_z | real | |
|
blob_nea_g | real | |
|
blob_nea_r | real | |
|
blob_nea_z | real | |
|
psfdepth_w1 | real | |
|
psfdepth_w2 | real | |
|
psfdepth_w3 | real | |
|
psfdepth_w4 | real | |
|
wise_coadd_id | text | |
|
wise_x | real | |
|
wise_y | real | |
|
lc_flux_w1 | real[] | |
|
lc_flux_w2 | real[] | |
|
lc_flux_ivar_w1 | real[] | |
|
lc_flux_ivar_w2 | real[] | |
|
lc_nobs_w1 | smallint[] | |
|
lc_nobs_w2 | smallint[] | |
|
lc_fracflux_w1 | real[] | |
|
lc_fracflux_w2 | real[] | |
|
lc_rchisq_w1 | real[] | |
|
lc_rchisq_w2 | real[] | |
|
lc_mjd_w1 | double precision[] | |
|
lc_mjd_w2 | double precision[] | |
|
lc_epoch_index_w1 | smallint[] | |
|
lc_epoch_index_w2 | smallint[] | |
|
sersic | real | |
|
sersic_ivar | real | |
|
shape_r | real | |
|
shape_r_ivar | real | |
|
shape_e1 | real | |
|
shape_e1_ivar | real | |
|
shape_e2 | real | |
|
shape_e2_ivar | real | |
|

#2Fujii Masao
masao.fujii@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

On 2020/11/17 3:15, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16722
Logged by: Sergey Koposov
Email address: skoposov@ed.ac.uk
PostgreSQL version: 11.9
Operating system: debian
Description:

Hi,

When ingesting a billion or so rows in the table that has some array columns
(see schema below), at some point all the backends doing the ingestion hang
with 100% CPU usage. When investigating, I think I can trace this to the
limit of 2^32 toasted records per table.

Yes, this is the limitation of the number of out-of-line values in toast.
https://wiki.postgresql.org/wiki/TOAST#Total_table_size_limit

See the gdb full backtrace of
hanged backends in the bottom. When the problem occurs, it has 272 mill
records.
Basically the hanging happens in GetNewOidWithIndex called by
toast_save_datum.
While I understand the limit for the toast number is there to stay, but the
behaviour of PG next to the limit is IMO a bug (or at least non-trivial) .

I would rather prefer to see an error-message as opposed to backends
hanging.

To emit an error, we need to check that there is no unused OID for
the toast and it would take very long to do that. So I'm not sure
if to emit an error message really improves the current situation or not.

OTOH it might be good idea to emit a warning message (every time we
cannot find unused OID in recent 1 million OIDs, for example) or report
the issue as wait event, or something while the record insertion is hanging
because of toast limit, so that we can easily detect the issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#3Magnus Hagander
magnus@hagander.net
In reply to: Fujii Masao (#2)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

On Wed, Nov 18, 2020 at 2:18 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/11/17 3:15, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16722
Logged by: Sergey Koposov
Email address: skoposov@ed.ac.uk
PostgreSQL version: 11.9
Operating system: debian
Description:

Hi,

When ingesting a billion or so rows in the table that has some array columns
(see schema below), at some point all the backends doing the ingestion hang
with 100% CPU usage. When investigating, I think I can trace this to the
limit of 2^32 toasted records per table.

Yes, this is the limitation of the number of out-of-line values in toast.
https://wiki.postgresql.org/wiki/TOAST#Total_table_size_limit

See the gdb full backtrace of
hanged backends in the bottom. When the problem occurs, it has 272 mill
records.
Basically the hanging happens in GetNewOidWithIndex called by
toast_save_datum.
While I understand the limit for the toast number is there to stay, but the
behaviour of PG next to the limit is IMO a bug (or at least non-trivial) .

I would rather prefer to see an error-message as opposed to backends
hanging.

To emit an error, we need to check that there is no unused OID for
the toast and it would take very long to do that. So I'm not sure
if to emit an error message really improves the current situation or not.

OTOH it might be good idea to emit a warning message (every time we
cannot find unused OID in recent 1 million OIDs, for example) or report
the issue as wait event, or something while the record insertion is hanging
because of toast limit, so that we can easily detect the issue.

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#4Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#3)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On 2020-11-18 11:57:05 +0100, Magnus Hagander wrote:

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

I don't think this is likely to work well as a wait event. All the index
traversals etc will do IO, acquire locks, etc, which will overwrite the
wait event and reset it to nothing once done.

Greetings,

Andres Freund

#5Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#4)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

On Wed, Nov 18, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-11-18 11:57:05 +0100, Magnus Hagander wrote:

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

I don't think this is likely to work well as a wait event. All the index
traversals etc will do IO, acquire locks, etc, which will overwrite the
wait event and reset it to nothing once done.

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#6Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#5)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On 2020-11-18 23:12:10 +0100, Magnus Hagander wrote:

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I'd rather spend effort to solve the underlying issue. It's a bit more
work, but not crazily so, I think. Even the pg_upgrade pieces aren't
that hard to solve.

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Andres Freund <andres@anarazel.de> writes:

On 2020-11-18 23:12:10 +0100, Magnus Hagander wrote:

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I'd rather spend effort to solve the underlying issue. It's a bit more
work, but not crazily so, I think. Even the pg_upgrade pieces aren't
that hard to solve.

Yeah. If we're going to put work into this, widening the IDs used
to identify toast values seems like the right work to be doing.

Having said that, I had imagined that we might never have to fix it,
because if your table is big enough that it has a problem of this
ilk then likely you want to partition it anyway. And partitioning
solves the problem since each partition has its own toast table.

regards, tom lane

#8Sergey Koposov
Sergey.Koposov@ed.ac.uk
In reply to: Tom Lane (#7)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

On Wed, 2020-11-18 at 19:02 -0500, Tom Lane wrote:

This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Andres Freund <andres@anarazel.de> writes:

On 2020-11-18 23:12:10 +0100, Magnus Hagander wrote:

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I'd rather spend effort to solve the underlying issue. It's a bit more
work, but not crazily so, I think. Even the pg_upgrade pieces aren't
that hard to solve.

Yeah. If we're going to put work into this, widening the IDs used
to identify toast values seems like the right work to be doing.

Having said that, I had imagined that we might never have to fix it,
because if your table is big enough that it has a problem of this
ilk then likely you want to partition it anyway. And partitioning
solves the problem since each partition has its own toast table.

I can say that for my case of very large static clustered tables the
partitioning doesn't really offer much benefits, only overhead (To the best
of my understanding at least), so I'm not using it.

Sergey

PS And the current workaround that I used for this particular dataset that I reported
was to just expand a few arrays that I have in the table into columns
(as all arrays had static lengths) to not have any toasts at all.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On 2020-11-18 19:02:34 -0500, Tom Lane wrote:

Having said that, I had imagined that we might never have to fix it,
because if your table is big enough that it has a problem of this
ilk then likely you want to partition it anyway. And partitioning
solves the problem since each partition has its own toast table.

A few billion rows isn't that much anymore, and partitioning has a fair
number of restrictions (plus this only needs 4 billion toasted fields,
not 4 billion rows). More importantly, to hit problems around this, one
doesn't even have to have all those rows in a table at once -
performance will suffer once the oid counter has wrapped around,
especially if there's longer sequences of assigned values.

Yeah. If we're going to put work into this, widening the IDs used
to identify toast values seems like the right work to be doing.

To outline, here's what I think the two major pieces to get there are:

1) Make toast oid assignment independent of the oid counter. The easiest
way likely is to also create a serial and use that. That alone
improves the situation considerably, because it takes much longer to
to wrap around in each toast table.

The overhead of the additional WAL records isn't nothing, but
compared to maintaining a btree it's not likely to be measurable.

2) Widen non pg-upgraded toast tables to have 64bit chunk_id field. To
reference chunks >= 2^32, add VARTAG_ONDISK64, which is only used
when needed.

Greetings,

Andres Freund

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#9)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Andres Freund <andres@anarazel.de> writes:

On 2020-11-18 19:02:34 -0500, Tom Lane wrote:

Yeah. If we're going to put work into this, widening the IDs used
to identify toast values seems like the right work to be doing.

To outline, here's what I think the two major pieces to get there are:

1) Make toast oid assignment independent of the oid counter. The easiest
way likely is to also create a serial and use that. That alone
improves the situation considerably, because it takes much longer to
to wrap around in each toast table.

I'm not really on board with creating an actual sequence object for each
toast table. I don't think they are efficient enough.

I had imagined internally widening the OID counter to 64 bits, of which
ordinary OID usage would just discard the high 32 bits, but for TOAST
chunk IDs we could use the whole value. Now, compared to your idea
that would result in using the wider toast-pointer format more often,
but I think there are ways around that. For example, we could do

(1) Grab the next 64-bit OID.
(2) Probe the toast table to see if the low 32 bits correspond to
a value in use in that toast table. If not, just use the low 32 bits.
(3) Otherwise, use the whole 64 bits and the wider pointer format.
(We could do another uniqueness probe here, but I think it's unnecessary,
since the 64-bit counter will never wrap around.)

Since the existing OID assignment mechanism requires a uniqueness
probe already, this isn't any more expensive than what we do now.

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On 2020-11-18 20:17:55 -0500, Tom Lane wrote:

I'm not really on board with creating an actual sequence object for each
toast table. I don't think they are efficient enough.

I'm not convinced that's true. Sure, nextval() could stand to be cheaper
- but on the other hand, this would allow us to entirely avoid doing
ahead-of-time uniqueness probes. Needing one extra WAL record every
SEQ_LOG_VALS/32 toast insertions seems likely to be win over doing
at least SEQ_LOG_VALS-1 additional uniqueness index probes.

It's not like the oid counter is free of contention - one benefit of
what I suggest would be that the potential contention point would move
from system wide to table wide.

I think reducing the rate of oid counter advancement is a significant
benefit, moving the point of where we have to care about efficiency
problems around system oid assignment costs quite a bit further into the
future.

I think there's also some small potential benefits around indexing in
having densely allocated toast chunk ids for each toast table, over the
sparser allocation we have right now.

Greetings,

Andres Freund

#12tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Magnus Hagander (#5)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi

Oh meh. Yeah, I didn't think about the actual implementation of doing it

:/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I think the statistics on OID assignments per-table is a good idea, but how
about simply logging the event to notify users as the first step?
I think it's easy to implement and helpful for users.

First of all, I would like to provide a way for the user to know that it is
taking a long time to generate the OID, and provide an opportunity for the
user to deal with this problem with statement_timeout etc.
This is an interim measure, but I think it's better than it is.

Since it is complicated to handle WaitEvent and statistics for each table,
I first considered a patch that simply outputs logs, giving priority to
avoiding a hang state without outputting logs.

* v01-0001-OID_log_output.patch
In this patch, GetNewOidWithIndex outputs a log when it loops too many
times to assign a new OID.
Also, if the log is output even once, it will be output to the log when
the OID is assigned.

* This patch does not cause an error even if it loops too many times.
How much time can be tolerated in the OID search loop depends on the
user, so I think that the user should decide the time to make an error with
statement_timeout.

* The log output interval threshold increases by "* = 2"(exponential
backoff) and the maximum number of log outputs is set to 5.
The reason for setting this threshold is to reduce the number of log
outputs and output the log before stopping at statement_timeout etc. I
think the appropriate value for this threshold is controversial.
Since it is not necessary to keep logging, the upper limit is set to 5
times.

* The log output interval threshold is set to 1 million for the first
log output.
In my environment, it takes about 4 seconds to loop 1 million times,
and the 5th log is output after about 2 minutes.I think this threshold is
just right considering statements_timeout, but it's controversial.

Log output does not need to continue output during OID assignment. I think
the first few and the last log are enough.

I will add this patch to the commitfest. I look forward to your feedback
about the patch.

By the way
How about distributing OIDs so that the GetNewOidWithIndex function doesn't
take long to find an available OID?
For example, skip one and assign an OID.

As a side effect of this method, the OID progresses faster.
(Maybe the btree index will be fragmented faster)

It may be better to include skips only for TOAST chunk_id, as this method
may have other side effects.

regards.
--
Tomohiro Hiramitsu
NTT Open Source Software Center

2020年12月21日(月) 16:44 Magnus Hagander <magnus@hagander.net>:

Show quoted text

On Wed, Nov 18, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-11-18 11:57:05 +0100, Magnus Hagander wrote:

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

I don't think this is likely to work well as a wait event. All the index
traversals etc will do IO, acquire locks, etc, which will overwrite the
wait event and reset it to nothing once done.

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Attachments:

v1-0001-GetNewOidWithIndex_log_output.patchapplication/octet-stream; name=v1-0001-GetNewOidWithIndex_log_output.patchDownload+27-1
#13Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: tomohiro hiramitsu (#12)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On Mon, Dec 28, 2020 at 4:31 PM tomohiro hiramitsu <hiramit.tm@gmail.com> wrote:

Hi

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I think the statistics on OID assignments per-table is a good idea, but how about simply logging the event to notify users as the first step?
I think it's easy to implement and helpful for users.

+1
It is a simple and quick way for users to know if the OID assignment
is taking a long time or not.
At least it will be better than now.

First of all, I would like to provide a way for the user to know that it is taking a long time to generate the OID, and provide an opportunity for the user to deal with this problem with statement_timeout etc.
This is an interim measure, but I think it's better than it is.

Since it is complicated to handle WaitEvent and statistics for each table, I first considered a patch that simply outputs logs, giving priority to avoiding a hang state without outputting logs.

* v01-0001-OID_log_output.patch
In this patch, GetNewOidWithIndex outputs a log when it loops too many times to assign a new OID.
Also, if the log is output even once, it will be output to the log when the OID is assigned.

* This patch does not cause an error even if it loops too many times.
How much time can be tolerated in the OID search loop depends on the user, so I think that the user should decide the time to make an error with statement_timeout.

* The log output interval threshold increases by "* = 2"(exponential backoff) and the maximum number of log outputs is set to 5.
The reason for setting this threshold is to reduce the number of log outputs and output the log before stopping at statement_timeout etc. I think the appropriate value for this threshold is controversial.
Since it is not necessary to keep logging, the upper limit is set to 5 times.

* The log output interval threshold is set to 1 million for the first log output.
In my environment, it takes about 4 seconds to loop 1 million times, and the 5th log is output after about 2 minutes.I think this threshold is just right considering statements_timeout, but it's controversial.

Log output does not need to continue output during OID assignment. I think the first few and the last log are enough.

I will add this patch to the commitfest. I look forward to your feedback about the patch.

By the way
How about distributing OIDs so that the GetNewOidWithIndex function doesn't take long to find an available OID?
For example, skip one and assign an OID.

As a side effect of this method, the OID progresses faster.
(Maybe the btree index will be fragmented faster)

I think the idea of assigning OIDs to sparse is interesting.
However, I think it needs to find out how much it will affect the performance.
It could be a non-negligible overhead, especially for large amounts of
data insertion/updating.

Best regards,

It may be better to include skips only for TOAST chunk_id, as this method may have other side effects.

regards.
--
Tomohiro Hiramitsu
NTT Open Source Software Center

2020年12月21日(月) 16:44 Magnus Hagander <magnus@hagander.net>:

On Wed, Nov 18, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-11-18 11:57:05 +0100, Magnus Hagander wrote:

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

I don't think this is likely to work well as a wait event. All the index
traversals etc will do IO, acquire locks, etc, which will overwrite the
wait event and reset it to nothing once done.

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#14tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Kasahara Tatsuhito (#13)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi

On Wed, Jan 6, 2021 at 11:13 AM Kasahara Tatsuhito <
kasahara.tatsuhito@gmail.com> wrote:

By the way
How about distributing OIDs so that the GetNewOidWithIndex function

doesn't take long to find an available OID?

For example, skip one and assign an OID.

As a side effect of this method, the OID progresses faster.
(Maybe the btree index will be fragmented faster)

I think the idea of assigning OIDs to sparse is interesting.
However, I think it needs to find out how much it will affect the
performance.
It could be a non-negligible overhead, especially for large amounts of
data insertion/updating.

I agree with you. I can't predict the impact of this method on performance.
I may need to do some performance tests to find out the impact of this
method.

Thank you for your reply.

regards.
--
Tomohiro Hiramitsu
NTT Open Source Software Center

#15Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kasahara Tatsuhito (#13)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On Wed, Jan 6, 2021 at 11:13 AM Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:

Hi,

On Mon, Dec 28, 2020 at 4:31 PM tomohiro hiramitsu <hiramit.tm@gmail.com> wrote:

Hi

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

I think the statistics on OID assignments per-table is a good idea, but how about simply logging the event to notify users as the first step?
I think it's easy to implement and helpful for users.

+1
It is a simple and quick way for users to know if the OID assignment
is taking a long time or not.
At least it will be better than now.

First of all, I would like to provide a way for the user to know that it is taking a long time to generate the OID, and provide an opportunity for the user to deal with this problem with statement_timeout etc.
This is an interim measure, but I think it's better than it is.

Since it is complicated to handle WaitEvent and statistics for each table, I first considered a patch that simply outputs logs, giving priority to avoiding a hang state without outputting logs.

* v01-0001-OID_log_output.patch
In this patch, GetNewOidWithIndex outputs a log when it loops too many times to assign a new OID.
Also, if the log is output even once, it will be output to the log when the OID is assigned.

* This patch does not cause an error even if it loops too many times.
How much time can be tolerated in the OID search loop depends on the user, so I think that the user should decide the time to make an error with statement_timeout.

* The log output interval threshold increases by "* = 2"(exponential backoff) and the maximum number of log outputs is set to 5.
The reason for setting this threshold is to reduce the number of log outputs and output the log before stopping at statement_timeout etc. I think the appropriate value for this threshold is controversial.
Since it is not necessary to keep logging, the upper limit is set to 5 times.

* The log output interval threshold is set to 1 million for the first log output.
In my environment, it takes about 4 seconds to loop 1 million times, and the 5th log is output after about 2 minutes.I think this threshold is just right considering statements_timeout, but it's controversial.

Log output does not need to continue output during OID assignment. I think the first few and the last log are enough.

I will add this patch to the commitfest. I look forward to your feedback about the patch.

I tested the patch.
The patch can be applied with the current HEAD
(eb42110d952f8d1ad4049b8f2491e9dfba75ffed) and the regression test
passed.

I actually tried to reproduce OID assignment conflicts using the
following procedure.
================================================================================================
DROP TABLE IF EXISTS t1;

-- Create a table. SET TOAST_TUPLE_TARGET for generate much TOAST records.
SELECT 'CREATE UNLOGGED TABLE t1 ( c1 int,' || string_agg(str, ',') ||
') WITH (TOAST_TUPLE_TARGET = 128);'
FROM (SELECT ' c' || i || ' text' as str FROM generate_series(2,21)i )t;\gexec

-- Generate data and insert. This query generate 20 TOAST records per-rows.
-- So this results in 16 million TOAST rows (same as
GETNEWOID_NOTIFICATION_LIMIT).
WITH r AS (SELECT string_agg(str, '') as d FROM (SELECT
md5(clock_timestamp()::text) AS str
FROM generate_series(1,4)t)s)
INSERT INTO t1 SELECT generate_series(1,800000)
,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d FROM r;

-- Check min(chunk_id) to rewind OID following test
SELECT 'SELECT min(chunk_id) FROM pg_toast.' || relname
FROM pg_class WHERE oid = (SELECT reltoastrelid FROM pg_class WHERE
relname = 't1');\gexec

-- Check normal(non OID conflicting) insert's duration
\timing
WITH r AS (SELECT string_agg(str, '') as d FROM (SELECT
md5(clock_timestamp()::text) AS str
FROM generate_series(1,4)t)s)
INSERT INTO t1 SELECT 1,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d FROM r;

INSERT 0 1
Time: 0.899 ms

-- Reset OID to the value of min(chunk_id) by using pg_resetwal (-o)
-- Check problematic(OID conflicting) insert's duration
\timing
WITH r AS (SELECT string_agg(str, '') as d FROM (SELECT
md5(clock_timestamp()::text) AS str
FROM generate_series(1,4)t)s)
INSERT INTO t1 SELECT 1,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d,d FROM r;

INSERT 0 1
Time: 36399.478 ms (00:36.399)
================================================================================================

And I got following log messages.
(The STATEMENT section is long and has been cut)
===================================================================================================
2021-02-18 19:04:59.055 JST [32153] LOG: failed to assign new OID in
relation "pg_toast_32018451" after 1000000 retries
2021-02-18 19:04:59.055 JST [32153] STATEMENT: WITH r ...
2021-02-18 19:05:01.299 JST [32153] LOG: failed to assign new OID in
relation "pg_toast_32018451" after 2000000 retries
2021-02-18 19:05:01.299 JST [32153] STATEMENT: WITH r ...
2021-02-18 19:05:05.802 JST [32153] LOG: failed to assign new OID in
relation "pg_toast_32018451" after 4000000 retries
2021-02-18 19:05:05.802 JST [32153] STATEMENT: WITH r ...
2021-02-18 19:05:14.853 JST [32153] LOG: failed to assign new OID in
relation "pg_toast_32018451" after 8000000 retries
2021-02-18 19:05:14.853 JST [32153] STATEMENT: WITH r ...
2021-02-18 19:05:33.185 JST [32153] LOG: failed to assign new OID in
relation "pg_toast_32018451" after 16000000 retries
2021-02-18 19:05:33.185 JST [32153] STATEMENT: WITH r ...
2021-02-18 19:05:33.185 JST [32153] LOG: the new OID has been
assigned in relation "pg_toast_32018451" after 16000021 retries
2021-02-18 19:05:33.185 JST [32153] STATEMENT: WITH r ...
===================================================================================================

It works as expected, and I think the message is useful.

Here are some comments about the code.

+#define GETNEWOID_NOTIFICATION_INTERVAL_FACTOR 1000000

It seems to be more of a minimum value for conflict detection rather
than an interval factor.
So, it would be better to use another name, such as
GETNEWOID_NOTIFICATION_MINVAL or others.

+        /* retry count and notification limit check */
+        if (retry_count == next_notify && next_notify <=
GETNEWOID_NOTIFICATION_LIMIT)
+        {
+            ereport(LOG,
+                (errmsg("failed to assign new OID in relation \"%s\"
after "UINT64_FORMAT" retries",
+                    RelationGetRelationName(relation), retry_count)));
+            next_notify *= 2; /* double it for the next notification */
+        }
+        retry_count++;

I think that we should add more comments about why output this LOG message.

Best regards,

By the way
How about distributing OIDs so that the GetNewOidWithIndex function doesn't take long to find an available OID?
For example, skip one and assign an OID.

As a side effect of this method, the OID progresses faster.
(Maybe the btree index will be fragmented faster)

I think the idea of assigning OIDs to sparse is interesting.
However, I think it needs to find out how much it will affect the performance.
It could be a non-negligible overhead, especially for large amounts of
data insertion/updating.

Best regards,

It may be better to include skips only for TOAST chunk_id, as this method may have other side effects.

regards.
--
Tomohiro Hiramitsu
NTT Open Source Software Center

2020年12月21日(月) 16:44 Magnus Hagander <magnus@hagander.net>:

On Wed, Nov 18, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-11-18 11:57:05 +0100, Magnus Hagander wrote:

This definitely sounds like something that's worth putting out as a
wait event. Even before you start traversing millions of OIDs it might
gradually start to show up, and being able to monitor that would
definitely be useful.

I don't think this is likely to work well as a wait event. All the index
traversals etc will do IO, acquire locks, etc, which will overwrite the
wait event and reset it to nothing once done.

Oh meh. Yeah, I didn't think about the actual implementation of doing it :/

That said, I'd really prefer to see *some* kind of counter that could
get people noticing this a bit earlier, rather than just bugging them
in the logfile once it's gone over a threshold. A statistics counter
maybe, but I guess that'd have to be tracked at a per-table level to
be really useful, which would make it potentially fairly expensive to
keep around...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#16tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Kasahara Tatsuhito (#15)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Hi,

On Wed, Jan 6, 2021 at 11:13 AM Kasahara Tatsuhito
<kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:

Here are some comments about the code.

+#define GETNEWOID_NOTIFICATION_INTERVAL_FACTOR 1000000

It seems to be more of a minimum value for conflict detection rather
than an interval factor.
So, it would be better to use another name, such as
GETNEWOID_NOTIFICATION_MINVAL or others.

+        /* retry count and notification limit check */
+        if (retry_count == next_notify && next_notify <=
GETNEWOID_NOTIFICATION_LIMIT)
+        {
+            ereport(LOG,
+                (errmsg("failed to assign new OID in relation \"%s\"
after "UINT64_FORMAT" retries",
+                    RelationGetRelationName(relation), retry_count)));
+            next_notify *= 2; /* double it for the next notification */
+        }
+        retry_count++;

I think that we should add more comments about why output this LOG

message.

Thank you for your comment.
I reflected your comment in the patch.

* v2-0001-GetNewOidWithIndex_log_output.patch

Best regards,

Attachments:

v2-0001-GetNewOidWithIndex_log_output.patchapplication/octet-stream; name=v2-0001-GetNewOidWithIndex_log_output.patchDownload+32-1
#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: tomohiro hiramitsu (#16)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

At Mon, 1 Mar 2021 10:56:50 +0900, tomohiro hiramitsu <hiramit.tm@gmail.com> wrote in

Hi,

On Wed, Jan 6, 2021 at 11:13 AM Kasahara Tatsuhito
<kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:

Here are some comments about the code.

+#define GETNEWOID_NOTIFICATION_INTERVAL_FACTOR 1000000

It seems to be more of a minimum value for conflict detection rather
than an interval factor.
So, it would be better to use another name, such as
GETNEWOID_NOTIFICATION_MINVAL or others.

+        /* retry count and notification limit check */
+        if (retry_count == next_notify && next_notify <=
GETNEWOID_NOTIFICATION_LIMIT)
+        {
+            ereport(LOG,
+                (errmsg("failed to assign new OID in relation \"%s\"
after "UINT64_FORMAT" retries",
+                    RelationGetRelationName(relation), retry_count)));
+            next_notify *= 2; /* double it for the next notification */
+        }
+        retry_count++;

I think that we should add more comments about why output this LOG

message.

Thank you for your comment.
I reflected your comment in the patch.

* v2-0001-GetNewOidWithIndex_log_output.patch

+ 			 * The query waits until the OID generation is successful.
+ 			 * If the number of retries for OID generation is large,
+ 			 * the number of retries will be notified.
+ 			 */

Perhaps we need a explanation about the logarithmic reduction of the
frequency. It would be something like to avoid showing that message
too often. (And I don't get what the "The query" points to..)

+			ereport(LOG,
+				(errmsg("failed to assign new OID in relation \"%s\" after "UINT64_FORMAT" retries", 

The embedded UINT64_FORMAT makes the message untranslatable on some
platforms. (Note that errmsg() is used for user-facing messages and
the message format string has translations.) The format string should
be handled separately from the message body. You will find an example
in WalReceiverMain().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro Horiguchi (#17)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Mon, 1 Mar 2021 10:56:50 +0900, tomohiro hiramitsu <hiramit.tm@gmail.com> wrote in 
+			ereport(LOG,
+				(errmsg("failed to assign new OID in relation \"%s\" after "UINT64_FORMAT" retries", 

The embedded UINT64_FORMAT makes the message untranslatable on some
platforms.

Right, we can't put platform-dependent substrings there.

The format string should
be handled separately from the message body. You will find an example
in WalReceiverMain().

Lately we've been using "%llu" or "%lld" for 64-bit values, and explicitly
casting the argument to [unsigned] long long. That seems to work
everywhere now that we always use our own snprintf code, and it's less
notationally painful than other ways.

regards, tom lane

#19tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Tom Lane (#18)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

Perhaps we need a explanation about the logarithmic reduction of the
frequency. It would be something like to avoid showing that message
too often. (And I don't get what the "The query" points to..)

I changed the patch comment.

Lately we've been using "%llu" or "%lld" for 64-bit values, and explicitly
casting the argument to [unsigned] long long. That seems to work
everywhere now that we always use our own snprintf code, and it's less
notationally painful than other ways.

Thank you for your comment.
I modified the patch to use "% llu".

In addition, I thought the variable name was long, so I shortened it a
little.

* v3-0001-GetNewOidWithIndex_log_output.patch

Best regards,

Attachments:

v3-0001-GetNewOidWithIndex_log_output.patchapplication/octet-stream; name=v3-0001-GetNewOidWithIndex_log_output.patchDownload+31-1
#20Fujii Masao
masao.fujii@gmail.com
In reply to: tomohiro hiramitsu (#19)
Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.

On 2021/03/11 14:10, tomohiro hiramitsu wrote:

Perhaps we need a explanation about the logarithmic reduction of the
frequency. It would be something like to avoid showing that message
too often. (And I don't get what the "The query" points to..)

I changed the patch comment.

Lately we've been using "%llu" or "%lld" for 64-bit values, and explicitly
casting the argument to [unsigned] long long.  That seems to work
everywhere now that we always use our own snprintf code, and it's less
notationally painful than other ways.

Thank you for your comment.
I modified the patch to use "% llu".

In addition, I thought the variable name was long, so I shortened it a little.

 * v3-0001-GetNewOidWithIndex_log_output.patch

+		if (retry_count == next_notify && next_notify <= GETNEWOID_NOTIFY_LIMIT)
+		{
+			ereport(LOG,
+				(errmsg("failed to assign new OID in relation \"%s\" after \"%llu\" retries",
+					RelationGetRelationName(relation), (unsigned long long) retry_count)));
+			next_notify *= 2; /* double it for the next notification */
+		}

With the patch, if the number of retries exceeds GETNEWOID_NOTIFY_LIMIT,
we stop logging notification messages. Why do we need to do that?
This behavior may cause user to get confused? For example, they may think
that OID has been successfully assigned because there are no notification
messages in the log for a while. So isn't it better to log notification
messages every GETNEWOID_NOTIFY_LIMIT after the number of retries
exceeds that?

Since users basically don't know the logic of GetNewOidWithIndex(),
they cannot understand what "XXX retries" part in the message means?
I have no good idea for the message. But what about something like
the following?

LOG: still finding an unused OID within relation \"%s\"
DETAIL: \"%llu\" OID candidates were checked, but no unused OID is yet found.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Fujii Masao (#20)
#22tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: tomohiro hiramitsu (#21)
#23Fujii Masao
masao.fujii@gmail.com
In reply to: tomohiro hiramitsu (#22)
#24tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Fujii Masao (#23)
#25tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: tomohiro hiramitsu (#24)
#26Fujii Masao
masao.fujii@gmail.com
In reply to: tomohiro hiramitsu (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#26)
#28tomohiro hiramitsu
hiramit.tm@gmail.com
In reply to: Fujii Masao (#27)